An edited excerpt from a series on optimizing BigQuery costs and performance
The first part of this series outlined what you need to have in place to run queries in BigQuery, and the second part covered the most common querying mistakes you are likely to encounter when optimizing your BigQuery usage. In this article, we explore how to identify the costliest queries in your environment. All of the code used in the series is referenced in this GitHub repository.
Most expensive queries overall
The first query you need to address is in the top_costly_queries.sql file. Arguably the most important query of all with regard to cost, this query includes the total cost that a query incurs over the interval throughout all of its iterations. The purpose of this query is twofold: to determine the overall most expensive query from your environment and to determine if an expensive query is running more times than necessary.
Most expensive individual queries
The second query you will be using for this section is in the top_billed_queries.sql file. It will list the most expensive queries in your environment, sorted by the most expensive descending. Unlike the first query, this factor doesn’t factor in the query being run multiple times; it simply displays the cost per run of the query.
Running this query will immediately push the most expensive queries up to the top of the result set. Find the top query and first see how much data it processed (the total*Billed columns). Then take that number and look at the query itself to determine what’s making it cost so much.
Note on duplicate queries
If the results are returning the same query multiple times and skewing the results or making it difficult to view, look at the top_billed_queries_deduplicated.sql file instead. This is the same query, but it deduplicates the exact query being run so only one result will appear. It will take longer to run and scan more data doing so, but it will return deduplicated results.
BigQuery doesn’t currently provide a native “string similarity” aggregate, such as a Hamming distance or Levenshtein distance, so filtering out similar queries with small changes such as a date is relatively difficult. If required, you can find a few UDF solutions that implement some of the “string similarity” algorithms for BigQuery, but it is preferable to create a solution outside of BigQuery instead because UDF implementations of computationally expensive algorithms like this are pretty slow.
Most expensive users
A third query you can use to examine cost is the top_cost_users.sql file. This query will list the most expensive queries in your environment – but sorted by the user followed by the most expensive query.
The purpose of this query is to show which users or service accounts are spending the most money and on what queries. This list often includes processes running inefficient queries you may be unaware of. In some extreme cases, these processes are located in other projects or clouds. Adding additional filters to remove these might also be beneficial to filter them out.
Other queries in repository
Also included in the GitHub repo are additional queries for more specialized purposes, such as finding Looker-originated queries, the number of times a query is run, the cost of specific-labeled queries, etc. These will be pretty specific, but they are ones that we at DoiT have used, so we are sharing them with the greater community.
Finding performance problem queries
The next major topic is finding queries that are eating up more resources than needed and not performing as expected. These queries are often also the most expensive queries, so there may be some overlap.
In this section, we will define the term “complexity” as the number of slots a query uses during its execution. This value is defined as the total amount of time slots spent performing work divided by the total execution time of the query. Here is an example of this being done in for the JOBS_BY_PROJECT view:
SELECT SAFE_DIVIDE(total_slot_ms, TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) AS slotCount FROM `<project-name>`.`<dataset-region>`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
This is an approximation because, unfortunately, BigQuery doesn’t return the exact values, and this calculation is what the BigQuery UI displays as well.
Complexity and slot counts
In any sort of database system, the increased logical complexity of a query usually correlates directly with the complexity of running the query internally. BigQuery is no exception. In addition, there is usually a direct correlation between this complexity and query cost. So generally speaking, reducing complexity also reduces costs.
That said, the primary query from the GitHub repository to use for determining query complexity is the top_complex_queries.sql query. This query returns the slots that use the most slots over their runtime, so grabbing the top ones from this will help identify queries that could potentially have performance issues. This is not an exact science because complex queries must often remain that way and work correctly, but it is a way to find some problematic candidates.
Another potential way to identify performance issues in queries is to look at the queries that run the longest. However, a query often takes a while because it simply must.
The query in longest_running_queries.sql will return all queries over the timeframe ordered by longest-running to least. This will push the queries that take the longest time to the top for examination. The runtimeToBytesBilledRatio column in that query will give a good idea of how many bytes the query processed per millisecond. Generally, if this number is greater than 1, it should be looked at to see if there is something that could be optimized to reduce how much data is processed.
A note on performance problems
In a future part of this series, I will cover performance tuning in greater detail, showing some of the most common, unknown pitfalls of BigQuery performance and methods to overcome them.
General purpose queries from the repository
The last topic touched on in this part of the series is a set of queries in the GitHub repository that show information or metadata that is more general than those covered so far.
Job type queries
By far the most common of the multiple types of jobs out there (queries, loads, extracts and table copies) are queries and loads. In response, the load_job_information.sql and query_job_information.sql files in the repository will return these types of jobs and useful metadata about them. A more generic version of this, called general_job_information.sql, will return all types of jobs and generic metadata about them.
In December 2022, Google changed the behavior of BigQuery in relation to concurrent queries. In anticipation of upcoming changes and the rollout of query queues, it has increased the hard limit of 100 concurrent queries per project to scale with the number of slots available. At the time of writing, this was not fully documented, but we have gathered this information from support and the existing documentation.
By default, this value is set to zero, which means that BigQuery dynamically determines the concurrency based on available resources that can be allocated to the project. When switching to flat-rate pricing, a setting on the reservation called maximum concurrency can be set to request this number as the top concurrency. Google can’t guarantee it will be set at that, but they attempt to meet it. The official documentation can be found on the query queues page here. With that disclaimer out of the way, let’s jump into the issues with concurrent queries.
If you run too many queries in parallel, you will get the dreaded 503 message “Service Unavailable” or the newer one, “Query was not admitted as the maximum concurrency has been reached.” This means your concurrency level has been breached and the queries will not be scheduled. This is bad for obvious reasons and should be avoided.
Once the new changes implemented by Google can be fully dissected and understood, we will present some strategies to mitigate this or to design so it doesn’t happen. In the meantime, use the concurrent_queries_by_minute.sql and concurrent_queries_by_seconds.sql queries to determine your concurrency averages across minutes and seconds. These will return the average amount of concurrent running queries over each minute (or second) in the defined interval.
Queries are often run multiple times either by a user or through some sort of automated process. The query_counts.sql query displays the number of times a query has run within the time interval. This is very useful to determine if a specific query is running more often than it should, thus wasting money. It also includes the total cost of the query over the specified interval.
What to do next
This post is condensed from my series of articles on optimizing your BigQuery queries. which will be expanded as changes are rolled out in BigQuery. If you are a current DoiT customer, you can perform all these steps within the BQ Lens part of the DoiT Console. In the meantime, you can get in touch with us at DoiT to avail of our deep, broad expertise in BigQuery and across the areas of machine learning and business intelligence.