Primer on BigQuery Cost and Performance Optimizations
Previous Articles in this Series
The “Doing” Introduction
The past two sections have covered a lot of topics on prerequisite knowledge, general good-knowledge-to-have, and setting up some infrastructure for analyzing the data.
All of the code used throughout this series is located in the following GitHub repository: https://github.com/doitintl/bigquery-optimization-queries
Since the two methods of monitoring generate vastly different schemas I have broken up the SQL files into two folders (audit_log and information_schema) in the repository.
Each directory will have a set of SQL files for each purpose and there is a README.md in each folder explaining the purpose of each script to assist in this.
In the root directory is a python script named generate_sql_files.py that will generate a copy of the SQL files with your project and dataset information.
Index of Queries in GitHub Repository
Due to this GitHub repository being a dynamic creation I have created an article listing out each query in it along with the descriptions here.
This will be updated as more queries are added.
Some advice before you run queries…
Starting in the next section and continuing throughout the rest of this series, I will be referencing many queries and examining a lot of data. The BigQuery UI is decent at coping with lots of queries and analysis, but I HIGHLY recommend exporting this data to a spreadsheet.
Google Sheets is the ideal option because BigQuery can export results directly to it (with a few size limitations). In addition, exporting data once to a spreadsheet is cheaper than rerunning the same query multiple times to do your analysis.
You also need to keep an eye on costs. Some of these queries can be very expensive, so I always recommend you look at the cost of the query before running it. I have put a variable called interval_in_days to each query, which controls how far back the query will look and this can be changed to reduce the amount of data it scans. Some customers of ours have just gone through and modified the WHERE clauses to grab ranges for them so the sky's the limit when it comes to these queries.
Quick Note for DoiT customers
All of the steps in this part can be performed inside of the BQ Lens part of the CMP. The Explorer pane will allow you to do these steps visually, in fact if you are using the audit log sink queries many of them are almost identical to those the CMP runs to display the data.
Finding Costly Queries
The first step of cost optimization for BigQuery is to find the costliest queries in your environment. Many times these are not known to exist at all so in the infamous words that GI Joe taught us from Saturday morning cartoons: “Knowing is half the battle.”
Once the queries are determined then it’s time to optimize the queries, determine the problem processes/users running them, and then fixing these.
I will be referencing SQL files in the GitHub repository mentioned above (link). Note that you can use either the audit log or information_schema files as they will achieve the same result.
Most Expensive Queries Overall
The first query you will be using in our analysis is in the top_costly_queries.sql file and is arguably the most important query out of them all in regards to cost. This query includes the total cost that a query incurs over the interval throughout all of its iterations. So this will tell you the actual most expensive queries that may be run multiple times over the interval costing more than they initially appear to.
The purpose of this query is twofold: to determine the overall most expensive query from your environment and to determine if an expensive query is running more times than it needs to. Oftentimes a process such as a cron job, a task inside of a DAG on an Airflow instance, a Cloud Function, etc. are running multiple times, or more times than they need to in many cases, which run a query. Even more often is that the creator, maintainer, or the person paying the cloud bill has no idea how much these queries cost per iteration and/or per month. That’s where this query comes in handy!
Most Expensive Individual Queries
The second query you will be using for this section is in the top_billed_queries.sql file. This query will list out the most expensive queries in your environment sorted by the most expensive descending. Note that unlike the first query this doesn’t take into account the query being run multiple times, it just displays the cost per run of the query.
Running this query will immediately push the most expensive queries up to the top of the result set. One of the best things to do here is to find the top query and first see how much data it processed (the total*Billed columns). Now take that number and look at the query itself to determine what’s going on to make it cost so much.
Many times the culprits will be the issues I listed above, but there may be other things going on here. So I will recommend looking at the queries to see if anything is obvious in them going on.
Note on duplicate queries
If the results are returning the exact same query multiple times and are skewing the results or making it difficult to view then look at the top_billed_queries_deduplicated.sql file instead. This is the exact same query, but it deduplicates the exact query being run so only one result will appear. It will take more time to run than and scan more data doing so, but it will return deduplicated results.
Note that at the time of writing BigQuery doesn’t provide a native “string similarity” aggregate, such as a Hamming distance or Levenshtein distance, so filtering out similar queries with small changes such as a date is relatively difficult to do . If it’s needed, there are a few UDF solutions that implement some of the “string similarity” algorithms for BigQuery out there, though I suggest instead creating a solution outside of BigQuery, as UDF implementations of computationally expensive algorithms like this are pretty slow.
Most Expensive Users
The third and last query you will be using to look at cost is the top_cost_users.sql file. This query will list out the most expensive queries in your environment but sorted by the user followed by the most expensive query.
The purpose of this query is to show which users or service accounts are spending the most amount of money and on what queries. Many times included in this list are processes running inefficient queries that you are unaware of. In some extreme cases these may be processes located in other projects or clouds. Adding additional filters to remove these might also be beneficial to filter them out.
Other Queries in Repository
Those are three main queries that are pretty general for helping discover queries costing more money than expected. Included in the GitHub repo are some additional queries for more special purposes such as finding Looker originated queries, how many times a query is run, how expensive specific-labeled queries cost, etc. These will be pretty specific, but are ones that myself and other members of my team at DoiT have used in the past so we are sharing with the greater community.
Finding Performance Problem Queries
The next major topic is finding queries that are eating up more resources than needed and might not be performing as expected. In many cases the queries found here will go hand-in-hand with the most expensive queries so there might be some common queries found.
In this section the term complexity will come up quite a bit. Defining the term complexity is, well, complex, but for the sake of simplicity I define it here as “how many slots a query uses during its execution. This value is defined as the total amount of time slots spent performing work divided by the total execution time of the query. Here is an example of this being done in for the JOBS_BY_PROJECT view:
SELECT SAFE_DIVIDE(total_slot_ms, TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)) AS slotCount FROM `<project-name>`.`<dataset-region>`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
To answer a question asked here often, this is an approximation as BigQuery unfortunately doesn’t return out the exact values and this calculation is what the BigQuery UI displays as well.
Complexity and Slot Counts
In any sort of database system when a query is more logically complex that usually has a direct correlation with the complexity of running the query internally. BigQuery is no exception to this and in addition there generally is a direct correlation between this complexity and query cost. So generally speaking, reducing complexity also reduces costs.
With that said the primary query from the GitHub repository to use for determining query complexity is the top_complex_queries.sql query. This query will return back the slots that use the most slots over their runtime so grabbing the top ones from this will help identify queries that could potentially have performance issues.
This is not an exact science as many times complex queries must remain that way and work correctly, but this is a manner to find some candidates that could have some issues.
Another potential way to identify performance issues in queries is to look at the queries that run the longest. As mentioned above this is not an exact science as well because many times a query may take a while because it just has to.
The query in longest_running_queries.sql will return out all queries over the timeframe ordered by longest running to least. This will push the queries that take the longest time up to the top for looking at. The runtimeToBytesBilledRatio column in that query will give a good idea of how many bytes the query processed per millisecond. Generally if this number is greater than 1 then it should be looked at to see if there is something that could be optimized to reduce how much data is processed.
More Performance Problems
In a future part of this series performance tuning will be covered much more heavily showing some of the most commonly unknown pitfalls of BigQuery performance and methods to overcome them.
General Purpose Queries from Repository
The last topic touched on by this part of the series is a set of queries in the GitHub repository that show information or metadata that is more general than those covered so far.
Job Type Queries
There are multiple types of jobs out there such as queries, loads, extracts, and table copies. The most common of these are by far queries and loads so there exists load_job_information.sql and query_job_information.sql files in the repository which will return out these types of jobs and metadata about them that will be useful.
There is also a more generic version of this called general_job_information.sql that will return all types of jobs and generic metadata about them.
I am going to preface this section with Google just made a change in the month of December 2022 to the behavior of BigQuery in relation to concurrent queries. It used to have a hard limit of 100 concurrent queries per project, but in anticipation of upcoming changes and the rollout of query queues they have changed this value to scale with the number of slots available.
Note this is not fully documented at the time of writing, but I have gathered this info from support and the existing documentation.
By default this value is set to zero, which means that BigQuery dynamically determines the concurrency based upon available resources that can be allocated to the project. When switching to a flat-rate pricing a setting called maximum concurrency on the reservation can be set to request this number being the top concurrency. Google can’t guarantee it will be set at that, but they attempt to meet it. The official documentation can be found on the query queues page here.
With that disclaimer out of the way let’s jump into the issues with concurrent queries and how to determine the issues.
While running queries there are times you will run too many in parallel then get the dreaded 503 “Service Unavailable” or the newer “Query was not admitted as the maximum concurrency has been reached" message. This means your concurrency level has been breached and the queries will not be scheduled. This is bad for obvious reasons.
Even if your organization hasn’t hit this limit in some projects then it’s good to be aware of it so it can be preemptively avoided. I will discuss some strategies to mitigate this or to design so it doesn’t happen in a later part of this series after the new changes just implemented by Google can be fully dissected and understood.
In the meantime to determine your concurrency averages across minutes and seconds use the concurrent_queries_by_minute.sql and concurrent_queries_by_seconds.sql queries. These will return the average amount of concurrent running queries over each minute (or second) in the defined interval.
Many times a query is run multiple times either by a user or through some sort of automated process. Since each query costs money it’s a good idea to know how often a query is run so that it can be determined if it’s running too often costing excess money.
The query_counts.sql query will display the number of times a query has run within the time interval. This is very useful to determine if a specific query is running more often than it should thus costing more money. It also includes the total cost of the query over the specified interval as well to assist in your determination of if it is running and costing too much or not.
This concludes the first three sections of writing on the subject of BigQuery optimization. In the next section I have placed a table showing the contents of the GitHub repository that will be updated as new queries are added documenting the functionality of each.
More sections to this series will be added shortly as BigQuery goes through a phase of change and thus I will be here to dig into how to keep your usage optimized as these are rolled out.
Previous Articles in this Series