Presenting a new guide to BigQuery cost and performance optimization
Sudden cost jumps are not unusual if you’re running multiple BigQuery datasets with multiple analyst teams querying them. But they’re not inevitable.
In our latest ebook, The BigQuery Optimization Handbook: Preparing to Save, DoiT Senior Cloud Architect and data specialist Sayle Matthews delivers the first part in a series of expert insights into how you can spend less to achieve more with BigQuery and make your Google Cloud bills more predictable.
The ebook covers
- Getting the basics right
- Avoiding common query mistakes
- Identifying your costliest queries
Getting the basics right
The data warehouse BigQuery is fully managed and provides built-in features like machine learning, geospatial analysis and business intelligence to help you manage and analyze big data. But it can be easy to run up unexpected costs if you don’t do some initial groundwork. Before you address your BigQuery costs, you will need to know some basics. First, slots and pricing models:
BigQuery’s computation rests on a construct called a slot, which is a vCPU with some memory attached. Theoretically, the more slots allocated and available, the faster queries run. Standard slots are always purchased in groups of 100 within a commitment for terms of one month or one year.
Slots perform the undocumented job of shuffling the data. Simply put, shuffling is redistributing processed data to a new location for the current or next step in the query plan to run faster. Up to 60% of the slots allocated to your project can act as shuffle slots at any given time. Shuffling boosts query efficiency, but it uses valuable slots to perform operations that don’t progress the query execution directly.
Pricing models are either on-demand or flat-rate. The default model is on-demand pricing, which sets a rate of $5 per TB of data scanned by queries. For most companies, this is the key driver of costs in BigQuery.
Flat-rate or slot pricing sets a flat price for BigQuery scanning at the potential expense of performance. It works by limiting the number of slots available to what you pre-purchase and removing the $5-per-TB scan price.
With flat-rate pricing, you can adjust your architecture designs to add a pool of slots called Flex slots to increase or replace your existing slots. Flex slots offer shorter and more flexible commitments, with a minimum commitment time of 60 seconds, and the option to eliminate the slots by canceling at any time.
To match the performance of on-demand, you need to purchase 20 100-slot chunks at a cost of $40,000 per month or $34,000 for one year. It’s not cost-effective to purchase 2,000 slots unless you are spending at least that much in BigQuery scan costs.
Before you can optimize your costs, you need to analyze your projects’ data usage. This requires access to the query data in your projects/datasets.
There are two methods of doing this: audit log sinks and the INFORMATION_SCHEMA tables. The audit log sink is the preferred method because the data is much richer. DoiT customers who already use the BigQuery Lens feature in our Cloud Management Platform (CMP) have the audit log sink set up in their environments. You can find more information here.
Avoiding common query mistakes
Before I jump into the queries that will give you the actual data you want, let’s look at some examples of common mistakes people make writing BigQuery queries. These can make queries take longer and cost more than necessary to process.
1. SELECT *
This is probably the biggest source of unnecessary additional costs with BigQuery queries. It’s usually unnecessary to select all the columns in a table or view. Remember BigQuery bills are based on the amount of data you scan in your queries, so minimize your selection to reduce this cost.
2. Unnecessary or larger joins
In data warehouses that focus on an OLAP strategy (such as BigQuery) the best practice is to denormalize the schemas in the database. This essentially flattens out the data structures and reduces the number of joins required compared with a traditional relational database.
3. Cross joins
Cross joins are required for several use cases in BigQuery, but problems arise when people do them as the innermost operation in their query, causing them to pull in far more data than will be passed to the output.
4. Using Common Table Expressions (CTEs) incorrectly
Common Table Expressions (CTEs) are amazing devices that simplify SQL code immensely. However, using a CTE in a query and referencing it multiple times so the CTE query is run multiple times means you will be billed for reading the data multiple times.
5. Not using partitions in WHERE clauses
Partitions are one of the most important features of BigQuery for reducing costs and optimizing read performance. However, they are frequently omitted, adding unnecessary costs to queries.
6. Using over-complicated views
Creating complex views can degrade performance. If the logic in a view is too complicated, it might be more suited to pre-calculation in another table or living in a materialized view to enhance performance.
7. Small inserts
If you need to insert a small number of records into a table, insert the data as a batch rather than doing numerous small inserts.
8. Overusing DML Statements
People often overuse DML statements if they treat BigQuery as a traditional RDBMS system and simply recreate data at will. A better alternative is to use an “additive model,” which inserts new rows with a timestamp to denote the latest and removes older rows periodically if history is not needed.
Identifying your costliest queries
The ebook links to a GitHub repository containing the SQL files you will need to find your most expensive queries. Here are the three main queries you will be using:
- Most expensive queries overall
- Most expensive individual queries
- Most expensive users
Other queries in the repository
The GitHub repo also includes some additional queries for more specialized purposes such as finding Looker-originated queries, the number of times a query is run, the cost of specific-labeled queries, etc.
Finding performance problem queries
After you have found the costliest queries, you will need to uncover those queries that consume more resources than needed and don’t perform as expected. These queries often align with the most expensive queries, so there may be some overlap.
Performance tuning is covered in greater depth in a future part of this series, focusing on some of the least known pitfalls of BigQuery performance and methods to overcome them.
The last topic touched on in this part of the series is a set of queries in the GitHub repository that show more general information or metadata:
- Job type queries
- Concurrent queries
- Query counts
What to do next
Download The BigQuery Optimization Handbook: Preparing to Save. As this and the future parts of the series contain a considerable amount of detailed material, it is recommended you start an audit log sink for your BigQuery projects to gather data over time and make these queries and the rest of this series more valuable. Remember, existing DoiT customers using the BigQuery Lens feature in our Cloud Management Platform (CMP) already have the audit log sink set up in their environments.