Skip to content
  • Products
    • Portfolio overview >

      Flexsave™

      Automatically manage cloud compute for optimized costs and agility

      Cloud Analytics

      Make sense of spend and chargeback to align with your business

      google cloud msp

      BigQuery Lens

      Optimize BigQuery costs with actionable recommendations and usage insights

      Spot Scaling

      Maximize AWS Spot savings and minimize disruptions for optimized scaling

      Anomaly Detection

      Autonomously identify cost spikes early, with zero configuration

      Platform overview >

      Organize your billing data for better business decisions

  • Services
    • Services overview >

      How we work

      Learn how we’re redefining support with our customer reliability engineering

      Stats

      View our live support and customer satisfaction statistics in real-time

      Cloud solutions

      Proven solutions to cloud complexity

      Areas of expertise

      Cloud Architecture

      Ensure your cloud architecture is future-ready and built for success

      Cloud Cost Optimization

      Identify opportunities to optimize costs and target spend for added value

      Cloud Migration

      Realize greater efficiency and innovation with successful cloud migration

      Cloud Security

      Center security in your cloud strategy to ensure ongoing efficacy and growth

      Data and Analytics

      Harness the potential of big data and analytics to gain a competitive edge

      Data Management

      Build your data practice with expert guidance tailored to your business goals

      DevOps Jump Start

      Accelerate your AWS workloads & release pipelines while also increasing automation, monitoring & reliability

      Infrastructure

      Maximize the full suite capabilities from your cloud infrastructure

      Kubernetes

      Manage the complexity of Kubernetes to enable innovation and scalability

      Location-Based Services

      Transform geolocational data into real-world, real-time intelligence

      Machine Learning

      Level-up key data with ML capabilities that accelerate innovation

      Multicloud

      Create meaningful business value with a robust multicloud strategy

      Training

      Build skills and capability across teams with certified, expert-led training

  • Partners
    • Alliances

      Proud to be an award‒winning multicloud partner to top‒tier cloud providers

      doit-together

      DoiT Together

      Enabling cloud growth and unlocking revenue through expert partnership

      ISV Go-Global

      Accelerate new customer growth and Marketplace integration on AWS and GCP

  • Resources
    • Resources hub >

      Blog

      Read the latest insights, tips and perspectives from our team of cloud experts

      Case Studies

      See how we’ve helped thousands of public cloud customers achieve their goals

      Ebooks and Guides

      Discover foundational expertise and future-ready recommendations for the cloud

      Events and Webinars

      Tech talks and interactive expert sessions delivered both virtually and in person

      GCPInstances.info

      Google Cloud Compute Engine instance comparison

      Help center

      Read documentation, product updates, and more

      Newsroom

      See what's new from DoiT in our latest news and announcements

      Trust Center

      How we focus on security, compliance, and privacy

      Videos

      Watch product demos, interviews and more from our cloud experts

  • About
    • About DoiT >

      Careers

      Browse our open positions and learn more about what it takes to be a Do’er

      Leadership

      Meet the team leading DoiT and our customers on a journey of hypergrowth

      Newsroom

      See what's new from DoiT in our latest news and announcements

  • Pricing
  • Contact us
  • Sign In
  • Products
    • Flexsave ™
    • Cloud Analytics
    • Spot Scaling
    • BigQuery Lens
    • Anomaly Detection
    • DoiT Platform
  • Services
    • How We Work
    • Stats
    • Cloud Solutions
    • Areas of expertise
      • Cloud Architecture
      • Cloud Cost Optimization
      • Cloud Migration Consulting Services
      • Cloud Security
      • Data and Analytics
      • Data Management
      • DevOps with AWS & DoiT
      • Infrastructure
      • Kubernetes
      • Location Based Services
      • Machine Learning
      • Multicloud
      • Training
  • Partners
    • ISV Go-Global
    • Award-winning public cloud partner
    • DoiT Together
  • Resources
    • Blog
    • Case Studies
    • Ebooks and Guides
    • Events and Webinars
    • GCPInstances.info
    • Help center
    • Newsroom
    • Trust Center
    • Videos
  • Pricing
  • About
    • Careers
    • Leadership
    • Newsroom
  • Contact us
  • Sign In
Contact us
Sign in

Blog

Avoiding eight common BigQuery query mistakes

  • Sayle Matthews Sayle Matthews
  • Date: January 10, 2023
Big-Query-query-mistakes-DoiT

How to save time and money processing your BigQuery queries

Part of the process of spending less to achieve more with BigQuery is to recognize some of the common mistakes people make when writing BigQuery queries. If you want to speed up your query processing and reduce the costs involved, you should avoid eight common mistakes:

1. SELECT *

SELECT * is probably the biggest source of unnecessary additional costs with BigQuery queries.

When you select all the columns in a table or view, you’re usually simply scanning excess data. You may have to do a SELECT * in a few cases, including when you have filtered down a view already, used a Common-Table-Expression (CTE) to reduce the data needed or you have a small table where all of the data in it is needed (such as a fact table).

Otherwise, performing a SELECT * on your data simply increases your BigQuery bill because BigQuery bills are based on the volume of data you scan in your queries when on an on-demand pricing model.

For example, if you perform a SELECT *on a 5TB table with five columns containing equal volumes of data, all of which has to be scanned, that query will cost $25, whereas a query involving a SELECT * only on the two columns you need would cost just $10. The costs quickly add up for queries run multiple times a day.

bigquery join

Example of using a SELECT * on a very large public dataset query

2. Unnecessary or larger joins

For data warehouses that focus on an OLAP strategy (like BigQuery), you are advised to denormalize the schemas in the database to flatten the data structures and minimize the number of joins required versus a traditional relational database. This is because a join operation is much slower within BigQuery than it is for a traditional database, due to the way the data is stored in the underlying system. Joining large tables obviously takes more time and scans more data than simply storing the required data (or a copy) in the same table.

You should also avoid the “self-join,” where data from the table might need to be broken up into windows of time or to put an internal ordering on duplicate rows (called ranking in many database systems). This is exceedingly slow, so use the window or analytic functions provided by BigQuery instead.

Here is an example of ranking duplicate job IDs in your INFORMATION_SCHEMA view:

 

SELECT
 query,
 job_id AS jobId,
 COALESCE(total_bytes_billed, 0) AS totalBytesBilled,
 ROW_NUMBER() OVER(PARTITION BY job_id ORDER BY end_time DESC) AS _rnk
FROM
 `<project-name>`.`<dataset-region>`.INFORMATION_SCHEMA.JOBS_BY_PROJECT

3. Cross joins

Cross joins are not something anyone from a Relational Database Management System (RDBMS) with a software engineering background would use, but they are required for several things in BigQuery. The principal use case is unnesting arrays into rows – a fairly common operation when working on analytical data.

bigquery cte

Example of unnesting a RECORD typed column using a CROSS JOIN

However, if you use cross joins as the innermost operation in your query, they pull in far more data than will be passed to the output, prompting BigQuery to bill you to scan and read a lot of data that may be thrown away at a later phase in the query. Instead, perform your cross joins at the outermost possible point in your query to minimize the volume of data being read before doing the cross join. This reduces your slot count and the volume of data you have to pay for.

4. Using Common Table Expressions (CTEs) incorrectly

Common Table Expressions (CTEs) are wonderful for breaking up SQL code that’s getting deep into multiple levels of subqueries. Generally used for readability rather than performance, they do not materialize the data and will be rerun if used multiple times. The biggest cost and performance issue I see is using a CTE in a query while referencing it multiple times. The CTE query is then run multiple times, so you will be billed for reading the data multiple times.

5. Not using partitions in WHERE clauses

Although partitions are one of the most important features of BigQuery for reducing costs and optimizing read performance, they are frequently omitted, incurring unnecessary spending on queries. A partition breaks up a table on disk into different physical partitions, based on an integer or timestamp/datetime/date value in a specific column, so when you read data from a partitioned table and specify a range on that column, it only needs to scan over the partitions that contain the data in that range – not the whole table.

The following query pulls the total billed bytes for all queries in the past 14 days. The JOBS_BY_PROJECT is partitioned by the creation_time column (schema doc is here) and, when run against a sample table with a total size of about 17GB, it processes 884 MB of data.

 

DECLARE interval_in_days INT64 DEFAULT 14;

SELECT
 query,
 total_bytes_billed AS totalBytesBilled
FROM
 `<project-name>`.`<dataset-region>`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
 creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL interval_in_days DAY)
   AND CURRENT_TIMESTAMP()

The following query is run using the start_time column, which is not partitioned but usually within fractions of a second close to the creation_time value, against a sample dataset it processes 15 GB of data. The logic is that it scans the whole table pulling the requested values out.

DECLARE interval_in_days INT64 DEFAULT 14;

SELECT
 query,
 total_bytes_billed AS totalBytesBilled
FROM
  `<project-name>`.`<dataset-region>`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
 start_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL interval_in_days DAY)
   AND CURRENT_TIMESTAMP()

The contrast is stark, even on a smaller dataset: Given that the first query costs about $0.004 and the second about $0.75, failing to use a partitioned column properly is roughly 21 times more expensive.

Performance is also an issue, with the first query taking about two seconds to run and the second about five. Scaling up to a multi-TB table could easily generate multiple-minute differences per query run.

6. Using overcomplicated views

As with most of its pseudo- and real-relational brethren, BigQuery supports a construct called a view. Essentially, a view is a query that masquerades its results as a table for easier querying. If the view contains very heavy computations that are executed each time the view is queried, it can degrade the query’s performance substantially. If the logic in a view is excessively complicated, it might be better suited to pre-calculation into another table or living in a materialized view to enhance performance.

7. Small inserts

BigQuery is best for processing substantial chunks of data at a time, but sometimes it’s necessary to insert a small number of records into a table, especially in some streaming types of applications. 

For small inserts, inserting 1KB or 10MB generally takes similar amounts of time and slot usage. Doing 1,000 inserts of a 1KB row could take up to 1,000 times as much slot consumption time as a single insert of 10MB of rows. Rather than doing multiple small inserts, batch up data and insert it as a batch. The same is true for streaming operations: Instead of using Streaming inserts, batch up your data before inserting it with a deadline for arrival.

8. Overusing DML Statements

This is a big issue that usually arises when someone approaches BigQuery as a traditional RDBMS system and simply recreates data at will.

These are three relatively common examples:

DELETE TABLE <table-name> IF EXISTS;
CREATE TABLE <table-name> ...;
INSERT INTO <table-name> (<columns>) VALUES (<values>);
TRUNCATE TABLE <table-name>;
INSERT INTO <table-name> (<columns>) VALUES (<values>);

 

DELETE FROM TABLE <table-name> WHERE <condition>;
INSERT INTO <table-name> (<columns>) VALUES (<values>);

Although running these on an RDBMS such as SQL Server or MySQL would be a relatively inexpensive operation, they perform very poorly within BigQuery. It is not optimized for doing DML statements in the same way that a traditional RDBMS is, so consider using an “additive model,” instead. In this model, new rows are inserted with a timestamp to indicate the latest, and older rows are deleted periodically if history is not needed. 

BigQuery is a data warehouse tuned for analytics, so it’s designed for working with existing data rather than modifying data in a transactional manner.

What to do next

This article is condensed from my series of articles on optimizing your BigQuery queries.

At DoiT, we have deep, broad expertise in BigQuery and across the areas of machine learning and business intelligence. To avail of our support, get in touch.

Subscribe to updates, news and more.

Subscribe

Subscribe to updates, news and more.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Related blogs

Ramp Plans Resource Hub Header1

Monitor your cloud commit attainment with DoiT Ramp Plans

DoiT Ramp Plans help you visualize, manage, and track your commit attainment so you can make sure you spend what you committed to, and act proactively.

Keep reading 
Secure-access-to-GCP-services-in-GitLab-Pipelines-with-Workload-Identity-Federation-DoiT-International

Secure access to GCP services in GitLab Pipelines with Workload Identity Federation

Traditionally, when using services like Google Cloud in a non-GCP environment (e.g., a CI/CD environment like GitLab pipelines), developers

Keep reading 
August 2023 Product Updates Resource Grid

[August 2023] DoiT Product Release Notes

We’re excited to share some recent updates we made in August to DoiT’s product portfolio. If you prefer watching

Keep reading 
View all blogs
Let’s do it

From cost optimization to cloud migration, machine learning and CloudOps, we’re here to make the public cloud easy — without the costs.

Ready to get started?

Get in touch

Company

  • About us
  • Blog
  • Careers
  • MS-HT statement
  • Newsroom
  • Privacy policy
  • Terms

Offering

  • Compliance
  • Products
  • Services
  • Solutions
  • Resources

Support

  • Sign in
  • Help center
  • Open ticket
  • Contact us

Never miss an update.

Subscribe to our newsletter

Subscribe to updates, news and more.