10 Costly BigQuery Mistakes Engineers Make (And How to Avoid Them)

Table of Contents

BigQuery is one of the most powerful data warehouses available, but its serverless, pay-as-you-go model can generate bills that even experienced engineers may not expect. The mistakes below are not theoretical. They show up in production environments regularly and cost real money.

**Mistake 1: Running SELECT * on Large Tables**

This is the single most common and immediately expensive habit in BigQuery. Because BigQuery charges by bytes scanned on on-demand pricing, pulling every column from a wide table is the fastest way to generate a bill you did not plan for.

BigQuery stores data in a columnar format. When you query only the columns you need, the engine reads only those columns. When you run SELECT *, it reads everything, regardless of what your application actually uses.

What this looks like in practice:

Query	Table Size	Columns Needed	Data Scanned	Cost at $6.25/TB
SELECT *	2 TB	4 of 60	2 TB	$12.50
SELECT only needed cols	2 TB	4 of 60	~133 GB	~$0.83

The fix is straightforward: always specify the columns you need. If you are working with a very wide table, audit your scheduled queries for SELECT * statements before the next billing cycle.

Mistake 2: Misunderstanding How LIMIT Works in BigQuery

Engineers coming from traditional databases assume that LIMIT 10 means BigQuery only reads 10 rows. It does not work that way. BigQuery scans the full table first, applies the query logic, and only then applies LIMIT to format the output.

A preview query on a 500 TB table using SELECT * with LIMIT 10 will cost you over $3,000 for what looked like a quick data check. This is a well-documented source of unexpected billing spikes, particularly in organisations where analysts have direct access to production datasets.

How to preview data without paying for a full scan:

Use the BigQuery table preview feature in the console, which shows rows without executing a query
Query a sample using TABLESAMPLE SYSTEM (1 PERCENT) to read a fraction of the data
Create a partitioned or clustered view for analysts to query safely during exploration

Mistake 3: Not Using Partitioning and Clustering

Tables without partitioning force BigQuery to scan the entire dataset for every query, regardless of how narrow your filter is. This is one of the highest-leverage fixes available and one of the most frequently skipped during initial table design.

Partitioning by a date or timestamp column means queries with a date filter only scan the relevant partition. Clustering by frequently filtered columns narrows the scan further within each partition.

Common partitioning and clustering mistakes:

Partitioning on a high-cardinality column that creates too many partitions and defeats the purpose
Wrapping the partition column in a function like DATE(timestamp_column), which forces a full scan because BigQuery cannot apply partition pruning through a function
Not clustering on the columns that appear most frequently in WHERE clauses

If your execution details show "Partitions scanned: 365 of 365" on a query filtering a single day, your partition filter is not working as intended.

Mistake 4: Using Streaming Inserts When Batch Loading Would Work

Streaming inserts in BigQuery cost $0.01 per 200 MB, which works out to $50 per TB ingested. Batch loading from Google Cloud Storage is free. For teams ingesting large volumes of data, the choice of ingestion method has a direct and often overlooked impact on the monthly bill.

The common justification for streaming inserts is freshness. But most use cases that cite real-time requirements actually need data available within a few minutes, not seconds. A micro-batch approach using Cloud Functions or Dataflow to buffer records into GCS and load them every 5 to 10 minutes delivers near-real-time freshness at zero ingestion cost.

When streaming inserts are justified:

Genuine real-time requirements where sub-minute data availability is a product or compliance requirement
Low-volume, high-frequency event streams where the ingestion cost is genuinely minimal
Workloads where the operational complexity of buffering would cost more than the streaming fees

For everything else, batch loading is the right default.

Mistake 5: Choosing the Wrong Pricing Model for Your Workload

BigQuery offers on-demand pricing at $6.25 per TB scanned and capacity-based pricing through BigQuery Editions using slot reservations. Teams that start on on-demand and never revisit that decision often end up significantly overpaying as query volume grows.

On-demand pricing works well for unpredictable workloads and low query volumes, typically under 20 TB per month. Once a team reaches consistent, predictable query patterns, capacity pricing through slot reservations almost always delivers better unit economics.

Pricing Model	Best For	Cost Model
On-demand	Exploratory, low-volume, unpredictable	$6.25 per TB scanned
Standard Edition (slots)	Predictable workloads, medium scale	Per slot-hour with autoscaling
Enterprise Edition	Production workloads needing idle slot sharing	Per slot-hour with commitment options
Enterprise Plus	Compliance-sensitive environments	Per slot-hour, the highest feature set

The transition decision should be based on at least 30 days of query pattern data, not on a point-in-time estimate. You can read more about GCP pricing models and how to evaluate them before making the switch.

Mistake 6: Misconfiguring Autoscaling Slots

Autoscaling sounds like the ideal configuration. Pay only for what you need, scale up when demand spikes. In practice, BigQuery's autoscaling engine is designed to run queries as quickly as possible, which means it will use every available slot up to your configured maximum, regardless of whether that speed is necessary for the workload.

A reservation with 0 baseline and 500 maximum autoscaling slots will regularly consume close to 500 slots on scheduled pipelines that could comfortably run on 100. Autoscaling adjusts in 100-slot increments, recalculating once per minute, so a workload that needs 30 slots consistently gets allocated 100 slots.

How to configure autoscaling more deliberately:

Audit whether each workload actually needs sub-minute completion or whether a longer runtime at lower slot allocation is acceptable
Set a baseline slot count that reflects steady-state demand rather than leaving it at zero
Apply workload management groups to separate interactive dashboards from batch pipelines so each gets the slot allocation appropriate to its latency requirements

Mistake 7: Inefficient JOIN Operations on Large Tables

Poorly structured joins are one of the most common sources of unexpectedly high slot consumption. When large tables are joined without filters applied beforehand, BigQuery must process the full dataset from both tables before it can return results.

The fix involves two habits that experienced BigQuery engineers treat as standard practice but that are frequently skipped under delivery pressure.

Best practices for JOIN cost reduction:

Apply WHERE filters to individual tables before joining them, not after
Place the largest table on the left side of the JOIN clause so BigQuery's optimiser can distribute the smaller table across worker nodes
Cluster both tables on the join keys so BigQuery can apply block pruning and skip irrelevant data segments before the join runs
Avoid joining tables that are replicated unnecessarily across regions, as cross-region joins carry both compute and potential egress costs

Mistake 8: No Query Cost Caps or Spending Quotas

BigQuery on-demand pricing has no default spending cap. A single misconfigured scheduled query that joins two large, unfiltered tables can scan hundreds of terabytes before anyone notices. This has resulted in $2,000 in charges from a single rogue query in documented production incidents, and it has resulted in far worse in environments without any monitoring in place.

Setting custom quotas at the project and user level is a straightforward control that most teams skip because it requires a small amount of upfront configuration.

Controls worth implementing immediately:

Set a maximum number of bytes billed per query at the project level to prevent runaway scans
Apply user-level quotas to limit daily data processed by individual analysts
Create a dedicated sandbox project with stricter quotas for exploratory analysis, separate from production
Enable budget alerts in GCP Billing to notify the team before spend crosses a defined threshold

For teams looking to build more structured GCP cost control practices, starting with query quotas is one of the highest-impact, lowest-effort first steps.

Mistake 9: Ignoring Long-Term Storage Discounts on Inactive Data

BigQuery automatically applies a 50% storage discount to tables and partitions that have not been modified for 90 consecutive days, bringing the cost from $0.02 per GB per month down to $0.01 per GB per month. Many teams are sitting on this discount without realising it, because they have never audited which datasets are actually inactive.

The missed opportunity goes in both directions. Teams that are unaware of the 90-day threshold sometimes run unnecessary updates or table modifications to keep them "fresh," inadvertently resetting the clock on the long-term discount without any operational reason to do so.

Storage cost practices worth building:

Run a regular audit of dataset modification timestamps to identify tables already qualifying for long-term storage rates
Archive or delete datasets from completed projects rather than leaving them in active storage indefinitely
Avoid touching tables that have reached or are approaching the 90-day threshold unless there is a genuine operational reason to do so

Mistake 10: Running High-Cost Queries Without a Dry Run First

BigQuery's dry run feature lets you submit a query to estimate how much data it will scan without actually running it or incurring a charge. It takes seconds to run and surfaces the exact byte estimate that determines the cost of the actual execution.

Despite being a native, free feature, dry runs are rarely part of the standard engineering workflow. Teams discover expensive queries after they have run, not before.

How to make dry runs part of standard practice:

Use the --dry_run flag in the bq command-line tool before executing any large ad hoc query
In the BigQuery console, check the query validator in the top-right corner before running, which shows the estimated bytes scanned
Build dry run checks into data pipeline code review processes so expensive queries are caught before they reach production
Use GCP cost monitoring tooling to surface high-cost query patterns across the team over time, not just at the point of individual query execution

To Sum Up

BigQuery cost overruns are seldom the result of a single catastrophic decision; instead, they accumulate through a series of small habits that made sense when tables were small and query volumes were low and became expensive as the environment scaled. The ten mistakes above are the most consistently cited across production engineering teams, and each has a fix that does not require rearchitecting your data platform.

Start with the highest-leverage changes: eliminate SELECT *, partition your tables correctly, set query cost caps, and audit your ingestion method. Those four changes alone account for the majority of addressable BigQuery waste in most environments. For teams managing GCP spend at scale, CloudKeeper's Lens for GCP surfaces BigQuery cost patterns alongside the rest of your GCP footprint, giving engineers and FinOps teams the visibility to act before bills arrive.

Let's discuss your cloud challenges and see how CloudKeeper can solve them all!

0 Comment