BigQuery Interview Questions

BigQuery interview questions for data engineer roles at GCP- native companies (Google, Spotify, Snap, Etsy, Twitter partial). 50+ questions covering BigQuery internals (slot- based pricing, partitioning, clustering, materialized views, BI Engine), SQL dialect specifics (ARRAY_AGG and UNNEST, STRUCT manipulation, JSON functions), cost optimization, and system design with BigQuery as the warehouse. Pair with the data engineer interview prep guide and the the GCP data engineer interview guide.

BigQuery Topics That Come Up in Interviews

Topics roughly ordered by how often they appear in GCP data engineer loops.

Topic	Frequency	Depth Expected
Partitioning (date, integer range)	Very common	Strategy choice and pruning behavior
Clustering (up to 4 columns)	Very common	Order matters; aligns with WHERE / JOIN patterns
Slot-based vs on-demand pricing	Common	When each is right, cost projection
ARRAY_AGG and UNNEST	Common	One-to-many flattening, nested aggregation
STRUCT and nested fields	Common	Schema design with nested types vs flat tables
JSON functions	Common	JSON_EXTRACT, JSON_VALUE, JSON_QUERY trade-offs
Materialized views	Common	Auto-refresh, supported SQL subset, when to use
BI Engine for dashboards	Occasional	Sub-second query layer for production Looker
APPROX_COUNT_DISTINCT (HLL++)	Occasional	When approximate is acceptable
Window functions	Common	QUALIFY clause, frame clauses, dialect-specific behavior
Time travel	Occasional	AT (TIMESTAMP), AT (DATETIME) for historical queries
Authorized views	Occasional	Cross-dataset access patterns with row-level security
Streaming inserts	Occasional	Streaming buffer, exactly-once via insertId
Storage Write API	Occasional	Newer, exactly-once batch writes

BigQuery Internals: Pricing and Performance

BigQuery's pricing model is the central interview topic. Slot-based: pay per slot per second of dedicated compute capacity. Predictable cost. On-demand: pay per byte scanned (currently $5 per TB). Variable cost, potentially much higher or lower than slot-based depending on query volume and shape. Most production workloads use slots above some volume threshold; ad-hoc and small workloads use on-demand.

Slot capacity sizing is the senior-level question. A 100-slot reservation can run roughly 100 medium queries concurrently. Most teams under-provision and hit queueing; some over-provision and waste money. The answer is to look at peak slot usage in INFORMATION_SCHEMA.JOBS and size to handle 95th percentile load.

Partitioning and clustering interact with pricing directly. A partitioned and clustered table that prunes effectively scans a small fraction of total bytes; the same query without partition pruning scans the whole table and costs orders of magnitude more. The interview prompt "why does this query cost $50 instead of $5" usually has the answer "because the WHERE clause didn't hit the partition column."

Eight BigQuery Interview Questions With Worked Answers

Why does this query cost $50? Find and fix the issue.

Common diagnostic. SELECT * scans every column; filtering on a non-partition column scans every partition. The fix: project only needed columns (column pruning saves 10-90% depending on schema), filter on the partition column (partition pruning saves the proportional amount of skipped partitions), ensure clustering aligns with frequently-filtered columns.

-- Bad: scans 10TB
SELECT *
FROM `project.dataset.events`
WHERE event_type = 'purchase'
  AND user_id = '12345';

-- Good: scans ~100GB
SELECT user_id, event_ts, amount_usd
FROM `project.dataset.events`
WHERE _PARTITIONDATE >= '2026-01-01'
  AND _PARTITIONDATE < '2026-02-01'
  AND event_type = 'purchase'
  AND user_id = '12345';
-- Required: table partitioned by _PARTITIONDATE
-- and clustered by (event_type, user_id)

Use ARRAY_AGG to compute user event sequences

ARRAY_AGG with ORDER BY produces an ordered array of values per group. Common pattern for user-event sequence analysis. Combined with STRUCT for multi-field rows.

SELECT
  user_id,
  ARRAY_AGG(
    STRUCT(event_type, event_ts, page_url)
    ORDER BY event_ts
  ) AS event_sequence
FROM events
WHERE _PARTITIONDATE >= CURRENT_DATE() - 7
GROUP BY user_id;

-- Downstream UNNEST to flatten back to rows when needed:
SELECT user_id, e.event_type, e.event_ts
FROM user_events_with_sequence,
UNNEST(event_sequence) AS e
WHERE e.event_type = 'purchase';

Compute approximate distinct users at billion-row scale

COUNT(DISTINCT) at billion-row scale is expensive because BigQuery materializes the distinct set. Use APPROX_COUNT_DISTINCT (HyperLogLog++) for estimates within ~2% of true count at constant memory. For rolling distinct counts, store HLL_SKETCH per bucket and merge.

-- Approximate count distinct (~2% error, constant memory)
SELECT
  date_trunc(event_ts, DAY) AS day,
  APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM events
WHERE _PARTITIONDATE >= '2026-01-01'
GROUP BY day;

-- Mergeable HLL sketches for rolling unique counts
WITH daily_sketches AS (
  SELECT
    date_trunc(event_ts, DAY) AS day,
    HLL_COUNT.INIT(user_id) AS sketch
  FROM events
  WHERE _PARTITIONDATE >= '2026-01-01'
  GROUP BY day
)
SELECT
  day,
  HLL_COUNT.MERGE(sketch) OVER (
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_unique_users
FROM daily_sketches
ORDER BY day;

Use QUALIFY to filter window function results

QUALIFY is BigQuery’s shortcut for filtering on window function output without a CTE wrapper. Cleaner than the SELECT FROM (SELECT ... ROW_NUMBER ...) WHERE rn = 1 pattern. Standard in BigQuery and Snowflake.

-- Top 1 event per user
SELECT user_id, event_ts, event_type
FROM events
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY event_ts DESC
) = 1;

-- Equivalent without QUALIFY (more verbose):
WITH ranked AS (
  SELECT
    user_id, event_ts, event_type,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_ts DESC
    ) AS rn
  FROM events
)
SELECT user_id, event_ts, event_type
FROM ranked
WHERE rn = 1;

When would you use a materialized view vs a scheduled query?

Materialized views: BigQuery auto-refreshes when source data changes; queries automatically rewrite to use the view when applicable. Best for high-frequency aggregates with low-frequency source changes. Limitations: only support a subset of SQL, can’t use window functions (until 2025), can’t join across datasets without authorized views. Scheduled queries: run on a cron, write to a regular table, full SQL support, you control refresh frequency. Best for complex transformations or when MV restrictions block the use case.

Design a clustering strategy for a 5TB clickstream table

Cluster on the columns most-filtered in queries. Order matters: clustering keys are sorted in the declared order, so the first key has the strongest pruning effect. Common pattern: cluster by (event_type, user_id) when event_type filter is common and user_id is the secondary filter. Don’t cluster on too many columns (BigQuery limit is 4 but diminishing returns past 2 in most cases).

Design a streaming insert pipeline with exactly-once semantics

Producer sets insertId per record. BigQuery deduplicates within a 1-minute window using insertId for streaming inserts. For exactly-once at scale, use the BigQuery Storage Write API with default streams in committed mode, which provides true exactly-once semantics across the entire pipeline. Discuss why streaming inserts have a per-row cost ($0.05 per GB) on top of storage cost; for high-volume use cases, switch to Storage Write API or batch loading.

Design a multi-tenant BigQuery deployment with cost attribution

Per-tenant project for isolation: each tenant gets its own project, shared dataset via authorized views. Cost attribution via labels on jobs and datasets. Slot reservations per tenant tier (Tier 1 customers get dedicated slots; Tier 2 share a flex pool). Row-level security via BigQuery row access policies for fine-grained isolation. Discuss the trade-off: per-project isolation is most secure but adds operational overhead; shared dataset with row-level security is simpler but requires careful policy management.

BigQuery vs Other Warehouses

BigQuery is one of three major cloud warehouses. The choice depends on cost model preference, operational tolerance, and feature requirements.

Dimension	BigQuery	Snowflake	Redshift
Pricing model	Slot or on-demand per byte	Credit-based per warehouse second	Cluster-based per node hour
Compute-storage separation	Yes (always)	Yes	Yes (RA3) or No (DC2)
Concurrency model	Slot-based queueing	Auto-scaled warehouses	Concurrency scaling clusters
Native streaming inserts	Yes (Storage Write API)	Yes (Snowpipe)	Yes (streaming ingestion)
Time travel	7 days default	1 to 90 days configurable	No native, via snapshots
ML in warehouse	BigQuery ML (mature)	Snowpark ML	Redshift ML (limited)
Geo-replication	Multi-region (slow)	Replicated databases	Cross-region snapshots
Best fit	GCP shops, ad-hoc heavy	Multi-cloud, predictable load	AWS-native, sustained load

How BigQuery Connects to the Rest of the Cluster

BigQuery is the warehouse component in any the GCP data engineer interview guide stack. The broader the SQL interview questions hub hub covers SQL fluency at the dialect-portable level; this guide is the BigQuery-specific layer on top. For the round-level framework, see the SQL round prep guide.

For warehouse comparison, see Snowflake vs Databricks interview comparison (the most-tested decision in 2024-2026 lakehouse interviews) and the company guides for the Snowflake data engineer interview guide and the Databricks data engineer interview guide.

Prepare for the interview

01 / Open invite

02min.

Know BigQuery the way the interviewer who asks it knows it.

a BigQuery query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

TikTokInterview question

Solve a BigQuery problem

Data engineer interview prep FAQ

How important is BigQuery knowledge for non-GCP roles?+

Less critical than for GCP-native roles, but increasingly important as a ‘modern warehouse’ literacy signal. Knowing BigQuery’s columnar storage and pricing model demonstrates fluency that transfers to Snowflake and Redshift conceptually.

What’s the difference between standard SQL and legacy SQL in BigQuery?+

Legacy SQL was BigQuery’s original dialect; standard SQL (ANSI-compliant) replaced it as the default in 2017. All new code should use standard SQL. Legacy SQL appears occasionally in old systems but is no longer the interview default.

Should I learn BigQuery ML?+

Helpful but not required for most data engineer roles. BigQuery ML lets you train models in SQL; it’s useful for analytics-engineer-leaning roles and for prototyping. For production ML, most teams move to Vertex AI or external ML platforms.

How is BigQuery pricing different from Snowflake?+

BigQuery: slot-based (predictable) or on-demand (per byte scanned). Snowflake: per warehouse second of runtime. BigQuery rewards efficient queries (less scan = less cost); Snowflake rewards short queries (less runtime = less cost). Different optimization mindsets.

Are BigQuery materialized views production-ready?+

Yes for the supported SQL subset. The 2024-2025 updates added significant capabilities (more SQL functions, partition-aligned MVs). For complex aggregations with window functions, scheduled queries to regular tables remain the more common pattern.

How important is the BigQuery Storage Write API?+

Increasingly important at L5+. Storage Write API replaces legacy streaming inserts with better cost, performance, and exactly-once semantics. Senior candidates know when to use it (high-volume streaming) vs streaming inserts (low-volume) vs batch loading (highest volume).

Is BigQuery Slots the right capacity model for my company?+

Slots are right when you have predictable, high query volume (hundreds of queries per hour) and want cost predictability. On-demand is right for ad-hoc and low-volume workloads. Many teams use both: slots for production pipelines, on-demand for analyst exploration.

What’s BI Engine and when do I need it?+

BI Engine is BigQuery’s in-memory cache for sub-second dashboard queries. You need it when: serving production Looker or Looker Studio dashboards, when query latency matters more than cost, when underlying tables are 10GB to 100GB. For larger tables, materialized views into BI Engine are the pattern.

02 / Why practice

Practice BigQuery SQL in the Browser

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start with a problem

More data engineer interview prep guides

the SQL interview questions hub→

The full SQL interview problem set, indexed by topic, difficulty, and company.

Redshift interview questions→

Redshift sort keys, dist keys, compression, and RA3 architecture interview prep.

Postgres interview questions→

Postgres MVCC, indexing, partitioning, and replication interview prep.

Flink interview questions→

Apache Flink stateful streaming, watermarks, exactly-once, checkpointing interview prep.

Hadoop interview questions→

Hadoop ecosystem (HDFS, MapReduce, YARN, Hive) interview prep, including modern relevance.

AWS Glue interview questions→

AWS Glue ETL jobs, crawlers, Data Catalog, and PySpark-on-Glue interview prep.