Tech-Specific Question Hub

BigQuery Interview Questions

BigQuery interview questions for data engineer roles at GCP- native companies (Google, Spotify, Snap, Etsy, Twitter partial). 50+ questions covering BigQuery internals (slot- based pricing, partitioning, clustering, materialized views, BI Engine), SQL dialect specifics (ARRAY_AGG and UNNEST, STRUCT manipulation, JSON functions), cost optimization, and system design with BigQuery as the warehouse. Pair with the data engineer interview prep guide and the the GCP data engineer interview guide.

The Short Answer
BigQuery questions appear in 100% of GCP data engineer loops and 30% of cross-cloud data engineer loops where BigQuery knowledge signals modern lakehouse fluency. The depth ranges from L3 syntax (writing standard SQL in BigQuery dialect) to L6 architecture (designing multi-tenant BigQuery deployments with cost attribution). Strong candidates know slot-based pricing intuitively, can predict query cost from query shape, and design tables with partitioning and clustering aligned to query patterns.
Updated April 2026ยทBy The DataDriven Team

BigQuery Topic Frequency in Interviews

From 78 reported GCP DE loops in 2024-2026.

TopicTest FrequencyDepth Expected
Partitioning (date, integer range)94%Strategy choice and pruning behavior
Clustering (up to 4 columns)87%Order matters; aligns with WHERE / JOIN patterns
Slot-based vs on-demand pricing78%When each is right, cost projection
ARRAY_AGG and UNNEST73%One-to-many flattening, nested aggregation
STRUCT and nested fields65%Schema design with nested types vs flat tables
JSON functions58%JSON_EXTRACT, JSON_VALUE, JSON_QUERY trade-offs
Materialized views52%Auto-refresh, supported SQL subset, when to use
BI Engine for dashboards47%Sub-second query layer for production Looker
APPROX_COUNT_DISTINCT (HLL++)44%When approximate is acceptable
Window functions76%QUALIFY clause, frame clauses, dialect-specific behavior
Time travel32%AT (TIMESTAMP), AT (DATETIME) for historical queries
Authorized views38%Cross-dataset access patterns with row-level security
Streaming inserts47%Streaming buffer, exactly-once via insertId
Storage Write API36%Newer, exactly-once batch writes

BigQuery Internals: Pricing and Performance

BigQuery's pricing model is the central interview topic. Slot-based: pay per slot per second of dedicated compute capacity. Predictable cost. On-demand: pay per byte scanned (currently $5 per TB). Variable cost, potentially much higher or lower than slot-based depending on query volume and shape. Most production workloads use slots above some volume threshold; ad-hoc and small workloads use on-demand.

Slot capacity sizing is the senior signal. A 100-slot reservation can run roughly 100 medium queries concurrently. Most teams under-provision and run into queueing; some over-provision and waste money. The analytical answer is to look at peak slot usage in INFORMATION_SCHEMA.JOBS and size to handle 95th percentile load.

Partitioning and clustering interact with pricing directly. A partitioned + clustered table that prunes effectively scans 1-10% of total bytes; the same query without partition pruning scans 100% and costs 10-100x more. The interview prompt "why does this query cost $50 instead of $5" usually has the answer "because the WHERE clause didn't hit the partition column."

Eight Real BigQuery Interview Questions With Worked Answers

L4

Why does this query cost $50? Find and fix the issue.

Common diagnostic. SELECT * scans every column; filtering on a non-partition column scans every partition. The fix: project only needed columns (column pruning saves 10-90% depending on schema), filter on the partition column (partition pruning saves the proportional amount of skipped partitions), ensure clustering aligns with frequently-filtered columns.
-- Bad: scans 10TB
SELECT *
FROM `project.dataset.events`
WHERE event_type = 'purchase'
  AND user_id = '12345';

-- Good: scans ~100GB
SELECT user_id, event_ts, amount_usd
FROM `project.dataset.events`
WHERE _PARTITIONDATE >= '2026-01-01'
  AND _PARTITIONDATE < '2026-02-01'
  AND event_type = 'purchase'
  AND user_id = '12345';
-- Required: table partitioned by _PARTITIONDATE
-- and clustered by (event_type, user_id)
L4

Use ARRAY_AGG to compute user event sequences

ARRAY_AGG with ORDER BY produces an ordered array of values per group. Common pattern for user-event sequence analysis. Combined with STRUCT for multi-field rows.
SELECT
  user_id,
  ARRAY_AGG(
    STRUCT(event_type, event_ts, page_url)
    ORDER BY event_ts
  ) AS event_sequence
FROM events
WHERE _PARTITIONDATE >= CURRENT_DATE() - 7
GROUP BY user_id;

-- Downstream UNNEST to flatten back to rows when needed:
SELECT user_id, e.event_type, e.event_ts
FROM user_events_with_sequence,
UNNEST(event_sequence) AS e
WHERE e.event_type = 'purchase';
L5

Compute approximate distinct users at billion-row scale

COUNT(DISTINCT) at billion-row scale is expensive because BigQuery materializes the distinct set. Use APPROX_COUNT_DISTINCT (HyperLogLog++) for estimates within ~2% of true count at constant memory. For rolling distinct counts, store HLL_SKETCH per bucket and merge.
-- Approximate count distinct (~2% error, constant memory)
SELECT
  date_trunc(event_ts, DAY) AS day,
  APPROX_COUNT_DISTINCT(user_id) AS approx_dau
FROM events
WHERE _PARTITIONDATE >= '2026-01-01'
GROUP BY day;

-- Mergeable HLL sketches for rolling unique counts
WITH daily_sketches AS (
  SELECT
    date_trunc(event_ts, DAY) AS day,
    HLL_COUNT.INIT(user_id) AS sketch
  FROM events
  WHERE _PARTITIONDATE >= '2026-01-01'
  GROUP BY day
)
SELECT
  day,
  HLL_COUNT.MERGE(sketch) OVER (
    ORDER BY day
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7d_unique_users
FROM daily_sketches
ORDER BY day;
L5

Use QUALIFY to filter window function results

QUALIFY is BigQuery's shortcut for filtering on window function output without a CTE wrapper. Cleaner than the SELECT FROM (SELECT ... ROW_NUMBER ...) WHERE rn = 1 pattern. Standard in BigQuery and Snowflake.
-- Top 1 event per user
SELECT user_id, event_ts, event_type
FROM events
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY user_id
  ORDER BY event_ts DESC
) = 1;

-- Equivalent without QUALIFY (more verbose):
WITH ranked AS (
  SELECT
    user_id, event_ts, event_type,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_ts DESC
    ) AS rn
  FROM events
)
SELECT user_id, event_ts, event_type
FROM ranked
WHERE rn = 1;
L5

When would you use a materialized view vs a scheduled query?

Materialized views: BigQuery auto-refreshes when source data changes; queries automatically rewrite to use the view when applicable. Best for high-frequency aggregates with low-frequency source changes. Limitations: only support a subset of SQL, can't use window functions (until 2025), can't join across datasets without authorized views. Scheduled queries: run on a cron, write to a regular table, full SQL support, you control refresh frequency. Best for complex transformations or when MV restrictions block the use case.
L5

Design a clustering strategy for a 5TB clickstream table

Cluster on the columns most-filtered in queries. Order matters: clustering keys are sorted in the declared order, so the first key has the strongest pruning effect. Common pattern: cluster by (event_type, user_id) when event_type filter is common and user_id is the secondary filter. Don't cluster on too many columns (BigQuery limit is 4 but diminishing returns past 2 in most cases).
L5

Design a streaming insert pipeline with exactly-once semantics

Producer sets insertId per record. BigQuery deduplicates within a 1-minute window using insertId for streaming inserts. For exactly-once at scale, use the BigQuery Storage Write API with default streams in committed mode, which provides true exactly-once semantics across the entire pipeline. Discuss why streaming inserts have a per-row cost ($0.05 per GB) on top of storage cost; for high- volume use cases, switch to Storage Write API or batch loading.
L6

Design a multi-tenant BigQuery deployment with cost attribution

Per-tenant project for isolation: each tenant gets its own project, shared dataset via authorized views. Cost attribution via labels on jobs and datasets. Slot reservations per tenant tier (Tier 1 customers get dedicated slots; Tier 2 share a flex pool). Row-level security via BigQuery row access policies for fine-grained isolation. Discuss the trade-off: per-project isolation is most secure but adds operational overhead; shared dataset with row-level security is simpler but requires careful policy management.

BigQuery vs Other Warehouses

BigQuery is one of three major cloud warehouses. The choice depends on cost model preference, operational tolerance, and feature requirements.

DimensionBigQuerySnowflakeRedshift
Pricing modelSlot or on-demand per byteCredit-based per warehouse secondCluster-based per node hour
Compute-storage separationYes (always)YesYes (RA3) or No (DC2)
Concurrency modelSlot-based queueingAuto-scaled warehousesConcurrency scaling clusters
Native streaming insertsYes (Storage Write API)Yes (Snowpipe)Yes (streaming ingestion)
Time travel7 days default1 to 90 days configurableNo native, via snapshots
ML in warehouseBigQuery ML (mature)Snowpark MLRedshift ML (limited)
Geo-replicationMulti-region (slow)Replicated databasesCross-region snapshots
Best fitGCP shops, ad-hoc heavyMulti-cloud, predictable loadAWS-native, sustained load

How BigQuery Connects to the Rest of the Cluster

BigQuery is the warehouse component in any the GCP data engineer interview guide stack. The broader the SQL interview questions hub hub covers SQL fluency at the dialect-portable level; this guide is the BigQuery-specific layer on top. For the round-level framework, see the SQL round prep guide.

For warehouse comparison, see Snowflake vs Databricks interview comparison (the most-tested decision in 2024-2026 lakehouse interviews) and the company guides for the Snowflake data engineer interview guide and the Databricks data engineer interview guide.

Data Engineer Interview Prep FAQ

How important is BigQuery knowledge for non-GCP roles?+
Less critical than for GCP-native roles, but increasingly important as a 'modern warehouse' literacy signal. Knowing BigQuery's columnar storage and pricing model demonstrates fluency that transfers to Snowflake and Redshift conceptually.
What's the difference between standard SQL and legacy SQL in BigQuery?+
Legacy SQL was BigQuery's original dialect; standard SQL (ANSI-compliant) replaced it as the default in 2017. All new code should use standard SQL. Legacy SQL appears occasionally in old systems but is no longer the interview default.
Should I learn BigQuery ML?+
Helpful but not required for most data engineer roles. BigQuery ML lets you train models in SQL; it's useful for analytics-engineer-leaning roles and for prototyping. For production ML, most teams move to Vertex AI or external ML platforms.
How is BigQuery pricing different from Snowflake?+
BigQuery: slot-based (predictable) or on-demand (per byte scanned). Snowflake: per warehouse second of runtime. BigQuery rewards efficient queries (less scan = less cost); Snowflake rewards short queries (less runtime = less cost). Different optimization mindsets.
Are BigQuery materialized views production-ready?+
Yes for the supported SQL subset. The 2024-2025 updates added significant capabilities (more SQL functions, partition-aligned MVs). For complex aggregations with window functions, scheduled queries to regular tables remain the more common pattern.
How important is the BigQuery Storage Write API?+
Increasingly important at L5+. Storage Write API replaces legacy streaming inserts with better cost, performance, and exactly-once semantics. Senior candidates know when to use it (high-volume streaming) vs streaming inserts (low-volume) vs batch loading (highest volume).
Is BigQuery Slots the right capacity model for my company?+
Slots are right when you have predictable, high query volume (hundreds of queries per hour) and want cost predictability. On-demand is right for ad-hoc and low-volume workloads. Many teams use both: slots for production pipelines, on-demand for analyst exploration.
What's BI Engine and when do I need it?+
BI Engine is BigQuery's in-memory cache for sub-second dashboard queries. You need it when: serving production Looker or Looker Studio dashboards, when query latency matters more than cost, when underlying tables are 10GB to 100GB. For larger tables, materialized views into BI Engine are the pattern.

Practice BigQuery SQL in the Browser

Run real BigQuery-flavored SQL problems against real schemas. Build the BigQuery-specific fluency that wins GCP data engineer loops.

Start BigQuery Practice

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats