Google Data Engineer Interview Questions
Google Data Engineer Interview Questions
Google-tagged data engineer interview questions with live grading.
Google data engineer interview questions tagged based on reported interview shape. BigQuery-flavored SQL with ARRAY and STRUCT and QUALIFY. Large-scale Dataflow pipeline design. Algorithm-adjacent Python with explicit complexity reasoning that distinguishes Google from most data engineer loops. GCP-native architectures with cost reasoning in slot consumption.
Google's data engineer interview loop is 5-6 rounds with the GCP data stack as the assumed default. BigQuery is the warehouse (columnar, separated storage and compute, no DISTKEY/SORTKEY because partitioning and clustering replace them, ARRAY and STRUCT for semi-structured data). Dataflow runs Apache Beam pipelines (unified batch and streaming, the Google-canonical stream-processing answer). Pub/Sub is event ingestion (at-least-once with deduplication via message_id). Dataproc is Spark on GCP for migration scenarios. Cloud Storage with BigLake provides data lake with BigQuery query federation.
The Google data engineer SQL bar is BigQuery-flavored. Window functions, CTEs, and aggregation are standard. BigQuery-specific syntax that comes up: QUALIFY (native in BigQuery, filters window results without a wrapping CTE), ARRAY_AGG and UNNEST for semi-structured data, STRUCT for nested records, date partitioning (PARTITION BY DATE(timestamp)), clustering (CLUSTER BY user_id for high-cardinality lookup). Practice in Postgres here is portable for ~85 percent of patterns; BigQuery-specific syntax is tagged on the problems where it applies.
What makes Google distinct from other data engineer loops is the algorithm-adjacent thread in the Python round. Google's overall engineering culture comes through in the bar. Candidates report standard pipeline questions (parse, dedup, sessionize) but also Big-O-aware questions: implement a sliding-window aggregator in O(n) with a deque, explain the time complexity of your dedup approach, identify when a generator-based stream beats a list-based accumulator, defend why your data structure is O(1) lookup versus O(n) scan. It is not LeetCode-hard, but it is harder on complexity reasoning than Amazon or Stripe. Prepare to articulate Big-O for every data structure choice.
The Google data engineer design round expects a GCP-native architecture. For a streaming clickstream: Pub/Sub for ingest (with shard-equivalent sizing in subscriber count), Dataflow streaming job for windowed aggregation and dedup, BigQuery for serving plus Cloud Storage for raw archive. For a batch warehouse: Cloud Storage to Dataproc Spark for heavy joins to BigQuery for serving. For ML feature store: Dataflow streaming to Bigtable for online plus BigQuery for offline. The design rubric weights the streaming-versus-batch decision (when does Dataflow streaming make sense versus batch loading to BigQuery every 15 minutes), exactly-once semantics in Pub/Sub plus Dataflow (the at-least-once plus dedup pattern), and cost reasoning at scale (BigQuery slot consumption, Dataflow worker hours, Cloud Storage storage class trade-offs).
Google levels its data engineers L3 (entry, rare for DE), L4 (mid), L5 (senior, most common hire for experienced data engineers), L6 (staff), L7 (senior staff). L5 typically targets 5+ years experience. Rubric depth scales: L5 expects trade-off articulation and ownership of pipelines, L6 expects org-level design influence.
The behavioral round at Google is "Googleyness and Leadership", less rigid than Amazon's LP framing but with consistent themes: ownership, collaboration, ambiguity tolerance, and growth mindset. STAR format works. Specific numbers matter. The interviewer is also assessing communication clarity and the ability to make a point concisely; Google interviewers are often impatient with rambling answers.
- What SQL dialect does Google use in data engineer interviews?
- BigQuery Standard SQL. Window functions, CTEs, and aggregation are standard. BigQuery-specific syntax to know: QUALIFY (native, filters window results without a wrapping CTE), ARRAY_AGG and UNNEST (for semi-structured data), STRUCT (for nested records), date partitioning (PARTITION BY DATE(timestamp)), clustering (CLUSTER BY user_id for high-cardinality lookup). Practice in Postgres ports for ~85 percent of patterns.
- Why is the Python round 'algorithm-adjacent' at Google?
- Google's broader engineering culture weights Big-O reasoning more than most companies. Data engineer Python rounds at Google include standard pipeline questions (parse, dedup, sessionize) plus complexity-aware questions: implement a sliding-window aggregator in O(n) with a deque, explain why your dedup is O(n) versus the sort-then-iterate O(n log n) alternative, identify when a generator beats a list. Not LeetCode hard, but candidates report being asked complexity for every data structure choice.
- What is Dataflow and why does Google ask about it?
- Dataflow is Google's managed Apache Beam service: unified batch and streaming programming model, the GCP-canonical answer for stream processing. In design rounds, Pub/Sub to Dataflow streaming to BigQuery is the standard streaming architecture. The interview tests whether you understand windowing (fixed, sliding, sessions), triggers (when to emit results), and watermarks (how Dataflow handles late-arriving events).
- How does BigQuery pricing affect data engineer design discussions?
- BigQuery prices on slot consumption (compute) and storage. The design rubric weights cost reasoning: partition pruning (querying only the relevant date partitions cuts cost N-fold), clustering (CLUSTER BY user_id makes lookups cheap on a 100TB table), materialized views (precomputed for repeated queries), and BI Engine (in-memory acceleration for dashboards). Mention slot reservations versus on-demand pricing for predictable workloads.
- What does the system design round look like at Google?
- GCP-native architecture, 45-60 minutes. Common scenarios: streaming clickstream (Pub/Sub to Dataflow to BigQuery plus Cloud Storage), batch warehouse with daily refresh (Cloud Storage to Dataproc Spark to BigQuery), ML feature store (Dataflow to Bigtable for online plus BigQuery for offline), or migration scenario (existing Spark cluster on EMR, move to GCP with Dataproc-then-Dataflow). Rubric weights streaming-vs-batch choice, exactly-once semantics, and cost reasoning.
- Do Google data engineer candidates need to know algorithms beyond DSA basics?
- Yes for the Python round, more than most data engineer loops. Beyond basic data structures (dict, set, list, generator), expect sliding-window aggregators, heap-based merging (heapq), graph traversal for lineage questions, and explicit time/space complexity for every approach. The bar is not LeetCode-hard, but the rubric weights complexity articulation.
- How does Google handle the behavioral round for data engineers?
- Google's behavioral round is 'Googleyness and Leadership' theme, less rigid than Amazon's LP framing but with consistent themes: ownership, collaboration, ambiguity tolerance, and growth mindset. STAR format works. Specific numbers matter. The interviewer is also assessing communication clarity and the ability to make a point concisely; Google interviewers are often impatient with rambling answers.
- What levels does Google hire data engineers at?
- L3 (entry, rare for DE), L4 (mid), L5 (senior, most common hire for experienced data engineers), L6 (staff), L7 (senior staff). L5 typically targets 5+ years experience. Rubric depth scales: L5 expects trade-off articulation and ownership of pipelines, L6 expects org-level design influence.
102 practice problems matching this filter. Domains: SQL (68), Python (34). Difficulty: medium (47), easy (37), hard (18).
SQL (68)
- 10 Lowest Uptime Services - medium - Ten services at the bottom of the reliability chart.
- 80th Percentile API Latency - medium - The 80th percentile tells the real story.
- Active Users With April Transactions - easy - Active accounts that also opened their wallets. How many?
- The Tag Order - hard - Tags arrived in chaos. The system needs them in line.
- Average Review Comments by Author - medium - Some authors get more feedback than others.
- Average Session Duration by Device - easy - Session length, device by device.
- Average Sessions Per User - hard - How often do users come back?
- Build Success vs Failure by Repo - medium - Green versus red, repo by repo.
- Campaign Revenue Totals - easy - Every campaign has a price tag. Total them up.
- Cost Share Within Category - medium - Each entry's slice of the category total.
- Cross-Variant User Pairs - medium - Same experiment. Different variants. Who overlaps?
- Custom Message Type Counts - medium - Not all messages are created equal.
- Days with More Edited Than Unedited Messages - medium - Some days, more messages get edited than sent.
- Deploy Reliability Scores - medium - A reliability scoreboard for deploy teams.
- The Apprentices Still in the Forge - easy - A model is not a model until it stops learning and starts earning.
- Device Types With Chrome Users - easy - Power users and their devices.
- Distinct Product Categories - easy - A quick category inventory.
- Duplicate DQ Check Records - medium - Passed QA twice. That's the problem.
- Exact Keyword Counts in Logs - hard - Errors and warnings. Count every single one.
- The A/B Verdict - medium - Variant A or Variant B. The conversion numbers pick the winner.
- Feature Name Intersection - hard - Training names versus serving names. The overlap.
- Find Deploy Authors - easy - Same person. Many different spellings.
- First Migration Record - easy - The very first migration. Where it all began.
- Health Checks per Service - easy - Some services get checked constantly.
- High Price Products - easy - Everything above 100.
- Inactive Android Control Users - medium - Android control cohort. Gone quiet.
- Intra-Region Latency Diff - hard - Same region. Different latency.
- iOS Adoption by Age Bucket - medium - The install numbers don't match the hype.
- iOS Sessions by Device Type - medium - iOS engagement, device by device.
- Last Migration Record - easy - The most recent migration. Is it the last?
- Mentorship User Pairs - medium - Pair them up. Mentor and mentee.
- Messages Containing Keyword - easy - Flagged terms in the messages.
- The Floor Price - medium - Before the negotiation, find what each provider really charges at its cheapest.
- Model Accuracy Drift - hard - Accuracy used to be higher.
- Most-Allocated Service - hard - The service every big team keeps paying for.
- Most Efficient API Endpoint - medium - Best throughput per call.
- Multi-Host Regions by Node Type - medium - Some regions are quietly building empires.
- Mutual Channel Connections - medium - Two users. What channels do they share?
- Never-Ordered Products - easy - In the catalog. Never purchased.
- New Services With Poor Health - hard - New services, already struggling.
- Nodes by Region and Type - medium - Broken down by region. Broken down by type.
- Noisiest Tables by DQ Failures - medium - The tables that fail the most checks.
- Non-Draft Content - easy - Everything except drafts.
- Oldest Alert per Service - hard - The oldest unresolved alert per service.
- The Regulars - medium - Past a certain threshold, casual becomes committed.
- The Notification Lifecycle - medium - Sent, opened, ignored. What happened after the alert went out?
- Quarterly Consolidated Cloud Costs - medium - Quarterly cloud spend, weighted.
- The Relentless Searchers - medium - Most users look once and leave. A few never stop looking.
- Reviewer Performance Metrics - medium - Some reviewers are thorough. Others are fast.
- Reviews Per Reviewer - easy - The workload split across reviewers.
- Search Terms Starting With G - easy - Queries starting with 'g'.
- Senior to Junior Ratio - medium - The ratio tells you a lot about the department.
- Services at Median Uptime - medium - Exactly at the median. Not above, not below.
- Session Page View Distance - hard - Page view distance per session.
- Sessions Per Device Type - easy - Sessions, device by device.
- Shared Channel Contacts - hard - User networks mapped through messages.
- The Middle Ground - medium - Strip the outliers from both ends. What does the core actually add up to?
- Symmetric Reply Network - medium - Who replies to whom? Both directions.
- The Legacy Hunt - easy - Old data. Still matters.
- The Podium Finish - medium - Top two products per category.
- The Token Census - easy - How many tokens are out there?
- Top 10 Rated Products - medium - The ten highest-rated items.
- Top Models by Framework - hard - Every framework has a star model.
- Top Percentile API Tokens - hard - The most suspicious tokens.
- Unclicked Searches by Campaign - medium - Searched but never clicked.
- Unique Hosts by Node Type - easy - How many unique hosts per node type?
- Unique Reporters per Content - medium - How many people flagged each item?
- Word Count Per Message - medium - How wordy are the messages?
Python (34)
- Batch Records - medium - Too many at once. Break them into groups.
- Batch With Metadata - easy - The list gets chopped.
- Column Max - easy - One value rules the column.
- Column Range - easy - From minimum to maximum. What is the spread?
- All Told - easy - Every shift leaves a number behind. Total the fleet.
- Cumulative Sum - medium - The total grows with every row.
- Diagonal Extract - medium - Not every value sits in a row or column.
- Even Filter - easy - Only the even ones survive.
- Explode List - easy - One row holds many values. Unpack it.
- Find Indices - medium - It is in there somewhere. Where exactly?
- Flatten the Feed - easy - Nested lists, all the way down.
- Full Outer Zip - medium - Two sides. No value left behind.
- Null Counter - easy - How many holes in the data?
- Quality Gate - easy - Not everything passes inspection.
- Full Circle - medium - Load has to keep moving. Pass it down the line.
- Sort Descending - easy - Biggest first. No exceptions.
- The Anomaly Detector - hard - Spot the outliers before they page someone.
- The DAG Executor - hard - Wire up a mini pipeline and watch it run.
- The Dependency Resolver - medium - Everything depends on everything.
- The Email Ranker - medium - Some inboxes see more action.
- The Event Aggregator - medium - Bucket a firehose of events into tidy time windows.
- The Log Pulse - easy - Some lines repeat themselves.
- The One-of-Each - easy - Strip the repeats, keep the originals.
- The Pipeline Filter - easy - In the door as one thing, out the door as another.
- The Running Total - easy - Each position holds the sum of everything before it.
- The Schema Differ - medium - Schema from yesterday vs today. Something changed.
- The Schema Migrator - hard - Old schema in, new schema out.
- The Streak Breaker - easy - It has a problem with repetition.
- The Stream Joiner - hard - Events don't wait for each other. This does.
- The Throttle Ceiling - medium - Too many requests in too short a timeframe. Throttle it.
- The Throttle Wall - hard - Stop the abusers. Let the rest through.
- Transform Column - easy - Same data, new shape.
- Transpose Table - medium - Rows become columns. Columns become rows.
- Value Count - easy - How many of each? Count them.