Data Engineer System Design Interview Prep
Data Engineer System Design Interview Prep
Prep for the system design round of a data engineer interview loop with rubric-scored practice scenarios.
System design interview prep for data engineer roles. End-to-end pipeline design rounds last 45 to 60 minutes. Scenarios include 10B-events-per-day clickstream, 15-minute-freshness dashboard, 28-day late-arriving conversion window, multi-region replication. Rubric weights SLA match, cost reasoning, 3 failure modes per component, tool fit, and adapt-on-fly when the interviewer changes a requirement.
The system design round shows up on 52 percent of senior-and-above data engineer interview loops. Format is 45 to 60 minutes, end-to-end pipeline design on a whiteboard or canvas, with a concrete scenario like "10 million events per day, 15-minute dashboard freshness SLA, downstream BI tool that cannot handle table swaps". The interviewer expects a high-level architecture, then drills on three failure modes per component, cost reasoning at scale, and the on-call story (what gets paged, who responds, what the runbook says).
Six scenario shapes recur across data engineer system design rounds in 2026. Clickstream ingestion: 10B events per day, web SDK to local buffer to Kafka to Spark Structured Streaming or Flink to Parquet on S3 partitioned by date and hour to dbt to gold star schemas in Snowflake. Daily ETL from Postgres to Snowflake: Debezium CDC to Kafka to S3 raw immutable to Spark daily ETL with run_id baked into output partitions to Snowflake MERGE on composite natural key. ML feature store: real-time path uses Flink to Redis with 10ms reads, batch path uses Spark to S3 Parquet to Feast catalog, training uses as-of joins with feature_ts less-than-or-equal-to label_ts to prevent leakage. Daily reconciliation pipeline for payments: Postgres to Debezium to Kafka to S3 raw immutable to idempotent Spark with run_id to Snowflake MERGE on (txn_id, run_id). Multi-region active-active warehouse: region-local writes with async cross-region CDC replication, conflict resolution via last-writer-wins or CRDT for counters, SLA tiers (real-time within region, eventually-consistent across regions), 2x storage minimum. Real-time analytics dashboard: micro-batch with Spark Structured Streaming on 1-minute trigger, Materialize or Druid for serving, hourly Spark to Snowflake for historical.
The L5+ rubric explicitly weights three failure modes per component. For each box on the whiteboard, name what happens when it dies, when it gets backed up, and when the upstream schema changes. For Kafka: broker dies (replication factor 3 handles N-1 failures), partition skew (key distribution review, repartitioning), consumer lag (autoscale consumer group, check downstream). For Spark Structured Streaming: executor OOM (memory tuning, broadcast threshold), watermark too aggressive (late data drops, increase watermark), checkpoint corruption (delete checkpoint and reprocess from earliest acceptable offset). For Snowflake MERGE: deadlock with concurrent writer (serialize via lock or queue, use insert_overwrite pattern), partition not yet committed (delay merge until processing watermark advances), schema drift (schema registry enforcement at producer side).
Companies that emphasize system design heavily in data engineer loops: Netflix (Spark and Iceberg with streaming and late-arriving data), Stripe (idempotent reconciliation, financial-data audit, multi-region for global payments), Meta (ads attribution with 28-day windows, feed-ranking signals pipeline at 10B+ events per day), Amazon (AWS-native architectures: Kinesis to Firehose to S3 to Glue and Athena and Redshift), Google (GCP-native: Pub/Sub to Dataflow to BigQuery), Databricks (Spark expertise, AQE, Delta MERGE INTO with optimize). The senior data engineer who has practiced 5 of these architectures end-to-end with explicit failure-mode articulation usually clears the system design round at any of them.
- What percentage of data engineer interviews include a system design round?
- Roughly 52 percent of senior-and-above data engineer interview loops include an explicit system design round. The share rises with seniority: nearly all L5+ data engineer loops include system design, and L6+ loops often include two design rounds (one pipeline-specific, one platform-level). Format is 45 to 60 minutes on a whiteboard or canvas.
- What does the system design rubric score for data engineer interviews?
- Five dimensions in most companies' rubrics. SLA match (25 percent): does the design meet the freshness, throughput, and latency requirements. Cost reasoning (20 percent): back-of-envelope numbers for Kafka shards, Spark workers, Snowflake credits. Failure modes (20 percent): 3 per component, with detection and recovery. Tool fit (15 percent): why this technology and not the alternative. Adapt-on-fly (20 percent): when the interviewer changes a requirement mid-round, does the design modify in place or restart.
- What scenarios are most common in data engineer system design rounds?
- Clickstream ingestion (10B events per day), daily ETL Postgres to Snowflake, ML feature store with online and offline paths, daily payment reconciliation, multi-region active-active warehouse, real-time analytics dashboard with micro-batch trigger. Each has a canonical architecture with company-specific variations: AWS-native at Amazon, GCP-native at Google, Spark-and-Iceberg at Netflix, idempotent-reconciliation at Stripe.
- How does the cost reasoning part of the rubric work?
- The L5+ rubric expects rough back-of-envelope numbers. For 10B events per day on Kafka: throughput is 116k events per second average, peak roughly 5x; with 1KB per event that is 116 MB/s average, 580 MB/s peak; on a 100MB/s-per-shard Kinesis-equivalent that is 2-6 shards depending on partition keys. For Snowflake: cost per TB scanned versus slot-reservation versus on-demand pricing. For S3: storage class trade-offs (Standard vs Infrequent Access vs Glacier).
- What is the 3-failure-modes-per-component expectation?
- For each box on the whiteboard, name what happens when it dies (replication, failover), when it gets backed up (autoscale, backpressure), and when the upstream schema changes (schema registry, schema-on-read, dead-letter queue). Senior data engineer candidates do this proactively; junior candidates wait to be asked.
- How does a data engineer prep for the mid-round pivot?
- Practice with a peer or AI mock interviewer that explicitly changes the requirements halfway through. Common pivots: SLA tightens from 15 minutes to 1 minute (requires moving from micro-batch to streaming), data volume jumps 100x (requires partitioning strategy review, broadcast vs sort-merge join decision flip), the BI tool cannot handle table swaps (requires insert_overwrite or materialized view pattern instead of CTAS). The L5 signal is articulating what changes and what stays in the existing design without throwing it out.
- How long is a system design round?
- 45 to 60 minutes for one round at most companies. Senior+ loops sometimes add a second design round (a 'design the platform' meta-question). The 45-minute version expects high-level architecture in 15 minutes, drill on 2-3 components in 20 minutes, and adapt-on-fly plus questions in the final 10. Pacing matters: spending 30 minutes on the high-level architecture means no time for the drill.
- What stack should a data engineer assume in design rounds?
- Depends on the company. AWS at Amazon (Kinesis, Glue, EMR, S3, Athena, Redshift). GCP at Google (Pub/Sub, Dataflow, BigQuery, Dataproc). Spark+Iceberg at Netflix. Presto+Hive+Spark+internal tools at Meta. Stack-neutral at smaller companies and at most non-FAANG: pick a stack you can defend and use it. Mention alternatives when they would be more appropriate.
124 practice problems matching this filter. Difficulty: medium (57), hard (67).
Pipeline Architecture (124)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
- Bikes Before Rush Hour - hard - Bikes in, bikes out. The city needs to predict demand.
- Credit for Every Touch - medium - They saw the ad, clicked the email, then bought. Who gets credit?
- Doubling Every Six Months - hard - Tuesdays are quiet. Black Friday is not.
- Eight-Hour-Old Positions - hard - Positions shift by the second. The math cannot lag.
- Eight Teams, Eight Latencies - medium - Millions of gamers. The architecture decision changes everything.
- End of Day Is Too Late - medium - Every swipe tells a story.
- Equities, ETFs, and the SEC - hard - Fractional shares, multi-currency, point-in-time. All of it.
- Event System for Multiple Consumers - hard - One event, many hungry consumers.
- Every Dataset Needs a Paper Trail - hard - The FDA has opinions about your data pipeline.
- Every Deal Is a Financial Transaction - hard - Real money on the table. Reconstruct every hand.
- Every Device, Every Impression - hard - Every ad seen. Every second watched. Real-time.
- Every Device Has Its Own Dialect - medium - Three sources. Three formats. Same workout.
- Every Firm Formats It Differently - medium - The regulator changed the format. Again. Handle it.
- Every Format Imaginable - hard - PDFs, HL7, JSON. All of it lands in the same lake.
- Everyone Wants the Same Data, Differently - hard - How you store it decides how fast you can read it.
- Every Region Exports Its Own Way - medium - Sales data, BigQuery, Dataflow. Make it all sing.
- Every Scan, Every Parcel, Every Pin Code - medium - Out for delivery. Delivered. Except the events arrived backwards.
- Fifty Thousand Retailers - medium - Retail data at CPG scale. Every SKU, every store.
- The Box That Won't Fit the Data - hard
- Five Times the Traffic, Five Times the Bill - hard - Scale up when needed. Do not bankrupt the team.
- Five Years of Cron Jobs - hard - Half the jobs run on cron. Half run on events. All of it has to move.
- Flying Blind Until Midnight - hard - Intraday risk, full lineage. The regulator is watching.
- Four Teams, One Topic, No Agreement - hard - Everybody is writing to it. Nobody documented it. Now production is fragile.
- Greenfield Build for Six Sources - hard - Infrastructure as code. Meaning as a service.
- Half a Million Rental Cars - medium - Every vehicle is reporting. Every rental matters.
- The Identity Problem - hard - Old systems. New demands. The same customer appears under three different names.
- Listens From Everywhere, Counted Once - hard - Phones, tablets, laptops. And some of them report late.
- Live Viewers, Live Billing - hard - The stream is live. The data cannot wait.
- Near-Real-Time Trending Dishes Dashboard - hard - The dish rankings update faster than the kitchen.
- Nested Docs, Flat Reports - medium - Two databases. One direction. No data left behind.
- Nightly Exports Are Too Slow - medium - Healthcare claims change constantly. The warehouse cannot fall behind.
- 4,500 Stores Before Sunrise - medium - The shelves open at 7. The data better be there.
- Not Every Team Can See Every Row - hard - Everyone can see the bucket. Not everyone should.
- One Bill Across Three Clouds - medium - AWS, Azure, GCP. Three bills. One truth.
- One Earthquake, Ten Thousand Tweets - hard - The firehose is on. Separate signal from noise.
- Out of the Data Center - medium - The on-prem servers are not getting any younger.
- The Speed Layer - medium - Dashboards can't wait for raw logs. Something has to happen upstream.
- Prove the Number Is Right - hard - Bad data in fintech is not just messy. It is expensive.
- Real Data, Fake Patients - hard - Dev needs production data. HIPAA says absolutely not.
- The Register Never Sleeps - medium - Every swipe lands in the warehouse. The table has to stay current without breaking.
- Recommendations Now, Royalties Later - medium - The catalog updated. Did anyone notice?
- Replicate It Without Breaking It - hard - The source changed. The lake needs to know immediately.
- Risk Models on Week-Old Data - medium - Loan approved. Loan denied. Every decision is an event.
- SaaS API Connector with Incremental Sync - medium - The API has rate limits. You have deadlines.
- Same-Day Sales, Every Store - medium - The cash register data needs to be queryable by morning.
- The Living Table - medium - Data lands continuously. History must survive every update.
- Score It Before It Clears - hard - The fraudsters move fast. Your pipeline has to move faster.
- Ship Before Fraud Finishes Checking - hard - The claim looks clean. The fraud model disagrees.
- Six Hours to Miss a Deadline - medium - The rebuild works. It just doesn't finish in time.
- Six Hours to Refresh Every Number - medium - Ratings change. The incremental model has to keep pace.
- Six Million Rows Before the Market Opens - medium - One massive CSV. Millions of timestamps.
- Six Sources, One Platform - medium - ADF orchestrates. Unity Catalog governs. Nothing leaks.
- Sixty Minutes, Every Hour - medium - Every hour, on the hour. No excuses.
- Stores and the Site, Together - hard - The registers never stop ringing.
- Store, Site, and Distributor - medium - Sales data is piling up. Someone has to make sense of it.
- The Acquisition Still Taking Bookings - hard - Two systems, two schemas. One truth.
- The Agency That Changes the Columns - medium - The schema changed overnight. Again.
- The Analysts Cannot Touch Production - medium - Production is the source. Analytics needs its own copy.
- The Analyst Who Saw the Salary Data - hard - Two incidents. One shared lake. The access model was never designed, just assumed.
- The API Drip Feed - medium - The API gives you 100 records at a time. You need millions.
- The Bad Row That Broke the Dashboard - medium - Bad records cannot reach the warehouse.
- The Binding and the Claim - medium - Policies are instant. Claims take their time.
- The Booking That Came Three Ways - hard - PMS, OTA, and website all think they took the reservation first.
- The Boutique That Sold in Six Currencies - hard - Every sale is real. The rate it was converted at depends on who is asking.
- The Bucket Full of Resumes - medium - A thousand resumes. Structured data inside each one.
- The Carrier Moving to Azure - medium - Claims arrive messy. The medallion cleans them up.
- The Claim That Picks Its Own Lane - medium - Three entry points. Different workflows. All must route correctly.
- The Clicks We Throw Away - hard - Every tap, swipe, and scroll. At scale.
- The Clock That Runs Two Ways - hard - Nightly batch and live events. One dashboard.
- The Consent Stitcher - medium - Consent was given. Or was it? Stitch the records together.
- The Dashboard and the Attribution Model - hard - Streaming and batch. One pipeline to rule them.
- The Decision Before the Door Closes - hard - The window to stop it is smaller than you think.
- The Distributor Filing Problem - medium - Hundreds of suppliers. One warehouse. One deadline.
- The Event Pile - hard - 600 million clicks a day. The budget is not infinite.
- The Fare Aggregator - medium - Airfares shift every minute. Catch the best ones.
- The Fleet That Never Stops - hard - Every truck is talking. Not everyone can hear them yet.
- The Leaderboard That Costs $25K a Month - hard - Product wants it live. Engineering has a price tag.
- The Meal Kit That Knows You - medium - What they ordered says a lot about what they want next.
- The Migration That Cannot Break Morning - hard - It all works today. Moving it without losing a single report is the hard part.
- The Models Going Stale - hard - The model is only as good as what you feed it.
- The Panel and the Set-Top Boxes - hard - Set-top boxes tell you who watched. Projection tells you how many.
- The Patients We Cannot Move - hard - Patient data stays local. Insights have to be global.
- The Points Arrive Two Days Late - medium - The bank data shows up late. The rewards were already sent.
- The Provider That Sometimes Sleeps - medium - The models run at dawn. The data has to be there first.
- The Query That Used to Be Fast - medium - Queries used to be fast. Something changed.
- The Queue That Wouldn't Stop Growing - medium - 500,000 messages behind and the number keeps climbing.
- The Revenue That Was Wrong for Two Weeks - medium - Nobody caught it until the CFO asked a question. Design the system that catches it first.
- The Sale That Needs to Land Now - medium - Three channels feeding one view. Not all of them speak the same language.
- The Signals That Power Recommendations - medium - Fresh signals, many teams, one pipeline.
- The User Who Asked to Be Forgotten - hard - Users want their data erased. Completely.
- The Vendor Who Never Warns You - medium - Every month, something is different. The dashboards have no idea.
- The What-If Machine - hard - A million slots. A thousand campaigns. Every combination matters.
- The Whiteboard Exercise - medium - Marker in hand. Draw the whole thing.
- Thirty Cities, One Forecast - hard - Five cities. Five data formats. One prediction.
- Thirty Countries, One Solvency Number - hard - Premiums collected globally. Losses happen locally.
- Thirty Million Unique Jobs a Year - hard - One press run, many orders. Group them right.
- Thousands of Practices, One Dataset - hard - Patient records in, operational insights out.
- Three Providers, One Workout - hard - The same ride, reported three times.
- Three Regions, One Finance Team - hard - Payments from everywhere. One consistent report.
- Three Regions, One Report - hard - Three regions, billions of payments, one merchant summary by 6 AM.
- Towers and Phones, Same Story - hard - Tower signals meet app events. Somewhere in between is the truth.
- Traders, Risk, and the Regulators - medium - Markets move in milliseconds. The pipeline has to keep up.
- Two Hundred Million Redirects - medium - Billions of clicks. One tiny code. Two very different clocks.
- Two Million Boxes by Monday Morning - hard - Shipped, maybe. Delivered, debatable.
- Two Systems, One Room Count - hard - Two booking systems. Rooms do not duplicate themselves.
- Two Ways to Catch a Change - medium - Two ways to watch the database. Each has a cost.
- Two Years of Every Click - hard - Every click, every aisle, every day for two years.
- Two Years of Clicks, Cheap - hard - Two years of clicks. Every query has to be affordable.
- What Everyone Is Watching - hard - Someone is watching. Capture everything.
- What Should We Recommend Tonight - hard - They ordered pad thai twice. That means something.
- Where Is Every Truck, Right Now - medium - Trucks are moving. Every ping counts.
- Which Promotion Is Actually Working - hard - Was the promotion worth it? The data knows.
- Who Is Churning and Why - medium - Subscribers churn. The pipeline cannot.
- Who Saw the Ad Twice - hard - TV and digital. Same viewer, two measurement worlds.