Data Pipeline Interview Questions
Data Pipeline Interview Questions
End-to-end pipeline design and operations problems for data engineer interview prep.
Data pipeline interview questions for data engineer roles, covering the design round and the operational round. End-to-end design across ingest, transform, and serve layers. Streaming and batch architectures. Idempotent transformation with MERGE INTO. Multi-source ingestion. Multi-region replication. The architecture rounds that data engineer L5+ loops expect.
Data pipeline design interview questions span ingest, transform, and serve layers end-to-end. The ingest layer captures data from sources (CDC, event streams, batch dumps, API pulls); the transform layer applies dedup, conformed-dimension joins, business rules, and aggregation; the serve layer makes the transformed data accessible (analytical warehouse for BI, online feature store for ML serving, materialized view for dashboards). Each layer has design questions that a senior data engineer is expected to answer.
Ingest layer design questions. Source type: transactional database (use CDC via Debezium), event stream (use Kafka/Kinesis/Pub/Sub directly), API (use scheduled batch pulls with high-water-mark), file drop (use S3 event trigger to Spark). Throughput sizing: events per second, peak factor, bytes per event. For Kafka: 10-20 MB/sec per partition. For Kinesis: 1 MB/sec per shard. Durability: replication factor (3 for Kafka), at-least-once delivery, snapshot recovery for new sources. Schema: registry-enforced contracts, additive-only evolution, raw payload preserved in bronze for replay.
Transform layer design questions. Compute engine: Spark for distributed batch, Flink for stateful streaming, dbt for in-warehouse SQL transformations, custom Python for lightweight glue. Idempotency: run_id baked into output partitions, MERGE INTO on composite natural keys, append-only with version column for slowly-changing facts. Late-arriving data: MERGE-ADD-not-REPLACE for windowed aggregations, watermark plus allowed lateness for streaming, backfill plan for batch. Dependency management: orchestrator (Airflow, Dagster) handles cross-pipeline dependencies; downstream pipelines wait on upstream completion via sensors or asset-based triggers.
Serve layer design questions. Analytical warehouse for BI: Snowflake, BigQuery, Redshift, Databricks with star schemas in the gold layer. Online feature store for ML: Redis or DynamoDB for 10ms read latency with batch backfill. Materialized view for dashboards: Snowflake materialized views, BigQuery materialized views, or pre-computed gold tables refreshed by dbt. Multi-region serving: read replicas with eventual consistency, or active-active with CRDT-based conflict resolution. Caching: CDN-fronted for static, ElastiCache-fronted for dynamic. Operational concerns: SLA monitoring, query performance dashboards, cost attribution per consumer.
The 45-60 minute data pipeline design round expects the data engineer to cover all three layers in the time allotted, with explicit failure-mode articulation at each. Pacing is critical: 10 minutes on the ingest layer, 15 minutes on transform, 10 minutes on serve, with the remaining 10-25 minutes for failure-mode drills and adapt-on-fly pivots. Spending 30 minutes on the ingest layer at the expense of transform and serve is a pacing failure even if the ingest design is excellent.
Companies that emphasize end-to-end data pipeline design in data engineer interviews: Netflix (streaming-heavy with Iceberg and Spark), Stripe (idempotent reconciliation across all three layers), Meta (large-scale clickstream and ads attribution), Amazon (AWS-native end-to-end), Google (GCP-native with Pub/Sub plus Dataflow plus BigQuery), Uber (Kafka plus Flink plus Pinot for real-time serving plus Spark for batch).
- What does a data pipeline design interview round cover?
- End-to-end design across ingest, transform, and serve layers. 45-60 minutes. Specific scenario: 10B events per day clickstream, daily Postgres-to-Snowflake ETL, ML feature store. Senior data engineer rubrics weight idempotency at each layer, failure-mode articulation, cost reasoning, and adapt-on-fly when the interviewer flips a requirement.
- How does a data engineer pace a 45-minute pipeline design round?
- 10 minutes ingest, 15 minutes transform, 10 minutes serve, 10-25 minutes for failure-mode drills and pivot. Spending 30 minutes on the ingest layer at the expense of transform and serve is a pacing failure even if the ingest design is excellent.
- What is the ingest layer in a data pipeline?
- The layer that captures data from sources. Mechanisms: CDC via Debezium for transactional databases, Kafka/Kinesis/Pub/Sub for event streams, scheduled batch SELECT for warehouses, API pulls for SaaS sources, S3 event triggers for file drops. Sizing: throughput in events per second, peak factor, bytes per event.
- What is the transform layer in a data pipeline?
- The layer that applies dedup, conformed-dimension joins, business rules, and aggregation. Compute engines: Spark for distributed batch, Flink for stateful streaming, dbt for in-warehouse SQL, custom Python for lightweight glue. Idempotency design (run_id, MERGE INTO) is the senior data engineer rubric item.
- What is the serve layer in a data pipeline?
- The layer that makes transformed data accessible. Analytical warehouse for BI (Snowflake, BigQuery). Online feature store for ML (Redis, DynamoDB). Materialized view for dashboards. Multi-region serving for global products. Caching, SLA monitoring, and cost attribution are operational concerns.
- How does a data engineer handle multi-source ingestion?
- One bronze layer per source type (transactional, event stream, batch, API), with consistent metadata: load_date, source_system, ingestion_method, raw_payload. Silver layer applies conformed dimensions across sources (one dim_customer used by orders from Postgres, events from Kafka, leads from Salesforce). Gold layer presents unified analytical models.
- What is the typical failure mode in pipeline design rounds?
- Not articulating failure modes at every component. A high-level architecture without 'what happens when Kafka brokers die, when Spark executors OOM, when Snowflake MERGE deadlocks' falls short of the L5 rubric. The L4 candidate produces a working architecture; the L5 candidate names 3 failure modes per component proactively.
- How does a data engineer prepare for the adapt-on-fly pivot?
- Practice with a peer or AI mock interviewer that flips a requirement mid-round. Common flips: SLA tightens from 15 min to 1 min, volume jumps 100x, downstream BI tool cannot handle table swaps, multi-region requirement added. The L5 signal is modifying the existing design in place and articulating what changes; the L4 signal is restarting from scratch.
133 practice problems matching this filter. Difficulty: medium (65), hard (68).
Pipeline Architecture (133)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- A Million Moving Dots - medium
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
- Bikes Before Rush Hour - hard - Bikes in, bikes out. The city needs to predict demand.
- Credit for Every Touch - medium - They saw the ad, clicked the email, then bought. Who gets credit?
- Doubling Every Six Months - hard - Tuesdays are quiet. Black Friday is not.
- Eight-Hour-Old Positions - hard - Positions shift by the second. The math cannot lag.
- Eight Teams, Eight Latencies - medium - Millions of gamers. The architecture decision changes everything.
- End of Day Is Too Late - medium - Every swipe tells a story.
- Equities, ETFs, and the SEC - hard - Fractional shares, multi-currency, point-in-time. All of it.
- Event System for Multiple Consumers - hard - One event, many hungry consumers.
- Every Dataset Needs a Paper Trail - hard - The FDA has opinions about your data pipeline.
- Every Deal Is a Financial Transaction - hard - Real money on the table. Reconstruct every hand.
- Every Device, Every Impression - hard - Every ad seen. Every second watched. Real-time.
- Every Device Has Its Own Dialect - medium - Three sources. Three formats. Same workout.
- Every Firm Formats It Differently - medium - The regulator changed the format. Again. Handle it.
- Every Format Imaginable - hard - PDFs, HL7, JSON. All of it lands in the same lake.
- Everyone Wants the Same Data, Differently - hard - How you store it decides how fast you can read it.
- Every Region Exports Its Own Way - medium - Sales data, BigQuery, Dataflow. Make it all sing.
- Every Scan, Every Parcel, Every Pin Code - medium - Out for delivery. Delivered. Except the events arrived backwards.
- Every Version of You - medium
- Fifty Thousand Retailers - medium - Retail data at CPG scale. Every SKU, every store.
- Five Times the Traffic, Five Times the Bill - hard - Scale up when needed. Do not bankrupt the team.
- Five Years of Cron Jobs - hard - Half the jobs run on cron. Half run on events. All of it has to move.
- Flying Blind Until Midnight - hard - Intraday risk, full lineage. The regulator is watching.
- Four Teams, One Topic, No Agreement - hard - Everybody is writing to it. Nobody documented it. Now production is fragile.
- Fresh and Forever - medium
- Greenfield Build for Six Sources - hard - Infrastructure as code. Meaning as a service.
- Half a Million Rental Cars - medium - Every vehicle is reporting. Every rental matters.
- The Identity Problem - hard - Old systems. New demands. The same customer appears under three different names.
- Listens From Everywhere, Counted Once - hard - Phones, tablets, laptops. And some of them report late.
- Live Viewers, Live Billing - hard - The stream is live. The data cannot wait.
- Mark to Market - medium
- Near-Real-Time Trending Dishes Dashboard - hard - The dish rankings update faster than the kitchen.
- Nested Docs, Flat Reports - medium - Two databases. One direction. No data left behind.
- Nightly Exports Are Too Slow - medium - Healthcare claims change constantly. The warehouse cannot fall behind.
- 4,500 Stores Before Sunrise - medium - The shelves open at 7. The data better be there.
- Not Every Team Can See Every Row - hard - Everyone can see the bucket. Not everyone should.
- One Bill Across Three Clouds - medium - AWS, Azure, GCP. Three bills. One truth.
- One Earthquake, Ten Thousand Tweets - hard - The firehose is on. Separate signal from noise.
- Out of the Data Center - medium - The on-prem servers are not getting any younger.
- The Speed Layer - medium - Dashboards can't wait for raw logs. Something has to happen upstream.
- Prove the Number Is Right - hard - Bad data in fintech is not just messy. It is expensive.
- Real Data, Fake Patients - hard - Dev needs production data. HIPAA says absolutely not.
- The Register Never Sleeps - medium - Every swipe lands in the warehouse. The table has to stay current without breaking.
- Recommendations Now, Royalties Later - medium - The catalog updated. Did anyone notice?
- Replicate It Without Breaking It - hard - The source changed. The lake needs to know immediately.
- Risk Models on Week-Old Data - medium - Loan approved. Loan denied. Every decision is an event.
- SaaS API Connector with Incremental Sync - medium - The API has rate limits. You have deadlines.
- Same-Day Sales, Every Store - medium - The cash register data needs to be queryable by morning.
- The Living Table - medium - Data lands continuously. History must survive every update.
- Score It Before It Clears - hard - The fraudsters move fast. Your pipeline has to move faster.
- Seconds and Months - medium
- Seconds to Trend - medium
- Ship Before Fraud Finishes Checking - hard - The claim looks clean. The fraud model disagrees.
- Six Hours to Miss a Deadline - medium - The rebuild works. It just doesn't finish in time.
- Six Hours to Refresh Every Number - medium - Ratings change. The incremental model has to keep pace.
- Six Million Rows Before the Market Opens - medium - One massive CSV. Millions of timestamps.
- Six Sources, One Platform - medium - ADF orchestrates. Unity Catalog governs. Nothing leaks.
- Sixty Minutes, Every Hour - medium - Every hour, on the hour. No excuses.
- Stores and the Site, Together - hard - The registers never stop ringing.
- Store, Site, and Distributor - medium - Sales data is piling up. Someone has to make sense of it.
- The Acquisition Still Taking Bookings - hard - Two systems, two schemas. One truth.
- The Agency That Changes the Columns - medium - The schema changed overnight. Again.
- The Analysts Cannot Touch Production - medium - Production is the source. Analytics needs its own copy.
- The Analyst Who Saw the Salary Data - hard - Two incidents. One shared lake. The access model was never designed, just assumed.
- The API Drip Feed - medium - The API gives you 100 records at a time. You need millions.
- The Bad Row That Broke the Dashboard - medium - Bad records cannot reach the warehouse.
- The Binding and the Claim - medium - Policies are instant. Claims take their time.
- The Booking That Came Three Ways - hard - PMS, OTA, and website all think they took the reservation first.
- The Boutique That Sold in Six Currencies - hard - Every sale is real. The rate it was converted at depends on who is asking.
- The Bucket Full of Resumes - medium - A thousand resumes. Structured data inside each one.
- The Carrier Moving to Azure - medium - Claims arrive messy. The medallion cleans them up.
- The Claim That Picks Its Own Lane - medium - Three entry points. Different workflows. All must route correctly.
- The Clicks We Throw Away - hard - Every tap, swipe, and scroll. At scale.
- The Clock That Runs Two Ways - hard - Nightly batch and live events. One dashboard.
- The Consent Stitcher - medium - Consent was given. Or was it? Stitch the records together.
- The Dashboard and the Attribution Model - hard - Streaming and batch. One pipeline to rule them.
- The Decision Before the Door Closes - hard - The window to stop it is smaller than you think.
- The Distributor Filing Problem - medium - Hundreds of suppliers. One warehouse. One deadline.
- The Early Warning - medium
- The Event Pile - hard - 600 million clicks a day. The budget is not infinite.
- The Fare Aggregator - medium - Airfares shift every minute. Catch the best ones.
- The Fleet That Never Stops - hard - Every truck is talking. Not everyone can hear them yet.
- The Leaderboard That Costs $25K a Month - hard - Product wants it live. Engineering has a price tag.
- The Meal Kit That Knows You - medium - What they ordered says a lot about what they want next.
- The Migration That Cannot Break Morning - hard - It all works today. Moving it without losing a single report is the hard part.
- The Models Going Stale - hard - The model is only as good as what you feed it.
- The Panel and the Set-Top Boxes - hard - Set-top boxes tell you who watched. Projection tells you how many.
- The Patients We Cannot Move - hard - Patient data stays local. Insights have to be global.
- The Points Arrive Two Days Late - medium - The bank data shows up late. The rewards were already sent.
- The Provider That Sometimes Sleeps - medium - The models run at dawn. The data has to be there first.
- The Query That Used to Be Fast - medium - Queries used to be fast. Something changed.
- The Queue That Wouldn't Stop Growing - medium - 500,000 messages behind and the number keeps climbing.
- The Revenue That Was Wrong for Two Weeks - medium - Nobody caught it until the CFO asked a question. Design the system that catches it first.
- The Sale That Needs to Land Now - medium - Three channels feeding one view. Not all of them speak the same language.
- The Same Stream Twice - hard
- The Signals That Power Recommendations - medium - Fresh signals, many teams, one pipeline.
- Counted Once, Remembered Forever - hard
- The User Who Asked to Be Forgotten - hard - Users want their data erased. Completely.
- The Vendor Who Never Warns You - medium - Every month, something is different. The dashboards have no idea.
- The What-If Machine - hard - A million slots. A thousand campaigns. Every combination matters.
- The Whiteboard Exercise - medium - Marker in hand. Draw the whole thing.
- Thirty Cities, One Forecast - hard - Five cities. Five data formats. One prediction.
- Thirty Countries, One Solvency Number - hard - Premiums collected globally. Losses happen locally.
- Thirty Million Unique Jobs a Year - hard - One press run, many orders. Group them right.
- Thousands of Practices, One Dataset - hard - Patient records in, operational insights out.
- Three Providers, One Workout - hard - The same ride, reported three times.
- Three Regions, One Finance Team - hard - Payments from everywhere. One consistent report.
- Three Regions, One Report - hard - Three regions, billions of payments, one merchant summary by 6 AM.
- Towers and Phones, Same Story - hard - Tower signals meet app events. Somewhere in between is the truth.
- Traders, Risk, and the Regulators - medium - Markets move in milliseconds. The pipeline has to keep up.
- Two Hundred Million Redirects - medium - Billions of clicks. One tiny code. Two very different clocks.
- Two Million Boxes by Monday Morning - hard - Shipped, maybe. Delivered, debatable.
- Two Sources of Truth - medium
- Two Systems, One Room Count - hard - Two booking systems. Rooms do not duplicate themselves.
- Two Ways to Catch a Change - medium - Two ways to watch the database. Each has a cost.
- Two Years of Every Click - hard - Every click, every aisle, every day for two years.
- Two Years of Clicks, Cheap - hard - Two years of clicks. Every query has to be affordable.
- What Everyone Is Watching - hard - Someone is watching. Capture everything.
- What Should We Recommend Tonight - hard - They ordered pad thai twice. That means something.
- Where Is Every Truck, Right Now - medium - Trucks are moving. Every ping counts.
- Which Promotion Is Actually Working - hard - Was the promotion worth it? The data knows.
- Who Is Churning and Why - medium - Subscribers churn. The pipeline cannot.
- Who Saw the Ad Twice - hard - TV and digital. Same viewer, two measurement worlds.