Data pipeline practice problems for data engineer interview prep. End-to-end design across ingest, transform, and serve layers. Scenarios include 10B-event-per-day clickstream, daily warehouse load, ML feature pipeline, payment reconciliation, multi-region active-active warehouse. Rubric-scored verdicts mirror the senior data engineer pipeline design rubric.
Data pipeline practice problems for data engineer roles cover six scenarios that recur in 2026 interview reports. Each problem is a 60-75 minute exercise: 45-60 minutes for the design, 15 minutes for rubric review. The rubric scores ingest design, transform design, serve design, idempotency at each layer, failure-mode articulation, and cost reasoning.
Scenario one: 10B-event-per-day clickstream. Web SDK to local buffer to CDN-fronted ingest to Kafka (24 partitions) to Spark Structured Streaming (1-min trigger) to Parquet on S3 (date/hour partition) to dbt micro-batch hourly to gold star schemas in Snowflake. Failure modes: SDK buffer overflow, CDN edge failure, Kafka broker death, consumer lag, late-arriving events. Cost: Kafka shards, Spark workers, Snowflake credits per query.
Scenario two: daily Postgres-to-Snowflake CDC. Debezium to Kafka to S3 raw immutable to Spark daily ETL to Snowflake MERGE INTO. Idempotency via run_id and composite-key MERGE. Failure modes: Debezium falls behind, schema change at source, dedup misses late event, MERGE deadlock. Backfill via insert-overwrite per partition or MERGE with fresh run_id.
Scenario three: ML feature store. Real-time path uses Flink to Redis with 10ms reads. Batch path uses Spark to S3 Parquet to Feast catalog. Training uses as-of joins with feature_ts less-than-or-equal-to label_ts to prevent leakage. Failure modes: Flink and Spark compute features differently, Redis OOM, training-serving skew. Solution: centralized feature library shared between both paths, eviction policy on Redis, feature distribution monitoring with drift alerts.
Scenario four: daily payment reconciliation. Postgres transactions via Debezium to Kafka to S3 bronze. Settlement reports delivered via SFTP nightly land in S3. Spark daily reconciliation reads both, joins on (txn_id, settlement_id), produces reconciled fact with status and discrepancy. Snowflake MERGE on (txn_id, run_id). Failure modes: settlement file late or missing, discrepancy alert false positive, MERGE deadlock with concurrent writer.
Scenario five: multi-region active-active warehouse. Region-local writes with async cross-region CDC replication. Conflict resolution via last-writer-wins for ordered data or CRDT for counters. SLA tiers: real-time within region, eventually-consistent across regions. 2x storage minimum cost. Failure modes: region failure (regional failover, RPO and RTO), cross-region lag, conflict resolution edge cases. Most companies do not need multi-region; senior data engineer L6+ rubrics test the design anyway.
Scenario six: real-time analytics dashboard. Spark Structured Streaming with 1-minute trigger reads Kafka, aggregates, writes to Druid or ClickHouse for sub-second BI query latency. Hourly Spark batch from Kafka to S3 to Snowflake for historical. Materialized views in Snowflake refresh on schedule for dashboard queries. Failure modes: Druid hot partition, Spark micro-batch lag, dashboard query timeout on cold cache.
Each problem ships with a rubric verdict that identifies what scored well and what was missed. The gap analysis between candidate solution and rubric verdict is where practice value compounds.
Data Pipeline Practice Problems
End-to-end data pipeline practice problems for data engineer interview prep.
123 practice problems matching this filter. Difficulty: medium (57), hard (66).
Pipeline Architecture (123)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Common questions
- How are data pipeline practice problems graded?
- Rubric-scored on six dimensions: ingest layer design (mechanism, sizing, durability), transform layer design (engine choice, idempotency, late-arriving handling), serve layer design (warehouse, feature store, materialized view), failure-mode articulation per component, cost reasoning, and adapt-on-fly. Multiple valid architectures score well if the data engineer can defend each component choice.
- How long does a data pipeline practice problem take?
- 60-75 minutes including rubric review. 45-60 minutes for the design exercise, 15 minutes for the rubric verdict comparison and gap analysis.
- What scenarios are most common in data pipeline practice?
- Six recur most: 10B-event-per-day clickstream, daily Postgres-to-Snowflake CDC, ML feature store with online and offline, daily payment reconciliation, multi-region active-active warehouse, real-time analytics dashboard. Each appears across multiple companies' data engineer interview reports.
- Do these practice problems test specific cloud vendors?
- Most stay vendor-neutral or offer multiple variants. AWS-native variants for Amazon prep (Kinesis, Glue, EMR, S3, Athena, Redshift). GCP-native variants for Google prep (Pub/Sub, Dataflow, BigQuery, Dataproc). Spark+Iceberg variants for Netflix and Databricks prep. Practice both AWS-native and GCP-native variants of clickstream and CDC.
- What is the most common failure mode in pipeline practice?
- Spending too long on the high-level architecture and running out of time for failure-mode drills. The rubric weights failure modes per component at 20 percent; skipping them is a guaranteed sub-L5 score. Pace the round: high-level in 15-20 minutes, failure modes for 20-25 minutes.
- How many pipeline practice problems should I solve before a senior data engineer onsite?
- Six well-practiced scenarios across the six recurring scenario shapes beats fifteen rushed similar ones. Aim for one of each over 3 weeks. The signal interviewers test is whether the data engineer can transfer the pattern to a new combination of source, transform, and serve.
- Do these practice problems include the cost reasoning at L5+ rubric weight?
- Yes. Each problem's rubric verdict includes back-of-envelope cost numbers: Kafka shards at $X per shard-hour, Spark workers at $Y per worker-hour, Snowflake credits at $Z per credit, S3 storage class trade-offs. Compare your numbers to the rubric's; order-of-magnitude correctness counts.
- How does the adapt-on-fly part of practice work?
- Each problem's rubric includes a 'pivot' section: a requirement that changes halfway through (SLA tightens, volume jumps, downstream constraint added). Walk through how your design would change. The L5 signal is articulating what changes and what stays in the existing design without throwing it out.