Data pipeline design interview questions for data engineer roles. End-to-end design across ingest, transform, and serve layers. Streaming and batch architectures. Idempotent transformation with MERGE INTO. Multi-source ingestion. Multi-region replication. The architecture rounds that data engineer L5+ loops expect.
Data pipeline design interview questions span ingest, transform, and serve layers end-to-end. The ingest layer captures data from sources (CDC, event streams, batch dumps, API pulls); the transform layer applies dedup, conformed-dimension joins, business rules, and aggregation; the serve layer makes the transformed data accessible (analytical warehouse for BI, online feature store for ML serving, materialized view for dashboards). Each layer has design questions that a senior data engineer is expected to answer.
Ingest layer design questions. Source type: transactional database (use CDC via Debezium), event stream (use Kafka/Kinesis/Pub/Sub directly), API (use scheduled batch pulls with high-water-mark), file drop (use S3 event trigger to Spark). Throughput sizing: events per second, peak factor, bytes per event. For Kafka: 10-20 MB/sec per partition. For Kinesis: 1 MB/sec per shard. Durability: replication factor (3 for Kafka), at-least-once delivery, snapshot recovery for new sources. Schema: registry-enforced contracts, additive-only evolution, raw payload preserved in bronze for replay.
Transform layer design questions. Compute engine: Spark for distributed batch, Flink for stateful streaming, dbt for in-warehouse SQL transformations, custom Python for lightweight glue. Idempotency: run_id baked into output partitions, MERGE INTO on composite natural keys, append-only with version column for slowly-changing facts. Late-arriving data: MERGE-ADD-not-REPLACE for windowed aggregations, watermark plus allowed lateness for streaming, backfill plan for batch. Dependency management: orchestrator (Airflow, Dagster) handles cross-pipeline dependencies; downstream pipelines wait on upstream completion via sensors or asset-based triggers.
Serve layer design questions. Analytical warehouse for BI: Snowflake, BigQuery, Redshift, Databricks with star schemas in the gold layer. Online feature store for ML: Redis or DynamoDB for 10ms read latency with batch backfill. Materialized view for dashboards: Snowflake materialized views, BigQuery materialized views, or pre-computed gold tables refreshed by dbt. Multi-region serving: read replicas with eventual consistency, or active-active with CRDT-based conflict resolution. Caching: CDN-fronted for static, ElastiCache-fronted for dynamic. Operational concerns: SLA monitoring, query performance dashboards, cost attribution per consumer.
The 45-60 minute data pipeline design round expects the data engineer to cover all three layers in the time allotted, with explicit failure-mode articulation at each. Pacing is critical: 10 minutes on the ingest layer, 15 minutes on transform, 10 minutes on serve, with the remaining 10-25 minutes for failure-mode drills and adapt-on-fly pivots. Spending 30 minutes on the ingest layer at the expense of transform and serve is a pacing failure even if the ingest design is excellent.
Companies that emphasize end-to-end data pipeline design in data engineer interviews: Netflix (streaming-heavy with Iceberg and Spark), Stripe (idempotent reconciliation across all three layers), Meta (large-scale clickstream and ads attribution), Amazon (AWS-native end-to-end), Google (GCP-native with Pub/Sub plus Dataflow plus BigQuery), Uber (Kafka plus Flink plus Pinot for real-time serving plus Spark for batch).
Data Pipeline Design Interview Questions
End-to-end pipeline design problems for data engineer interview prep.
123 practice problems matching this filter. Difficulty: medium (57), hard (66).
Pipeline Architecture (123)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Common questions
- What does a data pipeline design interview round cover?
- End-to-end design across ingest, transform, and serve layers. 45-60 minutes. Specific scenario: 10B events per day clickstream, daily Postgres-to-Snowflake ETL, ML feature store. Senior data engineer rubrics weight idempotency at each layer, failure-mode articulation, cost reasoning, and adapt-on-fly when the interviewer flips a requirement.
- How does a data engineer pace a 45-minute pipeline design round?
- 10 minutes ingest, 15 minutes transform, 10 minutes serve, 10-25 minutes for failure-mode drills and pivot. Spending 30 minutes on the ingest layer at the expense of transform and serve is a pacing failure even if the ingest design is excellent.
- What is the ingest layer in a data pipeline?
- The layer that captures data from sources. Mechanisms: CDC via Debezium for transactional databases, Kafka/Kinesis/Pub/Sub for event streams, scheduled batch SELECT for warehouses, API pulls for SaaS sources, S3 event triggers for file drops. Sizing: throughput in events per second, peak factor, bytes per event.
- What is the transform layer in a data pipeline?
- The layer that applies dedup, conformed-dimension joins, business rules, and aggregation. Compute engines: Spark for distributed batch, Flink for stateful streaming, dbt for in-warehouse SQL, custom Python for lightweight glue. Idempotency design (run_id, MERGE INTO) is the senior data engineer rubric item.
- What is the serve layer in a data pipeline?
- The layer that makes transformed data accessible. Analytical warehouse for BI (Snowflake, BigQuery). Online feature store for ML (Redis, DynamoDB). Materialized view for dashboards. Multi-region serving for global products. Caching, SLA monitoring, and cost attribution are operational concerns.
- How does a data engineer handle multi-source ingestion?
- One bronze layer per source type (transactional, event stream, batch, API), with consistent metadata: load_date, source_system, ingestion_method, raw_payload. Silver layer applies conformed dimensions across sources (one dim_customer used by orders from Postgres, events from Kafka, leads from Salesforce). Gold layer presents unified analytical models.
- What is the typical failure mode in pipeline design rounds?
- Not articulating failure modes at every component. A high-level architecture without 'what happens when Kafka brokers die, when Spark executors OOM, when Snowflake MERGE deadlocks' falls short of the L5 rubric. The L4 candidate produces a working architecture; the L5 candidate names 3 failure modes per component proactively.
- How does a data engineer prepare for the adapt-on-fly pivot?
- Practice with a peer or AI mock interviewer that flips a requirement mid-round. Common flips: SLA tightens from 15 min to 1 min, volume jumps 100x, downstream BI tool cannot handle table swaps, multi-region requirement added. The L5 signal is modifying the existing design in place and articulating what changes; the L4 signal is restarting from scratch.