Data pipeline design interview questions for data engineer roles. End-to-end design across ingest, transform, and serve layers. Streaming and batch architectures. Idempotent transformation with MERGE INTO. Multi-source ingestion. Multi-region replication. The architecture rounds that data engineer L5+ loops expect.

Data pipeline design interview questions span ingest, transform, and serve layers end-to-end. The ingest layer captures data from sources (CDC, event streams, batch dumps, API pulls); the transform layer applies dedup, conformed-dimension joins, business rules, and aggregation; the serve layer makes the transformed data accessible (analytical warehouse for BI, online feature store for ML serving, materialized view for dashboards). Each layer has design questions that a senior data engineer is expected to answer.

Ingest layer design questions. Source type: transactional database (use CDC via Debezium), event stream (use Kafka/Kinesis/Pub/Sub directly), API (use scheduled batch pulls with high-water-mark), file drop (use S3 event trigger to Spark). Throughput sizing: events per second, peak factor, bytes per event. For Kafka: 10-20 MB/sec per partition. For Kinesis: 1 MB/sec per shard. Durability: replication factor (3 for Kafka), at-least-once delivery, snapshot recovery for new sources. Schema: registry-enforced contracts, additive-only evolution, raw payload preserved in bronze for replay.

Transform layer design questions. Compute engine: Spark for distributed batch, Flink for stateful streaming, dbt for in-warehouse SQL transformations, custom Python for lightweight glue. Idempotency: run_id baked into output partitions, MERGE INTO on composite natural keys, append-only with version column for slowly-changing facts. Late-arriving data: MERGE-ADD-not-REPLACE for windowed aggregations, watermark plus allowed lateness for streaming, backfill plan for batch. Dependency management: orchestrator (Airflow, Dagster) handles cross-pipeline dependencies; downstream pipelines wait on upstream completion via sensors or asset-based triggers.

Serve layer design questions. Analytical warehouse for BI: Snowflake, BigQuery, Redshift, Databricks with star schemas in the gold layer. Online feature store for ML: Redis or DynamoDB for 10ms read latency with batch backfill. Materialized view for dashboards: Snowflake materialized views, BigQuery materialized views, or pre-computed gold tables refreshed by dbt. Multi-region serving: read replicas with eventual consistency, or active-active with CRDT-based conflict resolution. Caching: CDN-fronted for static, ElastiCache-fronted for dynamic. Operational concerns: SLA monitoring, query performance dashboards, cost attribution per consumer.

The 45-60 minute data pipeline design round expects the data engineer to cover all three layers in the time allotted, with explicit failure-mode articulation at each. Pacing is critical: 10 minutes on the ingest layer, 15 minutes on transform, 10 minutes on serve, with the remaining 10-25 minutes for failure-mode drills and adapt-on-fly pivots. Spending 30 minutes on the ingest layer at the expense of transform and serve is a pacing failure even if the ingest design is excellent.

Companies that emphasize end-to-end data pipeline design in data engineer interviews: Netflix (streaming-heavy with Iceberg and Spark), Stripe (idempotent reconciliation across all three layers), Meta (large-scale clickstream and ads attribution), Amazon (AWS-native end-to-end), Google (GCP-native with Pub/Sub plus Dataflow plus BigQuery), Uber (Kafka plus Flink plus Pinot for real-time serving plus Spark for batch).

Data Pipeline Design Interview Questions

End-to-end pipeline design problems for data engineer interview prep.

123 practice problems matching this filter. Difficulty: medium (57), hard (66).

Pipeline Architecture (123)

Common questions

What does a data pipeline design interview round cover?
End-to-end design across ingest, transform, and serve layers. 45-60 minutes. Specific scenario: 10B events per day clickstream, daily Postgres-to-Snowflake ETL, ML feature store. Senior data engineer rubrics weight idempotency at each layer, failure-mode articulation, cost reasoning, and adapt-on-fly when the interviewer flips a requirement.
How does a data engineer pace a 45-minute pipeline design round?
10 minutes ingest, 15 minutes transform, 10 minutes serve, 10-25 minutes for failure-mode drills and pivot. Spending 30 minutes on the ingest layer at the expense of transform and serve is a pacing failure even if the ingest design is excellent.
What is the ingest layer in a data pipeline?
The layer that captures data from sources. Mechanisms: CDC via Debezium for transactional databases, Kafka/Kinesis/Pub/Sub for event streams, scheduled batch SELECT for warehouses, API pulls for SaaS sources, S3 event triggers for file drops. Sizing: throughput in events per second, peak factor, bytes per event.
What is the transform layer in a data pipeline?
The layer that applies dedup, conformed-dimension joins, business rules, and aggregation. Compute engines: Spark for distributed batch, Flink for stateful streaming, dbt for in-warehouse SQL, custom Python for lightweight glue. Idempotency design (run_id, MERGE INTO) is the senior data engineer rubric item.
What is the serve layer in a data pipeline?
The layer that makes transformed data accessible. Analytical warehouse for BI (Snowflake, BigQuery). Online feature store for ML (Redis, DynamoDB). Materialized view for dashboards. Multi-region serving for global products. Caching, SLA monitoring, and cost attribution are operational concerns.
How does a data engineer handle multi-source ingestion?
One bronze layer per source type (transactional, event stream, batch, API), with consistent metadata: load_date, source_system, ingestion_method, raw_payload. Silver layer applies conformed dimensions across sources (one dim_customer used by orders from Postgres, events from Kafka, leads from Salesforce). Gold layer presents unified analytical models.
What is the typical failure mode in pipeline design rounds?
Not articulating failure modes at every component. A high-level architecture without 'what happens when Kafka brokers die, when Spark executors OOM, when Snowflake MERGE deadlocks' falls short of the L5 rubric. The L4 candidate produces a working architecture; the L5 candidate names 3 failure modes per component proactively.
How does a data engineer prepare for the adapt-on-fly pivot?
Practice with a peer or AI mock interviewer that flips a requirement mid-round. Common flips: SLA tightens from 15 min to 1 min, volume jumps 100x, downstream BI tool cannot handle table swaps, multi-region requirement added. The L5 signal is modifying the existing design in place and articulating what changes; the L4 signal is restarting from scratch.