ETL design interview prep for data engineer roles. Idempotent ETL with run_id baked into output partitions. MERGE INTO on composite natural key for upsert. Schema evolution with additive-only contracts. Late-arriving data with MERGE-ADD-not-REPLACE. Orchestration with Airflow or Dagster. The patterns that compose 80 percent of data engineer ETL design rounds in 2026.
ETL design questions in 2026 data engineer interviews center on five concerns. Idempotency: a re-run of yesterday's ETL must produce the same answer as the original run. The standard pattern is run_id baked into output partitions plus MERGE INTO on composite natural keys; restarting a failed Spark job overwrites the same partition with the same result. Schema evolution: source schemas change; ETL pipelines must propagate additive changes (new nullable columns) automatically and refuse to deploy breaking changes (renamed columns, type narrowing). Schema registries (Confluent Schema Registry for Avro, AWS Glue Schema Registry for Parquet) enforce contracts at the producer side; downstream bronze layer keeps the raw payload for replay.
Late-arriving data handling. Events arrive at the ETL pipeline after their event_time has passed: refunds processed days after the original transaction, click events from offline mobile devices, conversions in the 28-day attribution window. The standard pattern is MERGE INTO with ADD semantics, not REPLACE. MERGE INTO daily_revenue d USING (SELECT date, SUM(amount) AS revenue FROM events WHERE processed_at greater-than-or-equal-to since GROUP BY date) src ON d.date = src.date WHEN MATCHED THEN UPDATE SET revenue = d.revenue plus src.revenue WHEN NOT MATCHED THEN INSERT. The ADD (revenue = existing plus new) preserves yesterday's correct total when only late events arrive today; REPLACE would zero out the old total.
Orchestration choice. Airflow is the incumbent (Python-native DAGs, large operator ecosystem, mature but heavy). Dagster is the modern alternative (asset-based, software-defined assets concept, better type safety). Prefect for lighter workflows. dbt is not an orchestrator but is the de-facto in-warehouse transformation tool, called by Airflow or Dagster. Data engineer interview rounds rarely demand a specific orchestrator; what matters is the dependency model (asset-based, task-based) and the retry-on-failure semantics.
Backfill capability is the senior data engineer ETL question. A backfill re-runs ETL for a past date range to correct an error, fill a gap, or add a new column to historical data. The design constraint: backfills must not corrupt downstream consumers reading from the same tables. Two patterns. Insert-overwrite per partition: the backfill writes to a staging partition, then atomically swaps with the production partition. MERGE INTO with run_id: the backfill writes with a fresh run_id; downstream queries filter on the latest run_id per partition. Both require explicit partition design and tend to fail on non-partitioned tables.
Schema-on-read versus schema-on-write. Schema-on-write enforces the schema at ingest (parquet files with strict types, warehouse tables with column constraints). Schema-on-read defers enforcement to query time (raw JSON in S3, parsed at SELECT). Modern data engineer pipelines use schema-on-write for silver and gold layers (typed, validated, business-rule-applied) and schema-on-read for bronze (raw source format preserved for replay).
Companies whose data engineer interviews emphasize ETL design: Stripe (idempotent reconciliation, financial-data audit), Airbnb (ETL framework Airflow was created here; deep ETL expertise expected), Uber (large-scale Spark and Hive ETL), Snowflake and Databricks (vendor-specific ETL patterns).
ETL Design Interview Prep
Prep for the ETL design portion of a data engineer interview loop.
123 practice problems matching this filter. Difficulty: medium (57), hard (66).
Pipeline Architecture (123)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Common questions
- What is the idempotent ETL pattern?
- Run_id baked into output partitions plus MERGE INTO on composite natural key. A failed ETL job that restarts overwrites the same partition with the same run_id, producing the same result. The principle is that re-running yesterday's ETL today must produce yesterday's answer. Without idempotency, retries corrupt data and on-call data engineer responses become risky.
- How does a data engineer handle schema evolution in an ETL pipeline?
- Schema registry (Confluent Schema Registry for Avro, AWS Glue Schema Registry for Parquet) at the producer side. Additive changes (new nullable column) propagate automatically. Breaking changes (renamed column, narrowed type) refuse to deploy. Downstream bronze layer keeps the raw payload as JSON or original format for replay after writing a new mapping. Producer-side contract tests in CI catch breaking changes before deploy.
- How does ETL handle late-arriving data?
- MERGE INTO with ADD semantics, not REPLACE. MERGE INTO daily_revenue d USING src ON d.date = src.date WHEN MATCHED THEN UPDATE SET revenue = d.revenue + src.revenue. The ADD preserves yesterday's correct total when late events arrive; REPLACE would zero out the old total. Conversion windows (28 days for ads attribution) require this pattern by default.
- What is the difference between Airflow and Dagster?
- Airflow is the incumbent: Python DAGs, task-based, large operator ecosystem, mature but operationally heavy. Dagster is asset-based with software-defined assets concept and better type safety. Prefect targets lighter workflows. Data engineer interview rounds rarely demand a specific orchestrator; what matters is the dependency model and retry semantics. Mention both as options.
- How does a data engineer design backfill capability?
- Two patterns. Insert-overwrite per partition: backfill writes to a staging partition, then atomically swaps with the production partition. MERGE INTO with run_id: backfill writes with fresh run_id; downstream queries filter on latest run_id per partition. Both require explicit partition design. Non-partitioned tables generally cannot be safely backfilled without coordination.
- What is schema-on-read versus schema-on-write?
- Schema-on-write enforces schema at ingest: parquet files with strict types, warehouse tables with column constraints. Schema-on-read defers enforcement to query time: raw JSON in S3 parsed at SELECT. Modern data engineer pipelines use schema-on-write for silver and gold layers (typed, validated) and schema-on-read for bronze (raw source format preserved for replay).
- How long does the senior data engineer ETL design round take?
- 45 to 60 minutes typically. The scenario usually combines: a specific source database or stream, a freshness SLA, an idempotency requirement, a schema evolution constraint, and a downstream consumer with quirks (BI tool that cannot handle table swaps, ML serving layer that needs feature parity). Senior rubrics weight idempotency design, schema evolution story, and backfill capability.
- When does a data engineer choose ETL versus ELT?
- ETL transforms before landing in the warehouse: useful when the warehouse is expensive to scan or when source data is genuinely too large to land raw. ELT lands raw, transforms in-warehouse: dominates in 2026 because columnar warehouse compute is cheap and the raw bronze layer enables replay. Most modern data engineer pipelines are ELT with dbt or Spark on the bronze layer.