# ETL Interview Questions

> ETL pattern questions for data engineer interview prep.

Canonical URL: <https://datadriven.io/etl-interview-questions>

Breadcrumb: [Home](https://datadriven.io/) > [ETL Interview Questions](https://datadriven.io/etl-interview-questions)

## Summary

ETL interview questions for data engineer roles. Extract transform load patterns from operational databases to analytical warehouses. Idempotent design with run_id and MERGE. Schema evolution with additive contracts. Late-arriving data with MERGE-ADD semantics. Orchestration with Airflow and Dagster.

## What this page covers

ETL (extract, transform, load) is the foundational pattern of data engineer work. Source databases (Postgres, MySQL, MongoDB, Salesforce, third-party APIs) are extracted; the data is transformed (joined, deduplicated, aggregated, conformed to a target schema); the result is loaded into a warehouse (Snowflake, BigQuery, Redshift, Databricks). In 2026, ETL has largely been displaced by ELT (extract, load, transform in warehouse) for most workloads, but the term remains in interview rounds and the underlying patterns persist.

Six ETL patterns appear most in data engineer interviews. Full-refresh ETL: read the entire source, truncate the destination, insert all rows. Simple but expensive at scale; useful for small dimensions and reference data. Incremental ETL with high-water-mark: SELECT WHERE updated_at greater than last_run_max. Captures changes since last run; misses soft deletes and intermediate updates. CDC-based ETL: Debezium watching source WAL, emit changes to Kafka, downstream consumes (the modern default). Snapshot-and-diff: full snapshot at intervals, diff against previous snapshot for changes. Useful when CDC is unavailable but full-refresh is too expensive. SCD merge ETL: incremental ETL with SCD Type 2 update logic on dimension tables. Backfill ETL: re-run a date range to fill gaps or correct errors; requires idempotent design.

Idempotency in ETL. The standard pattern is run_id baked into output partitions or MERGE INTO on composite natural keys. A run_id is a stable identifier per ETL execution (typically run_date plus a sequence). Output partitions are keyed by run_id so a re-run overwrites the same partition with the same result. MERGE INTO on composite keys (pk, run_id) upserts to a target table where re-running produces the same upsert result. Without idempotency, retrying a failed ETL job corrupts data and the data engineer on-call response becomes risky.

Schema evolution in ETL. Source schemas change: a new column is added, a column is renamed, a type is widened. ETL pipelines must propagate additive changes (new nullable column flows to bronze, silver, gold without code change) and refuse to deploy breaking changes (renamed column, narrowed type would break downstream consumers). Schema registries (Confluent Schema Registry for Avro, AWS Glue Schema Registry for Parquet) enforce contracts at the producer side. Downstream bronze keeps the raw payload as JSON or original format for replay after a new mapping is written.

Orchestration in ETL. Airflow is the incumbent (Python DAGs, large operator ecosystem). Dagster is the asset-based alternative (better type safety, software-defined assets). Prefect for lighter workflows. dbt is not an orchestrator but is the de-facto in-warehouse transformation tool called by Airflow or Dagster. The orchestrator's job is to schedule, retry on failure, and propagate dependencies; the orchestrator should never contain transformation logic itself.

Companies whose data engineer interviews emphasize ETL: Airbnb (Airflow was created here; ETL expertise expected), Uber (large-scale Spark and Hive ETL), Snowflake and Databricks (vendor-specific patterns), Stripe (idempotent reconciliation), traditional enterprise (Inmon-style ETL into integrated data warehouse).

## Frequently asked questions

### What is the difference between ETL and ELT?

ETL transforms data in flight before landing in the warehouse (clean data only). ELT lands raw data in the warehouse first (bronze layer) and transforms inside the warehouse with dbt or Spark. ELT dominates in 2026 because columnar warehouse compute is cheap and the raw bronze layer enables replay. ETL persists for use cases where landing raw is genuinely too expensive (massive event streams) or when source-system loading is unacceptable.

### What is the idempotent ETL pattern?

Run_id baked into output partitions plus MERGE INTO on composite natural keys. A failed ETL job that restarts overwrites the same partition with the same run_id, producing the same result. The principle: re-running yesterday's ETL today must produce yesterday's answer. Without idempotency, retries corrupt data and on-call response becomes risky.

### How does a data engineer handle schema changes in ETL?

Schema registry at the producer side. Additive changes propagate automatically. Breaking changes refuse to deploy. Bronze layer keeps the raw payload as JSON for replay after writing a new mapping. Producer-side contract tests in CI catch breaking changes before deploy. The schema-evolution story is a senior data engineer ETL rubric item.

### What is the high-water-mark pattern for incremental ETL?

SELECT WHERE updated_at greater than last_run_max. The ETL job tracks the maximum updated_at it has processed; the next run picks up from there. Simple and cheap. Misses soft deletes (the row is gone from the source). Misses intermediate updates (only the latest state is visible). CDC handles both cases at the cost of operational complexity.

### When should a data engineer use full-refresh ETL?

For small dimensions (under 1M rows) and reference data (countries, currencies, lookup tables). Truncating and reloading is simple, side-effect-free, and avoids the complexity of incremental tracking. For fact tables or large dimensions, full-refresh is too expensive; use incremental or CDC.

### What is backfill ETL and how is it designed?

Backfill re-runs ETL for a past date range to fill gaps or correct errors. Two patterns. Insert-overwrite per partition: backfill writes to staging, atomically swaps with production partition. MERGE INTO with run_id: backfill writes with fresh run_id, downstream queries filter on latest run_id. Both require explicit partition design.

### What orchestrator should a data engineer pick for ETL?

Airflow for incumbent ecosystem and large operator library. Dagster for asset-based modeling and type safety. Prefect for lighter workflows. dbt as the in-warehouse transformation tool called by any of them. The orchestrator's job is to schedule, retry, and propagate dependencies; the orchestrator should not contain transformation logic.

### How does ETL fit into a medallion architecture?

ETL feeds the bronze layer (raw immutable). Subsequent transformations (silver: cleaned and conformed; gold: business-ready star schemas) are ELT done in-warehouse by dbt or Spark. The bronze layer enables replay: a bug in silver does not require re-ingesting from the source.

## How a data engineer designs an ETL pipeline

Six-step ETL design framework for data engineer interview rounds.

### Step 1: Pick the extract mechanism

Full refresh for small dims. High-water-mark incremental for stable schemas. CDC (Debezium) for transactional source with completeness needs.

### Step 2: Design idempotency

Run_id baked into output partitions. MERGE INTO on composite natural keys. Re-running produces the same result.

### Step 3: Handle schema evolution

Schema registry at the producer side. Additive changes propagate. Breaking changes refuse to deploy. Bronze keeps raw for replay.

### Step 4: Handle late-arriving data

MERGE INTO with ADD semantics, not REPLACE. Conversion windows of 28 days require this by default.

### Step 5: Plan backfill

Insert-overwrite per partition or MERGE with run_id. Both require explicit partition design.

### Step 6: Pick the orchestrator

Airflow, Dagster, or Prefect. Orchestrator schedules and retries; never contains transformation logic.

## Related practice catalogs

- [ETL design interview prep guide](https://datadriven.io/etl-design-interview-prep): Full design prep covering idempotency, schema evolution, backfill.
- [ETL practice problems](https://datadriven.io/etl-practice-problems): Hands-on ETL design with rubric-scored verdicts.
- [ELT interview questions](https://datadriven.io/elt-interview-questions): ELT as the modern alternative to ETL.
- [CDC pipeline interview questions](https://datadriven.io/cdc-pipeline-interview-questions): CDC as an ETL ingest mechanism.
- [Data pipeline design interview questions](https://datadriven.io/data-pipeline-interview-questions): End-to-end pipeline design including ETL.
- [System design interview prep](https://datadriven.io/system-design-interview-prep): Full system design including ETL.
- [Advanced SQL: MERGE patterns for ETL](https://datadriven.io/advanced-sql-interview-questions): Idempotent MERGE INTO syntax.
- [Data warehouse interview questions](https://datadriven.io/data-warehouse-interview-questions): Warehouse design that ETL targets.
- [SCD interview questions for ETL merge logic](https://datadriven.io/scd-interview-questions): Type 2 SCD merge in ETL pipelines.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.