ELT interview questions for data engineer roles. Extract, load raw to warehouse, transform in-warehouse with dbt or Spark. Bronze, silver, gold layers. Idempotent MERGE INTO on composite natural keys. The dominant pattern in 2026 data engineer pipelines.
ELT (extract, load, transform) is the dominant data engineer pipeline pattern in 2026. The change from ETL to ELT happened over the late 2010s and early 2020s as columnar warehouse compute became cheap enough that transforming inside the warehouse beat transforming in flight. The modern stack: extract from source (CDC via Debezium, batch SELECT, API pull), land raw immutable in the warehouse bronze layer (Snowflake variant or external table to S3 Parquet), transform in-warehouse with dbt (SQL-based) or Spark (Python or Scala).
Six ELT-specific concerns appear in data engineer interview rounds. Bronze layer design: raw immutable, schema-on-read for flexibility, partitioned by load_date for replay. Silver layer design: cleaned, typed, deduplicated, conformed dimensions. Gold layer design: business-ready star schemas with aggregates and materialized views. dbt versus Spark for transformation: dbt is SQL-only, runs in the warehouse, has the better developer experience for analytical transformations; Spark is general-purpose, handles non-SQL workloads (ML feature engineering, complex Python logic), more operational overhead. Most modern data engineer pipelines use both: dbt for analytics, Spark for ML and heavy transforms.
Idempotency in ELT. Bronze layer is append-only with load_date partition; re-running an extract for a date overwrites the same partition. Silver and gold use MERGE INTO on composite natural keys with run_id. dbt's incremental materialization handles this pattern automatically with the unique_key config option. Spark with Delta or Iceberg MERGE INTO similarly. The senior data engineer signal is articulating idempotency at every layer, not just at the top.
Schema evolution in ELT. The bronze layer's raw immutable nature handles additive changes naturally: new columns flow into bronze without code change. Silver layer transformations need explicit schema handling: dbt source freshness checks plus schema tests catch breaking changes; Spark schema-on-read with explicit column references propagates additive changes and breaks on rename. Schema registries at the producer side enforce contracts.
The dbt-specific patterns appear in data engineer interview rounds where the company uses dbt heavily (most 2026 startups, many enterprise teams). Incremental models with the unique_key config for idempotency. Snapshots for SCD Type 2 dimension tracking. Sources with freshness tests. Macros for cross-project DRY (a custom_dedup macro that wraps ROW_NUMBER for the team's standard dedup logic). Tests (not_null, unique, accepted_values, custom singular tests) gating deploys.
Companies whose data engineer interviews emphasize ELT and dbt: Stripe (dbt-heavy), Airbnb (Airflow + dbt-equivalent internal tool), most modern startups (Snowflake or BigQuery + dbt Cloud), Snowflake itself (vendor-specific dbt patterns), Databricks (Delta + dbt or Spark).
ELT Interview Questions
ELT pattern questions for data engineer interview prep.
123 practice problems matching this filter. Difficulty: medium (57), hard (66).
Pipeline Architecture (123)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Common questions
- What is the difference between ETL and ELT?
- ETL transforms in flight before landing clean data in the warehouse. ELT lands raw data in the warehouse first (bronze layer) and transforms in-warehouse with dbt or Spark. ELT dominates in 2026 because columnar warehouse compute is cheap and the raw bronze layer enables replay. ETL persists for use cases where landing raw is too expensive or source-system loading is unacceptable.
- Why has ELT become the default for data engineer pipelines?
- Three reasons. Columnar warehouse compute (Snowflake, BigQuery, Redshift, Databricks) is cheap enough to transform in-warehouse without breaking the budget. The raw bronze layer enables replay: a bug in silver does not require re-ingesting from the source. The dbt ecosystem and similar in-warehouse tools have matured to the point that SQL-based transformation is more productive than custom Spark code for most workloads.
- When does a data engineer pick dbt versus Spark for in-warehouse transformation?
- dbt for analytical transformations: aggregations, dimensional modeling, business-rule applications, dashboards. SQL-only, runs in the warehouse, fast developer iteration. Spark for non-SQL workloads: ML feature engineering, complex Python logic, very large transforms that benefit from Spark's distributed compute. Most modern data engineer pipelines use both: dbt for analytics, Spark for ML and heavy transforms.
- What is the bronze, silver, gold layer pattern in ELT?
- Bronze: raw immutable from source, partitioned by load_date, schema-on-read for flexibility. Silver: cleaned, typed, deduplicated, conformed dimensions, schema-on-write. Gold: business-ready star schemas with aggregates and materialized views, optimized for query. Each layer has its own ownership and quality contract; a bug in silver does not require re-ingesting from source.
- How does dbt handle idempotency?
- Incremental models with the unique_key config option. The model's transformation logic produces rows keyed by unique_key; dbt MERGE INTO the destination table replacing or upserting matching rows. Re-running with the same source data produces the same destination. dbt also supports the insert_overwrite materialization for partition-replace idempotency.
- What is a dbt snapshot and when does a data engineer use it?
- dbt snapshot is the built-in SCD Type 2 mechanism. The snapshot captures the state of a source table at the snapshot time and tracks changes with valid_from, valid_to columns. Re-running the snapshot adds new rows for any source changes. Useful for tracking slowly-changing dimensions (customer attributes, product catalog) without writing custom merge logic.
- How does ELT handle schema evolution?
- Bronze layer's raw immutable nature handles additive changes naturally: new columns flow into bronze without code change. Silver layer transformations need explicit schema handling: dbt source freshness checks and schema tests catch breaking changes; Spark schema-on-read with explicit column references propagates additive and breaks on rename. Schema registries enforce contracts at producer.
- What is the difference between dbt Cloud and dbt Core?
- dbt Core is the open-source CLI tool; runs on any infrastructure. dbt Cloud is the managed service with web UI, scheduling, alerting, IDE, CI integration. Most data engineer interviews are stack-agnostic about which dbt; the patterns are the same. Mention dbt Cloud if the company uses it (most cloud-first companies do).