ETL practice problems for data engineer interview prep. Hands-on idempotent ETL design with run_id and MERGE INTO. Schema evolution scenarios with additive and breaking changes. Late-arriving data with MERGE-ADD semantics. Backfill scenarios with insert-overwrite and run_id patterns. Each problem ships with a rubric-scored verdict matching the senior data engineer ETL design rubric.
ETL practice problems for data engineer roles cover six scenario families that recur in 2026 interview reports. Each problem has a rubric-scored verdict covering five dimensions: extract mechanism choice, idempotency design, schema evolution story, late-arriving data handling, and orchestration plus backfill plan.
Daily ETL from Postgres to Snowflake. Source: a 50TB Postgres production database with 200 tables. Constraint: 4-hour load window, full backfill capability for the last 30 days, no impact to source production load. Canonical solution: Debezium CDC connector watches the Postgres WAL, emits change events to Kafka, Kafka Connect S3 Sink lands raw to S3 bronze, Spark daily ETL reads S3 partitions and dedups on (table_pk, op_ts), Snowflake MERGE INTO upserts to gold with run_id baked in. Backfill: re-run Spark with the same run_id over the date range, MERGE overwrites the affected partitions.
Incremental ETL with high-water-mark and audit. Source: a Salesforce instance with 50 tables and 100M total rows. Constraint: hourly freshness, no schema changes deploy without contract review, audit log for every row inserted or updated. Canonical solution: incremental SELECT WHERE LastModifiedDate greater than last_run_max for each table, dedup on (pk, LastModifiedDate), MERGE INTO with audit INSERT to a side table for every operation. Schema evolution: contract validated against a JSON schema in CI before deploy.
ML feature pipeline with backfill. Source: clickstream events landed in S3 by an upstream pipeline. Constraint: compute 50 features per user per day, support 30-day backfill when a feature definition changes, online and offline parity. Canonical solution: Spark daily batch reads S3 partitions, computes features with shared feature library (used by both batch and streaming), writes to S3 feature parquet with date partition. Backfill: re-run with new feature definition, partition-overwrite per affected date. Online parity: Flink streaming job using the same feature library writes to Redis.
Daily payment reconciliation. Source: internal transactions in Postgres, processor settlement reports in S3 (delivered by SFTP nightly). Constraint: daily reconciliation, alert on any discrepancy greater than 0.01 percent of total volume, idempotent re-runs for late settlement files. Canonical solution: Spark daily ETL reads internal transactions (via CDC from Postgres) and settlement reports (from S3), joins on (txn_id, settlement_id), produces reconciled fact with status and discrepancy. Snowflake MERGE on (txn_id, run_id) for idempotency. Late settlement files trigger backfill MERGE on the affected date.
Schema evolution scenario. Source: a JSON event stream from a third-party SDK. Constraint: handle additive changes automatically (new nullable field), refuse to deploy breaking changes (renamed field, type narrowing), preserve raw event payload for 90 days. Canonical solution: bronze layer stores raw JSON in S3 with date partition. Silver layer extracts typed fields via a Spark job with explicit schema; new nullable fields propagate without code change. Breaking changes blocked by schema registry contract validation at the producer side; the data engineer team writes a new mapping when intentional schema change happens, deploys mapping plus bronze re-parse together.
Each practice problem includes a 45-60 minute design exercise plus rubric review. The rubric review identifies what scored well (idempotency design, schema story) and what was missed (backfill plan, alerting strategy). The gap analysis is the practice value.
ETL Practice Problems
Hands-on ETL design practice for data engineer interview prep.
123 practice problems matching this filter. Difficulty: medium (57), hard (66).
Pipeline Architecture (123)
- 45 Minutes Turned Into 3.5 Hours - medium - Spark jobs are running. Just not fast enough.
- 600 Million Events a Day - hard - 600 million events a day. Two years of retention.
- A Clean Number for Every Merchant - hard - Raw payment logs in. Clean merchant summaries out.
- A Million Cars Phoning Home - hard - Every vehicle is a sensor. Deploy the pipeline to catch it all.
- Analysts Are Slowing the Store Down - medium - Orders placed. Data warehouse hungry.
- A New Column on a Billion Rows - hard - Add and backfill a new column to a billion-row production table with zero downtime.
- A Shared Drive Full of Contracts - medium - Buried in PDFs. The data is in there somewhere.
- A Stream All Day and a File at Midnight - hard - Real-time and batch. Same pipeline. No compromises.
- Badging Items That Already Sold Out - hard - Same-day delivery. The features have to be faster.
- Basel, CCAR, and Monday Morning - medium - The regulator does not accept 'eventually consistent.'
Common questions
- How are ETL practice problems graded?
- Rubric-scored on five dimensions: extract mechanism choice (CDC vs incremental vs full refresh), idempotency design (run_id, MERGE INTO), schema evolution story (registry, additive-only contracts), late-arriving data handling (MERGE-ADD-not-REPLACE), orchestration plus backfill plan (insert-overwrite vs run_id-filter). Multiple valid designs score well if the data engineer can defend each choice.
- What scenarios are most common in ETL practice problems?
- Daily Postgres-to-Snowflake CDC, incremental Salesforce ETL with audit, ML feature pipeline with backfill, daily payment reconciliation, schema evolution with breaking-change refusal, multi-source-to-data-vault ingestion. Each appears across multiple companies' data engineer interview reports.
- How long does an ETL practice session take?
- 60-75 minutes including rubric review: 45-60 minutes for the design exercise, 15 minutes for the rubric verdict comparison. The gap analysis between the candidate's solution and the rubric-scored verdict is the practice value.
- What is the most common failure mode in ETL practice?
- Skipping the idempotency design. The data engineer designs a working extract-transform-load flow without addressing what happens on retry. A failed Spark job that restarts must produce the same warehouse state, not a partial double-write. Run_id baked into output partitions plus MERGE INTO is the standard fix.
- How does a data engineer practice schema evolution scenarios?
- Practice with a specific source schema change: a column added (additive), a column renamed (breaking), a type widened (additive in some systems, breaking in others). Walk through what the bronze, silver, and gold layers need to do. The senior signal is articulating where the schema registry sits, what the contract validation looks like, and how the team coordinates a breaking change.
- What is the rubric weight on backfill capability?
- 15-20 percent at L5+. A pipeline without backfill capability is operationally fragile: bugs cannot be corrected without manual data surgery. The standard backfill patterns are insert-overwrite per partition and MERGE INTO with run_id. Both require explicit partition design upfront.
- Do these ETL practice problems cover specific vendors?
- Most stay vendor-neutral (Spark, dbt, MERGE INTO patterns work across Snowflake, BigQuery, Redshift, Databricks). Vendor-specific variants tagged on problems where they matter: Snowflake-specific MERGE syntax, BigQuery partition-replace, Databricks Delta MERGE INTO with optimize, Redshift COPY-then-UPSERT pattern.
- How many ETL practice problems should a senior data engineer solve before an onsite?
- Six well-practiced scenarios across six distinct ETL shapes (CDC, incremental, full refresh, ML feature, reconciliation, schema evolution) beats fifteen similar ones. Aim for one of each over 2-3 weeks. The signal interviewers test is whether the data engineer can transfer the pattern to a new source-and-destination pair.