Six real data engineer take-home assignments (paraphrased and de-identified), each with the prompt, an example solution, the rubric the company graded against, and the annotated walkthrough of where most candidates lose points. Sourced from candidate-shared take-homes between 2024 and 2026 across 19 companies. Pair with the complete data engineer interview preparation framework for round-by-round prep.
A small payments company asks: 'Build a reconciliation pipeline that ingests transactions, refunds, and chargebacks and produces a daily report of any mismatch.'
recon-pipeline/
├── README.md # 5-min walkthrough, runs in 30 sec
├── Makefile # make install, make run, make test, make report
├── pyproject.toml # pinned Python 3.11, polars, click, pytest
├── data/
│ ├── input/ # the 3 CSVs
│ └── output/ # generated reports (gitignored)
├── src/
│ └── recon/
│ ├── __init__.py
│ ├── ingest.py # CSV -> polars DataFrame, schema validated
│ ├── transform.py # joins, merchant aggregation
│ ├── checks.py # refund > txn anomaly detection
│ ├── report.py # CSV + markdown output
│ └── cli.py # click-based entrypoint
├── tests/
│ ├── conftest.py
│ ├── test_transform.py
│ └── fixtures/ # 30-row sample of each input
└── docs/
└── design.md # how I would productionize thisTrap 1: Not handling refunds across day boundaries. A transaction on day 1 with refund on day 3 needs to be reported in day 1's revenue (most candidates report it in day 3). Document your choice.
Trap 2: Using floats for currency. Use Decimal or integer cents. Half the failed submissions hit precision bugs at scale.
Trap 3: Not documenting how you would deploy this. Graders explicitly look for the productionization section in the README. Half a page on Airflow DAG, on-call setup, and SLA.
A travel marketplace asks: 'Build a search index pipeline that ingests listing updates and powers a queryable search service with 5-second freshness.'
Two-phase: ingest writes to SQLite with FTS5 full-text index. Updates write to a small change log table; a background process applies them every 5 seconds. Query layer is FastAPI with three endpoints (search, filter, nearby).
Trade-off documented: SQLite chosen over Elasticsearch for self-contained simplicity at this data size. Production version would use Elasticsearch or OpenSearch with a Kafka-fed update stream.
Trap 1: Building a clever inverted index from scratch instead of using SQLite FTS5 or Whoosh. Wrong choice: graders care about the end-to-end design, not your custom index code.
Trap 2: Skipping the update path. Many candidates ingest the initial dump and forget the 5-second freshness requirement. The freshness path is half the grade.
Trap 3: Geo distance with naive Pythagoras. Use haversine. Document that you considered geohash bucketing for performance at larger scale.
An analytics platform asks: 'Given event JSON in S3, build a PySpark job that produces a daily user-level aggregate table optimized for downstream analytics.'
Use spark.read.json with explicit schema (skip auto-inference; it's slow). Repartition by date before write. Use approx_count_distinct for distinct types (mention as cost-saving choice in README).
Output Parquet with snappy compression. Partition by date. Coalesce to control file count (avoid the small-file problem).
README discusses: why approx vs exact distinct (cost vs accuracy), why Parquet over CSV (column pruning, compression), why partition by date (downstream queries almost always filter by date).
An analytics-engineer-focused company asks: 'Given raw event tables in Snowflake, build a dbt project that produces a star schema with conformed dimensions.'
models/ ├── staging/ │ ├── stg_events.sql # 1:1 from raw, type cast, rename │ ├── stg_users.sql │ └── stg_products.sql ├── intermediate/ │ └── int_user_actions_enriched.sql ├── marts/ │ ├── core/ │ │ ├── dim_user.sql # SCD Type 2 via dbt snapshots │ │ ├── dim_product.sql │ │ ├── dim_date.sql │ │ └── fact_user_actions.sql │ └── core.yml # tests + docs snapshots/ └── dim_user_snapshot.sql # SCD Type 2 logic via dbt snapshot tests/ └── assert_no_duplicate_user_action_per_ts.sql docs/ └── README.md
A real-time analytics company asks: 'Build a streaming consumer that sessionizes events from Kafka and emits session-end records to a downstream topic.'
Use Faust agent with @app.agent stream processing. Maintain in-memory dict of (user_id) to last_event_ts and accumulating events. On new event: if gap exceeds 30 min, emit previous session and start new. Background task expires sessions on TTL.
State persistence: Faust uses RocksDB by default (mention in README). Without persistence, restart loses in-progress sessions; with it, sessions resume cleanly.
Failure modes documented in README: consumer crash mid- session (state recovers from RocksDB), producer outage (consumer stays alive, processes whatever arrives), late events (more than 1 hour past current head: emit to dead-letter topic).
A FinTech company asks: 'Given a SQLite database with 5 tables, write SQL that answers 8 business questions and explain your approach.'
One file per question (q1.sql ... q8.sql) instead of one giant file. Each file has a top comment with the question, the approach, and any assumptions. README has a table linking questions to file paths.
For the gap-and-island question (consecutive spending days), candidates who use the ROW_NUMBER subtraction trick score 5/5. Candidates who use a self-join approach score 4/5. Candidates who use a procedural loop score 2/5.
For the LTV ranking question, DENSE_RANK over RANK is the right choice (preserves tied users). Mentioning the distinction is a senior signal.
Across all six examples, four patterns appear in every winning submission: a README with a productionization section, a Makefile with one-line commands, at least three unit tests, and a documented trade-off for one major architectural choice. If your submission is missing any of these four, it caps at 3/5 regardless of code quality.
The framework for tackling any take-home is in how to pass the Data Engineer take-home. The question patterns the take-home is testing for come from how to pass the SQL round and how to pass the Python round. If you're targeting a specific company, the take-home flavor matches their stack: how to pass the Airbnb Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Databricks Data Engineer interview.
Build the muscle memory of a take-home submission by running the patterns in our sandbox. SQL, Python, modeling, and design problems with instant feedback.
Start Practicing NowRubric, repo structure, and the README pattern that wins.
Practice the patterns that show up inside take-homes.
Pillar guide covering every round in the Data Engineer loop, end to end.
Free downloadable PDF of 100+ data engineer interview questions and answers, updated 2026.
The 50 most frequently asked data engineer interview questions, with worked answers.
100 of the most asked data engineer interview questions across all four domains.
Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.
Continue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.