Data Engineer Take-Home Examples
Six real data engineer take-home assignments (paraphrased and de-identified), each with the prompt, an example solution, the rubric the company graded against, and the annotated walkthrough of where most candidates lose points. Sourced from candidate-shared take-homes between 2024 and 2026 across 19 companies. Pair with the complete data engineer interview preparation framework for round-by-round prep.
Example 1: Stripe-Style Reconciliation Pipeline (8 hours)
A small payments company asks: 'Build a reconciliation pipeline that ingests transactions, refunds, and chargebacks and produces a daily report of any mismatch.'
What the company sent
What 5/5 looked like
recon-pipeline/
├── README.md # 5-min walkthrough, runs in 30 sec
├── Makefile # make install, make run, make test, make report
├── pyproject.toml # pinned Python 3.11, polars, click, pytest
├── data/
│ ├── input/ # the 3 CSVs
│ └── output/ # generated reports (gitignored)
├── src/
│ └── recon/
│ ├── __init__.py
│ ├── ingest.py # CSV -> polars DataFrame, schema validated
│ ├── transform.py # joins, merchant aggregation
│ ├── checks.py # refund > txn anomaly detection
│ ├── report.py # CSV + markdown output
│ └── cli.py # click-based entrypoint
├── tests/
│ ├── conftest.py
│ ├── test_transform.py
│ └── fixtures/ # 30-row sample of each input
└── docs/
└── design.md # how I would productionize thisThe traps in this take-home
Trap 1: Not handling refunds across day boundaries. A transaction on day 1 with refund on day 3 needs to be reported in day 1's revenue (most candidates report it in day 3). Document your choice.
Trap 2: Using floats for currency. Use Decimal or integer cents. Half the failed submissions hit precision bugs at scale.
Trap 3: Not documenting how you would deploy this. Reviewers explicitly look for the productionization section in the README. Half a page on Airflow DAG, on-call setup, and SLA.
Example 2: Airbnb-Style Listing Search Backend (8 hours)
A travel marketplace asks: 'Build a search index pipeline that ingests listing updates and powers a queryable search service with 5-second freshness.'
What the company sent
Architecture choice (in README)
Two-phase: ingest writes to SQLite with FTS5 full-text index. Updates write to a small change log table; a background process applies them every 5 seconds. Query layer is FastAPI with three endpoints (search, filter, nearby).
Trade-off documented: SQLite chosen over Elasticsearch for self-contained simplicity at this data size. Production version would use Elasticsearch or OpenSearch with a Kafka-fed update stream.
The traps in this take-home
Trap 1: Building a clever inverted index from scratch instead of using SQLite FTS5 or Whoosh. Wrong choice: reviewers care about the end-to-end design, not your custom index code.
Trap 2: Skipping the update path. Many candidates ingest the initial dump and forget the 5-second freshness requirement. The freshness path is half the score.
Trap 3: Geo distance with naive Pythagoras. Use haversine. Document that you considered geohash bucketing for performance at larger scale.
Example 3: Databricks-Style PySpark ETL (4 hours)
An analytics platform asks: 'Given event JSON in S3, build a PySpark job that produces a daily user-level aggregate table optimized for downstream analytics.'
What the company sent
What to optimize for
Use spark.read.json with explicit schema (skip auto-inference; it's slow). Repartition by date before write. Use approx_count_distinct for distinct types (mention as cost-saving choice in README).
Output Parquet with snappy compression. Partition by date. Coalesce to control file count (avoid the small-file problem).
README discusses: why approx vs exact distinct (cost vs accuracy), why Parquet over CSV (column pruning, compression), why partition by date (downstream queries almost always filter by date).
Example 4: dbt + Snowflake Modeling Take-Home (6 hours)
An analytics-engineer-focused company asks: 'Given raw event tables in Snowflake, build a dbt project that produces a star schema with conformed dimensions.'
What the company sent
dbt project layout that scores 5/5
models/ ├── staging/ │ ├── stg_events.sql # 1:1 from raw, type cast, rename │ ├── stg_users.sql │ └── stg_products.sql ├── intermediate/ │ └── int_user_actions_enriched.sql ├── marts/ │ ├── core/ │ │ ├── dim_user.sql # SCD Type 2 via dbt snapshots │ │ ├── dim_product.sql │ │ ├── dim_date.sql │ │ └── fact_user_actions.sql │ └── core.yml # tests + docs snapshots/ └── dim_user_snapshot.sql # SCD Type 2 logic via dbt snapshot tests/ └── assert_no_duplicate_user_action_per_ts.sql docs/ └── README.md
Example 5: Streaming Take-Home With Kafka and Faust (8 hours)
A real-time analytics company asks: 'Build a streaming consumer that sessionizes events from Kafka and emits session-end records to a downstream topic.'
What the company sent
Stateful consumer with TTL
Use Faust agent with @app.agent stream processing. Maintain in-memory dict of (user_id) to last_event_ts and accumulating events. On new event: if gap exceeds 30 min, emit previous session and start new. Background task expires sessions on TTL.
State persistence: Faust uses RocksDB by default (mention in README). Without persistence, restart loses in-progress sessions; with it, sessions resume cleanly.
Failure modes documented in README: consumer crash mid- session (state recovers from RocksDB), producer outage (consumer stays alive, processes whatever arrives), late events (more than 1 hour past current head: emit to dead-letter topic).
Example 6: SQL-Heavy Take-Home (4 hours)
A FinTech company asks: 'Given a SQLite database with 5 tables, write SQL that answers 8 business questions and explain your approach.'
What the company sent
What reviewers look for
One file per question (q1.sql ... q8.sql) instead of one giant file. Each file has a top comment with the question, the approach, and any assumptions. README has a table linking questions to file paths.
For the gap-and-island question (consecutive spending days), candidates who use the ROW_NUMBER subtraction trick score 5/5. Candidates who use a self-join approach score 4/5. Candidates who use a procedural loop score 2/5.
For the LTV ranking question, DENSE_RANK over RANK is the right choice (preserves tied users). Mentioning the distinction is a senior signal.
Cross-Example Patterns
Four elements appear in nearly every top-scored data engineer take-home submission: a README that includes a section on how the system would be productionized, a Makefile that exposes setup and run commands as one-liners, a small but real unit test suite, and an explicit trade-off discussion for at least one architectural choice. Submissions missing more than one of these tend to cap below the top of the rubric, regardless of code quality.
For a framework that applies to any take-home, see how to pass the Data Engineer take-home. The technical patterns these assignments test come from how to pass the SQL round and how to pass the Python round. Company-specific take-home flavors track each company's stack: how to pass the Airbnb Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Databricks Data Engineer interview.
Know the patterns before the interviewer asks them.
Data engineer interview prep FAQ
Are these the actual take-home prompts?+
Can I submit my own take-home for review?+
How long should I spend on a 4-hour take-home?+
Should I use AI tools on the take-home?+
What if I get stuck on the take-home?+
Should I use pandas, PySpark, polars, or vanilla Python?+
How do I make my take-home stand out?+
Practice the patterns from these take-homes
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Adjacent Data Engineer Interview Prep Reading
Rubric, repo structure, and the README pattern that wins.
Practice the patterns that show up inside take-homes.
Pillar guide covering every round in the Data Engineer loop, end to end.
More data engineer interview prep guides
Free bank of 100+ data engineer interview questions and answers, runnable in-browser or open-source on GitHub. Updated 2026.
The 50 most frequently asked data engineer interview questions, with worked answers.
100 of the most asked data engineer interview questions across all four domains.
Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.