Data Engineer Take-Home Examples

Six real data engineer take-home assignments (paraphrased and de-identified), each with the prompt, an example solution, the rubric the company graded against, and the annotated walkthrough of where most candidates lose points. Sourced from candidate-shared take-homes between 2024 and 2026 across 19 companies. Pair with the complete data engineer interview preparation framework for round-by-round prep.

Example 1: Stripe-Style Reconciliation Pipeline (8 hours)

A small payments company asks: 'Build a reconciliation pipeline that ingests transactions, refunds, and chargebacks and produces a daily report of any mismatch.'

Prompt

What the company sent

Three CSV files: transactions (1M rows), refunds (50K rows), chargebacks (5K rows). Build a Python pipeline that produces (1) a per-merchant daily revenue report and (2) a list of any transactions where the refund amount exceeds the original transaction. Time budget: 8 hours. Deliver as a git repo. README required.

Winning Repo Structure

What 5/5 looked like

recon-pipeline/
├── README.md            # 5-min walkthrough, runs in 30 sec
├── Makefile             # make install, make run, make test, make report
├── pyproject.toml       # pinned Python 3.11, polars, click, pytest
├── data/
│   ├── input/           # the 3 CSVs
│   └── output/          # generated reports (gitignored)
├── src/
│   └── recon/
│       ├── __init__.py
│       ├── ingest.py    # CSV -> polars DataFrame, schema validated
│       ├── transform.py # joins, merchant aggregation
│       ├── checks.py    # refund > txn anomaly detection
│       ├── report.py    # CSV + markdown output
│       └── cli.py       # click-based entrypoint
├── tests/
│   ├── conftest.py
│   ├── test_transform.py
│   └── fixtures/        # 30-row sample of each input
└── docs/
    └── design.md        # how I would productionize this

Where Candidates Lose Points

The traps in this take-home

Trap 1: Not handling refunds across day boundaries. A transaction on day 1 with refund on day 3 needs to be reported in day 1's revenue (most candidates report it in day 3). Document your choice.

Trap 2: Using floats for currency. Use Decimal or integer cents. Half the failed submissions hit precision bugs at scale.

Trap 3: Not documenting how you would deploy this. Reviewers explicitly look for the productionization section in the README. Half a page on Airflow DAG, on-call setup, and SLA.

Example 2: Airbnb-Style Listing Search Backend (8 hours)

A travel marketplace asks: 'Build a search index pipeline that ingests listing updates and powers a queryable search service with 5-second freshness.'

Prompt

What the company sent

A JSON file with 100K listings (id, title, description, price, lat, lon, amenities). A second JSON with 20K listing updates over the last 24 hours. Build a Python pipeline that ingests listings, applies updates, and exposes a search-by-text + filter-by-price + filter-by-distance API. 8 hours. Local-only is fine.

Winning Approach

Architecture choice (in README)

Two-phase: ingest writes to SQLite with FTS5 full-text index. Updates write to a small change log table; a background process applies them every 5 seconds. Query layer is FastAPI with three endpoints (search, filter, nearby).

Trade-off documented: SQLite chosen over Elasticsearch for self-contained simplicity at this data size. Production version would use Elasticsearch or OpenSearch with a Kafka-fed update stream.

Where Candidates Lose Points

The traps in this take-home

Trap 1: Building a clever inverted index from scratch instead of using SQLite FTS5 or Whoosh. Wrong choice: reviewers care about the end-to-end design, not your custom index code.

Trap 2: Skipping the update path. Many candidates ingest the initial dump and forget the 5-second freshness requirement. The freshness path is half the score.

Trap 3: Geo distance with naive Pythagoras. Use haversine. Document that you considered geohash bucketing for performance at larger scale.

Example 3: Databricks-Style PySpark ETL (4 hours)

An analytics platform asks: 'Given event JSON in S3, build a PySpark job that produces a daily user-level aggregate table optimized for downstream analytics.'

Prompt

What the company sent

Sample S3 path with 10 GB of nested event JSON (one file per hour, 24 files). Schema in a separate file. Build a PySpark job that produces fact_user_daily (user_id, date, event_count, distinct_event_types, first_event_ts, last_event_ts, total_revenue). Output as Parquet, partitioned by date. 4 hours. Run locally with spark-submit.

Winning Approach

What to optimize for

Use spark.read.json with explicit schema (skip auto-inference; it's slow). Repartition by date before write. Use approx_count_distinct for distinct types (mention as cost-saving choice in README).

Output Parquet with snappy compression. Partition by date. Coalesce to control file count (avoid the small-file problem).

README discusses: why approx vs exact distinct (cost vs accuracy), why Parquet over CSV (column pruning, compression), why partition by date (downstream queries almost always filter by date).

Example 4: dbt + Snowflake Modeling Take-Home (6 hours)

An analytics-engineer-focused company asks: 'Given raw event tables in Snowflake, build a dbt project that produces a star schema with conformed dimensions.'

Prompt

What the company sent

Snowflake credentials to a sandbox with raw_events, raw_users, raw_products tables. Build a dbt project that produces fact_user_actions, dim_user, dim_product, dim_date. Implement SCD Type 2 on dim_user. 6 hours.

Winning Project Structure

dbt project layout that scores 5/5

models/
├── staging/
│   ├── stg_events.sql       # 1:1 from raw, type cast, rename
│   ├── stg_users.sql
│   └── stg_products.sql
├── intermediate/
│   └── int_user_actions_enriched.sql
├── marts/
│   ├── core/
│   │   ├── dim_user.sql      # SCD Type 2 via dbt snapshots
│   │   ├── dim_product.sql
│   │   ├── dim_date.sql
│   │   └── fact_user_actions.sql
│   └── core.yml              # tests + docs
snapshots/
└── dim_user_snapshot.sql    # SCD Type 2 logic via dbt snapshot
tests/
└── assert_no_duplicate_user_action_per_ts.sql
docs/
└── README.md

Example 5: Streaming Take-Home With Kafka and Faust (8 hours)

A real-time analytics company asks: 'Build a streaming consumer that sessionizes events from Kafka and emits session-end records to a downstream topic.'

Prompt

What the company sent

Docker compose with Kafka + Zookeeper + a producer that emits 100 events/sec. Build a Python consumer that groups events into sessions (30-min inactivity gap) and emits session-end records to a 'sessions' topic. 8 hours. Faust or kafka-python both acceptable.

Winning Approach

Stateful consumer with TTL

Use Faust agent with @app.agent stream processing. Maintain in-memory dict of (user_id) to last_event_ts and accumulating events. On new event: if gap exceeds 30 min, emit previous session and start new. Background task expires sessions on TTL.

State persistence: Faust uses RocksDB by default (mention in README). Without persistence, restart loses in-progress sessions; with it, sessions resume cleanly.

Failure modes documented in README: consumer crash mid- session (state recovers from RocksDB), producer outage (consumer stays alive, processes whatever arrives), late events (more than 1 hour past current head: emit to dead-letter topic).

Example 6: SQL-Heavy Take-Home (4 hours)

A FinTech company asks: 'Given a SQLite database with 5 tables, write SQL that answers 8 business questions and explain your approach.'

Prompt

What the company sent

SQLite file with users, accounts, transactions, merchants, fraud_flags. 8 questions: from simple aggregations to gap-and-island consecutive-day-spend, fraud rate by merchant category, and a window-function rank by user lifetime value. 4 hours. Submit as a single .sql file plus a README with explanations.

Winning Submission Structure

What reviewers look for

One file per question (q1.sql ... q8.sql) instead of one giant file. Each file has a top comment with the question, the approach, and any assumptions. README has a table linking questions to file paths.

For the gap-and-island question (consecutive spending days), candidates who use the ROW_NUMBER subtraction trick score 5/5. Candidates who use a self-join approach score 4/5. Candidates who use a procedural loop score 2/5.

For the LTV ranking question, DENSE_RANK over RANK is the right choice (preserves tied users). Mentioning the distinction is a senior signal.

Cross-Example Patterns

Four elements appear in nearly every top-scored data engineer take-home submission: a README that includes a section on how the system would be productionized, a Makefile that exposes setup and run commands as one-liners, a small but real unit test suite, and an explicit trade-off discussion for at least one architectural choice. Submissions missing more than one of these tend to cap below the top of the rubric, regardless of code quality.

For a framework that applies to any take-home, see how to pass the Data Engineer take-home. The technical patterns these assignments test come from how to pass the SQL round and how to pass the Python round. Company-specific take-home flavors track each company's stack: how to pass the Airbnb Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Databricks Data Engineer interview.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

Data engineer interview prep FAQ

Are these the actual take-home prompts?+

These are paraphrased and de-identified versions of real take-home prompts shared by candidates. The structures, time budgets, and rubric expectations are accurate. The exact wording differs to protect the originating companies.

Can I submit my own take-home for review?+

Not directly through this page yet, but our coaching service offers take-home reviews against the same rubric we used to score the examples here. The framework guide shows you how to self-score.

How long should I spend on a 4-hour take-home?+

4 to 6 hours total: 4 of focused work, 1 of README polish, 1 of testing. Spending exactly the stated time is a green flag. Spending 2x is a yellow flag (over-engineering). Under shows lack of effort.

Should I use AI tools on the take-home?+

Most companies allow AI assistance but expect disclosure. The reasonable use is boilerplate generation; the unreasonable use is offloading design decisions. The README, trade-off analysis, and architectural choices need to reflect the candidate's own reasoning. A submission whose code sophistication exceeds the depth of the written explanation tends to get marked down.

What if I get stuck on the take-home?+

Document the sticking point in the README, what was tried, and why it did not work. Reviewers typically value this kind of transparency more than a complete submission that hides a known weakness. A partial submission with clear documentation tends to score above a complete submission with a glossed-over flaw.

Should I use pandas, PySpark, polars, or vanilla Python?+

Match the tool to the data size in the prompt. Under 1 GB: pandas or polars. 1-10 GB: PySpark in local mode. Over 10 GB: PySpark with cluster setup documented. Vanilla Python is fine for very small data and shows fluency.

How do I make my take-home stand out?+

Three additions tend to consistently differentiate submissions: a Makefile that exposes setup, run, test, and report as single-command operations; a docs/design.md describing how the project would be productionized; and an architecture decision record explaining one non-obvious choice. None take more than an hour, and each signals senior-level judgment.

02 / Why practice

Practice the patterns from these take-homes

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open the problems

Adjacent Data Engineer Interview Prep Reading

The Take-Home Assignment Framework Guide→

Rubric, repo structure, and the README pattern that wins.

Python Interview Questions for Data Engineers→

Practice the patterns that show up inside take-homes.

Complete Data Engineer Interview Prep Framework→

Pillar guide covering every round in the Data Engineer loop, end to end.

More data engineer interview prep guides

free Data Engineer interview questions and answers→

Free bank of 100+ data engineer interview questions and answers, runnable in-browser or open-source on GitHub. Updated 2026.

the curated 50 Data Engineer interview questions→

The 50 most frequently asked data engineer interview questions, with worked answers.

the 100 most asked Data Engineer interview questions→

100 of the most asked data engineer interview questions across all four domains.

FAANG Data Engineer interview questions and answers→

Real questions from Meta, Amazon, Apple, Netflix, and Google Data Engineer loops, with answers.

how to pass the SQL round→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

how to pass the Python round→

JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.