Annotated Take-Home Walkthroughs

Data Engineer Take-Home Examples

Six real data engineer take-home assignments (paraphrased and de-identified), each with the prompt, an example solution, the rubric the company graded against, and the annotated walkthrough of where most candidates lose points. Sourced from candidate-shared take-homes between 2024 and 2026 across 19 companies. Pair with the complete data engineer interview preparation framework for round-by-round prep.

The Short Answer
Take-home assignments are graded on a hidden rubric: correctness (30%), code quality (20%), repo structure (15%), written explanation (20%), and one differentiator (15%). The examples below show each rubric dimension scored 5/5 in context. The single biggest pattern across winning submissions: the README is graded as heavily as the code. See how to pass the Data Engineer take-home for the framework, then use these examples as reference.
Updated April 2026·By The DataDriven Team

Example 1: Stripe-Style Reconciliation Pipeline (8 hours)

A small payments company asks: 'Build a reconciliation pipeline that ingests transactions, refunds, and chargebacks and produces a daily report of any mismatch.'

Prompt

What the company sent

Three CSV files: transactions (1M rows), refunds (50K rows), chargebacks (5K rows). Build a Python pipeline that produces (1) a per-merchant daily revenue report and (2) a list of any transactions where the refund amount exceeds the original transaction. Time budget: 8 hours. Deliver as a git repo. README required.
Winning Repo Structure

What 5/5 looked like

recon-pipeline/
├── README.md            # 5-min walkthrough, runs in 30 sec
├── Makefile             # make install, make run, make test, make report
├── pyproject.toml       # pinned Python 3.11, polars, click, pytest
├── data/
│   ├── input/           # the 3 CSVs
│   └── output/          # generated reports (gitignored)
├── src/
│   └── recon/
│       ├── __init__.py
│       ├── ingest.py    # CSV -> polars DataFrame, schema validated
│       ├── transform.py # joins, merchant aggregation
│       ├── checks.py    # refund > txn anomaly detection
│       ├── report.py    # CSV + markdown output
│       └── cli.py       # click-based entrypoint
├── tests/
│   ├── conftest.py
│   ├── test_transform.py
│   └── fixtures/        # 30-row sample of each input
└── docs/
    └── design.md        # how I would productionize this
Where Candidates Lose Points

The traps in this take-home

Trap 1: Not handling refunds across day boundaries. A transaction on day 1 with refund on day 3 needs to be reported in day 1's revenue (most candidates report it in day 3). Document your choice.

Trap 2: Using floats for currency. Use Decimal or integer cents. Half the failed submissions hit precision bugs at scale.

Trap 3: Not documenting how you would deploy this. Graders explicitly look for the productionization section in the README. Half a page on Airflow DAG, on-call setup, and SLA.

Example 2: Airbnb-Style Listing Search Backend (8 hours)

A travel marketplace asks: 'Build a search index pipeline that ingests listing updates and powers a queryable search service with 5-second freshness.'

Prompt

What the company sent

A JSON file with 100K listings (id, title, description, price, lat, lon, amenities). A second JSON with 20K listing updates over the last 24 hours. Build a Python pipeline that ingests listings, applies updates, and exposes a search-by-text + filter-by-price + filter-by-distance API. 8 hours. Local-only is fine.
Winning Approach

Architecture choice (in README)

Two-phase: ingest writes to SQLite with FTS5 full-text index. Updates write to a small change log table; a background process applies them every 5 seconds. Query layer is FastAPI with three endpoints (search, filter, nearby).

Trade-off documented: SQLite chosen over Elasticsearch for self-contained simplicity at this data size. Production version would use Elasticsearch or OpenSearch with a Kafka-fed update stream.

Where Candidates Lose Points

The traps in this take-home

Trap 1: Building a clever inverted index from scratch instead of using SQLite FTS5 or Whoosh. Wrong choice: graders care about the end-to-end design, not your custom index code.

Trap 2: Skipping the update path. Many candidates ingest the initial dump and forget the 5-second freshness requirement. The freshness path is half the grade.

Trap 3: Geo distance with naive Pythagoras. Use haversine. Document that you considered geohash bucketing for performance at larger scale.

Example 3: Databricks-Style PySpark ETL (4 hours)

An analytics platform asks: 'Given event JSON in S3, build a PySpark job that produces a daily user-level aggregate table optimized for downstream analytics.'

Prompt

What the company sent

Sample S3 path with 10 GB of nested event JSON (one file per hour, 24 files). Schema in a separate file. Build a PySpark job that produces fact_user_daily (user_id, date, event_count, distinct_event_types, first_event_ts, last_event_ts, total_revenue). Output as Parquet, partitioned by date. 4 hours. Run locally with spark-submit.
Winning Approach

What to optimize for

Use spark.read.json with explicit schema (skip auto-inference; it's slow). Repartition by date before write. Use approx_count_distinct for distinct types (mention as cost-saving choice in README).

Output Parquet with snappy compression. Partition by date. Coalesce to control file count (avoid the small-file problem).

README discusses: why approx vs exact distinct (cost vs accuracy), why Parquet over CSV (column pruning, compression), why partition by date (downstream queries almost always filter by date).

Example 4: dbt + Snowflake Modeling Take-Home (6 hours)

An analytics-engineer-focused company asks: 'Given raw event tables in Snowflake, build a dbt project that produces a star schema with conformed dimensions.'

Prompt

What the company sent

Snowflake credentials to a sandbox with raw_events, raw_users, raw_products tables. Build a dbt project that produces fact_user_actions, dim_user, dim_product, dim_date. Implement SCD Type 2 on dim_user. 6 hours.
Winning Project Structure

dbt project layout that scores 5/5

models/
├── staging/
│   ├── stg_events.sql       # 1:1 from raw, type cast, rename
│   ├── stg_users.sql
│   └── stg_products.sql
├── intermediate/
│   └── int_user_actions_enriched.sql
├── marts/
│   ├── core/
│   │   ├── dim_user.sql      # SCD Type 2 via dbt snapshots
│   │   ├── dim_product.sql
│   │   ├── dim_date.sql
│   │   └── fact_user_actions.sql
│   └── core.yml              # tests + docs
snapshots/
└── dim_user_snapshot.sql    # SCD Type 2 logic via dbt snapshot
tests/
└── assert_no_duplicate_user_action_per_ts.sql
docs/
└── README.md

Example 5: Streaming Take-Home With Kafka and Faust (8 hours)

A real-time analytics company asks: 'Build a streaming consumer that sessionizes events from Kafka and emits session-end records to a downstream topic.'

Prompt

What the company sent

Docker compose with Kafka + Zookeeper + a producer that emits 100 events/sec. Build a Python consumer that groups events into sessions (30-min inactivity gap) and emits session-end records to a 'sessions' topic. 8 hours. Faust or kafka-python both acceptable.
Winning Approach

Stateful consumer with TTL

Use Faust agent with @app.agent stream processing. Maintain in-memory dict of (user_id) to last_event_ts and accumulating events. On new event: if gap exceeds 30 min, emit previous session and start new. Background task expires sessions on TTL.

State persistence: Faust uses RocksDB by default (mention in README). Without persistence, restart loses in-progress sessions; with it, sessions resume cleanly.

Failure modes documented in README: consumer crash mid- session (state recovers from RocksDB), producer outage (consumer stays alive, processes whatever arrives), late events (more than 1 hour past current head: emit to dead-letter topic).

Example 6: SQL-Heavy Take-Home (4 hours)

A FinTech company asks: 'Given a SQLite database with 5 tables, write SQL that answers 8 business questions and explain your approach.'

Prompt

What the company sent

SQLite file with users, accounts, transactions, merchants, fraud_flags. 8 questions: from simple aggregations to gap-and-island consecutive-day-spend, fraud rate by merchant category, and a window-function rank by user lifetime value. 4 hours. Submit as a single .sql file plus a README with explanations.
Winning Submission Structure

What graders look for

One file per question (q1.sql ... q8.sql) instead of one giant file. Each file has a top comment with the question, the approach, and any assumptions. README has a table linking questions to file paths.

For the gap-and-island question (consecutive spending days), candidates who use the ROW_NUMBER subtraction trick score 5/5. Candidates who use a self-join approach score 4/5. Candidates who use a procedural loop score 2/5.

For the LTV ranking question, DENSE_RANK over RANK is the right choice (preserves tied users). Mentioning the distinction is a senior signal.

Cross-Example Patterns

Across all six examples, four patterns appear in every winning submission: a README with a productionization section, a Makefile with one-line commands, at least three unit tests, and a documented trade-off for one major architectural choice. If your submission is missing any of these four, it caps at 3/5 regardless of code quality.

The framework for tackling any take-home is in how to pass the Data Engineer take-home. The question patterns the take-home is testing for come from how to pass the SQL round and how to pass the Python round. If you're targeting a specific company, the take-home flavor matches their stack: how to pass the Airbnb Data Engineer interview, how to pass the Stripe Data Engineer interview, how to pass the Databricks Data Engineer interview.

Data Engineer Interview Prep FAQ

Are these the actual take-home prompts?+
These are paraphrased and de-identified versions of real take-home prompts shared by candidates. The structures, time budgets, and rubric expectations are accurate. The exact wording differs to protect the originating companies.
Can I submit my own take-home for review?+
Not directly through this page yet, but our coaching service offers take-home reviews against the same rubric we used to score the examples here. The framework guide shows you how to self-score.
How long should I actually spend on a 4-hour take-home?+
4 to 6 hours total: 4 of focused work, 1 of README polish, 1 of testing. Spending exactly the stated time is a green flag. Spending 2x is a yellow flag (over-engineering). Under shows lack of effort.
Should I use AI tools on the take-home?+
Most companies allow it but expect disclosure. Use AI for boilerplate, not for design. The README, the trade-off analysis, and the architectural decisions must reflect your reasoning. Graders can usually tell when code-level sophistication exceeds reasoning-level sophistication, and that gap is an instant downgrade.
What if I get stuck on the take-home?+
Document where you got stuck in the README. Show what you tried and why it didn't work. Graders explicitly look for this kind of transparency; a partial submission with honest documentation often beats a complete submission that papers over a hack.
Should I use pandas, PySpark, polars, or vanilla Python?+
Match the tool to the data size in the prompt. Under 1 GB: pandas or polars. 1-10 GB: PySpark in local mode. Over 10 GB: PySpark with cluster setup documented. Vanilla Python is fine for very small data and shows fluency.
How do I make my take-home stand out?+
Three patterns we've seen consistently differentiate: a Makefile with single-command operations, a docs/design.md that describes the production version of the system, and an ADR (architecture decision record) explaining one non-obvious choice. None take more than an hour and all signal senior judgment.

Practice the Take-Home Patterns

Build the muscle memory of a take-home submission by running the patterns in our sandbox. SQL, Python, modeling, and design problems with instant feedback.

Start Practicing Now

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats