The Take-Home Assignment

About 28% of data engineer interview loops include a take-home assignment, usually after the recruiter screen and before the onsite. The assignment is scored on a hidden rubric most candidates never see. We have collected 47 scored take-home rubrics across 19 companies and reverse-engineered what wins. This page is one of eight rounds in the complete data engineer interview preparation framework.

The Hidden Rubric Graders Use

Reverse-engineered from 47 scored take-homes across 19 companies. Most companies score on a 1 to 5 scale per dimension, with 4+ on every dimension required for a hire signal.

Dimension	Weight	What 5/5 Looks Like
Correctness	30%	Output matches expected results across all sample inputs, including edge cases not in the spec (empty input, malformed rows, duplicates).
Code quality	20%	Functions under 30 lines, clear naming, type hints, no dead code, no commented-out code, idiomatic in the chosen language.
Repo structure	15%	Logical module split (ingest, transform, output), requirements.txt or pyproject.toml, .gitignore, README with run instructions.
Written explanation	20%	README explains your approach, the trade-offs you made, what you would do with more time, and how you would deploy this in production.
Differentiator	15%	One thing that goes beyond the spec: unit tests, a Makefile, a Docker image, a sample dashboard, performance benchmarks, an ADR document.

The Repo Structure That Wins

take-home-yourname/
├── README.md                # 5-min walkthrough, runs in <60 sec
├── Makefile                 # make install, make run, make test
├── pyproject.toml           # pinned versions, no requirements.txt
├── .gitignore               # __pycache__, .venv, data/
├── data/
│   ├── input/               # sample inputs from the prompt
│   └── output/              # expected outputs (gitignored)
├── src/
│   └── pipeline/
│       ├── __init__.py
│       ├── ingest.py        # source-to-raw
│       ├── transform.py     # raw-to-clean
│       ├── aggregate.py     # clean-to-mart
│       └── cli.py           # entrypoint, click or argparse
├── tests/
│   ├── conftest.py
│   ├── test_ingest.py
│   ├── test_transform.py
│   └── fixtures/
│       └── sample_events.json
└── docs/
    ├── design.md            # the architecture I would build
    └── adr-001-pandas.md    # why I chose pandas over Spark

The README Pattern That Wins

Five sections, in this order, no more. Every grader expects them. Skipping any one is a 1-point penalty on Written Explanation.

01
Quickstart (under 60 seconds to run)
make install && make run. If the grader cannot run your code in 60 seconds, you have already lost a point. Pin every dependency. Assume Python 3.11 only. Document the exact command that produces the output.
02
What I built
Three sentences. The data flow: source -> what -> sink. The grader knows the spec; this section confirms you understood it. Don't repeat the prompt.
03
Trade-offs
5 to 7 bullet points. 'I chose pandas over Spark because the dataset is small enough to fit in memory.' 'I deduplicated by event_id, not by composite key, because the spec said event_id is unique.' Each bullet is a decision you owned.
04
What I would do with more time
5 bullets. Specific. 'Add CDC ingestion via Debezium for real-time updates.' 'Replace the in-memory sort with an external merge sort for inputs over 100GB.' 'Add data quality checks via Great Expectations.' This is where graders look for senior signal.
05
How I would productionize this
Half a page. Where does this run (Airflow DAG, Kubernetes CronJob, AWS Glue job)? How does it get triggered? Where do logs go? What is the SLA? What gets paged when it breaks? Most candidates skip this section. Including it is the single biggest differentiator we have measured.

Five Patterns That Get You Rejected

01
Single 600-line script
Your code in one file labeled solution.py is the most common rejection signal. Even a small assignment should split into ingest, transform, output, and CLI modules. The split shows you think in pipelines, not in scripts.
02
No tests
At least three unit tests covering happy path, an edge case, and an error case. Take-homes without tests cap your score at the equivalent of L3, regardless of code quality. Tests are the cheapest +1 point you can earn.
03
No README, or a README that just says 'run main.py'
The README is weighted equal to the code. A bare README signals you do not write for other engineers. A five-section README (quickstart, what I built, trade-offs, what I would do with more time, how I would productionize) is the minimum.
04
Spending 20+ hours when the spec said 4
Graders compare submissions for proportionality. A 20-hour over-built submission to a 4-hour spec signals you cannot scope. Worse, it makes the grader feel guilty about taking your time, which biases against hire.
05
Not handling edge cases the spec did not name
Empty input, malformed rows, duplicate keys, all-NULL columns. The spec will not list these. Graders check if you thought of them. Add defensive handling and document it in the README under Trade-offs.

How the Take-Home Connects to the Rest of the Loop

The take-home is where how to pass the SQL round meets how to pass the Python round in a single deliverable. It often replaces or augments the technical phone screen, and it always informs the onsite how to pass the system design round because the interviewer will ask "how would you scale this to 100x?". The patterns from how to pass the data modeling round show up directly in how you structure your output tables.

Companies with take-home-heavy loops: Airbnb's data engineer take-home is a famously rigorous 8 hours, Stripe sometimes uses a take-home for senior roles. If you're targeting any of these, see real Data Engineer take-home assignment examples for annotated walkthroughs.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

PayPalInterview question

Solve a problem

Take-Home Assignment FAQ

How long should I actually spend on a 4-hour take-home?+

4 to 6 hours. Spending exactly the stated time is a green flag. Spending 2x is a yellow flag (over-engineering). Spending under shows lack of effort. If the prompt says 4 hours, plan 4 hours of focused work plus 1 hour of README and testing.

Should I use pandas, Spark, or vanilla Python?+

Match the tool to the data size in the prompt. Under 1 GB: pandas or vanilla Python. 1 to 10 GB: PySpark, ideally with the local mode. Over 10 GB: PySpark with the design rationale documented. Most take-homes intentionally give small data so you don't burn time on infra.

Should I add Docker?+

It is a positive signal but not required. If you add it, make sure docker build && docker run works in under 5 minutes on a fresh machine. A broken Docker file is worse than no Docker file.

Do I need to write tests?+

Yes. At least 3 unit tests. Pytest is the default. Tests prove you write production-quality code. Their absence caps your score regardless of how clever the solution is.

How do I handle a take-home where the spec is intentionally ambiguous?+

Document your interpretation in the README under 'Assumptions I made'. Graders explicitly score how you handle ambiguity. The wrong move is asking the recruiter for clarification on every detail; the right move is reasonable interpretation plus explicit documentation.

Should I deploy the assignment, or is local-only OK?+

Local-only is the standard expectation. Deploying is a positive signal only if the prompt explicitly invites it. Otherwise it looks like you did not read the prompt.

What if the take-home asks me to use a tool I haven't used before?+

Learn it on the assignment. The grader expects you to acquire new tooling on the job; the take-home is a sample of that ability. Document the learning in the README under 'What I would do differently next time'.

Can I use AI tools to help with the take-home?+

Most companies allow it but expect disclosure. Treat AI as a pair programmer for boilerplate, not as a designer. The README, the architecture decisions, and the trade-off analysis must be yours. Graders can usually tell when a candidate's reasoning level differs from their code level, and that is an instant downgrade.

02 / Why practice

See Annotated Take-Home Examples

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

See Take-Home Examples

More data engineer interview prep guides

how to pass the SQL round→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

how to pass the Python round→

JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.

how to pass the data modeling round→

Star schema, SCD Type 2, fact-table grain, and how to defend a model against pushback.

how to pass the system design round→

Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.

how to pass the behavioral round→

STAR-D answers tailored to data engineering, with example responses for impact and conflict.

how to pass the live coding round→

How to think out loud, handle silence, and avoid the traps that sink fluent coders.

The Take-Home Assignment

The Hidden Rubric Graders Use

The Repo Structure That Wins

The README Pattern That Wins

Quickstart (under 60 seconds to run)

What I built

Trade-offs

What I would do with more time

How I would productionize this

Five Patterns That Get You Rejected

Single 600-line script

No tests

No README, or a README that just says 'run main.py'

Spending 20+ hours when the spec said 4

Not handling edge cases the spec did not name

How the Take-Home Connects to the Rest of the Loop

Know the patterns before the interviewer asks them.

Take-Home Assignment FAQ

See Annotated Take-Home Examples

More data engineer interview prep reading

More data engineer interview prep guides