Interview Round Guide

The Take-Home Assignment

About 28% of data engineer interview loops include a take-home assignment, usually after the recruiter screen and before the onsite. The assignment is scored on a hidden rubric most candidates never see. We have collected 47 graded take-home rubrics across 19 companies and reverse-engineered what wins. This page is one of eight rounds in the complete data engineer interview preparation framework.

The Short Answer
Expect a 4 to 8 hour assignment with a 1-week deadline. Typical format: a messy CSV or JSON dataset, a business question, and an ambiguous spec. Graders score five dimensions: correctness, code quality, repo structure, written explanation, and one differentiator (tests, documentation, performance work, or a thoughtful README). The single biggest failure mode is treating it as a coding test instead of a deliverable. The README is graded as heavily as the code.
Updated April 2026·By The DataDriven Team

The Hidden Rubric Graders Use

Reverse-engineered from 47 graded take-homes across 19 companies. Most companies grade on a 1 to 5 scale per dimension, with 4+ on every dimension required for a hire signal.

DimensionWeightWhat 5/5 Looks Like
Correctness30%Output matches expected results across all sample inputs, including edge cases not in the spec (empty input, malformed rows, duplicates).
Code quality20%Functions under 30 lines, clear naming, type hints, no dead code, no commented-out code, idiomatic in the chosen language.
Repo structure15%Logical module split (ingest, transform, output), requirements.txt or pyproject.toml, .gitignore, README with run instructions.
Written explanation20%README explains your approach, the trade-offs you made, what you would do with more time, and how you would deploy this in production.
Differentiator15%One thing that goes beyond the spec: unit tests, a Makefile, a Docker image, a sample dashboard, performance benchmarks, an ADR document.

The Repo Structure That Wins

Every grader opens the README first, then the directory tree. If those two artifacts are confused, the rest of the submission is read with skepticism. The structure below is what we have seen score 5/5 on repo structure across 12 different graders.

take-home-yourname/
├── README.md                # 5-min walkthrough, runs in <60 sec
├── Makefile                 # make install, make run, make test
├── pyproject.toml           # pinned versions, no requirements.txt
├── .gitignore               # __pycache__, .venv, data/
├── data/
│   ├── input/               # sample inputs from the prompt
│   └── output/              # expected outputs (gitignored)
├── src/
│   └── pipeline/
│       ├── __init__.py
│       ├── ingest.py        # source-to-raw
│       ├── transform.py     # raw-to-clean
│       ├── aggregate.py     # clean-to-mart
│       └── cli.py           # entrypoint, click or argparse
├── tests/
│   ├── conftest.py
│   ├── test_ingest.py
│   ├── test_transform.py
│   └── fixtures/
│       └── sample_events.json
└── docs/
    ├── design.md            # the architecture I would build
    └── adr-001-pandas.md    # why I chose pandas over Spark

The README Pattern That Wins

Five sections, in this order, no more. Every grader expects them. Skipping any one is a 1-point penalty on Written Explanation.

1

Quickstart (under 60 seconds to run)

make install && make run. If the grader cannot run your code in 60 seconds, you have already lost a point. Pin every dependency. Assume Python 3.11 only. Document the exact command that produces the output.
2

What I built

Three sentences. The data flow: source -> what -> sink. The grader knows the spec; this section confirms you understood it. Don't repeat the prompt.
3

Trade-offs

5 to 7 bullet points. 'I chose pandas over Spark because the dataset is small enough to fit in memory.' 'I deduplicated by event_id, not by composite key, because the spec said event_id is unique.' Each bullet is a decision you owned.
4

What I would do with more time

5 bullets. Specific. 'Add CDC ingestion via Debezium for real-time updates.' 'Replace the in-memory sort with an external merge sort for inputs over 100GB.' 'Add data quality checks via Great Expectations.' This is where graders look for senior signal.
5

How I would productionize this

Half a page. Where does this run (Airflow DAG, Kubernetes CronJob, AWS Glue job)? How does it get triggered? Where do logs go? What is the SLA? What gets paged when it breaks? Most candidates skip this section. Including it is the single biggest differentiator we have measured.

Five Patterns That Get You Rejected

1

Single 600-line script

Your code in one file labeled solution.py is the most common rejection signal. Even a small assignment should split into ingest, transform, output, and CLI modules. The split shows you think in pipelines, not in scripts.
2

No tests

At least three unit tests covering happy path, an edge case, and an error case. Take-homes without tests cap your score at the equivalent of L3, regardless of code quality. Tests are the cheapest +1 point you can earn.
3

No README, or a README that just says 'run main.py'

The README is graded equal to the code. A bare README signals you do not write for other engineers. The five-section pattern above is the minimum.
4

Spending 20+ hours when the spec said 4

Graders compare submissions for proportionality. A 20-hour over-built submission to a 4-hour spec signals you cannot scope. Worse, it makes the grader feel guilty about taking your time, which biases against hire.
5

Not handling edge cases the spec did not name

Empty input, malformed rows, duplicate keys, all-NULL columns. The spec will not list these. Graders check if you thought of them. Add defensive handling and document it in the README under Trade-offs.

How the Take-Home Connects to the Rest of the Loop

The take-home is where how to pass the SQL round meets how to pass the Python round in a single deliverable. It often replaces or augments the technical phone screen, and it always informs the onsite how to pass the system design round because the interviewer will ask "how would you scale this to 100x?". The patterns from how to pass the data modeling round show up directly in how you structure your output tables.

Companies with take-home-heavy loops: Airbnb's data engineer take-home is a famously rigorous 8 hours, Stripe sometimes uses a take-home for senior roles. If you're targeting any of these, see real Data Engineer take-home assignment examples for annotated walkthroughs.

Data Engineer Interview Prep FAQ

How long should I actually spend on a 4-hour take-home?+
4 to 6 hours. Spending exactly the stated time is a green flag. Spending 2x is a yellow flag (over-engineering). Spending under shows lack of effort. If the prompt says 4 hours, plan 4 hours of focused work plus 1 hour of README and testing.
Should I use pandas, Spark, or vanilla Python?+
Match the tool to the data size in the prompt. Under 1 GB: pandas or vanilla Python. 1 to 10 GB: PySpark, ideally with the local mode. Over 10 GB: PySpark with the design rationale documented. Most take-homes intentionally give small data so you don't burn time on infra.
Should I add Docker?+
It is a positive signal but not required. If you add it, make sure docker build && docker run works in under 5 minutes on a fresh machine. A broken Docker file is worse than no Docker file.
Do I need to write tests?+
Yes. At least 3 unit tests. Pytest is the default. Tests prove you write production-quality code. Their absence caps your score regardless of how clever the solution is.
How do I handle a take-home where the spec is intentionally ambiguous?+
Document your interpretation in the README under 'Assumptions I made'. Graders explicitly score how you handle ambiguity. The wrong move is asking the recruiter for clarification on every detail; the right move is reasonable interpretation plus explicit documentation.
Should I deploy the assignment, or is local-only OK?+
Local-only is the standard expectation. Deploying is a positive signal only if the prompt explicitly invites it. Otherwise it looks like you did not read the prompt.
What if the take-home asks me to use a tool I haven't used before?+
Learn it on the assignment. The grader expects you to acquire new tooling on the job; the take-home is a sample of that ability. Document the learning in the README under 'What I would do differently next time'.
Can I use AI tools to help with the take-home?+
Most companies allow it but expect disclosure. Treat AI as a pair programmer for boilerplate, not as a designer. The README, the architecture decisions, and the trade-off analysis must be yours. Graders can usually tell when a candidate's reasoning level differs from their code level, and that is an instant downgrade.

See Annotated Take-Home Examples

Real take-home prompts from Stripe, Airbnb, Databricks, and more, with example solutions and graded rubric breakdowns.

See Take-Home Examples

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats