Data Engineer Coding Practice
A DE coding interview tests 4 surfaces: SQL on a warehouse, Python in a notebook or script, PySpark on a cluster, and a pipeline design on a whiteboard. Practice that only covers SQL leaves you cold on the other 3. The 1,317 problems here split across all 4 surfaces, each scored against the real engine the surface targets.
A DE coding interview tests 4 surfaces: SQL on a warehouse, Python in a notebook or script, PySpark on a cluster, and a pipeline design on a whiteboard. Practice that only covers SQL leaves you cold on the other 3. The 1,317 problems here split across all 4 surfaces, each scored against the real engine the surface targets.
Know the patterns before the interviewer asks them.
4 coding surfaces, with what each grader actually does
The shape of the surface drives the shape of the practice. SQL is scored on rows. Python is scored on test cases. PySpark is scored on DataFrame output. Design is scored on rubric dimensions.
WITH ranked AS (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY event_at DESC
) AS rn
FROM events
)
SELECT * FROM ranked WHERE rn = 1;def validate(records, schema):
clean, bad = [], []
for i, r in enumerate(records):
errs = check(r, schema)
(bad if errs else clean).append(
{"i": i, "errs": errs, "r": r}
if errs else r)
return clean, badw = Window.partitionBy("user_id") \
.orderBy(F.desc("event_at"))
df.withColumn("rn", F.row_number().over(w)) \
.filter("rn = 1") \
.drop("rn")[Source] -> [Kafka topic] -> [Flink job]
↓
[DLQ topic]
↓
[S3 bronze]
↓
[dbt incremental]
↓
[Snowflake gold]Where the prep time should go
% of DE coding interview surface area by category. The biggest miss most candidates make is over-investing in algorithm DSA.
6-week DE coding prep plan
Calibrated to a candidate with working SQL/Python knowledge targeting a mid-to-senior DE role.
| Week | Surface focus | Daily volume | Target volume | Pass criteria |
|---|---|---|---|---|
| Weeks 1-2 | SQL foundations + topic coverage | 60-90 min | 40-50 SQL problems across joins, GROUP BY, basic window functions | Solve any Easy in <5 min, any Medium in <15 min |
| Weeks 3-4 | SQL window functions + Python patterns | 90-120 min | 30 more SQL (window-heavy) + 25-30 Python (parsing, dedup, validation) | Top-N per group automatic. Write a structured-error validator from memory. |
| Week 5 | PySpark (if relevant) + pipeline design | 2 hr | 20-30 PySpark + 6-8 design canvas problems | Recognize broadcast vs sort-merge threshold. Pick batch vs streaming for a stated SLA. |
| Week 6 | Mocks + weak spots | 60-90 min + 1-2 mocks | Drill mode on weakest topic. 3-4 AI mock loops. | Pass 8 of 10 timed Mediums. Mock verdict consistent across runs. |
DE coding practice FAQ
Is this LeetCode for data engineers?+
Do I need Docker or any local install?+
What languages are supported?+
How long does it take to get DE-interview-ready?+
Should I skip PySpark?+
Are the problems based on real interviews?+
Start week 1, day 1
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition