Data Engineer Coding Practice

Q: Is this LeetCode for data engineers?

Functionally similar with a different catalog. LeetCode leans on algorithm puzzles (DSA) that account for ~4% of DE rounds. The catalog here is built around the patterns that account for the other 96%: SQL windows, Python parsing and validation, PySpark joins and skew, pipeline design. The time-allocation breakdown by surface area makes the argument concrete.

Q: Do I need Docker or any local install?

No. SQL runs in a browser Postgres sandbox, Python runs in a browser Python sandbox, PySpark runs in a browser Spark sandbox. The pipeline canvas is in-browser. The first problem you start is also the first the grader judges; there's no setup phase.

Q: What languages are supported?

SQL (Postgres 16 dialect with portability tags), Python 3.11, PySpark 3.5. Pipeline design uses a domain-specific canvas with named tools (Kafka, Flink, Spark, Snowflake, dbt, Airflow). Scala and Java aren't in the catalog because they're rarely tested in DE interviews.

Q: How long does it take to get DE-interview-ready?

Phone-screen ready in 3-4 weeks at 5-7 hr/wk (~80 problems). Onsite-ready in 6-8 weeks at 8-10 hr/wk (~200 problems). FAANG / staff-level ready in 10-12 weeks with deeper PySpark and design coverage (~300 problems). A 6-week schedule is the standard mid-to-senior target.

Q: Should I skip PySpark?

If your target isn't a Spark shop, yes. Snowflake-on-dbt companies and most analytics-leaning DE roles don't test PySpark. Check /companies for the actual tech stack of your target. If PySpark matters, plan 25-35 hours of focused practice; the learning curve is steeper than SQL.

Q: Are the problems based on real interviews?

Yes. The catalog sources from interview write-ups across 76 named companies. Problems are paraphrased to remove identifying details but preserve the technical shape. Company tags appear when at least 1 write-up cited the company in that question shape; the tags are visible on the problem itself.

A DE coding interview tests 4 surfaces: SQL on a warehouse, Python in a notebook or script, PySpark on a cluster, and a pipeline design on a whiteboard. Practice that only covers SQL leaves you cold on the other 3. The 1,317 problems here split across all 4 surfaces, each scored against the real engine the surface targets.

Open the catalog Random problem

1,317

Coding problems total

Surfaces: SQL, Python, PySpark, design

Named companies tagged

Forever, no signup

4 coding surfaces, with what each grader actually does

The shape of the surface drives the shape of the practice. SQL is scored on rows. Python is scored on test cases. PySpark is scored on DataFrame output. Design is scored on rubric dimensions.

SQL854 problems

EnginePostgres 16

Grader10 random seeds per submission

Frequency95% of DE loops include >= 1 SQL round

WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY user_id ORDER BY event_at DESC
  ) AS rn
  FROM events
)
SELECT * FROM ranked WHERE rn = 1;

Python388 problems

EnginePython 3.11 sandbox

Grader5-15 test cases incl. perf budgets

Frequency78% include a Python coding round

def validate(records, schema):
    clean, bad = [], []
    for i, r in enumerate(records):
        errs = check(r, schema)
        (bad if errs else clean).append(
            {"i": i, "errs": errs, "r": r}
            if errs else r)
    return clean, bad

PySpark45 problems

EnginePySpark 3.5

Grader10 seeded Parquet inputs

Frequency30% of loops at Spark shops

w = Window.partitionBy("user_id") \
          .orderBy(F.desc("event_at"))
df.withColumn("rn", F.row_number().over(w)) \
  .filter("rn = 1") \
  .drop("rn")

Pipeline design30 problems

EngineInteractive canvas

GraderRubric: SLA, cost, failure modes

Frequency52% of senior+ loops

[Source] -> [Kafka topic] -> [Flink job]
              ↓
        [DLQ topic]
                       ↓
                  [S3 bronze]
                       ↓
                 [dbt incremental]
                       ↓
               [Snowflake gold]

Where the prep time should go

% of DE coding interview surface area by category. The biggest miss most candidates make is over-investing in algorithm DSA.

Where the prep time should actually go (% of DE coding interview surface area)

SQL (windows, CTEs, aggregation)

38%

Python pipeline patterns

24%

PySpark (Spark-shop companies)

16%

System / pipeline design

14%

Data modeling exercises

Algorithm DSA (LeetCode-style)

Algorithm DSA is roughly 4%. The same skill gets you through occasional LeetCode-flavored questions that drift into DE rounds. Budgeting more than that is a mis-allocation.

6-week DE coding prep plan

Calibrated to a candidate with working SQL/Python knowledge targeting a mid-to-senior DE role.

Week	Surface focus	Daily volume	Target volume	Pass criteria
Weeks 1-2	SQL foundations + topic coverage	60-90 min	40-50 SQL problems across joins, GROUP BY, basic window functions	Solve any Easy in <5 min, any Medium in <15 min
Weeks 3-4	SQL window functions + Python patterns	90-120 min	30 more SQL (window-heavy) + 25-30 Python (parsing, dedup, validation)	Top-N per group automatic. Write a structured-error validator from memory.
Week 5	PySpark (if relevant) + pipeline design	2 hr	20-30 PySpark + 6-8 design canvas problems	Recognize broadcast vs sort-merge threshold. Pick batch vs streaming for a stated SLA.
Week 6	Mocks + weak spots	60-90 min + 1-2 mocks	Drill mode on weakest topic. 3-4 AI mock loops.	Pass 8 of 10 timed Mediums. Mock verdict consistent across runs.

DE coding practice FAQ

Is this LeetCode for data engineers?+

Functionally similar with a different catalog. LeetCode leans on algorithm puzzles (DSA) that account for ~4% of DE rounds. The catalog here is built around the patterns that account for the other 96%: SQL windows, Python parsing and validation, PySpark joins and skew, pipeline design. The time-allocation breakdown by surface area makes the argument concrete.

Do I need Docker or any local install?+

No. SQL runs in a browser Postgres sandbox, Python runs in a browser Python sandbox, PySpark runs in a browser Spark sandbox. The pipeline canvas is in-browser. The first problem you start is also the first the grader judges; there's no setup phase.

What languages are supported?+

SQL (Postgres 16 dialect with portability tags), Python 3.11, PySpark 3.5. Pipeline design uses a domain-specific canvas with named tools (Kafka, Flink, Spark, Snowflake, dbt, Airflow). Scala and Java aren't in the catalog because they're rarely tested in DE interviews.

How long does it take to get DE-interview-ready?+

Phone-screen ready in 3-4 weeks at 5-7 hr/wk (~80 problems). Onsite-ready in 6-8 weeks at 8-10 hr/wk (~200 problems). FAANG / staff-level ready in 10-12 weeks with deeper PySpark and design coverage (~300 problems). A 6-week schedule is the standard mid-to-senior target.

Should I skip PySpark?+

If your target isn't a Spark shop, yes. Snowflake-on-dbt companies and most analytics-leaning DE roles don't test PySpark. Check /companies for the actual tech stack of your target. If PySpark matters, plan 25-35 hours of focused practice; the learning curve is steeper than SQL.

Are the problems based on real interviews?+

Yes. The catalog sources from interview write-ups across 76 named companies. Problems are paraphrased to remove identifying details but preserve the technical shape. Company tags appear when at least 1 write-up cited the company in that question shape; the tags are visible on the problem itself.

02 / Why practice

Start week 1, day 1

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open week 1

Adjacent practice

DE Coding Interview Mode→

AI mock interviewer across SQL, Python, PySpark, design with verdicts.

Full DE Loop Simulator→

Loop simulator including data modeling and behavioral rounds.

DE Interview Prep Hub→

Round-by-round breakdown of what each DE interview surface tests.