Data Engineer Coding Practice

A DE coding interview tests 4 surfaces: SQL on a warehouse, Python in a notebook or script, PySpark on a cluster, and a pipeline design on a whiteboard. Practice that only covers SQL leaves you cold on the other 3. The 1,317 problems here split across all 4 surfaces, each scored against the real engine the surface targets.

A DE coding interview tests 4 surfaces: SQL on a warehouse, Python in a notebook or script, PySpark on a cluster, and a pipeline design on a whiteboard. Practice that only covers SQL leaves you cold on the other 3. The 1,317 problems here split across all 4 surfaces, each scored against the real engine the surface targets.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
MicrosoftInterview question
Solve a problem
1,317
Coding problems total
4
Surfaces: SQL, Python, PySpark, design
76
Named companies tagged
$0
Forever, no signup

4 coding surfaces, with what each grader actually does

The shape of the surface drives the shape of the practice. SQL is scored on rows. Python is scored on test cases. PySpark is scored on DataFrame output. Design is scored on rubric dimensions.

SQL854 problems
EnginePostgres 16
Grader10 random seeds per submission
Frequency95% of DE loops include >= 1 SQL round
WITH ranked AS (
  SELECT *, ROW_NUMBER() OVER (
    PARTITION BY user_id ORDER BY event_at DESC
  ) AS rn
  FROM events
)
SELECT * FROM ranked WHERE rn = 1;
Python388 problems
EnginePython 3.11 sandbox
Grader5-15 test cases incl. perf budgets
Frequency78% include a Python coding round
def validate(records, schema):
    clean, bad = [], []
    for i, r in enumerate(records):
        errs = check(r, schema)
        (bad if errs else clean).append(
            {"i": i, "errs": errs, "r": r}
            if errs else r)
    return clean, bad
PySpark45 problems
EnginePySpark 3.5
Grader10 seeded Parquet inputs
Frequency30% of loops at Spark shops
w = Window.partitionBy("user_id") \
          .orderBy(F.desc("event_at"))
df.withColumn("rn", F.row_number().over(w)) \
  .filter("rn = 1") \
  .drop("rn")
Pipeline design30 problems
EngineInteractive canvas
GraderRubric: SLA, cost, failure modes
Frequency52% of senior+ loops
[Source] -> [Kafka topic] -> [Flink job]
              ↓
        [DLQ topic]
                       ↓
                  [S3 bronze]
                       ↓
                 [dbt incremental]
                       ↓
               [Snowflake gold]

Where the prep time should go

% of DE coding interview surface area by category. The biggest miss most candidates make is over-investing in algorithm DSA.

Where the prep time should actually go (% of DE coding interview surface area)
SQL (windows, CTEs, aggregation)
38%
Python pipeline patterns
24%
PySpark (Spark-shop companies)
16%
System / pipeline design
14%
Data modeling exercises
4%
Algorithm DSA (LeetCode-style)
4%
Algorithm DSA is roughly 4%. The same skill gets you through occasional LeetCode-flavored questions that drift into DE rounds. Budgeting more than that is a mis-allocation.

6-week DE coding prep plan

Calibrated to a candidate with working SQL/Python knowledge targeting a mid-to-senior DE role.

WeekSurface focusDaily volumeTarget volumePass criteria
Weeks 1-2SQL foundations + topic coverage60-90 min40-50 SQL problems across joins, GROUP BY, basic window functionsSolve any Easy in <5 min, any Medium in <15 min
Weeks 3-4SQL window functions + Python patterns90-120 min30 more SQL (window-heavy) + 25-30 Python (parsing, dedup, validation)Top-N per group automatic. Write a structured-error validator from memory.
Week 5PySpark (if relevant) + pipeline design2 hr20-30 PySpark + 6-8 design canvas problemsRecognize broadcast vs sort-merge threshold. Pick batch vs streaming for a stated SLA.
Week 6Mocks + weak spots60-90 min + 1-2 mocksDrill mode on weakest topic. 3-4 AI mock loops.Pass 8 of 10 timed Mediums. Mock verdict consistent across runs.

DE coding practice FAQ

Is this LeetCode for data engineers?+
Functionally similar with a different catalog. LeetCode leans on algorithm puzzles (DSA) that account for ~4% of DE rounds. The catalog here is built around the patterns that account for the other 96%: SQL windows, Python parsing and validation, PySpark joins and skew, pipeline design. The time-allocation breakdown by surface area makes the argument concrete.
Do I need Docker or any local install?+
No. SQL runs in a browser Postgres sandbox, Python runs in a browser Python sandbox, PySpark runs in a browser Spark sandbox. The pipeline canvas is in-browser. The first problem you start is also the first the grader judges; there's no setup phase.
What languages are supported?+
SQL (Postgres 16 dialect with portability tags), Python 3.11, PySpark 3.5. Pipeline design uses a domain-specific canvas with named tools (Kafka, Flink, Spark, Snowflake, dbt, Airflow). Scala and Java aren't in the catalog because they're rarely tested in DE interviews.
How long does it take to get DE-interview-ready?+
Phone-screen ready in 3-4 weeks at 5-7 hr/wk (~80 problems). Onsite-ready in 6-8 weeks at 8-10 hr/wk (~200 problems). FAANG / staff-level ready in 10-12 weeks with deeper PySpark and design coverage (~300 problems). A 6-week schedule is the standard mid-to-senior target.
Should I skip PySpark?+
If your target isn't a Spark shop, yes. Snowflake-on-dbt companies and most analytics-leaning DE roles don't test PySpark. Check /companies for the actual tech stack of your target. If PySpark matters, plan 25-35 hours of focused practice; the learning curve is steeper than SQL.
Are the problems based on real interviews?+
Yes. The catalog sources from interview write-ups across 76 named companies. Problems are paraphrased to remove identifying details but preserve the technical shape. Company tags appear when at least 1 write-up cited the company in that question shape; the tags are visible on the problem itself.
02 / Why practice

Start week 1, day 1

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Adjacent practice