Python Coding Practice for Data Engineering Interviews

Python for data engineering looks different from Python for software engineering. A DE interview wants to see you parse a malformed CSV without crashing, deduplicate an event stream by composite key, validate records with field-level errors, walk a nested JSON. The 388 problems in this catalog run in a real Python 3.11 sandbox in the browser; each ships with public and hidden test cases.

Python for data engineering looks different from Python for software engineering. A DE interview wants to see you parse a malformed CSV without crashing, deduplicate an event stream by composite key, validate records with field-level errors, walk a nested JSON. The 388 problems in this catalog run in a real Python 3.11 sandbox in the browser; each ships with public and hidden test cases.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a Python query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1def sessionize(events):
2 sessions = []
3 for e in events:
4 if gap_minutes(e) > 30:
5
Execute your solution0.4s avg.
ShopifyInterview question
Solve a problem
388
Evaluated Python problems
Python 3.11
Sandbox runtime
5-15
Tests per problem (public + hidden)
0
Setup steps

The shape of DE Python vs. SWE Python

2 problems, similar surface area, very different in what they test.

SWE-style (LeetCode), 4% of DE rounds
def longest_substring_without_repeat(s: str) -> int:
    seen = {}
    start = best = 0
    for i, ch in enumerate(s):
        if ch in seen and seen[ch] >= start:
            start = seen[ch] + 1
        seen[ch] = i
        best = max(best, i - start + 1)
    return best

Algorithm on a toy input. Pointer state, integer counting, complexity reasoning. Necessary for SWE; rarely shows up in DE rounds.

DE-style, 90%+ of DE rounds
def dedup_events(events: list[dict]) -> list[dict]:
    latest = {}
    for ev in events:
        key = (ev["user_id"], ev["event_type"])
        cur = latest.get(key)
        if (cur is None or
            ev["event_time"] > cur["event_time"] or
            (ev["event_time"] == cur["event_time"]
             and ev["event_id"] > cur["event_id"])):
            latest[key] = ev
    return list(latest.values())

Dict as keyed store, composite key, tiebreaker on ties, returns list-of-dicts shape. Same skill that lands in production code.

Catalog topic distribution

388 problems sorted by what they teach. Algorithm DSA is the smallest slice on purpose.

Topic share, 388 problems
Data transformation
112 · 29%
Dict and set operations
62 · 16%
File parsing / IO
45 · 12%
String manipulation
38 · 10%
ETL flow control
32 · 8%
Error handling
28 · 7%
Date and time
24 · 6%
OOP and context managers
22 · 6%
Generators and lazy iteration
15 · 4%
Algorithms (rare)
10 · 2%

How the test runner shows up after submit

# Schema for the prompt:
#   events:  list[dict] with keys user_id, event_type, event_time, event_id
#   returns: list[dict], 1 per (user_id, event_type), most recent
#
# Public tests (visible to you):
#   test_empty_input
#   test_single_event
#   test_no_duplicates
#   test_basic_dedup
#
# Hidden tests (revealed only after passing public):
#   test_timestamp_ties_use_event_id
#   test_late_arriving_event
#   test_100k_events_under_1_second
#   test_unicode_user_ids
#   test_event_time_at_dst_boundary

# Submission output format:

submit @ 2026-05-26T16:42:18Z
  test_empty_input             PASS  0.4 ms
  test_single_event            PASS  0.5 ms
  test_no_duplicates           PASS  0.8 ms
  test_basic_dedup             PASS  1.2 ms
  test_timestamp_ties...       FAIL  expected event_id 882, got 881
                               (your code returns the first event seen on a tie)
  test_late_arriving_event     PASS  2.1 ms
  test_100k_events...          FAIL  exceeded 1000 ms (your impl is O(n^2))
  test_unicode_user_ids        PASS  0.9 ms
  test_event_time_at_dst...    PASS  1.4 ms

verdict: 7/9 pass. fix the tie tiebreaker and the quadratic loop. resubmit.

Public tests visible upfront; hidden tests revealed after the public ones pass. Performance budgets are explicit.

Where to actually practice Python for DE interviews

Pricing reflects May 2026 public tiers. 'DE share' is a rough estimate of how much of each catalog matches the interview shapes the topic chart describes.

SiteCatalogTest runnerDE-shaped sharePerformance testsFree tier
DataDriven (this site)388 problems, all freeReal Python 3.11, 5-15 tests per problem100% DE-shapedYes, with time budgetsYes, no signup
LeetCode~2400 problems, ~30% freeReal Python, fixed test casesMaybe 5% DE-shapedYesEasy + slice of Medium
HackerRank Python~125 problemsReal Python, fixed testsMaybe 10%LimitedMost free
PYnative630+ exercisesSelf-check via solutionMaybe 15% (some pandas)NoFree
Exercism Python146 exercisesMentor review + tests~10%NoFree

Python coding practice FAQ

What kind of Python do data engineers actually write?+
Dict and set work, file I/O, string and JSON parsing, validation with descriptive errors, pipeline composition (router, batcher, retry decorator). Algorithm DSA is a small share of the surface area. The topic chart shows the distribution drawn from real DE interview write-ups.
Should I prep with LeetCode for a data engineer interview?+
Largely no. Roughly 4% of DE interview Python rounds resemble LeetCode-style algorithm puzzles. The other 96% are pipeline-shaped. If you're also interviewing for SWE or ML roles, LeetCode is useful; for pure DE rounds, the time goes further on DE-specific practice.
Do I need to know pandas or PySpark?+
Depends on the company. Pandas appears across most DE interviews as a generic library question. PySpark dominates at Spark-first companies (Databricks, Netflix, Uber, Airbnb). Polars is rare in interviews but signals fluency. The bank covers both pandas and PySpark; check company guides at /companies for what each tests.
How many Python problems should I solve before an interview?+
30-50 problems for a phone screen, 80-120 for an onsite, 150+ for FAANG-level loops. Distribute across the high-frequency topics (data transformation, dict/set, file parsing, error handling). Quality matters more than count; a problem you've solved, debugged, and re-derived a week later beats 5 you skimmed.
How does the test runner handle hidden test cases?+
Public tests are visible in the problem statement so you can read the shape of the input and the expected output. Hidden tests stay hidden until you pass the public ones, the way most company evaluators work. The hidden bucket usually includes empty input, edge data (Unicode, timezone, duplicates), and a performance test with a time budget.
Can I run my own scratch code against the test inputs?+
Yes. Each problem has a scratch panel that runs the same Python 3.11 environment as the test runner. Print, debug, and explore the test inputs before submitting. The scratch panel doesn't count against any submission limit; there isn't one.
02 / Why practice

Open the editor and write a function

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Adjacent practice