Python Coding Practice for Data Engineering Interviews

Python for data engineering looks different from Python for software engineering. A DE interview wants to see you parse a malformed CSV without crashing, deduplicate an event stream by composite key, validate records with field-level errors, walk a nested JSON. The 388 problems in this catalog run in a real Python 3.11 sandbox in the browser; each ships with public and hidden test cases.

Open the editor Random Python problem

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a Python query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

ShopifyInterview question

Solve a problem

388

Evaluated Python problems

Python 3.11

Sandbox runtime

5-15

Tests per problem (public + hidden)

Setup steps

The shape of DE Python vs. SWE Python

2 problems, similar surface area, very different in what they test.

SWE-style (LeetCode), 4% of DE rounds

def longest_substring_without_repeat(s: str) -> int:
    seen = {}
    start = best = 0
    for i, ch in enumerate(s):
        if ch in seen and seen[ch] >= start:
            start = seen[ch] + 1
        seen[ch] = i
        best = max(best, i - start + 1)
    return best

Algorithm on a toy input. Pointer state, integer counting, complexity reasoning. Necessary for SWE; rarely shows up in DE rounds.

DE-style, 90%+ of DE rounds

def dedup_events(events: list[dict]) -> list[dict]:
    latest = {}
    for ev in events:
        key = (ev["user_id"], ev["event_type"])
        cur = latest.get(key)
        if (cur is None or
            ev["event_time"] > cur["event_time"] or
            (ev["event_time"] == cur["event_time"]
             and ev["event_id"] > cur["event_id"])):
            latest[key] = ev
    return list(latest.values())

Dict as keyed store, composite key, tiebreaker on ties, returns list-of-dicts shape. Same skill that lands in production code.

Catalog topic distribution

388 problems sorted by what they teach. Algorithm DSA is the smallest slice on purpose.

Topic share, 388 problems

Data transformation

112 · 29%

Dict and set operations

62 · 16%

File parsing / IO

45 · 12%

String manipulation

38 · 10%

ETL flow control

32 · 8%

Error handling

28 · 7%

Date and time

24 · 6%

OOP and context managers

22 · 6%

Generators and lazy iteration

15 · 4%

Algorithms (rare)

10 · 2%

How the test runner shows up after submit

# Schema for the prompt:
#   events:  list[dict] with keys user_id, event_type, event_time, event_id
#   returns: list[dict], 1 per (user_id, event_type), most recent
#
# Public tests (visible to you):
#   test_empty_input
#   test_single_event
#   test_no_duplicates
#   test_basic_dedup
#
# Hidden tests (revealed only after passing public):
#   test_timestamp_ties_use_event_id
#   test_late_arriving_event
#   test_100k_events_under_1_second
#   test_unicode_user_ids
#   test_event_time_at_dst_boundary

# Submission output format:

submit @ 2026-05-26T16:42:18Z
  test_empty_input             PASS  0.4 ms
  test_single_event            PASS  0.5 ms
  test_no_duplicates           PASS  0.8 ms
  test_basic_dedup             PASS  1.2 ms
  test_timestamp_ties...       FAIL  expected event_id 882, got 881
                               (your code returns the first event seen on a tie)
  test_late_arriving_event     PASS  2.1 ms
  test_100k_events...          FAIL  exceeded 1000 ms (your impl is O(n^2))
  test_unicode_user_ids        PASS  0.9 ms
  test_event_time_at_dst...    PASS  1.4 ms

verdict: 7/9 pass. fix the tie tiebreaker and the quadratic loop. resubmit.

Public tests visible upfront; hidden tests revealed after the public ones pass. Performance budgets are explicit.

Where to actually practice Python for DE interviews

Pricing reflects May 2026 public tiers. 'DE share' is a rough estimate of how much of each catalog matches the interview shapes the topic chart describes.

Site	Catalog	Test runner	DE-shaped share	Performance tests	Free tier
DataDriven (this site)	388 problems, all free	Real Python 3.11, 5-15 tests per problem	100% DE-shaped	Yes, with time budgets	Yes, no signup
LeetCode	~2400 problems, ~30% free	Real Python, fixed test cases	Maybe 5% DE-shaped	Yes	Easy + slice of Medium
HackerRank Python	~125 problems	Real Python, fixed tests	Maybe 10%	Limited	Most free
PYnative	630+ exercises	Self-check via solution	Maybe 15% (some pandas)	No	Free
Exercism Python	146 exercises	Mentor review + tests	~10%	No	Free

Python coding practice FAQ

What kind of Python do data engineers actually write?+

Dict and set work, file I/O, string and JSON parsing, validation with descriptive errors, pipeline composition (router, batcher, retry decorator). Algorithm DSA is a small share of the surface area. The topic chart shows the distribution drawn from real DE interview write-ups.

Should I prep with LeetCode for a data engineer interview?+

Largely no. Roughly 4% of DE interview Python rounds resemble LeetCode-style algorithm puzzles. The other 96% are pipeline-shaped. If you're also interviewing for SWE or ML roles, LeetCode is useful; for pure DE rounds, the time goes further on DE-specific practice.

Do I need to know pandas or PySpark?+

Depends on the company. Pandas appears across most DE interviews as a generic library question. PySpark dominates at Spark-first companies (Databricks, Netflix, Uber, Airbnb). Polars is rare in interviews but signals fluency. The bank covers both pandas and PySpark; check company guides at /companies for what each tests.

How many Python problems should I solve before an interview?+

30-50 problems for a phone screen, 80-120 for an onsite, 150+ for FAANG-level loops. Distribute across the high-frequency topics (data transformation, dict/set, file parsing, error handling). Quality matters more than count; a problem you've solved, debugged, and re-derived a week later beats 5 you skimmed.

How does the test runner handle hidden test cases?+

Public tests are visible in the problem statement so you can read the shape of the input and the expected output. Hidden tests stay hidden until you pass the public ones, the way most company evaluators work. The hidden bucket usually includes empty input, edge data (Unicode, timezone, duplicates), and a performance test with a time budget.

Can I run my own scratch code against the test inputs?+

Yes. Each problem has a scratch panel that runs the same Python 3.11 environment as the test runner. Print, debug, and explore the test inputs before submitting. The scratch panel doesn't count against any submission limit; there isn't one.

02 / Why practice

Open the editor and write a function

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open the editor

Adjacent practice

Browse all 388 problems→

Full catalog with topic, difficulty, and pattern filters.

DE-pattern Python→

Pipeline-shaped problems organized by interview pattern.

PySpark Practice→

DataFrame transformations, joins, window functions in a Spark sandbox.