Python Practice Problems

Python Practice Problems

Live Python practice problems for data engineer roles with hidden test cases including Unicode, DST, and performance budgets.

394 Python practice problems shaped like real data engineer work: dict and set fluency, file parsing and validation, sessionization, retry logic with jittered backoff, generator-based streaming. Live Python 3.11 sandbox with public and hidden test cases.

Python practice for data engineer roles is structurally different from Python practice for software engineer interviews. The catalog here optimizes for pipeline-shaped work: parsing the kind of malformed CSV that ships from a third-party export, deduplicating events with composite keys and tiebreakers, walking nested JSON with a recursive flattener that handles both lists-as-records and lists-as-attributes, sessionizing event streams with itertools.groupby or a plain for-loop, writing a retry decorator with exponential backoff plus random jitter (the no-jitter version is a self-DDoS waiting to happen), streaming a 50GB file with a generator so memory stays constant regardless of input size, and implementing pandas SCD Type 2 merge logic the way a real warehouse pipeline would.

Public tests are visible in the problem statement so the data engineer can read the input shape and the expected output. Hidden tests reveal after the public ones pass and typically include empty input, single-element edge cases, Unicode user IDs (because emoji and CJK characters break naive byte-counting), event timestamps at the DST boundary (which break naive timezone math), and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions. The performance test is not a separate "optimize this" problem; it is woven into the same problem statement and fails the same submission if your solution is quadratic.

Scratch panel mode runs the same Python 3.11 environment as the test runner. Print, debug, and explore the test inputs before submitting. The scratch panel does not count against any submission limit; there is no submission limit. Library coverage: pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, and tenacity. PySpark problems run in a separate sandbox at /pyspark-interview-questions with the standard pyspark imports.

The catalog is graded on five dimensions in submission analytics: correctness against public tests, correctness against hidden tests, performance against the wall-clock budget, error handling for malformed input, and Pythonic style (idiomatic data structure choice, clear naming, narrow exception catching). The first three are binary pass-or-fail. The last two are inferred from the submission code and surfaced as feedback after the submission completes. Data engineer candidates who pass correctness but fail performance typically need to switch from nested loops to dict lookups (O(n-squared) to O(n)) or from list accumulation to generator streaming (O(memory) to O(1)).

What kind of Python do these practice problems test?
Pipeline-shaped Python: parsing malformed CSVs without crashing, deduplicating events with composite keys and tiebreakers, walking nested JSON, sessionizing event streams, writing retry decorators with jittered backoff, streaming large files with generators. About 4 percent of problems are algorithm puzzles for breadth; the other 96 percent are pipeline work that mirrors real production code at data engineer-shaped companies.
How do public and hidden test cases work?
Public tests are visible in the problem statement so a data engineer can read the input shape and expected output. Hidden tests reveal one at a time after the public tests pass. Hidden bucket typically includes empty input, Unicode (emoji, CJK characters), DST boundary timestamps, and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions.
What libraries are available in the Python sandbox?
pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, tenacity. PySpark runs in a separate sandbox. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.
How do I see why a hidden test failed?
After failure, the runner reveals the input shape, the expected output shape, and the specific assertion that failed. For performance tests, wall-clock time and the threshold. For correctness, the first divergent element of the diff. Submit, read the failure, fix, resubmit. The same loop as a real grading harness in a data engineer take-home.
Are there time limits on submissions?
No per-problem time limit on writing code. The performance tests have wall-clock budgets (typically 1 to 5 seconds for a 100k-record input), enforced inside the sandbox. Quadratic solutions usually fail performance tests; linear or O(n log n) solutions pass with margin.
Can I use pandas for problems that do not explicitly require it?
Yes. The grader accepts any valid Python solution. Pandas is often the cleanest expression: a SCD Type 2 merge in pandas is roughly 10 lines versus roughly 30 in vanilla Python. The interview rubric varies by company. Some prefer vanilla to verify you understand the underlying data structures. Others want pandas because production code uses it. A data engineer should mention both in interview discussion.
What is the difference from LeetCode Python practice?
LeetCode emphasizes algorithm puzzles (graph traversal, DP, sliding window). This catalog emphasizes pipeline patterns (parsing, dedup, validation, ETL flow). LeetCode performance tests are tight on algorithmic complexity. This catalog's performance tests are tight on memory bounds (streaming versus buffering) and IO patterns more than on Big-O. About 4 percent topical overlap; the rest is different territory for data engineer prep.

394 practice problems matching this filter. Difficulty: easy (184), medium (184), hard (26).

Python (394)