# Python Practice Problems

> Live Python practice problems for data engineer roles with hidden test cases including Unicode, DST, and performance budgets.

Canonical URL: <https://datadriven.io/python-practice-problems>

Breadcrumb: [Home](https://datadriven.io/) > [Python Practice Problems](https://datadriven.io/python-practice-problems)

## Summary

390 Python practice problems shaped like real data engineer work: dict and set fluency, file parsing and validation, sessionization, retry logic with jittered backoff, generator-based streaming. Live Python 3.11 sandbox with public and hidden test cases.

## What this page covers

Python practice for data engineer roles is structurally different from Python practice for software engineer interviews. The catalog here optimizes for pipeline-shaped work: parsing the kind of malformed CSV that ships from a third-party export, deduplicating events with composite keys and tiebreakers, walking nested JSON with a recursive flattener that handles both lists-as-records and lists-as-attributes, sessionizing event streams with itertools.groupby or a plain for-loop, writing a retry decorator with exponential backoff plus random jitter (the no-jitter version is a self-DDoS waiting to happen), streaming a 50GB file with a generator so memory stays constant regardless of input size, and implementing pandas SCD Type 2 merge logic the way a real warehouse pipeline would.

Public tests are visible in the problem statement so the data engineer can read the input shape and the expected output. Hidden tests reveal after the public ones pass and typically include empty input, single-element edge cases, Unicode user IDs (because emoji and CJK characters break naive byte-counting), event timestamps at the DST boundary (which break naive timezone math), and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions. The performance test is not a separate "optimize this" problem; it is woven into the same problem statement and fails the same submission if your solution is quadratic.

Scratch panel mode runs the same Python 3.11 environment as the test runner. Print, debug, and explore the test inputs before submitting. The scratch panel does not count against any submission limit; there is no submission limit. Library coverage: pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, and tenacity. PySpark problems run in a separate sandbox at /pyspark-interview-questions with the standard pyspark imports.

The catalog is graded on five dimensions in submission analytics: correctness against public tests, correctness against hidden tests, performance against the wall-clock budget, error handling for malformed input, and Pythonic style (idiomatic data structure choice, clear naming, narrow exception catching). The first three are binary pass-or-fail. The last two are inferred from the submission code and surfaced as feedback after the submission completes. Data engineer candidates who pass correctness but fail performance typically need to switch from nested loops to dict lookups (O(n-squared) to O(n)) or from list accumulation to generator streaming (O(memory) to O(1)).

## Frequently asked questions

### What kind of Python do these practice problems test?

Pipeline-shaped Python: parsing malformed CSVs without crashing, deduplicating events with composite keys and tiebreakers, walking nested JSON, sessionizing event streams, writing retry decorators with jittered backoff, streaming large files with generators. About 4 percent of problems are algorithm puzzles for breadth; the other 96 percent are pipeline work that mirrors real production code at data engineer-shaped companies.

### How do public and hidden test cases work?

Public tests are visible in the problem statement so a data engineer can read the input shape and expected output. Hidden tests reveal one at a time after the public tests pass. Hidden bucket typically includes empty input, Unicode (emoji, CJK characters), DST boundary timestamps, and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions.

### What libraries are available in the Python sandbox?

pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, tenacity. PySpark runs in a separate sandbox. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.

### How do I see why a hidden test failed?

After failure, the runner reveals the input shape, the expected output shape, and the specific assertion that failed. For performance tests, wall-clock time and the threshold. For correctness, the first divergent element of the diff. Submit, read the failure, fix, resubmit. The same loop as a real grading harness in a data engineer take-home.

### Are there time limits on submissions?

No per-problem time limit on writing code. The performance tests have wall-clock budgets (typically 1 to 5 seconds for a 100k-record input), enforced inside the sandbox. Quadratic solutions usually fail performance tests; linear or O(n log n) solutions pass with margin.

### Can I use pandas for problems that do not explicitly require it?

Yes. The grader accepts any valid Python solution. Pandas is often the cleanest expression: a SCD Type 2 merge in pandas is roughly 10 lines versus roughly 30 in vanilla Python. The interview rubric varies by company. Some prefer vanilla to verify you understand the underlying data structures. Others want pandas because production code uses it. A data engineer should mention both in interview discussion.

### What is the difference from LeetCode Python practice?

LeetCode emphasizes algorithm puzzles (graph traversal, DP, sliding window). This catalog emphasizes pipeline patterns (parsing, dedup, validation, ETL flow). LeetCode performance tests are tight on algorithmic complexity. This catalog's performance tests are tight on memory bounds (streaming versus buffering) and IO patterns more than on Big-O. About 4 percent topical overlap; the rest is different territory for data engineer prep.

## How a data engineer approaches a Python practice problem

Five-step loop that matches how the grader expects you to develop and submit.

### Step 1: Read public tests to infer input shape

Public tests show input keys, value types, and expected output. Read them before reading the prompt; they are the type signature.

### Step 2: Sketch in the scratch panel

Scratch runs the same Python 3.11 as the grader, with no submission limit. Print intermediate state, explore the input, find the edge cases before writing the final solution.

### Step 3: Implement against the public tests first

Pass all public tests before submitting. The runner only reveals hidden tests after public ones pass.

### Step 4: Read hidden test failures

Hidden tests typically include empty input, Unicode, DST boundary, and a performance test. The runner shows input shape, expected output, and the failing assertion.

### Step 5: Fix the failure mode, not just the test

An empty-input failure usually means your code assumes len greater-than 0; fix the assumption, not just the one test. A DST failure means switching to a timezone-aware datetime library. A performance failure means switching from nested loops to dict lookups.

## Related practice catalogs

- [Python interview questions catalog](https://datadriven.io/python-interview-questions): Same 390 problems organized by interview-frequency tag for data engineer prep.
- [Python coding practice for data engineer rounds](https://datadriven.io/python-coding-practice): Topic-by-topic entry points: parsing, dedup, sessionization, retry.
- [PySpark practice problems for Spark-first companies](https://datadriven.io/pyspark-interview-questions): Spark-flavored Python in a live Spark sandbox for Databricks, Netflix, Uber.
- [SQL practice paired with Python](https://datadriven.io/sql-practice-problems): Multi-seed Postgres grading. Pair SQL practice with Python for full data engineer prep.
- [Full data engineer coding practice across surfaces](https://datadriven.io/data-engineer-coding-practice): SQL, Python, PySpark in one practice catalog.
- [Python practice problems with worked solutions](https://datadriven.io/python-practice-problems-with-solutions): Twenty worked solutions for pipeline-shaped problems with the why behind each.
- [Full data engineer interview question catalog](https://datadriven.io/data-engineer-interview-questions): 1,400+ problems across all 5 rounds.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.