Question 1

What kind of Python do these practice problems test?

Accepted Answer

Pipeline-shaped Python: parsing malformed CSVs without crashing, deduplicating events with composite keys and tiebreakers, walking nested JSON, sessionizing event streams, writing retry decorators with jittered backoff, streaming large files with generators. About 4 percent of problems are algorithm puzzles for breadth; the other 96 percent are pipeline work that mirrors real production code at data engineer-shaped companies.

Question 2

How do public and hidden test cases work?

Accepted Answer

Public tests are visible in the problem statement so a data engineer can read the input shape and expected output. Hidden tests reveal one at a time after the public tests pass. Hidden bucket typically includes empty input, Unicode (emoji, CJK characters), DST boundary timestamps, and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions.

Question 3

What libraries are available in the Python sandbox?

Accepted Answer

pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, tenacity. PySpark runs in a separate sandbox. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.

Question 4

How do I see why a hidden test failed?

Accepted Answer

After failure, the runner reveals the input shape, the expected output shape, and the specific assertion that failed. For performance tests, wall-clock time and the threshold. For correctness, the first divergent element of the diff. Submit, read the failure, fix, resubmit. The same loop as a real grading harness in a data engineer take-home.

Question 5

Are there time limits on submissions?

Accepted Answer

No per-problem time limit on writing code. The performance tests have wall-clock budgets (typically 1 to 5 seconds for a 100k-record input), enforced inside the sandbox. Quadratic solutions usually fail performance tests; linear or O(n log n) solutions pass with margin.

Question 6

Can I use pandas for problems that do not explicitly require it?

Accepted Answer

Yes. The grader accepts any valid Python solution. Pandas is often the cleanest expression: a SCD Type 2 merge in pandas is roughly 10 lines versus roughly 30 in vanilla Python. The interview rubric varies by company. Some prefer vanilla to verify you understand the underlying data structures. Others want pandas because production code uses it. A data engineer should mention both in interview discussion.

Question 7

What is the difference from LeetCode Python practice?

Accepted Answer

LeetCode emphasizes algorithm puzzles (graph traversal, DP, sliding window). This catalog emphasizes pipeline patterns (parsing, dedup, validation, ETL flow). LeetCode performance tests are tight on algorithmic complexity. This catalog's performance tests are tight on memory bounds (streaming versus buffering) and IO patterns more than on Big-O. About 4 percent topical overlap; the rest is different territory for data engineer prep.

Python Practice Problems

Python Practice Problems