Python Interview Questions

Python Interview Questions

Live-executed Python interview questions for data engineer roles, shaped like real pipeline code.

394 Python interview questions pulled from data engineer interview reports. Pipeline-shaped work: parsing malformed CSVs without crashing, deduplicating event streams by composite key with a proper tiebreaker, walking nested JSON with a recursive flattener, sessionizing event logs with itertools.groupby, writing retry decorators with exponential backoff plus jitter. Not LeetCode algorithm prep.

Python for data engineer roles is structurally different from Python for software engineer roles. About 4 percent of data engineer Python rounds resemble LeetCode algorithm puzzles. The other 96 percent are pipeline-shaped. The interviewer wants to see you parse a malformed CSV without crashing, deduplicate an event stream by composite key with a proper tiebreaker, walk a nested JSON with a recursive flattener that handles both lists-as-records and lists-as-attributes, sessionize an event log with itertools.groupby or a plain for-loop, write a retry decorator with exponential backoff plus random jitter (because without jitter every worker retries at the same later time and you have built a self-DDoS), and stream a 50GB file with a generator so memory stays constant regardless of input size.

The 394-question catalog mirrors the data engineer interview surface area. Data transformation problems are 29 percent of the bank (112 problems). Dict and set operations 16 percent (62). File parsing and IO 12 percent (45). String manipulation 10 percent (38). ETL flow control (router, batcher, retry decorator) 8 percent (32). Error handling with descriptive field-level errors 7 percent (28). Date and time including DST and timezone 6 percent (24). OOP and context managers 6 percent (22). Generators and lazy iteration 4 percent (15). Traditional DSA algorithms 2 percent (10), present for breadth but deliberately rare.

Library coverage. Vanilla Python is preferred in most data engineer rounds; the interviewer wants to see that you understand what pandas does under the hood. Pandas appears across most data engineer loops as a generic library question; a typical prompt is "implement SCD Type 2 merge logic in pandas". PySpark dominates at Spark-first companies (Databricks, Netflix, Uber, Airbnb, DoorDash, Spotify). Polars is rare in interviews but signals fluency when it comes up. The sandbox ships pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, and tenacity. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.

Every Python question runs in a real Python 3.11 sandbox with 5 to 15 test cases. Public tests are visible in the problem statement so the data engineer can read the input shape. Hidden tests reveal after the public ones pass and typically include empty input, single-element edge cases, Unicode user IDs (emoji and CJK characters break naive byte-counting), event timestamps at the DST boundary (which break naive timezone math), and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions. The performance test is woven into the same problem statement, not separated; the same submission that fails on correctness fails on performance when the implementation is quadratic.

Is this LeetCode-style Python algorithm prep?
No. About 4 percent of data engineer Python rounds resemble LeetCode algorithm puzzles. The other 96 percent are pipeline-shaped: parsing malformed CSVs, deduplicating event streams by composite key, validating records with field-level errors, sessionizing event logs, writing retry decorators. The catalog reflects that distribution. The algorithms tier exists for breadth but is intentionally small.
Does the Python code actually execute in the sandbox?
Yes. Every submission runs in a real Python 3.11 sandbox with 5 to 15 test cases. Public tests are visible in the problem statement; hidden tests reveal after the public ones pass and include empty input, Unicode, DST boundary timestamps, and a performance budget that fails quadratic solutions on 100k records.
Do I need to know pandas or PySpark for a data engineer Python round?
Depends on the company. Pandas shows up across most data engineer interviews as a generic library question (a typical prompt: 'implement SCD Type 2 merge logic'). PySpark dominates at Spark-first companies: Databricks, Netflix, Uber, Airbnb. Polars is rare in interviews but signals fluency. The catalog covers both pandas and PySpark; the company-specific lists show which library each company actually tests.
What Python idioms come up most in data engineer interview rounds?
defaultdict and OrderedDict for grouped accumulation. csv.DictReader for tabular ingest. json with recursive flattening for nested payloads. itertools.groupby for sessionization. heapq.merge for streaming sorted merges. asyncio.Semaphore for rate-limited fetch. tenacity (or a hand-rolled decorator) for exponential backoff with random jitter. Most candidates write the longer for-loop version of these; that is fine. Mention the library equivalent.
How many Python problems should a data engineer solve before an interview?
30 to 50 for a phone screen, 80 to 120 for an onsite, 150 or more for FAANG-level loops. Distribute across the high-frequency topics: data transformation, dict and set work, file parsing, error handling, date and time. Quality beats count. A problem you have solved, debugged, and re-derived a week later is worth five you skimmed once.
How does the test runner show hidden case failures?
After public tests pass, the runner reveals hidden tests one at a time with the input shape, the expected output shape, and the specific assertion that failed. For performance tests, wall-clock time and the threshold. For correctness, the diff between your output and expected with the first divergent element highlighted.
Can I use third-party libraries beyond pandas and the stdlib?
The sandbox ships pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, and tenacity. Spark problems run in a separate PySpark sandbox with the standard pyspark imports. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.

394 practice problems matching this filter. Difficulty: easy (184), medium (184), hard (26).

Python (394)