# Python Interview Questions

> Live-executed Python interview questions for data engineer roles, shaped like real pipeline code.

Canonical URL: <https://datadriven.io/python-interview-questions>

Breadcrumb: [Home](https://datadriven.io/) > [Python Interview Questions](https://datadriven.io/python-interview-questions)

## Summary

399 Python interview questions pulled from data engineer interview reports. Pipeline-shaped work: parsing malformed CSVs without crashing, deduplicating event streams by composite key with a proper tiebreaker, walking nested JSON with a recursive flattener, sessionizing event logs with itertools.groupby, writing retry decorators with exponential backoff plus jitter. Not LeetCode algorithm prep.

## What this page covers

Python for data engineer roles is structurally different from Python for software engineer roles. About 4 percent of data engineer Python rounds resemble LeetCode algorithm puzzles. The other 96 percent are pipeline-shaped. The interviewer wants to see you parse a malformed CSV without crashing, deduplicate an event stream by composite key with a proper tiebreaker, walk a nested JSON with a recursive flattener that handles both lists-as-records and lists-as-attributes, sessionize an event log with itertools.groupby or a plain for-loop, write a retry decorator with exponential backoff plus random jitter (because without jitter every worker retries at the same later time and you have built a self-DDoS), and stream a 50GB file with a generator so memory stays constant regardless of input size.

The 399-question catalog mirrors the data engineer interview surface area. Data transformation problems are 29 percent of the bank (112 problems). Dict and set operations 16 percent (62). File parsing and IO 12 percent (45). String manipulation 10 percent (38). ETL flow control (router, batcher, retry decorator) 8 percent (32). Error handling with descriptive field-level errors 7 percent (28). Date and time including DST and timezone 6 percent (24). OOP and context managers 6 percent (22). Generators and lazy iteration 4 percent (15). Traditional DSA algorithms 2 percent (10), present for breadth but deliberately rare.

Library coverage. Vanilla Python is preferred in most data engineer rounds; the interviewer wants to see that you understand what pandas does under the hood. Pandas appears across most data engineer loops as a generic library question; a typical prompt is "implement SCD Type 2 merge logic in pandas". PySpark dominates at Spark-first companies (Databricks, Netflix, Uber, Airbnb, DoorDash, Spotify). Polars is rare in interviews but signals fluency when it comes up. The sandbox ships pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, and tenacity. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.

Every Python question runs in a real Python 3.11 sandbox with 5 to 15 test cases. Public tests are visible in the problem statement so the data engineer can read the input shape. Hidden tests reveal after the public ones pass and typically include empty input, single-element edge cases, Unicode user IDs (emoji and CJK characters break naive byte-counting), event timestamps at the DST boundary (which break naive timezone math), and a 100k-record performance test with a wall-clock budget that fails O(n-squared) solutions. The performance test is woven into the same problem statement, not separated; the same submission that fails on correctness fails on performance when the implementation is quadratic.

## Frequently asked questions

### Is this LeetCode-style Python algorithm prep?

No. About 4 percent of data engineer Python rounds resemble LeetCode algorithm puzzles. The other 96 percent are pipeline-shaped: parsing malformed CSVs, deduplicating event streams by composite key, validating records with field-level errors, sessionizing event logs, writing retry decorators. The catalog reflects that distribution. The algorithms tier exists for breadth but is intentionally small.

### Does the Python code actually execute in the sandbox?

Yes. Every submission runs in a real Python 3.11 sandbox with 5 to 15 test cases. Public tests are visible in the problem statement; hidden tests reveal after the public ones pass and include empty input, Unicode, DST boundary timestamps, and a performance budget that fails quadratic solutions on 100k records.

### Do I need to know pandas or PySpark for a data engineer Python round?

Depends on the company. Pandas shows up across most data engineer interviews as a generic library question (a typical prompt: 'implement SCD Type 2 merge logic'). PySpark dominates at Spark-first companies: Databricks, Netflix, Uber, Airbnb. Polars is rare in interviews but signals fluency. The catalog covers both pandas and PySpark; the company-specific lists show which library each company actually tests.

### What Python idioms come up most in data engineer interview rounds?

defaultdict and OrderedDict for grouped accumulation. csv.DictReader for tabular ingest. json with recursive flattening for nested payloads. itertools.groupby for sessionization. heapq.merge for streaming sorted merges. asyncio.Semaphore for rate-limited fetch. tenacity (or a hand-rolled decorator) for exponential backoff with random jitter. Most candidates write the longer for-loop version of these; that is fine. Mention the library equivalent.

### How many Python problems should a data engineer solve before an interview?

30 to 50 for a phone screen, 80 to 120 for an onsite, 150 or more for FAANG-level loops. Distribute across the high-frequency topics: data transformation, dict and set work, file parsing, error handling, date and time. Quality beats count. A problem you have solved, debugged, and re-derived a week later is worth five you skimmed once.

### How does the test runner show hidden case failures?

After public tests pass, the runner reveals hidden tests one at a time with the input shape, the expected output shape, and the specific assertion that failed. For performance tests, wall-clock time and the threshold. For correctness, the diff between your output and expected with the first divergent element highlighted.

### Can I use third-party libraries beyond pandas and the stdlib?

The sandbox ships pandas, polars, numpy, pyarrow, json, csv, re, itertools, collections, heapq, datetime, requests, asyncio, and tenacity. Spark problems run in a separate PySpark sandbox with the standard pyspark imports. ML libraries (scikit-learn, torch) are not in scope; this is data engineer practice, not ML engineer practice.

## How a data engineer prepares for the Python round

Python idioms and patterns that show up across data engineer interview Python rounds, in the order most candidates work through them.

### Step 1: Master dict and set operations

defaultdict(list) for grouped accumulation, dict comprehensions, set arithmetic for membership and dedup. This composes 16 percent of data engineer Python questions.

### Step 2: Master file parsing

csv.DictReader, json with nested structures, generator-based chunked reading for memory bounds, dead-letter handling for malformed lines.

### Step 3: Master dedup by composite key

Dict keyed on tuple, update on newer timestamp with a composite tiebreaker. The SQL ROW_NUMBER OVER pattern translated to Python.

### Step 4: Master sessionization

Sort by (user, ts), walk with previous_ts variable or itertools.groupby, start new session when gap exceeds threshold.

### Step 5: Master retry decorators

Catch narrow exceptions (requests.RequestException, not bare Exception), sleep 2-to-the-attempt plus random jitter, give up after N tries. Without jitter you build a self-DDoS.

### Step 6: Master generators for streaming

yield rows from a large CSV one at a time. Memory stays constant regardless of file size. Pandas equivalent: read_csv(chunksize=N).

### Step 7: Master pandas SCD Type 2 merge

Identify changed rows, expire current (valid_to, is_current=False), insert new (valid_from=now, is_current=True). Use merge with indicator.

## Related practice catalogs

- [Python practice problems with hidden tests](https://datadriven.io/python-practice-problems): Public plus hidden test cases including Unicode, DST, and performance budgets.
- [PySpark interview problems for Spark-first roles](https://datadriven.io/pyspark-interview-questions): DataFrame fluency, join strategy, skew diagnosis, Spark UI reading.
- [SQL interview problems for data engineer prep](https://datadriven.io/sql-interview-questions): 927 problems with 10-seed grading. Pair SQL with Python for full data engineer prep.
- [Full data engineer interview catalog](https://datadriven.io/data-engineer-interview-questions): 1,400+ problems across SQL, Python, modeling, and pipeline design.
- [Netflix data engineer interview questions](https://datadriven.io/netflix-data-engineer-interview-questions): Spark-heavy Python, streaming patterns, late-arriving data reconciliation.
- [Amazon data engineer interview questions](https://datadriven.io/amazon-data-engineer-interview-questions): Pipeline-shaped Python with idempotency and retry-with-jitter patterns.

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.