The Python Round
Python shows up in 65% of data engineer interview loops, usually as a 45 to 60 minute live coding round or a take-home assignment. The format catches candidates off guard because it is not LeetCode. Interviewers want to see if you can parse a messy file, deduplicate records, and write production-quality data transformation code without reaching for pandas on every problem. This page is one of eight rounds in the complete data engineer interview preparation framework.
What the Python Round Actually Tests
Pattern frequency from 1,042 interview reports. Note that algorithmic LeetCode-style problems are a small fraction. Data manipulation dominates.
| Pattern | Share of Python Questions | Common In |
|---|---|---|
| For loops over records | 13.1% | Every loop |
| Function definition with type hints | 9.0% | L4+, FAANG |
| List comprehensions | 8.2% | Every loop |
| Dict-as-index lookup | 7.1% | Every loop |
| if/else branching on row state | 6.3% | Every loop |
| Algorithm fundamentals (sort, hash) | 7.9% | FAANG only |
| Class definition (lightweight) | 4.4% | L4+, infra-heavy roles |
| Sorting with custom keys | 3.6% | Every loop |
| Generator and yield | 3.2% | L4+, scale-focused roles |
| JSON and CSV parsing | 8.7% | Every loop |
| collections.defaultdict, Counter | 6.1% | L4+ |
| File I/O with context managers | 5.4% | Every loop |
| Date and timezone handling | 4.2% | Analytics roles |
| Pandas DataFrame ops (when allowed) | 5.8% | Analytics, take-homes |
Five Worked Solutions From Real Loops
Each solution uses standard library Python only. Pandas equivalents are noted in the explanation but not used in the answer because most live coding rounds disallow it.
Flatten a nested JSON object into one level
def flatten(obj, prefix="", sep="."):
out = {}
if isinstance(obj, dict):
for k, v in obj.items():
key = f"{prefix}{sep}{k}" if prefix else k
out.update(flatten(v, key, sep))
elif isinstance(obj, list):
out[prefix] = ",".join(str(x) for x in obj)
else:
out[prefix] = obj
return out
# Example
nested = {"user": {"id": 1, "addr": {"city": "NYC", "zip": "10001"}}}
print(flatten(nested))
# {"user.id": 1, "user.addr.city": "NYC", "user.addr.zip": "10001"}Deduplicate by (user_id, event_type), keep latest
def dedup_latest(records):
latest = {}
for r in records:
key = (r["user_id"], r["event_type"])
if key not in latest or r["ts"] > latest[key]["ts"]:
latest[key] = r
return list(latest.values())
# Edge cases to mention:
# - Empty input returns []
# - Equal timestamps: keeps whichever comes last in input orderGroup events into sessions with 30-min inactivity gap
from datetime import timedelta
def sessionize(events, gap_minutes=30):
events = sorted(events, key=lambda e: (e["user_id"], e["ts"]))
threshold = timedelta(minutes=gap_minutes)
out = []
last_user = None
last_ts = None
session_id = 0
for e in events:
if e["user_id"] != last_user or (e["ts"] - last_ts) > threshold:
session_id += 1
out.append({**e, "session_id": session_id})
last_user, last_ts = e["user_id"], e["ts"]
return outStream a large CSV with chunked transformation
import csv
from typing import Iterator
def stream_clean_csv(path: str, chunk_size: int = 10_000) -> Iterator[list[dict]]:
with open(path, newline="") as f:
reader = csv.DictReader(f)
chunk = []
for row in reader:
if not row.get("email"):
continue
row["email"] = row["email"].lower().strip()
chunk.append(row)
if len(chunk) >= chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
for batch in stream_clean_csv("users.csv"):
bulk_insert(batch) # production: writes to Postgres or S3Inner join two lists of dicts on a shared key
def inner_join(left, right, key):
right_index = {}
for r in right:
right_index.setdefault(r[key], []).append(r)
out = []
for l in left:
for r in right_index.get(l[key], []):
merged = {**l, **r}
out.append(merged)
return out
# Note: right value overwrites left on key collision.
# Mention this and ask the interviewer if it is acceptable.What Interviewers Watch For
- 01
Standard library fluency
collections.defaultdict, collections.Counter, itertools.groupby, functools.reduce, datetime, json, csv, re. If you import any of these reflexively when they fit the problem, the interviewer notes it as senior signal. - 02
Edge cases stated upfront
Empty input, single-row input, all-None input, malformed row, missing key. State two or three before writing the function. Many candidates write a working function then realize at the end that empty input crashes it. - 03
Type hints on signatures
Not required, but a strong signal at L4+. Show you think about the contract. def dedup(records: list[dict]) -> list[dict] beats def dedup(records). - 04
No premature optimization
Write the clear version first. Then ask the interviewer if performance matters before refactoring. Candidates who rewrite a working function for performance without being asked waste 10 minutes and signal poor judgment. - 05
Generators when scale is mentioned
If the problem says 'large file', 'streaming', or 'cannot fit in memory', use yield. Materializing the full list disqualifies an otherwise correct answer. - 06
No pandas without permission
Pandas in a vanilla Python round is the single most common L3 rejection signal. Ask before importing. If allowed, use it sparingly. If not, drop down to dict and list.
When Pandas Is Right (and When It Is Wrong)
Pandas is right for take-home assignments where the dataset is small to medium (under 10 GB), the problem involves a lot of groupby and pivoting, and the interviewer evaluates on output not on the code. It is also right for analytics-engineer style rounds where the problem reads like a SQL query in disguise. See how to pass the analytics engineer interview for how pandas-heavy that loop runs.
Pandas is wrong for live coding when the interviewer wants to see Python fluency, for streaming or generator problems, for anything where the problem says "process this 50 GB file", and for low-level transformation logic where one .apply() call hides the real algorithm. If you find yourself reaching for .apply with a lambda, write the loop instead so the interviewer can see your thinking. The same instinct applies in how to pass the SQL round, where reaching for a window function on a problem solvable with GROUP BY is the parallel mistake.
Know the patterns before the interviewer asks them.
How the Python Round Connects to the Rest of the Loop
Python is the connective tissue of the Data Engineer loop. The sessionization pattern in this round is the same gap-and-island pattern from how to pass the SQL round, just expressed in procedural code instead of declarative SQL. The composite-key dedup pattern is the same logic you defend in how to pass the data modeling round when you argue for SCD Type 2. Generators and chunked I/O are the scale-down version of the partitioning and shuffle patterns from how to pass the system design round.
Take-home assignments often combine SQL and Python in one artifact, which is why the how to pass the Data Engineer take-home is the highest-leverage prep page for take-home heavy companies. If you're targeting Airbnb (where the take-home is the loop) or Databricks (where PySpark replaces vanilla Python), read those pages next.
How to Prepare in Four Weeks
- 01
Week 1: Standard library mastery
20 problems using only collections, itertools, datetime, json, csv. Goal: write a working solution to any data manipulation problem without imports beyond these. - 02
Week 2: Patterns and parsing
JSON flattening, CSV streaming, log parsing, deduplication, sessionization. 15 problems. Time yourself: medium under 15 minutes, hard under 25. Always state edge cases first. - 03
Week 3: Pandas and numpy basics
Only if take-home rounds are likely. groupby, merge, pivot_table, apply, melt. 10 problems. Focus on writing readable transformation chains, not one-liners. - 04
Week 4: Mock rounds out loud
10 mock interviews on the Python mock interview. Speak every line. Narrate the type signature, the edge cases, the algorithm choice, the time complexity. Silence is the most common failure mode.
Python Round FAQ
Do I need to know data structures and algorithms for the data engineer Python round?+
Is Python 3.10 syntax accepted in interviews?+
Should I use type hints?+
How is the Python round different at FAANG vs other companies?+
Can I use Jupyter or do I get a plain editor?+
What if I forget a method name during the round?+
How long is a typical Python round?+
Pass the Python Round in 4 Weeks
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
More data engineer interview prep reading
100+ Python questions tagged by company and pattern.
What to learn in Python before any Data Engineer interview.
Pillar guide covering every round in the Data Engineer loop, end to end.
More data engineer interview prep guides
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
Star schema, SCD Type 2, fact-table grain, and how to defend a model against pushback.
Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.
STAR-D answers tailored to data engineering, with example responses for impact and conflict.
What graders look for in a 4 to 8 hour Data Engineer take-home, with a rubric breakdown.
How to think out loud, handle silence, and avoid the traps that sink fluent coders.