Python shows up in 65% of data engineer interview loops, usually as a 45 to 60 minute live coding round or a take-home assignment. The format catches candidates off guard because it is not LeetCode. Interviewers want to see if you can parse a messy file, deduplicate records, and write production-quality data transformation code without reaching for pandas on every problem. This page is one of eight rounds in the complete data engineer interview preparation framework.
Pattern frequency from 1,042 interview reports. Note that algorithmic LeetCode-style problems are a small fraction. Data manipulation dominates.
| Pattern | Share of Python Questions | Common In |
|---|---|---|
| For loops over records | 13.1% | Every loop |
| Function definition with type hints | 9.0% | L4+, FAANG |
| List comprehensions | 8.2% | Every loop |
| Dict-as-index lookup | 7.1% | Every loop |
| if/else branching on row state | 6.3% | Every loop |
| Algorithm fundamentals (sort, hash) | 7.9% | FAANG only |
| Class definition (lightweight) | 4.4% | L4+, infra-heavy roles |
| Sorting with custom keys | 3.6% | Every loop |
| Generator and yield | 3.2% | L4+, scale-focused roles |
| JSON and CSV parsing | 8.7% | Every loop |
| collections.defaultdict, Counter | 6.1% | L4+ |
| File I/O with context managers | 5.4% | Every loop |
| Date and timezone handling | 4.2% | Analytics roles |
| Pandas DataFrame ops (when allowed) | 5.8% | Analytics, take-homes |
Each solution uses standard library Python only. Pandas equivalents are noted in the explanation but not used in the answer because most live coding rounds disallow it.
def flatten(obj, prefix="", sep="."):
out = {}
if isinstance(obj, dict):
for k, v in obj.items():
key = f"{prefix}{sep}{k}" if prefix else k
out.update(flatten(v, key, sep))
elif isinstance(obj, list):
out[prefix] = ",".join(str(x) for x in obj)
else:
out[prefix] = obj
return out
# Example
nested = {"user": {"id": 1, "addr": {"city": "NYC", "zip": "10001"}}}
print(flatten(nested))
# {"user.id": 1, "user.addr.city": "NYC", "user.addr.zip": "10001"}def dedup_latest(records):
latest = {}
for r in records:
key = (r["user_id"], r["event_type"])
if key not in latest or r["ts"] > latest[key]["ts"]:
latest[key] = r
return list(latest.values())
# Edge cases to mention:
# - Empty input returns []
# - Equal timestamps: keeps whichever comes last in input orderfrom datetime import timedelta
def sessionize(events, gap_minutes=30):
events = sorted(events, key=lambda e: (e["user_id"], e["ts"]))
threshold = timedelta(minutes=gap_minutes)
out = []
last_user = None
last_ts = None
session_id = 0
for e in events:
if e["user_id"] != last_user or (e["ts"] - last_ts) > threshold:
session_id += 1
out.append({**e, "session_id": session_id})
last_user, last_ts = e["user_id"], e["ts"]
return outimport csv
from typing import Iterator
def stream_clean_csv(path: str, chunk_size: int = 10_000) -> Iterator[list[dict]]:
with open(path, newline="") as f:
reader = csv.DictReader(f)
chunk = []
for row in reader:
if not row.get("email"):
continue
row["email"] = row["email"].lower().strip()
chunk.append(row)
if len(chunk) >= chunk_size:
yield chunk
chunk = []
if chunk:
yield chunk
for batch in stream_clean_csv("users.csv"):
bulk_insert(batch) # production: writes to Postgres or S3def inner_join(left, right, key):
right_index = {}
for r in right:
right_index.setdefault(r[key], []).append(r)
out = []
for l in left:
for r in right_index.get(l[key], []):
merged = {**l, **r}
out.append(merged)
return out
# Note: right value overwrites left on key collision.
# Mention this and ask the interviewer if it is acceptable.Pandas is right for take-home assignments where the dataset is small to medium (under 10 GB), the problem involves a lot of groupby and pivoting, and the interviewer evaluates on output not on the code. It is also right for analytics-engineer style rounds where the problem reads like a SQL query in disguise. See how to pass the analytics engineer interview for how pandas-heavy that loop runs.
Pandas is wrong for live coding when the interviewer wants to see Python fluency, for streaming or generator problems, for anything where the problem says "process this 50 GB file", and for low-level transformation logic where one .apply() call hides the real algorithm. If you find yourself reaching for .apply with a lambda, write the loop instead so the interviewer can see your thinking. The same instinct applies in how to pass the SQL round, where reaching for a window function on a problem solvable with GROUP BY is the parallel mistake.
Python is the connective tissue of the Data Engineer loop. The sessionization pattern in this round is the same gap-and-island pattern from how to pass the SQL round, just expressed in procedural code instead of declarative SQL. The composite-key dedup pattern is the same logic you defend in how to pass the data modeling round when you argue for SCD Type 2. Generators and chunked I/O are the scale-down version of the partitioning and shuffle patterns from how to pass the system design round.
Take-home assignments often combine SQL and Python in one artifact, which is why the how to pass the Data Engineer take-home is the highest-leverage prep page for take-home heavy companies. If you're targeting Airbnb (where the take-home is the loop) or Databricks (where PySpark replaces vanilla Python), read those pages next.
Practice in a real Python sandbox in the browser. Write code, run it, see results. Build the speed and instincts you need to write clean Python under interview pressure.
Start the Python Mock Interview100+ Python questions tagged by company and pattern.
What to learn in Python before any Data Engineer interview.
Pillar guide covering every round in the Data Engineer loop, end to end.
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
Star schema, SCD Type 2, fact-table grain, and how to defend a model against pushback.
Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.
STAR-D answers tailored to data engineering, with example responses for impact and conflict.
What graders look for in a 4 to 8 hour Data Engineer take-home, with a rubric breakdown.
How to think out loud, handle silence, and avoid the traps that sink fluent coders.
Continue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.