Interview Round Guide

The Python Round

Python shows up in 65% of data engineer interview loops, usually as a 45 to 60 minute live coding round or a take-home assignment. The format catches candidates off guard because it is not LeetCode. Interviewers want to see if you can parse a messy file, deduplicate records, and write production-quality data transformation code without reaching for pandas on every problem. This page is one of eight rounds in the complete data engineer interview preparation framework.

The Short Answer
Expect data wrangling, not algorithms. The most common problems are JSON flattening, CSV parsing, deduplication by composite key, sessionization with a timeout, and joining two lists of dicts. Use vanilla Python with the standard library by default. Pandas only when the interviewer explicitly allows it. Write functions that handle empty input, missing keys, and malformed rows without crashing. Speed matters: medium problems in 15 minutes, hard in 25.
Updated April 2026ยทBy The DataDriven Team

What the Python Round Actually Tests

Pattern frequency from 1,042 interview reports. Note that algorithmic LeetCode-style problems are a small fraction. Data manipulation dominates.

PatternShare of Python QuestionsCommon In
For loops over records13.1%Every loop
Function definition with type hints9.0%L4+, FAANG
List comprehensions8.2%Every loop
Dict-as-index lookup7.1%Every loop
if/else branching on row state6.3%Every loop
Algorithm fundamentals (sort, hash)7.9%FAANG only
Class definition (lightweight)4.4%L4+, infra-heavy roles
Sorting with custom keys3.6%Every loop
Generator and yield3.2%L4+, scale-focused roles
JSON and CSV parsing8.7%Every loop
collections.defaultdict, Counter6.1%L4+
File I/O with context managers5.4%Every loop
Date and timezone handling4.2%Analytics roles
Pandas DataFrame ops (when allowed)5.8%Analytics, take-homes

Five Worked Solutions From Real Loops

Each solution uses standard library Python only. Pandas equivalents are noted in the explanation but not used in the answer because most live coding rounds disallow it.

Pattern: JSON Flattening

Flatten a nested JSON object into one level

Recurse over keys. Concatenate parent keys with a separator. Decide upfront how to handle lists: explode into rows or serialize to a string. State the decision out loud.
def flatten(obj, prefix="", sep="."):
    out = {}
    if isinstance(obj, dict):
        for k, v in obj.items():
            key = f"{prefix}{sep}{k}" if prefix else k
            out.update(flatten(v, key, sep))
    elif isinstance(obj, list):
        out[prefix] = ",".join(str(x) for x in obj)
    else:
        out[prefix] = obj
    return out

# Example
nested = {"user": {"id": 1, "addr": {"city": "NYC", "zip": "10001"}}}
print(flatten(nested))
# {"user.id": 1, "user.addr.city": "NYC", "user.addr.zip": "10001"}
Pattern: Composite Key Dedup

Deduplicate by (user_id, event_type), keep latest

Two valid approaches. The dict-keyed-on-tuple approach is O(n) and uses O(n) extra memory. The sort-then-iterate approach is O(n log n) with O(1) extra memory if you sort in place. Mention the tradeoff before writing.
def dedup_latest(records):
    latest = {}
    for r in records:
        key = (r["user_id"], r["event_type"])
        if key not in latest or r["ts"] > latest[key]["ts"]:
            latest[key] = r
    return list(latest.values())

# Edge cases to mention:
# - Empty input returns []
# - Equal timestamps: keeps whichever comes last in input order
Pattern: Sessionization

Group events into sessions with 30-min inactivity gap

Real production code from a web analytics pipeline. Sort by user_id and ts, then walk the sorted list. Increment session_id when the gap exceeds the threshold or when the user changes.
from datetime import timedelta

def sessionize(events, gap_minutes=30):
    events = sorted(events, key=lambda e: (e["user_id"], e["ts"]))
    threshold = timedelta(minutes=gap_minutes)
    out = []
    last_user = None
    last_ts = None
    session_id = 0
    for e in events:
        if e["user_id"] != last_user or (e["ts"] - last_ts) > threshold:
            session_id += 1
        out.append({**e, "session_id": session_id})
        last_user, last_ts = e["user_id"], e["ts"]
    return out
Pattern: Generator for Scale

Stream a large CSV with chunked transformation

Loading a 50 GB CSV into memory crashes the process. A generator yields one chunk at a time. The interviewer is looking for the yield keyword and an explanation of why it matters.
import csv
from typing import Iterator

def stream_clean_csv(path: str, chunk_size: int = 10_000) -> Iterator[list[dict]]:
    with open(path, newline="") as f:
        reader = csv.DictReader(f)
        chunk = []
        for row in reader:
            if not row.get("email"):
                continue
            row["email"] = row["email"].lower().strip()
            chunk.append(row)
            if len(chunk) >= chunk_size:
                yield chunk
                chunk = []
        if chunk:
            yield chunk

for batch in stream_clean_csv("users.csv"):
    bulk_insert(batch)  # production: writes to Postgres or S3
Pattern: List-of-Dict Join

Inner join two lists of dicts on a shared key

The naive O(n*m) double loop is wrong for any non-trivial size. Build a dict index on the smaller list, then iterate the larger. O(n + m) time, O(min(n, m)) space.
def inner_join(left, right, key):
    right_index = {}
    for r in right:
        right_index.setdefault(r[key], []).append(r)
    out = []
    for l in left:
        for r in right_index.get(l[key], []):
            merged = {**l, **r}
            out.append(merged)
    return out

# Note: right value overwrites left on key collision.
# Mention this and ask the interviewer if it is acceptable.

What Interviewers Watch For

1

Standard library fluency

collections.defaultdict, collections.Counter, itertools.groupby, functools.reduce, datetime, json, csv, re. If you import any of these reflexively when they fit the problem, the interviewer notes it as senior signal.
2

Edge cases stated upfront

Empty input, single-row input, all-None input, malformed row, missing key. State two or three before writing the function. Many candidates write a working function then realize at the end that empty input crashes it.
3

Type hints on signatures

Not required, but a strong signal at L4+. Show you think about the contract. def dedup(records: list[dict]) -> list[dict] beats def dedup(records).
4

No premature optimization

Write the clear version first. Then ask the interviewer if performance matters before refactoring. Candidates who rewrite a working function for performance without being asked waste 10 minutes and signal poor judgment.
5

Generators when scale is mentioned

If the problem says 'large file', 'streaming', or 'cannot fit in memory', use yield. Materializing the full list disqualifies an otherwise correct answer.
6

No pandas without permission

Pandas in a vanilla Python round is the single most common L3 rejection signal. Ask before importing. If allowed, use it sparingly. If not, drop down to dict and list.

When Pandas Is Right (and When It Is Wrong)

Pandas is right for take-home assignments where the dataset is small to medium (under 10 GB), the problem involves a lot of groupby and pivoting, and the interviewer evaluates on output not on the code. It is also right for analytics-engineer style rounds where the problem reads like a SQL query in disguise. See how to pass the analytics engineer interview for how pandas-heavy that loop runs.

Pandas is wrong for live coding when the interviewer wants to see Python fluency, for streaming or generator problems, for anything where the problem says "process this 50 GB file", and for low-level transformation logic where one .apply() call hides the real algorithm. If you find yourself reaching for .apply with a lambda, write the loop instead so the interviewer can see your thinking. The same instinct applies in how to pass the SQL round, where reaching for a window function on a problem solvable with GROUP BY is the parallel mistake.

How the Python Round Connects to the Rest of the Loop

Python is the connective tissue of the Data Engineer loop. The sessionization pattern in this round is the same gap-and-island pattern from how to pass the SQL round, just expressed in procedural code instead of declarative SQL. The composite-key dedup pattern is the same logic you defend in how to pass the data modeling round when you argue for SCD Type 2. Generators and chunked I/O are the scale-down version of the partitioning and shuffle patterns from how to pass the system design round.

Take-home assignments often combine SQL and Python in one artifact, which is why the how to pass the Data Engineer take-home is the highest-leverage prep page for take-home heavy companies. If you're targeting Airbnb (where the take-home is the loop) or Databricks (where PySpark replaces vanilla Python), read those pages next.

How to Prepare in Four Weeks

1

Week 1: Standard library mastery

20 problems using only collections, itertools, datetime, json, csv. Goal: write a working solution to any data manipulation problem without imports beyond these.
2

Week 2: Patterns and parsing

JSON flattening, CSV streaming, log parsing, deduplication, sessionization. 15 problems. Time yourself: medium under 15 minutes, hard under 25. Always state edge cases first.
3

Week 3: Pandas and numpy basics

Only if take-home rounds are likely. groupby, merge, pivot_table, apply, melt. 10 problems. Focus on writing readable transformation chains, not one-liners.
4

Week 4: Mock rounds out loud

10 mock interviews on the Python mock interview. Speak every line. Narrate the type signature, the edge cases, the algorithm choice, the time complexity. Silence is the most common failure mode.

Data Engineer Interview Prep FAQ

Do I need to know data structures and algorithms for the data engineer Python round?+
Light algorithmic knowledge helps: hash maps, sorting, basic recursion, dynamic programming for one or two problems. You do not need LeetCode hard. You need to recognize when a problem reduces to a hash map lookup or a sort. Spend 80% of prep time on data manipulation, 20% on basic algorithms.
Is Python 3.10 syntax accepted in interviews?+
Yes. match/case, walrus operator, f-strings, dict union with the | operator, and PEP 604 type hints (X | Y instead of Union[X, Y]) are all fair game. Some interviewers run on Python 3.9, so avoid match/case unless you confirm. Walrus and f-strings are universally accepted.
Should I use type hints?+
At L4 and above, yes. Type hints on the function signature show you care about contracts and that you have written production code. Inline type hints on every variable is overkill and slows you down.
How is the Python round different at FAANG vs other companies?+
FAANG companies (especially Meta and Amazon) sometimes ask one DSA-style problem alongside a data manipulation problem. Stripe, Airbnb, Databricks, and most data-heavy companies stick to data wrangling. Read the recruiter prep doc carefully. If it mentions LeetCode, prepare a few medium problems on hash maps, two pointers, and BFS.
Can I use Jupyter or do I get a plain editor?+
Most live rounds use a plain shared editor (CoderPad, HackerRank, Google Docs). Take-home assignments allow Jupyter. Practice in a plain text editor with no autocomplete to build the muscle memory.
What if I forget a method name during the round?+
Say it out loud: 'I cannot remember the exact signature of itertools.groupby, but the idea is...'. The interviewer will tell you. Pretending you know and writing the wrong code is worse. Forgetting is human. Faking is a trust issue.
How long is a typical Python round?+
45 to 60 minutes, with one or two problems. The first problem is usually a warmup (10 to 15 minutes). The second is the real evaluation (30 to 40 minutes), with follow-ups on edge cases, performance, and how you would test the function.

Pass the Python Round in 4 Weeks

Practice in a real Python sandbox in the browser. Write code, run it, see results. Build the speed and instincts you need to write clean Python under interview pressure.

Start the Python Mock Interview

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats