The Python Round

Python shows up in 65% of data engineer interview loops, usually as a 45 to 60 minute live coding round or a take-home assignment. The format catches candidates off guard because it is not LeetCode. Interviewers want to see if you can parse a messy file, deduplicate records, and write production-quality data transformation code without reaching for pandas on every problem. This page is one of eight rounds in the complete data engineer interview preparation framework.

What the Python Round Actually Tests

Pattern frequency from 1,042 interview reports. Note that algorithmic LeetCode-style problems are a small fraction. Data manipulation dominates.

Pattern	Share of Python Questions	Common In
For loops over records	13.1%	Every loop
Function definition with type hints	9.0%	L4+, FAANG
List comprehensions	8.2%	Every loop
Dict-as-index lookup	7.1%	Every loop
if/else branching on row state	6.3%	Every loop
Algorithm fundamentals (sort, hash)	7.9%	FAANG only
Class definition (lightweight)	4.4%	L4+, infra-heavy roles
Sorting with custom keys	3.6%	Every loop
Generator and yield	3.2%	L4+, scale-focused roles
JSON and CSV parsing	8.7%	Every loop
collections.defaultdict, Counter	6.1%	L4+
File I/O with context managers	5.4%	Every loop
Date and timezone handling	4.2%	Analytics roles
Pandas DataFrame ops (when allowed)	5.8%	Analytics, take-homes

Five Worked Solutions From Real Loops

Each solution uses standard library Python only. Pandas equivalents are noted in the explanation but not used in the answer because most live coding rounds disallow it.

Pattern: JSON Flattening

Flatten a nested JSON object into one level

Recurse over keys. Concatenate parent keys with a separator. Decide upfront how to handle lists: explode into rows or serialize to a string. State the decision out loud.

def flatten(obj, prefix="", sep="."):
    out = {}
    if isinstance(obj, dict):
        for k, v in obj.items():
            key = f"{prefix}{sep}{k}" if prefix else k
            out.update(flatten(v, key, sep))
    elif isinstance(obj, list):
        out[prefix] = ",".join(str(x) for x in obj)
    else:
        out[prefix] = obj
    return out

# Example
nested = {"user": {"id": 1, "addr": {"city": "NYC", "zip": "10001"}}}
print(flatten(nested))
# {"user.id": 1, "user.addr.city": "NYC", "user.addr.zip": "10001"}

Pattern: Composite Key Dedup

Deduplicate by (user_id, event_type), keep latest

Two valid approaches. The dict-keyed-on-tuple approach is O(n) and uses O(n) extra memory. The sort-then-iterate approach is O(n log n) with O(1) extra memory if you sort in place. Mention the tradeoff before writing.

def dedup_latest(records):
    latest = {}
    for r in records:
        key = (r["user_id"], r["event_type"])
        if key not in latest or r["ts"] > latest[key]["ts"]:
            latest[key] = r
    return list(latest.values())

# Edge cases to mention:
# - Empty input returns []
# - Equal timestamps: keeps whichever comes last in input order

Pattern: Sessionization

Group events into sessions with 30-min inactivity gap

Real production code from a web analytics pipeline. Sort by user_id and ts, then walk the sorted list. Increment session_id when the gap exceeds the threshold or when the user changes.

from datetime import timedelta

def sessionize(events, gap_minutes=30):
    events = sorted(events, key=lambda e: (e["user_id"], e["ts"]))
    threshold = timedelta(minutes=gap_minutes)
    out = []
    last_user = None
    last_ts = None
    session_id = 0
    for e in events:
        if e["user_id"] != last_user or (e["ts"] - last_ts) > threshold:
            session_id += 1
        out.append({**e, "session_id": session_id})
        last_user, last_ts = e["user_id"], e["ts"]
    return out

Pattern: Generator for Scale

Stream a large CSV with chunked transformation

Loading a 50 GB CSV into memory crashes the process. A generator yields one chunk at a time. The interviewer is looking for the yield keyword and an explanation of why it matters.

import csv
from typing import Iterator

def stream_clean_csv(path: str, chunk_size: int = 10_000) -> Iterator[list[dict]]:
    with open(path, newline="") as f:
        reader = csv.DictReader(f)
        chunk = []
        for row in reader:
            if not row.get("email"):
                continue
            row["email"] = row["email"].lower().strip()
            chunk.append(row)
            if len(chunk) >= chunk_size:
                yield chunk
                chunk = []
        if chunk:
            yield chunk

for batch in stream_clean_csv("users.csv"):
    bulk_insert(batch)  # production: writes to Postgres or S3

Pattern: List-of-Dict Join

Inner join two lists of dicts on a shared key

The naive O(n*m) double loop is wrong for any non-trivial size. Build a dict index on the smaller list, then iterate the larger. O(n + m) time, O(min(n, m)) space.

def inner_join(left, right, key):
    right_index = {}
    for r in right:
        right_index.setdefault(r[key], []).append(r)
    out = []
    for l in left:
        for r in right_index.get(l[key], []):
            merged = {**l, **r}
            out.append(merged)
    return out

# Note: right value overwrites left on key collision.
# Mention this and ask the interviewer if it is acceptable.

What Interviewers Watch For

01
Standard library fluency
collections.defaultdict, collections.Counter, itertools.groupby, functools.reduce, datetime, json, csv, re. If you import any of these reflexively when they fit the problem, the interviewer notes it as senior signal.
02
Edge cases stated upfront
Empty input, single-row input, all-None input, malformed row, missing key. State two or three before writing the function. Many candidates write a working function then realize at the end that empty input crashes it.
03
Type hints on signatures
Not required, but a strong signal at L4+. Show you think about the contract. def dedup(records: list[dict]) -> list[dict] beats def dedup(records).
04
No premature optimization
Write the clear version first. Then ask the interviewer if performance matters before refactoring. Candidates who rewrite a working function for performance without being asked waste 10 minutes and signal poor judgment.
05
Generators when scale is mentioned
If the problem says 'large file', 'streaming', or 'cannot fit in memory', use yield. Materializing the full list disqualifies an otherwise correct answer.
06
No pandas without permission
Pandas in a vanilla Python round is the single most common L3 rejection signal. Ask before importing. If allowed, use it sparingly. If not, drop down to dict and list.

When Pandas Is Right (and When It Is Wrong)

Pandas is right for take-home assignments where the dataset is small to medium (under 10 GB), the problem involves a lot of groupby and pivoting, and the interviewer evaluates on output not on the code. It is also right for analytics-engineer style rounds where the problem reads like a SQL query in disguise. See how to pass the analytics engineer interview for how pandas-heavy that loop runs.

Pandas is wrong for live coding when the interviewer wants to see Python fluency, for streaming or generator problems, for anything where the problem says "process this 50 GB file", and for low-level transformation logic where one .apply() call hides the real algorithm. If you find yourself reaching for .apply with a lambda, write the loop instead so the interviewer can see your thinking. The same instinct applies in how to pass the SQL round, where reaching for a window function on a problem solvable with GROUP BY is the parallel mistake.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a Python query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

ShopifyInterview question

Solve a problem

How the Python Round Connects to the Rest of the Loop

Python is the connective tissue of the Data Engineer loop. The sessionization pattern in this round is the same gap-and-island pattern from how to pass the SQL round, just expressed in procedural code instead of declarative SQL. The composite-key dedup pattern is the same logic you defend in how to pass the data modeling round when you argue for SCD Type 2. Generators and chunked I/O are the scale-down version of the partitioning and shuffle patterns from how to pass the system design round.

Take-home assignments often combine SQL and Python in one artifact, which is why the how to pass the Data Engineer take-home is the highest-leverage prep page for take-home heavy companies. If you're targeting Airbnb (where the take-home is the loop) or Databricks (where PySpark replaces vanilla Python), read those pages next.

How to Prepare in Four Weeks

01
Week 1: Standard library mastery
20 problems using only collections, itertools, datetime, json, csv. Goal: write a working solution to any data manipulation problem without imports beyond these.
02
Week 2: Patterns and parsing
JSON flattening, CSV streaming, log parsing, deduplication, sessionization. 15 problems. Time yourself: medium under 15 minutes, hard under 25. Always state edge cases first.
03
Week 3: Pandas and numpy basics
Only if take-home rounds are likely. groupby, merge, pivot_table, apply, melt. 10 problems. Focus on writing readable transformation chains, not one-liners.
04
Week 4: Mock rounds out loud
10 mock interviews on the Python mock interview. Speak every line. Narrate the type signature, the edge cases, the algorithm choice, the time complexity. Silence is the most common failure mode.

Python Round FAQ

Do I need to know data structures and algorithms for the data engineer Python round?+

Light algorithmic knowledge helps: hash maps, sorting, basic recursion, dynamic programming for one or two problems. You do not need LeetCode hard. You need to recognize when a problem reduces to a hash map lookup or a sort. Spend 80% of prep time on data manipulation, 20% on basic algorithms.

Is Python 3.10 syntax accepted in interviews?+

Yes. match/case, walrus operator, f-strings, dict union with the | operator, and PEP 604 type hints (X | Y instead of Union[X, Y]) are all fair game. Some interviewers run on Python 3.9, so avoid match/case unless you confirm. Walrus and f-strings are universally accepted.

Should I use type hints?+

At L4 and above, yes. Type hints on the function signature show you care about contracts and that you have written production code. Inline type hints on every variable is overkill and slows you down.

How is the Python round different at FAANG vs other companies?+

FAANG companies (especially Meta and Amazon) sometimes ask one DSA-style problem alongside a data manipulation problem. Stripe, Airbnb, Databricks, and most data-heavy companies stick to data wrangling. Read the recruiter prep doc carefully. If it mentions LeetCode, prepare a few medium problems on hash maps, two pointers, and BFS.

Can I use Jupyter or do I get a plain editor?+

Most live rounds use a plain shared editor (CoderPad, HackerRank, Google Docs). Take-home assignments allow Jupyter. Practice in a plain text editor with no autocomplete to build the muscle memory.

What if I forget a method name during the round?+

Say it out loud: 'I cannot remember the exact signature of itertools.groupby, but the idea is...'. The interviewer will tell you. Pretending you know and writing the wrong code is worse. Forgetting is human. Faking is a trust issue.

How long is a typical Python round?+

45 to 60 minutes, with one or two problems. The first problem is usually a warmup (10 to 15 minutes). The second is the real evaluation (30 to 40 minutes), with follow-ups on edge cases, performance, and how you would test the function.

02 / Why practice

Pass the Python Round in 4 Weeks

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Start the Python Mock Interview

More data engineer interview prep guides

how to pass the SQL round→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

how to pass the data modeling round→

Star schema, SCD Type 2, fact-table grain, and how to defend a model against pushback.

how to pass the system design round→

Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.

how to pass the behavioral round→

STAR-D answers tailored to data engineering, with example responses for impact and conflict.

how to pass the Data Engineer take-home→

What graders look for in a 4 to 8 hour Data Engineer take-home, with a rubric breakdown.

how to pass the live coding round→

How to think out loud, handle silence, and avoid the traps that sink fluent coders.

The Python Round

What the Python Round Actually Tests

Five Worked Solutions From Real Loops

Flatten a nested JSON object into one level

Deduplicate by (user_id, event_type), keep latest

Group events into sessions with 30-min inactivity gap

Stream a large CSV with chunked transformation

Inner join two lists of dicts on a shared key

What Interviewers Watch For

Standard library fluency

Edge cases stated upfront

Type hints on signatures

No premature optimization

Generators when scale is mentioned

No pandas without permission

When Pandas Is Right (and When It Is Wrong)

Know the patterns before the interviewer asks them.

How the Python Round Connects to the Rest of the Loop

How to Prepare in Four Weeks

Week 1: Standard library mastery

Week 2: Patterns and parsing

Week 3: Pandas and numpy basics

Week 4: Mock rounds out loud

Python Round FAQ

Pass the Python Round in 4 Weeks

More data engineer interview prep reading

More data engineer interview prep guides