Python Interview Questions for Data Engineers

Over half of DE loops include a Python round. From the debriefs collected, the four buckets that show up most are loops, function design, lists, and dictionaries. Almost nothing looks like a LeetCode tree problem. Almost everything looks like data you'd see on the job.

How the DE Python round differs from the SWE one

The Python is the same. The questions are completely different. Software engineering rounds lean on algorithms: tree traversal, dynamic programming, graph search. Data engineering rounds lean on messy data: parse this file, transform these dicts, dedupe these rows.

The input is messy. SWE problems give you a clean array of integers. DE problems give you JSON with inconsistent field names, half the timestamps in ISO and half in epoch milliseconds, and one row where amount is the string "null". Half the answer is noticing.

Memory matters. If the prompt says ten gigabytes and your first move is read() into a list, the round is already off track. Generators, streaming reads, and chunked iteration are the difference between a solution that runs and one that gets OOM-killed on the test data.

Standard library, not pandas. Default to the standard library: csv, json, collections, itertools. Pandas is fine if the prompt is genuinely tabular and you ask first; reaching for it on a five-row dedup reads as overkill. If you can't write a GROUP BY with defaultdict in under five minutes, that's the gap to close first.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a Python query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1def sessionize(events):

2 sessions = []

3 for e in events:

4 if gap_minutes(e) > 30:

Execute your solution0.4s avg.

ShopifyInterview question

Solve a problem

Topics, ranked by what shows up

Eight topic clusters, with the typical share of a Python round each one represents. The first four cover roughly forty percent of questions between them.

Topic	Share / when it shows up	Difficulty
Dictionary Operations	About 7% of Python questions	Medium
List and Set Manipulation	About 8% of Python questions	Easy-Medium
String Processing	Common in take-homes	Easy-Medium
File I/O (JSON and CSV)	On most take-homes	Medium
ETL Patterns	Standard at senior levels	Medium-Hard
Error Handling and Edge Cases	Sometimes	Medium
Generators and Memory Efficiency	Sometimes, mostly at senior	Medium-Hard
Testing and Code Quality	Mostly take-homes	Medium

What each topic asks

Dictionary Operations. Dict is the Python data structure that does the most work in DE code: lookups, grouping, aggregating, dedup. Questions are usually short and test whether you can manipulate dicts without reaching for pandas.

Given two lists of equal length, build a dictionary mapping the first list to the second.
Invert a dictionary by swapping keys and values. When values collide, collect the original keys into a list.
Merge two dictionaries so that overlapping keys sum instead of overwrite.
Group a list of dictionaries by a chosen key and return a dictionary of lists.

List and Set Manipulation. List comprehensions, set operations, and the choice between O(n) and O(n*log n) approaches. The interviewer is usually checking whether you reach for a set when you need uniqueness instead of looping.

Dedupe a list of dictionaries by one specific key, keeping the last occurrence.
Find elements in list A that aren't in list B, preserving the order from A (a plain set difference loses order).
Flatten a list of lists into a single list without itertools.
On a sorted list of integers, find all pairs that sum to a target value in linear time.

String Processing. Almost every parsing question is fundamentally a string problem. Log lines, CSV rows with embedded quotes, semi-structured fields. The questions look like data cleaning because in the job, they are.

Parse an Apache access-log line and pull out the IP, timestamp, method, path, and status code.
Convert snake_case strings to camelCase.
Parse a string of key=value pairs separated by semicolons into a dictionary, handling quoted values that themselves contain semicolons.
Validate that a string is a well-formed ISO 8601 datetime without using dateutil.

File I/O (JSON and CSV). Reading, transforming, and writing structured files is the actual content of half of all DE Python work. The questions surface encoding gotchas, malformed rows, and the memory limits that make a naive read() impossible.

Read a JSON file of nested objects and flatten it into a flat list of dicts suitable for CSV output.
Compute the average of a numeric column in a 10 GB CSV without loading the whole file into memory.
Filter a CSV by a row predicate and write the result, handling input rows with inconsistent column counts.
Merge two JSON arrays, deduping by an id field and preferring values from the second file when there's a conflict.

ETL Patterns. Less about syntax, more about whether your code looks like something that would survive in a real pipeline. Pure functions, testable units, sensible failure handling.

Take a list of raw event dicts, validate required fields, normalize timestamps to UTC, and group by user_id.
Write a schema validator that checks a dict against a definition (field names, types, required-or-not).
Filter records to only those newer than a watermark, the kind of incremental-load primitive every pipeline needs.
Join two lists of dicts on a common key with LEFT JOIN semantics, including the handling for missing right-side rows.

Error Handling and Edge Cases. Production pipelines fail constantly. The questions here test whether your default style is defensive (try / log / partial output) or fragile (one bad row crashes the run).

Retry an HTTP request up to three times with exponential backoff and jitter. Return the response on success or raise after the last attempt.
Process a list of records where some are malformed. Return two lists: clean and quarantined with the reason for each rejection.
Write a context manager that logs the start, end, and duration of the block it wraps.

Generators and Memory Efficiency. Once a dataset is large enough that it won't fit in memory, generators stop being a stylistic preference and become the only option. These questions check whether yield is in your reflexes.

Generator that reads a large file and yields batches of N lines.
Lazily chain multiple iterables one element at a time, without materializing an intermediate list.
Read two pre-sorted files in parallel and yield records in sorted order. The classic merge step of an external sort.

Testing and Code Quality. Take-homes increasingly ask you to write the tests as well as the function. It's a fast signal for whether the candidate would actually be useful on day one.

Unit tests for a function that parses date strings in several formats, including the negative cases.
A property-based test that confirms a transformation function always returns the same row count as it received.
A fixture that writes a known CSV to a temp file, runs the function under test, and asserts on the output.

Prepare for the interview

03 / From the bank03 of many

03hand-picked.

Max Length Token

Medium5 min

The longest token wins.

Open the sandbox

Pick your level

CoinbaseL3Rename Keys DoorDashL4The Gap Filler LyftL5The Version Parade

Pulled from debriefs where Python parsing was the gate.

Worked example: group records by key, vanilla Python

from collections import defaultdict

def group_orders(orders):
    agg = defaultdict(lambda: {"total": 0, "count": 0})
    for row in orders:
        cid = row["customer_id"]
        agg[cid]["total"] += row["amount"]
        agg[cid]["count"] += 1
    return dict(agg)

Take a list of order rows, produce total spend plus order count per customer. No pandas. This is the question most candidates underestimate because it sounds trivial, then they write twenty lines of nested conditionals. defaultdict skips the if-key-not-in-dict dance; the lambda creates a fresh accumulator the first time a new customer_id shows up. The standard-library answer to a SQL GROUP BY with SUM and COUNT. If you can write this from a blank file in under three minutes, the dictionary section of a Python round is solved.

Common questions

How is the DE Python interview different from the SWE one?+

The SWE version leans into algorithms and data structures: tree traversals, graph search, dynamic programming. The DE version leans into messy data: parse this file, transform these dicts, dedupe these rows. Binary tree problems almost never come up. The bar is generally lower on raw cleverness and higher on whether your code looks like something that would survive in a pipeline.

Do I need pandas?+

Usually not, and reaching for it without asking can hurt you. Most interviewers want to see you handle the problem with the standard library: dict, list, set, csv, json, itertools, collections. Some shops are fine with pandas if you ask first. Some roles, especially the ML-adjacent ones, expect it. The job description is the tell.

How many Python problems should I do before a loop?+

Thirty to fifty is a reasonable range if you solve them properly. The four buckets that pay off most are loops (around 13% of questions in our corpus), function design (9%), lists (8%), and dicts (7%). Those four together cover almost forty percent of what gets asked.

Should I practice in an IDE or in a plain editor?+

At least half your practice should be in a plain editor with no autocomplete. CoderPad and HackerRank-style sandboxes are stark. If you've only ever practiced with PyCharm intellisense, the first ten minutes of the real interview will be spent fighting muscle memory.

02 / Why practice

Solve a few against the real evaluator

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open a Python problem

Related Guides

PySpark Questions→

DataFrame ops, broadcast joins, the skew question that comes up at every Spark shop.

Spark Questions→

Partitioning, shuffles, narrow versus wide transforms, and when to cache.

DE Interview Prep→

Every round of the loop, with study plans tuned to your starting point.

SQL Interview Questions→

The round you'll see in almost every loop. JOINs, windows, CTEs, dedup.