Data Engineering Interview Prep
More than half of data engineering interviews include a Python round. The most tested topics: for loops, function design, list manipulation, algorithms, and dictionary operations. This is data manipulation, not LeetCode.
Based on DataDriven's analysis of verified interview data. Every question below can be practiced with real Python execution on DataDriven.
Python appears in a majority of data engineering interviews, but the questions look nothing like software engineering rounds. You will parse files, transform nested data structures, handle encoding errors, and write code that processes data in batches. If/else logic (6.3%), classes (4.4%), and sorting (3.6%) round out the top concepts tested.
The input is messy. Software engineering problems give you clean arrays of integers. Data engineering problems give you a JSON file with inconsistent field names, missing values, and timestamps in three different formats.
Memory matters. Data engineering interviewers care whether your solution can handle a 10GB file. If your first instinct is to load everything into a list, that is a red flag. Generators and streaming patterns separate strong candidates from average ones.
No external libraries. Most interviewers want vanilla Python. The csv module, json module, collections module, and itertools are fair game. Pandas is usually not. If you cannot write a GROUP BY equivalent using dictionaries, practice that before anything else.
Dictionaries appear in 7.1% of Python interview questions. Interviewers test your ability to merge, filter, group, and transform dictionary data without reaching for external libraries.
Given two lists of equal length, create a dictionary mapping elements from the first list to the second.
Write a function that inverts a dictionary (swap keys and values), handling duplicate values by collecting keys into lists.
Merge two dictionaries where overlapping keys sum their values instead of overwriting.
Given a list of dictionaries representing rows, group them by a specified key and return a dictionary of lists.
Lists appear in 8.2% of Python interview questions. List comprehensions, set operations, and efficient iteration patterns test whether you write Pythonic code or verbose loops.
Deduplicate a list of dictionaries based on a specific key, keeping the last occurrence.
Find elements that appear in list A but not in list B, preserving order (sets lose order).
Flatten a list of lists into a single list without using itertools.
Given a sorted list of integers, find all pairs that sum to a target value in O(n) time.
Data engineers parse log files, extract fields from semi-structured text, and clean messy string data. These questions reflect real pipeline work.
Parse an Apache log line and extract the IP address, timestamp, HTTP method, path, and status code.
Write a function that converts snake_case strings to camelCase.
Given a string containing key=value pairs separated by semicolons, parse it into a dictionary. Handle quoted values that may contain semicolons.
Validate that a string is a properly formatted ISO 8601 datetime without using dateutil.
Reading, transforming, and writing structured files is core to data engineering. Interviewers want to see you handle encoding issues, malformed rows, and memory constraints.
Read a JSON file containing nested objects and flatten it into a list of dictionaries suitable for CSV export.
Process a 10GB CSV file line by line, computing the average of a numeric column without loading the entire file into memory.
Write a function that reads a CSV, filters rows based on a condition, and writes the result to a new CSV. Handle the case where the input file has inconsistent column counts.
Merge two JSON files containing arrays of objects, deduplicating by an ID field and preferring values from the second file.
These questions test your ability to write clean, testable data transformation code. The interviewer evaluates your pipeline thinking, not just your Python syntax.
Write a pipeline function that takes raw event data (list of dicts), validates required fields, converts timestamps to UTC, and groups events by user_id.
Implement a simple schema validation function that checks a dictionary against a schema definition (field names, types, required/optional).
Write an incremental processing function that takes a list of records with timestamps and a watermark, returning only records newer than the watermark.
Build a function that joins two datasets (lists of dicts) on a common key, similar to a SQL LEFT JOIN, including handling of missing keys.
Production pipelines break. Interviewers test whether you write defensive code that fails gracefully, logs useful information, and handles partial failures.
Write a function that retries an HTTP request up to 3 times with exponential backoff. Return the response on success or raise after all retries are exhausted.
Process a list of records where some have missing or malformed fields. Return two lists: successfully processed records and error records with reasons.
Write a context manager that logs the start time, end time, and duration of a code block.
Data engineers process large datasets. Generators, iterators, and lazy evaluation are the tools that keep memory usage constant regardless of input size.
Write a generator that reads a large file and yields batches of N lines at a time.
Implement a function that lazily chains multiple iterables, yielding elements one at a time without creating an intermediate list.
Write a generator that reads from two sorted files simultaneously and yields records in sorted order (merge sort pattern).
Some interviews include a testing component or ask you to write tests for code you just wrote. This signals senior-level thinking about data quality.
Write unit tests for a function that parses date strings in multiple formats. Include tests for invalid inputs.
Given a data transformation function, write a property-based test that verifies the output always has the same number of rows as the input.
Write a test fixture that creates a temporary CSV file with known data, runs a processing function, and asserts on the output.
Given a list of dictionaries representing order rows, group them by customer_id and compute the total spend and order count per customer. No pandas allowed.
from collections import defaultdict
def group_orders(orders):
agg = defaultdict(lambda: {"total": 0, "count": 0})
for row in orders:
cid = row["customer_id"]
agg[cid]["total"] += row["amount"]
agg[cid]["count"] += 1
return dict(agg)
# Example usage
orders = [
{"customer_id": "c1", "amount": 50},
{"customer_id": "c2", "amount": 120},
{"customer_id": "c1", "amount": 75},
{"customer_id": "c2", "amount": 30},
]
result = group_orders(orders)
# {"c1": {"total": 125, "count": 2},
# "c2": {"total": 150, "count": 2}}defaultdict avoids repeated key-existence checks. The lambda creates a fresh accumulator for each new customer_id on first access. This is the vanilla-Python equivalent of SQL's GROUP BY with SUM and COUNT. Interviewers use this pattern to test whether you can do basic aggregation without reaching for pandas.
Write real Python. Run it. See if your output matches. Build the muscle memory you need for interview day.