Data Engineering Interview Prep

Python Interview Questions for Data Engineers

More than half of data engineering interviews include a Python round. The most tested topics: for loops, function design, list manipulation, algorithms, and dictionary operations. This is data manipulation, not LeetCode.

Based on DataDriven's analysis of verified interview data. Every question below can be practiced with real Python execution on DataDriven.

How Python Interviews Differ for Data Engineers vs Software Engineers

Python appears in a majority of data engineering interviews, but the questions look nothing like software engineering rounds. You will parse files, transform nested data structures, handle encoding errors, and write code that processes data in batches. If/else logic (6.3%), classes (4.4%), and sorting (3.6%) round out the top concepts tested.

The input is messy. Software engineering problems give you clean arrays of integers. Data engineering problems give you a JSON file with inconsistent field names, missing values, and timestamps in three different formats.

Memory matters. Data engineering interviewers care whether your solution can handle a 10GB file. If your first instinct is to load everything into a list, that is a red flag. Generators and streaming patterns separate strong candidates from average ones.

No external libraries. Most interviewers want vanilla Python. The csv module, json module, collections module, and itertools are fair game. Pandas is usually not. If you cannot write a GROUP BY equivalent using dictionaries, practice that before anything else.

1. Dictionary Operations
+
Difficulty: MediumInterview frequency: 7.1% of Python questions

Dictionaries appear in 7.1% of Python interview questions. Interviewers test your ability to merge, filter, group, and transform dictionary data without reaching for external libraries.

Q1

Given two lists of equal length, create a dictionary mapping elements from the first list to the second.

Q2

Write a function that inverts a dictionary (swap keys and values), handling duplicate values by collecting keys into lists.

Q3

Merge two dictionaries where overlapping keys sum their values instead of overwriting.

Q4

Given a list of dictionaries representing rows, group them by a specified key and return a dictionary of lists.

2. List and Set Manipulation
+
Difficulty: Easy-MediumInterview frequency: 8.2% of Python questions

Lists appear in 8.2% of Python interview questions. List comprehensions, set operations, and efficient iteration patterns test whether you write Pythonic code or verbose loops.

Q1

Deduplicate a list of dictionaries based on a specific key, keeping the last occurrence.

Q2

Find elements that appear in list A but not in list B, preserving order (sets lose order).

Q3

Flatten a list of lists into a single list without using itertools.

Q4

Given a sorted list of integers, find all pairs that sum to a target value in O(n) time.

3. String Processing
+
Difficulty: Easy-MediumInterview frequency: High

Data engineers parse log files, extract fields from semi-structured text, and clean messy string data. These questions reflect real pipeline work.

Q1

Parse an Apache log line and extract the IP address, timestamp, HTTP method, path, and status code.

Q2

Write a function that converts snake_case strings to camelCase.

Q3

Given a string containing key=value pairs separated by semicolons, parse it into a dictionary. Handle quoted values that may contain semicolons.

Q4

Validate that a string is a properly formatted ISO 8601 datetime without using dateutil.

4. File I/O (JSON and CSV)
+
Difficulty: MediumInterview frequency: Very High

Reading, transforming, and writing structured files is core to data engineering. Interviewers want to see you handle encoding issues, malformed rows, and memory constraints.

Q1

Read a JSON file containing nested objects and flatten it into a list of dictionaries suitable for CSV export.

Q2

Process a 10GB CSV file line by line, computing the average of a numeric column without loading the entire file into memory.

Q3

Write a function that reads a CSV, filters rows based on a condition, and writes the result to a new CSV. Handle the case where the input file has inconsistent column counts.

Q4

Merge two JSON files containing arrays of objects, deduplicating by an ID field and preferring values from the second file.

5. ETL Patterns
+
Difficulty: Medium-HardInterview frequency: High

These questions test your ability to write clean, testable data transformation code. The interviewer evaluates your pipeline thinking, not just your Python syntax.

Q1

Write a pipeline function that takes raw event data (list of dicts), validates required fields, converts timestamps to UTC, and groups events by user_id.

Q2

Implement a simple schema validation function that checks a dictionary against a schema definition (field names, types, required/optional).

Q3

Write an incremental processing function that takes a list of records with timestamps and a watermark, returning only records newer than the watermark.

Q4

Build a function that joins two datasets (lists of dicts) on a common key, similar to a SQL LEFT JOIN, including handling of missing keys.

6. Error Handling and Edge Cases
+
Difficulty: MediumInterview frequency: Medium

Production pipelines break. Interviewers test whether you write defensive code that fails gracefully, logs useful information, and handles partial failures.

Q1

Write a function that retries an HTTP request up to 3 times with exponential backoff. Return the response on success or raise after all retries are exhausted.

Q2

Process a list of records where some have missing or malformed fields. Return two lists: successfully processed records and error records with reasons.

Q3

Write a context manager that logs the start time, end time, and duration of a code block.

7. Generators and Memory Efficiency
+
Difficulty: Medium-HardInterview frequency: Medium

Data engineers process large datasets. Generators, iterators, and lazy evaluation are the tools that keep memory usage constant regardless of input size.

Q1

Write a generator that reads a large file and yields batches of N lines at a time.

Q2

Implement a function that lazily chains multiple iterables, yielding elements one at a time without creating an intermediate list.

Q3

Write a generator that reads from two sorted files simultaneously and yields records in sorted order (merge sort pattern).

8. Testing and Code Quality
+
Difficulty: MediumInterview frequency: Medium

Some interviews include a testing component or ask you to write tests for code you just wrote. This signals senior-level thinking about data quality.

Q1

Write unit tests for a function that parses date strings in multiple formats. Include tests for invalid inputs.

Q2

Given a data transformation function, write a property-based test that verifies the output always has the same number of rows as the input.

Q3

Write a test fixture that creates a temporary CSV file with known data, runs a processing function, and asserts on the output.

Worked Example: Group Records by Key (Python GROUP BY)

Given a list of dictionaries representing order rows, group them by customer_id and compute the total spend and order count per customer. No pandas allowed.

from collections import defaultdict

def group_orders(orders):
    agg = defaultdict(lambda: {"total": 0, "count": 0})
    for row in orders:
        cid = row["customer_id"]
        agg[cid]["total"] += row["amount"]
        agg[cid]["count"] += 1
    return dict(agg)

# Example usage
orders = [
    {"customer_id": "c1", "amount": 50},
    {"customer_id": "c2", "amount": 120},
    {"customer_id": "c1", "amount": 75},
    {"customer_id": "c2", "amount": 30},
]
result = group_orders(orders)
# {"c1": {"total": 125, "count": 2},
#  "c2": {"total": 150, "count": 2}}

defaultdict avoids repeated key-existence checks. The lambda creates a fresh accumulator for each new customer_id on first access. This is the vanilla-Python equivalent of SQL's GROUP BY with SUM and COUNT. Interviewers use this pattern to test whether you can do basic aggregation without reaching for pandas.

Python Interview Questions FAQ

How is a Python interview for data engineers different from software engineers?+
Data engineering Python interviews focus on data manipulation: parsing files, transforming dictionaries, handling messy input, and writing ETL logic. Software engineering interviews focus on algorithms, data structures, and object-oriented design. You will rarely see binary trees or graph traversal in a data engineering Python round.
Do I need to know pandas for data engineering interviews?+
Usually not. Most interviewers want you to use built-in Python: dicts, lists, csv module, json module. Some companies allow pandas if you ask, but defaulting to vanilla Python shows stronger fundamentals. If a role specifically involves pandas-heavy work, the job description will say so.
How many Python questions should I practice?+
30-50 questions covering the topics on this page will prepare you well. Focus on loops (13.1% of questions), function design (9.0%), lists (8.2%), and dictionaries (7.1%) first. These four categories alone cover nearly 40% of Python interview content.
Should I practice Python on a whiteboard or in an editor?+
Practice in a plain text editor without autocomplete at least half the time. Many companies use shared editors like CoderPad that have no autocomplete. If you rely on IDE hints for dict methods or file handling, you will struggle under interview conditions.

Practice Python for Data Engineering

Write real Python. Run it. See if your output matches. Build the muscle memory you need for interview day.