Python Interview Questions

Python Interview Questions for Data Engineers

More than half of data engineering interviews include a Python round. The most tested topics: for loops, function design, list manipulation, algorithms, and dictionary operations. This is data manipulation, not LeetCode.
Updated April 2026·By The DataDriven Team

How Python interviews differ for data engineers vs software engineers

Python appears in a majority of data engineering interviews, but the questions look nothing like software engineering rounds. You will parse files, transform nested data structures, handle encoding errors, and write code that processes data in batches. If/else logic (6.3%), classes (4.4%), and sorting (3.6%) round out the top concepts tested.

The input is messy

Software engineering problems give you clean arrays of integers. Data engineering problems give you a JSON file with inconsistent field names, missing values, and timestamps in three different formats.

Memory matters

Data engineering interviewers care whether your solution can handle a 10GB file. If your first instinct is to load everything into a list, that is a red flag. Generators and streaming patterns separate strong candidates from average ones.

No external libraries

Most interviewers want vanilla Python. The csv module, json module, collections module, and itertools are fair game. Pandas is usually not. If you cannot write a GROUP BY equivalent using dictionaries, practice that before anything else.
Topic 01 . 7.1% of Python questions . Medium

Dictionary Operations

Dictionaries appear in 7.1% of Python interview questions. Interviewers test your ability to merge, filter, group, and transform dictionary data without reaching for external libraries.

4 questions interviewers ask
  • Given two lists of equal length, create a dictionary mapping elements from the first list to the second.
  • Write a function that inverts a dictionary (swap keys and values), handling duplicate values by collecting keys into lists.
  • Merge two dictionaries where overlapping keys sum their values instead of overwriting.
  • Given a list of dictionaries representing rows, group them by a specified key and return a dictionary of lists.
Topic 02 . 8.2% of Python questions . Easy-Medium

List and Set Manipulation

Lists appear in 8.2% of Python interview questions. List comprehensions, set operations, and efficient iteration patterns test whether you write Pythonic code or verbose loops.

4 questions interviewers ask
  • Deduplicate a list of dictionaries based on a specific key, keeping the last occurrence.
  • Find elements that appear in list A but not in list B, preserving order (sets lose order).
  • Flatten a list of lists into a single list without using itertools.
  • Given a sorted list of integers, find all pairs that sum to a target value in O(n) time.
Topic 03 . High . Easy-Medium

String Processing

Data engineers parse log files, extract fields from semi-structured text, and clean messy string data. These questions reflect real pipeline work.

4 questions interviewers ask
  • Parse an Apache log line and extract the IP address, timestamp, HTTP method, path, and status code.
  • Write a function that converts snake_case strings to camelCase.
  • Given a string containing key=value pairs separated by semicolons, parse it into a dictionary. Handle quoted values that may contain semicolons.
  • Validate that a string is a properly formatted ISO 8601 datetime without using dateutil.
Topic 04 . Very High . Medium

File I/O (JSON and CSV)

Reading, transforming, and writing structured files is core to data engineering. Interviewers want to see you handle encoding issues, malformed rows, and memory constraints.

4 questions interviewers ask
  • Read a JSON file containing nested objects and flatten it into a list of dictionaries suitable for CSV export.
  • Process a 10GB CSV file line by line, computing the average of a numeric column without loading the entire file into memory.
  • Write a function that reads a CSV, filters rows based on a condition, and writes the result to a new CSV. Handle the case where the input file has inconsistent column counts.
  • Merge two JSON files containing arrays of objects, deduplicating by an ID field and preferring values from the second file.
Topic 05 . High . Medium-Hard

ETL Patterns

These questions test your ability to write clean, testable data transformation code. The interviewer evaluates your pipeline thinking, not just your Python syntax.

4 questions interviewers ask
  • Write a pipeline function that takes raw event data (list of dicts), validates required fields, converts timestamps to UTC, and groups events by user_id.
  • Implement a simple schema validation function that checks a dictionary against a schema definition (field names, types, required/optional).
  • Write an incremental processing function that takes a list of records with timestamps and a watermark, returning only records newer than the watermark.
  • Build a function that joins two datasets (lists of dicts) on a common key, similar to a SQL LEFT JOIN, including handling of missing keys.
Topic 06 . Medium . Medium

Error Handling and Edge Cases

Production pipelines break. Interviewers test whether you write defensive code that fails gracefully, logs useful information, and handles partial failures.

3 questions interviewers ask
  • Write a function that retries an HTTP request up to 3 times with exponential backoff. Return the response on success or raise after all retries are exhausted.
  • Process a list of records where some have missing or malformed fields. Return two lists: successfully processed records and error records with reasons.
  • Write a context manager that logs the start time, end time, and duration of a code block.
Topic 07 . Medium . Medium-Hard

Generators and Memory Efficiency

Data engineers process large datasets. Generators, iterators, and lazy evaluation are the tools that keep memory usage constant regardless of input size.

3 questions interviewers ask
  • Write a generator that reads a large file and yields batches of N lines at a time.
  • Implement a function that lazily chains multiple iterables, yielding elements one at a time without creating an intermediate list.
  • Write a generator that reads from two sorted files simultaneously and yields records in sorted order (merge sort pattern).
Topic 08 . Medium . Medium

Testing and Code Quality

Some interviews include a testing component or ask you to write tests for code you just wrote. This signals senior-level thinking about data quality.

3 questions interviewers ask
  • Write unit tests for a function that parses date strings in multiple formats. Include tests for invalid inputs.
  • Given a data transformation function, write a property-based test that verifies the output always has the same number of rows as the input.
  • Write a test fixture that creates a temporary CSV file with known data, runs a processing function, and asserts on the output.

Worked example: group records by key (Python GROUP BY)

Given a list of dictionaries representing order rows, group them by customer_id and compute the total spend and order count per customer. No pandas allowed.

from collections import defaultdict

def group_orders(orders):
    agg = defaultdict(lambda: {"total": 0, "count": 0})
    for row in orders:
        cid = row["customer_id"]
        agg[cid]["total"] += row["amount"]
        agg[cid]["count"] += 1
    return dict(agg)

# Example usage
orders = [
    {"customer_id": "c1", "amount": 50},
    {"customer_id": "c2", "amount": 120},
    {"customer_id": "c1", "amount": 75},
    {"customer_id": "c2", "amount": 30},
]
result = group_orders(orders)
# {"c1": {"total": 125, "count": 2},
#  "c2": {"total": 150, "count": 2}}

defaultdict avoids repeated key-existence checks. The lambda creates a fresh accumulator for each new customer_id on first access. This is the vanilla-Python equivalent of SQL's GROUP BY with SUM and COUNT. Interviewers use this pattern to test whether you can do basic aggregation without reaching for pandas.

Python interview questions FAQ

How is a Python interview for data engineers different from software engineers?+
Data engineering Python interviews focus on data manipulation: parsing files, transforming dictionaries, handling messy input, and writing ETL logic. Software engineering interviews focus on algorithms, data structures, and object-oriented design. You will rarely see binary trees or graph traversal in a data engineering Python round.
Do I need to know pandas for data engineering interviews?+
Usually not. Most interviewers want you to use built-in Python: dicts, lists, csv module, json module. Some companies allow pandas if you ask, but defaulting to vanilla Python shows stronger fundamentals. If a role specifically involves pandas-heavy work, the job description will say so.
How many Python questions should I practice?+
30-50 questions covering the topics on this page will prepare you well. Focus on loops (13.1% of questions), function design (9.0%), lists (8.2%), and dictionaries (7.1%) first. These four categories alone cover nearly 40% of Python interview content.
Should I practice Python on a whiteboard or in an editor?+
Practice in a plain text editor without autocomplete at least half the time. Many companies use shared editors like CoderPad that have no autocomplete. If you rely on IDE hints for dict methods or file handling, you will struggle under interview conditions.

Practice Python for data engineering

Write real Python. Run it. See if your output matches. Build the muscle memory you need for interview day.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 924 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats