Most candidates walk into DE Python rounds expecting LeetCode. Then the interviewer hands them a messy CSV and asks them to dedupe it by a composite key. In our corpus of 1,042 verified rounds, 31% of Python questions test for loops, 25% test function definitions, 16% test dictionaries. Only 21% touch algorithms at all, and even those are usually data transformation problems in disguise. Study the wrong thing and you'll walk in prepared for a test that doesn't exist.
Python share of DE rounds
For-loop frequency
Function-def frequency
Actual algorithm questions
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Most candidates grind algorithms and skip file I/O. Interviewers do the opposite: eight out of ten Python rounds in our dataset started with "here's a file, parse it." Ranked below by how often each skill shows up in real rounds.
Data engineers read and write files constantly: JSON, CSV, Parquet, line-delimited JSON, YAML config files. You need to know the standard library modules (json, csv, pathlib) and when to use them. Beyond basic reading and writing, you need to handle encoding issues (UTF-8 vs Latin-1), malformed rows that break csv.reader, nested JSON structures that need flattening, and files too large to fit in memory. The ability to process a 10GB CSV line by line using generators is a fundamental DE Python skill.
High. File processing questions appear in roughly 30% of Python DE interviews.
Dictionaries are the workhorse data structure for data engineers. Event records arrive as dicts. API responses are dicts. Config files parse to dicts. You need to group records by key, merge dicts with conflict resolution, flatten nested structures, and transform lists of dicts into different shapes. Knowing defaultdict, Counter, and OrderedDict from the collections module saves time and produces cleaner code. Sets matter too: deduplication, intersection, and difference operations on large datasets run in O(1) per lookup instead of O(n) with lists.
Very high. Dictionary manipulation is the most common Python DE interview pattern.
Data engineers pull data from REST APIs, webhook endpoints, and internal services. You need to make GET and POST requests with headers and authentication, handle pagination (offset-based and cursor-based), implement retry logic with exponential backoff, and parse JSON responses that may be nested or inconsistent. The requests library is standard. For production pipelines, you should also know about connection pooling (requests.Session), timeouts (always set them), and rate limiting (sleep between requests or use a token bucket).
Medium. API integration questions appear in take-home assignments more than live interviews.
The core of data engineering Python is transforming messy input into clean output. This means type casting with validation (is this string actually an integer?), date parsing across multiple formats, null handling (None, empty string, 'NULL', 'N/A' all mean different things), string normalization (strip whitespace, lowercase, remove special characters), and schema validation (does this record have all required fields with correct types?). Write functions that are explicit about what they accept and what they reject.
Very high. Transformation tasks are the most common Python interview pattern after dict operations.
Production pipelines break. Data engineers write code that fails gracefully: catching specific exceptions, logging useful context (not just 'an error occurred'), implementing retry logic, and separating recoverable failures from fatal ones. A pipeline that processes 1 million records and dies on record 500,000 because of one malformed row is a pipeline that was written without error handling. You need to decide: skip the bad record and log it? Collect all errors and report them at the end? Halt processing entirely? The right answer depends on the use case, and interviewers want to hear you think about it.
Medium-high. Error handling questions often appear as follow-ups to data transformation problems.
When you process files that are larger than available memory, generators let you read and transform data one chunk at a time. The yield keyword turns a function into a generator that produces values lazily. Chaining generators creates a streaming pipeline: read a line, parse it, transform it, write it, and move to the next line without ever holding the full dataset in memory. This is not an academic concept. It is the difference between a pipeline that runs on a t2.micro and one that requires a memory-optimized instance.
Medium. Generator questions appear in senior-level interviews and system design discussions.
Testing data pipelines is different from testing web applications. You need to test transformations against known input/output pairs, validate schema enforcement, test edge cases (empty input, null values, duplicate records), and mock external dependencies (APIs, databases, file systems). pytest is the standard. Fixtures let you set up test data once and reuse it. Parametrize lets you run the same test with different inputs. The ability to write tests for your pipeline code signals senior-level thinking.
Medium. Some interviews include a testing component or ask you to write tests for code you just wrote.
Beyond scripts, production data engineering code lives in packages with proper structure: src layout, pyproject.toml or setup.py, requirements files or Poetry lock files, and clear separation between library code and entry points. Knowing how to structure a Python project, manage dependencies, and create reproducible environments (venv, pip-tools, Poetry) matters because data pipelines run in CI/CD systems, Docker containers, and Airflow workers where 'it works on my machine' is not acceptable.
Low in interviews, high on the job. Rarely tested directly but signals production experience.
Every year we watch candidates waste two months on web frameworks, ML libraries, or deep OOP. None of it shows up. Here's what you can skip.
Data engineering interviews do not ask you to implement binary search trees or solve dynamic programming problems. If a question requires knowledge of algorithms, it is a software engineering interview that was mislabeled. DE Python focuses on practical data manipulation: parsing, transforming, validating, and loading data.
Pandas is a great tool for analysis, but data engineering Python leans more on the standard library and purpose-built tools. In interviews, you are often expected to solve problems without pandas to demonstrate that you understand the underlying operations. On the job, you will use pandas for exploratory work and smaller datasets, but production pipelines use Spark, SQL, or plain Python with generators for scale.
ML is for data scientists and ML engineers. Data engineers build the pipelines that feed ML models, but you do not need to understand gradient descent or neural networks. If a DE job description lists ML as a requirement, the role is likely a hybrid position or the company has not clearly defined the boundary between DE and DS roles.
Web frameworks are for backend engineers. Data engineers occasionally build simple APIs to serve data (using FastAPI or Flask), but this is not a core skill and is rarely tested in interviews. If your DE interview asks you to build a REST API, the company may be looking for a generalist backend engineer.
Real-style questions that test the skills covered above.
Open and parse the JSON file. For each record, parse the last_active date and compare to today minus 30 days. Use a defaultdict(list) to group by country. For each country, open a CSV file using csv.DictWriter, write the header from the record keys, and write all rows. Handle edge cases: missing last_active field, unparseable dates, empty country values. The interviewer checks whether you validate input before processing and whether you close files properly (use context managers).
Write a decorator that wraps the function in a try/except loop. On each exception, log the attempt number, exception type, and wait time. Sleep for 2^attempt seconds (with a cap). After N failures, re-raise the last exception. Use functools.wraps to preserve the original function's metadata. The interviewer looks for clean decorator syntax, proper logging (not print statements), and whether you handle the case where the function succeeds on a retry (return the result immediately).
Open the file with csv.DictReader. Accumulate rows into a batch list. When the batch reaches size N, yield it and reset. After the loop, yield any remaining rows. The interviewer checks memory efficiency (you should never hold more than N rows at once), proper file handling (context manager), and whether you handle edge cases: empty file, file with only headers, N larger than the file's row count.
Build an index (dictionary) from the right table keyed by the join column for O(1) lookups. Iterate through the left table. For each record, look up the matching record in the right index. If found, merge the dictionaries (handling key conflicts by prefixing). If not found, include the left record with None values for right-side columns. The interviewer tests whether you think about performance (indexing vs nested loops), key conflicts, and handling multiple matches (one-to-many joins).
Iterate through the schema. For each field, check if it exists in the record (if required). If it exists, check isinstance against the expected type. Collect all errors (do not stop at the first one). Return a list of error objects with field name, expected type, actual type, and error description. The interviewer looks for thoroughness: handling None vs missing, nested types, and whether you return useful error messages that a developer could debug with.
The Python DE interviewers actually test. No binary trees, no dynamic programming, no linked lists.
Start Practicing