Python for Data Engineering

Most candidates walk into DE Python rounds expecting LeetCode. Then the interviewer hands them a messy CSV and asks them to dedupe it by a composite key. In our corpus of 1,042 verified rounds, 31% of Python questions test for loops, 25% test function definitions, 16% test dictionaries. Only 21% touch algorithms at all, and even those are usually data transformation problems in disguise.

35%

Python share of DE rounds

31%

For-loop frequency

25%

Function-def frequency

21%

Actual algorithm questions

Python Skills That Matter for Data Engineers

Most candidates grind algorithms and skip file I/O. Interviewers do the opposite: eight out of ten Python rounds in our dataset started with 'here's a file, parse it.' Ranked below by how often each skill shows up in real rounds.

High | ~30% of Python DE interviews

File I/O and Data Formats

Data engineers read and write files constantly: JSON, CSV, Parquet, line-delimited JSON, YAML config files. You need to know the standard library modules (json, csv, pathlib) and when to use them. Beyond basic reading and writing, you need to handle encoding issues (UTF-8 vs Latin-1), malformed rows that break csv.reader, nested JSON structures that need flattening, and files too large to fit in memory. The ability to process a 10GB CSV line by line using generators is a fundamental DE Python skill. Key modules: json, csv, pathlib, gzip, io.StringIO, json.JSONDecodeError

Very High | Most common pattern

Dictionary and Data Structure Operations

Dictionaries are the workhorse data structure for data engineers. Event records arrive as dicts. API responses are dicts. Config files parse to dicts. You need to group records by key, merge dicts with conflict resolution, flatten nested structures, and transform lists of dicts into different shapes. Knowing defaultdict, Counter, and OrderedDict from the collections module saves time and produces cleaner code. Sets matter too: deduplication, intersection, and difference operations on large datasets run in O(1) per lookup instead of O(n) with lists. Key modules: collections.defaultdict, collections.Counter, dict comprehensions, set operations

Medium | More common in take-homes

API Calls and HTTP Clients

Data engineers pull data from REST APIs, webhook endpoints, and internal services. You need to make GET and POST requests with headers and authentication, handle pagination (offset-based and cursor-based), implement retry logic with exponential backoff, and parse JSON responses that may be nested or inconsistent. The requests library is standard. For production pipelines, you should also know about connection pooling (requests.Session), timeouts (always set them), and rate limiting (sleep between requests or use a token bucket). Key modules: requests, requests.Session, time.sleep, urllib.parse, json

Very High | Most common Python interview pattern

Data Transformation and Cleaning

The core of data engineering Python is transforming messy input into clean output. This means type casting with validation (is this string actually an integer?), date parsing across multiple formats, null handling (None, empty string, 'NULL', 'N/A' all mean different things), string normalization (strip whitespace, lowercase, remove special characters), and schema validation (does this record have all required fields with correct types?). Write functions that are explicit about what they accept and what they reject. Key modules: datetime, re, typing, dataclasses, enum

Medium-High | Often as follow-ups to transformation problems

Error Handling and Logging

Production pipelines break. Data engineers write code that fails gracefully: catching specific exceptions, logging useful context (not just 'an error occurred'), implementing retry logic, and separating recoverable failures from fatal ones. A pipeline that processes 1 million records and dies on record 500,000 because of one malformed row is a pipeline that was written without error handling. You need to decide: skip the bad record and log it? Collect all errors and report them at the end? Halt processing entirely? The right answer depends on the use case, and interviewers want to hear you think about it. Key modules: logging, try/except/finally, custom exceptions, contextlib

Medium | Senior-level interviews and system design

Generators and Memory Efficiency

When you process files that are larger than available memory, generators let you read and transform data one chunk at a time. The yield keyword turns a function into a generator that produces values lazily. Chaining generators creates a streaming pipeline: read a line, parse it, transform it, write it, and move to the next line without ever holding the full dataset in memory. This is not an academic concept. It is the difference between a pipeline that runs on a t2.micro and one that requires a memory-optimized instance. Key modules: yield, itertools, functools, generator expressions

Medium | Some interviews include a testing component

Testing Pipeline Code

Testing data pipelines is different from testing web applications. You need to test transformations against known input/output pairs, validate schema enforcement, test edge cases (empty input, null values, duplicate records), and mock external dependencies (APIs, databases, file systems). pytest is the standard. Fixtures let you set up test data once and reuse it. Parametrize lets you run the same test with different inputs. The ability to write tests for your pipeline code signals senior-level thinking. Key modules: pytest, pytest.fixture, pytest.mark.parametrize, unittest.mock

Low in interviews, high on the job

Packaging and Project Structure

Beyond scripts, production data engineering code lives in packages with proper structure: src layout, pyproject.toml or setup.py, requirements files or Poetry lock files, and clear separation between library code and entry points. Knowing how to structure a Python project, manage dependencies, and create reproducible environments (venv, pip-tools, Poetry) matters because data pipelines run in CI/CD systems, Docker containers, and Airflow workers where 'it works on my machine' is not acceptable. Key modules: venv, pip-tools, pyproject.toml, importlib

What You Do NOT Need to Learn

Every year candidates waste two months on web frameworks, ML libraries, or deep OOP. None of it shows up. Here is what you can skip.

You need to learn algorithms and data structures

Data engineering interviews do not ask you to implement binary search trees or solve dynamic programming problems. If a question requires knowledge of algorithms, it is a software engineering interview that was mislabeled. DE Python focuses on practical data manipulation: parsing, transforming, validating, and loading data.

You need pandas for everything

Pandas is a great tool for analysis, but data engineering Python leans more on the standard library and purpose-built tools. In interviews, you are often expected to solve problems without pandas to demonstrate that you understand the underlying operations. On the job, you will use pandas for exploratory work and smaller datasets, but production pipelines use Spark, SQL, or plain Python with generators for scale.

You need machine learning

ML is for data scientists and ML engineers. Data engineers build the pipelines that feed ML models, but you do not need to understand gradient descent or neural networks. If a DE job description lists ML as a requirement, the role is likely a hybrid position or the company has not clearly defined the boundary between DE and DS roles.

You need Django or Flask

Web frameworks are for backend engineers. Data engineers occasionally build simple APIs to serve data (using FastAPI or Flask), but this is not a core skill and is rarely tested in interviews. If your DE interview asks you to build a REST API, the company may be looking for a generalist backend engineer.

5 Python Interview Questions for Data Engineers

Real-style questions that test the Python skills DE interviews focus on — dict/set manipulation, file parsing, error handling, idempotent transforms.

Read a JSON file, filter, group, write per-country CSVs

Write a function that reads a JSON file containing an array of user records, filters to users active in the last 30 days, groups them by country, and writes one CSV file per country. Approach: Open and parse the JSON file. For each record, parse the last_active date and compare to today minus 30 days. Use a defaultdict(list) to group by country. For each country, open a CSV file using csv.DictWriter, write the header from the record keys, and write all rows. Handle edge cases: missing last_active field, unparseable dates, empty country values. The interviewer checks whether you validate input before processing and whether you close files properly (use context managers).

Implement a retry decorator with exponential backoff

Implement a retry decorator that retries a function up to N times with exponential backoff. The decorator should log each retry attempt. Approach: Write a decorator that wraps the function in a try/except loop. On each exception, log the attempt number, exception type, and wait time. Sleep for 2^attempt seconds (with a cap). After N failures, re-raise the last exception. Use functools.wraps to preserve the original function's metadata. The interviewer looks for clean decorator syntax, proper logging (not print statements), and whether you handle the case where the function succeeds on a retry (return the result immediately).

Generator that yields batches of N rows from a large CSV

Write a generator that reads a large CSV file and yields batches of N rows as lists of dictionaries. Each batch should be exactly N rows except potentially the last batch. Approach: Open the file with csv.DictReader. Accumulate rows into a batch list. When the batch reaches size N, yield it and reset. After the loop, yield any remaining rows. The interviewer checks memory efficiency (you should never hold more than N rows at once), proper file handling (context manager), and whether you handle edge cases: empty file, file with only headers, N larger than the file's row count.

LEFT JOIN on two lists of dictionaries

Given two lists of dictionaries representing database tables, implement a LEFT JOIN on a specified key. Return the joined records. Approach: Build an index (dictionary) from the right table keyed by the join column for O(1) lookups. Iterate through the left table. For each record, look up the matching record in the right index. If found, merge the dictionaries (handling key conflicts by prefixing). If not found, include the left record with None values for right-side columns. The interviewer tests whether you think about performance (indexing vs nested loops), key conflicts, and handling multiple matches (one-to-many joins).

Schema validation function returning all errors

Write a schema validation function that takes a record (dict) and a schema definition (dict mapping field names to expected types and required/optional flags). Return validation errors. Approach: Iterate through the schema. For each field, check if it exists in the record (if required). If it exists, check isinstance against the expected type. Collect all errors (do not stop at the first one). Return a list of error objects with field name, expected type, actual type, and error description. The interviewer looks for thoroughness: handling None vs missing, nested types, and whether you return useful error messages that a developer could debug with.

Python for Data Engineering FAQ

How much Python do I need to know for DE interviews?+

You need solid fundamentals: data structures (lists, dicts, sets), file I/O (JSON, CSV), string manipulation, error handling, and functions (including decorators and generators). You do not need advanced OOP, metaprogramming, or algorithm knowledge. If you can write a function that reads a file, transforms the data, handles edge cases, and writes clean output, you are ready for most DE Python interviews.

Should I learn Python or SQL first for data engineering?+

SQL first. It is tested in every single DE interview and is more immediately useful for data work. Once your SQL is solid (window functions, CTEs, optimization), move to Python. Many candidates make the mistake of spending months on Python tutorials before touching SQL, then get stuck in interviews because SQL is always the first filter.

Is Python enough, or do I also need Scala or Java?+

Python is enough for the vast majority of DE roles. Scala appears in Spark-heavy roles (especially at companies that run Spark on JVM for performance), and Java appears at some large enterprises. Unless the job description specifically requires Scala or Java, Python covers you. If you do learn a second language, Scala is the most useful for data engineering.

How do I practice Python for DE interviews specifically?+

Skip LeetCode. Instead, practice: reading and writing JSON/CSV files, transforming lists of dictionaries, implementing retry logic, building generators for large file processing, and writing schema validation functions. These are the patterns that actually appear in DE interviews. DataDriven has Python challenges specifically designed for data engineering contexts.

02 / Why practice

Stop Grinding Trees. Start Parsing Files.

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Parsing and reshaping, sessionization, dedup with tie-breaks, streaming aggregation, top-N-per-group. Writing them by hand turns the unfamiliar into pattern recognition

Start Practicing

Related Guides

Python Interview Questions→

Every Python topic tested in DE interviews with frequency data

Data Engineering Roadmap→

18-week plan covering SQL, Python, data modeling, and pipelines

DE Interview Prep Guide→

Complete preparation framework for data engineering interviews