Python Practice

Python Practice for Data Engineers

Python shows up in 35% of the 1,042 DE interview rounds we analyzed. For loops lead at 31%, function definition at 25%, dictionaries at 16%, algorithms at 21%. These 388 challenges match that distribution exactly, which means every hour of practice lands on something interviewers actually ask. Every problem executes in a real Docker sandbox, not a string-matched sandbox pretending to run code.

388

Python challenges

35%

Python share of DE rounds

31%

For-loop frequency

16%

Dict-ops frequency

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Challenge Types

The 388 problems split four ways. Function implementation and data transformation together account for 64% of the library, mirroring their 64% combined share in real interview rounds.

Function Implementation

~140 problems

Write a function from scratch given a specification. These test your ability to translate requirements into working code. Topics include string manipulation, dictionary operations, list processing, set operations, and basic algorithms.

# Example: Group records by a composite key
def group_by_key(records: list[dict], keys: list[str]) -> dict:
    """Group a list of dicts by composite key.

    Args:
        records: List of dictionaries with consistent keys
        keys: List of key names to group by

    Returns:
        Dict mapping tuple of key values to list of records
    """
    groups = {}
    for record in records:
        key = tuple(record[k] for k in keys)
        if key not in groups:
            groups[key] = []
        groups[key].append(record)
    return groups

Debugging

~90 problems

You receive broken code and need to find and fix the bug. These test your ability to read code carefully, trace execution, and spot common Python pitfalls: off-by-one errors, mutable default arguments, incorrect type handling, missing edge cases, and silent failures.

# Example: Find and fix the bug
def deduplicate(records, key_field):
    """Remove duplicate records based on key_field.
    Keep the first occurrence."""
    seen = set()
    result = []
    for record in records:
        key = record[key_field]
        if key not in seen:
            result.append(record)
            # BUG: forgot to add key to seen set
            # FIX: seen.add(key)
    return result

Data Transformation

~100 problems

Transform data from one structure to another. These mirror real pipeline work: flattening nested JSON, pivoting rows to columns, merging datasets, computing aggregations, and reshaping data for downstream consumers. The input is always a Python data structure (list of dicts, nested dict, etc.) and the output is a different structure.

# Example: Flatten nested event data
def flatten_events(events: list[dict]) -> list[dict]:
    """Flatten nested event payloads into flat records.

    Input:  [{"event": "click", "ts": "2024-01-01",
              "props": {"page": "/home", "button": "cta"}}]
    Output: [{"event": "click", "ts": "2024-01-01",
              "page": "/home", "button": "cta"}]
    """
    flat = []
    for event in events:
        row = {k: v for k, v in event.items() if k != "props"}
        if "props" in event and isinstance(event["props"], dict):
            row.update(event["props"])
        flat.append(row)
    return flat

Pipeline Logic

~58 problems

Implement components of a data pipeline: validators, parsers, routers, retry logic, batching functions, and schema enforcers. These test whether you can write production-quality code that handles failures gracefully.

# Example: Validate and route records
def validate_and_route(records: list[dict],
                       required_fields: list[str]) -> dict:
    """Split records into valid and invalid based on required fields.

    Returns: {"valid": [...], "invalid": [...]}
    """
    valid = []
    invalid = []
    for record in records:
        missing = [f for f in required_fields if f not in record
                   or record[f] is None]
        if missing:
            record["_missing_fields"] = missing
            invalid.append(record)
        else:
            valid.append(record)
    return {"valid": valid, "invalid": invalid}

Topic Breakdown

Problems are organized by the Python concept they test. Here is the distribution across topics, ordered by how often each topic appears in data engineering interviews.

TopicProblemsInterview Frequency
Dictionaries and JSON68Very High
String Parsing52Very High
List Operations48High
Error Handling35High
Set Operations30High
Comprehensions and Generators28Medium
File I/O and Parsing25Medium
Date and Time22Medium
Regular Expressions20Medium
Classes and OOP18Low-Medium
Functional Patterns15Low-Medium
Algorithms and Data Structures27Low

Study priority: Start with dictionaries, string parsing, and list operations. These three topics account for 43% of all problems and appear in the majority of data engineering Python interviews. Once those feel comfortable, move to error handling and set operations.

How Python Practice Works

Every problem follows the same workflow, designed to match what happens in a real interview.

1. Read the Problem

Each problem has a description, input/output specification, example test cases, and constraints. The description is written in the same style as real interview problems: clear enough to solve, vague enough to require clarifying assumptions.

2. Write Your Solution

Type your code in the editor. You get a function signature with type hints as a starting point. The editor supports Python syntax highlighting and basic autocomplete.

3. Run Against Test Cases

Your code runs in a real Python environment (not a syntax checker). You see actual output for each test case, with clear pass/fail indicators. If a test fails, you see the expected vs actual output to help you debug.

4. Review and Iterate

After passing all tests, you can review your solution against the reference approach. The reference solution highlights Pythonic patterns and edge case handling that interviewers look for.

What Interviewers Look For in Python

Data engineering Python interviews evaluate different things than software engineering interviews. Here is what actually matters.

Handling Messy Data

Real data has nulls, empty strings, inconsistent types, and unexpected formats. Interviewers give you slightly messy input on purpose. They want to see if you check for None before accessing attributes, handle empty collections without crashing, and validate input types. A solution that works on clean data but crashes on an empty list scores poorly.

Pythonic Patterns

Interviewers notice if you write Java-style Python. Using enumerate instead of manual index tracking, dict.get() instead of key-check-then-access, list comprehensions instead of manual loops for simple transforms, and f-strings instead of string concatenation. These patterns signal experience with the language.

Choosing the Right Data Structure

Using a set for membership testing instead of scanning a list. Using a defaultdict instead of checking if a key exists. Using a Counter instead of manual counting. The right data structure choice often simplifies the code by 50% and shows the interviewer you think about performance naturally.

Clear Error Messages

In pipeline code, silent failures are worse than crashes. If your function receives invalid input, raising a ValueError with a descriptive message is better than returning None or an empty result. Interviewers who have built real pipelines value this because they have debugged silent data loss caused by swallowed exceptions.

Python vs SQL in Data Engineering Interviews

Most data engineering interviews test both Python and SQL, but in different rounds and for different reasons.

AspectSQL RoundPython Round
What it testsSet-based thinkingProcedural logic
Problem styleGiven tables, write a queryGiven data, write a function
Edge casesNULLs, duplicates, empty tablesNone, empty lists, type mismatches
Time pressure15-25 min per problem20-30 min per problem
Company weight60-70% of score at most companies30-40% of score at most companies

If you have limited prep time, prioritize SQL. It carries more weight at most companies. But do not skip Python entirely. Bombing the Python round can knock you out even if your SQL is perfect.

Difficulty Levels

Problems are tagged by difficulty. Here is what each level means in interview context.

Easy

~150 problems

Phone screen level. Single-function problems with clear specs. Basic data structure operations, simple string manipulation, straightforward transformations. If you struggle with these, focus here before moving up.

Medium

~170 problems

On-site level. Multi-step logic, nested data structures, edge case handling required. You need to think about the approach before coding. These match the difficulty of a typical 45-minute coding round at mid-tier to top-tier companies.

Hard

~68 problems

Senior-level and FAANG-level problems. Multi-function solutions, complex state management, performance considerations, and production-quality error handling required. These are stretch problems. If you can solve hard problems cleanly, you are well prepared for any data engineering Python round.

Python Practice FAQ

What Python topics are tested in data engineering interviews?+
Data engineering interviews test Python differently than software engineering interviews. The focus is on data manipulation (dictionaries, lists, sets, comprehensions), file I/O (reading CSV, JSON, Parquet), string parsing, error handling, and basic algorithm patterns like grouping, deduplication, and merging datasets. You will rarely see LeetCode-style dynamic programming or graph traversal. Instead, expect problems like 'parse this log file and count errors by hour' or 'deduplicate these records using a composite key.'
How is Python practice different from SQL practice for data engineering?+
SQL practice focuses on querying existing data: joins, aggregations, window functions. Python practice focuses on transforming data programmatically: parsing raw inputs, handling edge cases, implementing business logic that is too complex for SQL, and building pipeline components. In interviews, SQL tests your ability to think in sets. Python tests your ability to think in sequences and handle messy real-world data. Most data engineering roles test both.
Do I need to know pandas for data engineering interviews?+
It depends on the company. Some companies test pandas explicitly (merge, groupby, apply, pivot_table). Others avoid library-specific questions and test pure Python. If the job description mentions pandas, prepare for it. If it does not, focus on core Python data structures and built-in functions. Knowing pandas well is a bonus in any case because it shows you can work with tabular data efficiently, but it is not a universal requirement.
How many Python problems should I practice before interviewing?+
Aim for 50 to 100 problems across different categories. Quality matters more than quantity. If you can solve a function implementation problem, a debugging problem, and a data transformation problem without looking anything up, you are in good shape. Focus on problems that mirror real pipeline work: parsing structured text, grouping records, handling nulls and duplicates, and writing clean functions with proper error handling.

388 Problems. 35% of the Interview. Your Move.

Match your practice distribution to the real interview distribution. Start with a for-loop problem.