Python Practice for Data Engineers

Python for a data engineer is glue plus correctness. The interview tests glue: a parser that handles a malformed row without crashing the batch, a validator that emits structured errors, a dedup that survives late-arriving data, a retry decorator that doesn't amplify partial failures. 8 patterns below cover the surface; 388 graded problems implement them.

Python for a data engineer is glue plus correctness. The interview tests glue: a parser that handles a malformed row without crashing the batch, a validator that emits structured errors, a dedup that survives late-arriving data, a retry decorator that doesn't amplify partial failures. 8 patterns below cover the surface; 388 graded problems implement them.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a Python query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1def sessionize(events):
2 sessions = []
3 for e in events:
4 if gap_minutes(e) > 30:
5
Execute your solution0.4s avg.
ShopifyInterview question
Solve a problem
388
DE Python problems
8
Pipeline patterns
29%
Are data transformation
4%
Are algorithm DSA (mostly absent)

Where each pattern sits in a real pipeline

The 8 patterns aren't abstract categories; each maps to a specific stage of an actual DE pipeline. The diagram shows the flow.

A typical DE pipeline and where each Python pattern sits
     ┌──────────────┐
     │  Source      │  CSV / JSON / API / S3 / DB CDC
     └──────┬───────┘
            │
            ▼
     ┌──────────────┐
     │  Parser      │   ◄── Pattern 1: ingestion + parsing
     └──────┬───────┘       Handle malformed rows, encoding, partial success
            │
            ▼
     ┌──────────────┐
     │  Validator   │   ◄── Pattern 3: schema validation
     └──────┬───────┘       Type checks, required fields, structured errors
            │       \
            │        ▼
            │   ┌──────────┐
            │   │ Rejected │     to dead-letter queue
            │   └──────────┘
            ▼
     ┌──────────────┐
     │  Transformer │   ◄── Pattern 4: data transformation
     └──────┬───────┘       Flatten, pivot, derive, normalize
            │
            ▼
     ┌──────────────┐
     │  Dedup       │   ◄── Pattern 2: dedup + idempotency
     └──────┬───────┘       Composite key, tiebreaker on late-arriving
            │
            ▼
     ┌──────────────┐
     │  Reconciler  │   ◄── Pattern 8: late-data reconciliation
     └──────┬───────┘       Event-time vs processing-time semantics
            │
            ▼
     ┌──────────────┐
     │  Sink        │   ◄── Pattern 5: ETL flow / orchestration
     └──────────────┘       Retry, batch, idempotent upsert

8 patterns, with problem counts and pipeline position

Each pattern has a sample prompt and the skills it actually tests. Together they cover the bulk of DE Python interview questions.

01Ingestion and parsing
45 problems · Front of pipeline

Parse a 10MB CSV where 0.5% of rows are malformed. Return clean records and an error list with row number and reason.

csv module dialectsencoding errorspartial-success return tuple
02Dedup and idempotency
32 problems · Mid-pipeline

Dedup an event stream by (user_id, event_type). Late-arriving events keep the existing record unless their event_time is later.

composite key dictstiebreaker semanticsin-place vs return-new
03Schema validation and enforcement
28 problems · After parsing

Validate 100k records against a 12-field schema. Return clean records and rejected records with field-level error messages.

isinstance checksstructured errorsfail fast vs collect all errors
04Transformations and reshaping
50 problems · Core transformation

Flatten nested event payloads so each properties key becomes a top-level column. Preserve type, handle missing keys.

dict comprehensionrecursion on nested datadataclasses for typing
05ETL flow and orchestration logic
32 problems · Glue layer

Implement a batcher that flushes when buffer hits 1000 records or 5 seconds elapse, whichever comes first.

asyncio.wait or threading.Timercontext managersgraceful flush on close
06PySpark and DataFrame ops
45 problems · Distributed compute

Given an 8M-row events DataFrame and a 2M-row users DataFrame, write the join with the right strategy and explain why.

DataFrame APIbroadcast hintwindow functions
07File-format-specific work
20 problems · Storage layer

Read Parquet where a new optional column was added between batches. Handle both schemas without crashing.

pyarrow schema evolutionmergeSchema costexplicit schema declaration
08Late-data and reconciliation
15 problems · Pipeline correctness

Implement a function that takes a stream of events with possibly-late records and returns correct daily aggregates that survive reprocessing.

event_time vs processing_timeidempotent upsertwatermark logic

What a Pattern 3 (validation) answer looks like

# Pattern 3: schema validation with structured errors.
# Interviewers reward error design over a bare ValueError because they've
# debugged silent data loss caused by swallowed exceptions.

from dataclasses import dataclass
from typing import Optional, Any

@dataclass
class FieldSpec:
    type: type
    required: bool = True
    max_len: Optional[int] = None
    choices: Optional[list[Any]] = None

@dataclass
class RowError:
    row_index: int
    field: str
    reason: str
    raw_value: Any

def validate_records(
    records: list[dict],
    schema: dict[str, FieldSpec],
) -> tuple[list[dict], list[RowError]]:
    clean, errors = [], []
    for i, record in enumerate(records):
        row_errors = []
        for field, spec in schema.items():
            value = record.get(field)
            if value is None:
                if spec.required:
                    row_errors.append(RowError(i, field, "required field missing", None))
                continue
            if not isinstance(value, spec.type):
                row_errors.append(RowError(i, field,
                    f"expected {spec.type.__name__}, got {type(value).__name__}", value))
                continue
            if spec.max_len is not None and hasattr(value, "__len__"):
                if len(value) > spec.max_len:
                    row_errors.append(RowError(i, field,
                        f"length {len(value)} > max {spec.max_len}", value))
            if spec.choices and value not in spec.choices:
                row_errors.append(RowError(i, field,
                    f"value not in allowed choices {spec.choices}", value))
        if row_errors:
            errors.extend(row_errors)
        else:
            clean.append(record)
    return clean, errors

# Why the structure:
#   row_index = where in the input batch
#   field     = which column to fix
#   reason    = human-readable
#   raw_value = for downstream replay / debugging
# All 4 are needed at 2am during an oncall.

Structured errors with row_index and field-level messages. The interview reward is in the error design, not just the happy path.

3 prep shapes that don't work

Common ways DE candidates burn Python prep time on the wrong material.

Solving LeetCode arrays + strings

Arrays and strings are 4% of DE interview Python. Trees are 1%. DP is 0.3%. Reps on these don't transfer to dedup, parsing, validation, or pipeline composition.

Memorizing 'top 50 Python interview questions' listicles

The listicles aren't wrong; they're not interactive. Reading 'use a defaultdict' isn't the same as choosing defaultdict over Counter under time pressure with a stack trace from a wrong attempt.

Writing a pandas-only answer

Some interviews allow pandas; many test pure Python first specifically to see whether you can write the logic without library shortcuts. If the prompt doesn't mention pandas, write standard library Python.

DE Python practice FAQ

What's the difference between DE Python and SWE Python?+
DE Python is dict and set heavy, file parsing heavy, error handling heavy. SWE Python leans more on classes, async patterns, and algorithm work. They overlap at syntax but diverge at problem shapes. The diagram above shows where each DE pattern lives in a pipeline.
What FAANG companies actually test Python the way the topic chart describes?+
Meta tests SQL plus Python product analytics (session windows, funnel computation). Amazon tests Python ETL flow (retry, idempotency, partial-success returns). Google tests pure Python transformations under time pressure. Netflix tests PySpark inside the actual platform stack. Stripe tests Python around payments-style state machines. Each company guide at /companies breaks it down.
Do I need to know pandas, polars, or PySpark?+
Depends on the company. Pandas: most DE rounds across companies. PySpark: Spark shops (Databricks, Netflix, Uber, Airbnb). Polars: rare in interviews. The bank includes problems in each library; check the per-company emphasis before allocating prep time.
How is the grader different from running Python locally?+
Test cases beyond the happy path. Locally you can write a function and pass your 1 test. The grader runs 5-15 test cases including hidden ones: empty input, single element, performance bound, malformed input, late-arriving data. A solution that works on the happy path and crashes on an empty list fails the way it would in production.
What's the minimum Python prep for a DE phone screen?+
About 30-40 problems split across the 3 highest-frequency patterns: data transformation (15), dict and set (10), file parsing (10). At Easy and Medium difficulty. 4-6 hours of focused practice over a week is usually enough to clear a phone screen at most companies.
Is there a senior-track focus for Python?+
Yes. The Hard tier (68 problems) covers composite transformations, late-arriving reconciliation, idempotent upserts, schema evolution. Plan 20-30 Hard problems in the last 2 weeks of senior-level prep.
02 / Why practice

Open a Pattern 3 (validation) problem

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Adjacent practice