Python Practice for Data Engineers
Python for a data engineer is glue plus correctness. The interview tests glue: a parser that handles a malformed row without crashing the batch, a validator that emits structured errors, a dedup that survives late-arriving data, a retry decorator that doesn't amplify partial failures. 8 patterns below cover the surface; 388 graded problems implement them.
Python for a data engineer is glue plus correctness. The interview tests glue: a parser that handles a malformed row without crashing the batch, a validator that emits structured errors, a dedup that survives late-arriving data, a retry decorator that doesn't amplify partial failures. 8 patterns below cover the surface; 388 graded problems implement them.
Know the patterns before the interviewer asks them.
Where each pattern sits in a real pipeline
The 8 patterns aren't abstract categories; each maps to a specific stage of an actual DE pipeline. The diagram shows the flow.
┌──────────────┐
│ Source │ CSV / JSON / API / S3 / DB CDC
└──────┬───────┘
│
▼
┌──────────────┐
│ Parser │ ◄── Pattern 1: ingestion + parsing
└──────┬───────┘ Handle malformed rows, encoding, partial success
│
▼
┌──────────────┐
│ Validator │ ◄── Pattern 3: schema validation
└──────┬───────┘ Type checks, required fields, structured errors
│ \
│ ▼
│ ┌──────────┐
│ │ Rejected │ to dead-letter queue
│ └──────────┘
▼
┌──────────────┐
│ Transformer │ ◄── Pattern 4: data transformation
└──────┬───────┘ Flatten, pivot, derive, normalize
│
▼
┌──────────────┐
│ Dedup │ ◄── Pattern 2: dedup + idempotency
└──────┬───────┘ Composite key, tiebreaker on late-arriving
│
▼
┌──────────────┐
│ Reconciler │ ◄── Pattern 8: late-data reconciliation
└──────┬───────┘ Event-time vs processing-time semantics
│
▼
┌──────────────┐
│ Sink │ ◄── Pattern 5: ETL flow / orchestration
└──────────────┘ Retry, batch, idempotent upsert8 patterns, with problem counts and pipeline position
Each pattern has a sample prompt and the skills it actually tests. Together they cover the bulk of DE Python interview questions.
Parse a 10MB CSV where 0.5% of rows are malformed. Return clean records and an error list with row number and reason.
Dedup an event stream by (user_id, event_type). Late-arriving events keep the existing record unless their event_time is later.
Validate 100k records against a 12-field schema. Return clean records and rejected records with field-level error messages.
Flatten nested event payloads so each properties key becomes a top-level column. Preserve type, handle missing keys.
Implement a batcher that flushes when buffer hits 1000 records or 5 seconds elapse, whichever comes first.
Given an 8M-row events DataFrame and a 2M-row users DataFrame, write the join with the right strategy and explain why.
Read Parquet where a new optional column was added between batches. Handle both schemas without crashing.
Implement a function that takes a stream of events with possibly-late records and returns correct daily aggregates that survive reprocessing.
What a Pattern 3 (validation) answer looks like
# Pattern 3: schema validation with structured errors.
# Interviewers reward error design over a bare ValueError because they've
# debugged silent data loss caused by swallowed exceptions.
from dataclasses import dataclass
from typing import Optional, Any
@dataclass
class FieldSpec:
type: type
required: bool = True
max_len: Optional[int] = None
choices: Optional[list[Any]] = None
@dataclass
class RowError:
row_index: int
field: str
reason: str
raw_value: Any
def validate_records(
records: list[dict],
schema: dict[str, FieldSpec],
) -> tuple[list[dict], list[RowError]]:
clean, errors = [], []
for i, record in enumerate(records):
row_errors = []
for field, spec in schema.items():
value = record.get(field)
if value is None:
if spec.required:
row_errors.append(RowError(i, field, "required field missing", None))
continue
if not isinstance(value, spec.type):
row_errors.append(RowError(i, field,
f"expected {spec.type.__name__}, got {type(value).__name__}", value))
continue
if spec.max_len is not None and hasattr(value, "__len__"):
if len(value) > spec.max_len:
row_errors.append(RowError(i, field,
f"length {len(value)} > max {spec.max_len}", value))
if spec.choices and value not in spec.choices:
row_errors.append(RowError(i, field,
f"value not in allowed choices {spec.choices}", value))
if row_errors:
errors.extend(row_errors)
else:
clean.append(record)
return clean, errors
# Why the structure:
# row_index = where in the input batch
# field = which column to fix
# reason = human-readable
# raw_value = for downstream replay / debugging
# All 4 are needed at 2am during an oncall.Structured errors with row_index and field-level messages. The interview reward is in the error design, not just the happy path.
3 prep shapes that don't work
Common ways DE candidates burn Python prep time on the wrong material.
Solving LeetCode arrays + strings
Arrays and strings are 4% of DE interview Python. Trees are 1%. DP is 0.3%. Reps on these don't transfer to dedup, parsing, validation, or pipeline composition.
Memorizing 'top 50 Python interview questions' listicles
The listicles aren't wrong; they're not interactive. Reading 'use a defaultdict' isn't the same as choosing defaultdict over Counter under time pressure with a stack trace from a wrong attempt.
Writing a pandas-only answer
Some interviews allow pandas; many test pure Python first specifically to see whether you can write the logic without library shortcuts. If the prompt doesn't mention pandas, write standard library Python.
DE Python practice FAQ
What's the difference between DE Python and SWE Python?+
What FAANG companies actually test Python the way the topic chart describes?+
Do I need to know pandas, polars, or PySpark?+
How is the grader different from running Python locally?+
What's the minimum Python prep for a DE phone screen?+
Is there a senior-track focus for Python?+
Open a Pattern 3 (validation) problem
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition