Data Engineering Interview Prep
System design rounds test your ability to architect a data platform on a whiteboard. You will draw components, discuss trade-offs (consistency vs availability, cost vs latency), estimate capacity, and defend your choices. No code. All reasoning.
This guide covers whiteboard architecture and trade-off analysis. For operational questions like debugging production failures, backfill strategies, and error handling, see our pipeline interview questions guide.
Most candidates fail system design by jumping straight to drawing boxes and arrows. The strongest answers follow a consistent structure that demonstrates both technical depth and communication skill.
Step 1: Clarify requirements (3-5 min). Ask about data volume, freshness requirements, query patterns, and who consumes the output. "Is this a daily batch report or a real-time dashboard?" changes everything about your design.
Step 2: Draw the high-level flow (5 min). Sources on the left, storage on the right, processing in the middle. Label each component. Do not pick specific tools yet. Just name the layers: ingestion, transformation, storage, serving.
Step 3: Walk through data flow (10 min). Follow one record from source to output. Explain what happens at each stage: validation, transformation, deduplication, loading. This is where you demonstrate depth.
Step 4: Discuss failure modes (10 min). What happens when the source is down? When a transformation fails halfway? When data arrives late? Your recovery strategy matters more than your happy-path design.
Step 5: Trade-offs and alternatives (10 min). "I chose batch here because the freshness requirement is 1 hour. If that changed to 1 minute, I would swap this component for a streaming consumer." Show that your design is deliberate, not default.
The first fork in any system design answer. Interviewers test whether you can reason about this trade-off from first principles, not just recite definitions.
Data systems force you to choose. Can your dashboard show slightly stale data? Can your pipeline tolerate duplicate records temporarily? Your answer to these questions shapes the entire architecture.
Where your data lives determines how fast you can query it, how much it costs, and how flexible your schema can be. This is where many candidates go shallow. Go deep.
System design interviews expect back-of-envelope math. How many events per second? How much storage per day? What compute do you need? These numbers drive your architecture choices.
Naming Kafka or Airflow is not enough. Interviewers want to hear why you chose that component over the alternatives, and under what conditions you would choose differently.
Every architecture decision has a cost dimension. Senior candidates discuss cost trade-offs as naturally as they discuss latency trade-offs.
Practice these end to end. Set a 35-minute timer, talk through your design out loud, and draw the architecture on paper or a whiteboard. Then review against the hints below.
Key considerations: Consider: real-time trip events vs daily financial reconciliation (batch + streaming trade-off), surge pricing signals that need sub-second latency, driver/rider dimension tables with SCD for profile changes, geo-partitioned storage for regional query performance, and how the platform serves both operational dashboards and monthly business reviews from the same data.
Key considerations: Consider: fact table grain (one row per order line item?), slowly changing dimensions for seller profiles, a bridge table for orders with multiple items, currency conversion for international transactions, and the trade-off between a fully normalized OLTP-style schema vs a denormalized star schema for analytics.
Key considerations: Consider: streaming architecture with sub-30-second latency, feature computation (transaction velocity, geo-distance from last transaction), the consistency vs availability trade-off (blocking a legitimate transaction vs missing fraud), model serving infrastructure, and how to A/B test new fraud models without increasing risk.
Key considerations: Consider: the architecture for ingestion at scale (centralized vs per-source connectors), how to normalize different schemas into a common model, the storage layer trade-off (data lake for raw, warehouse for curated), how to handle sources with wildly different data volumes and update frequencies, and the serving layer for self-service analytics.
Key considerations: Consider: event collection with experiment assignment tracking, the trade-off between pre-computing experiment results (fast queries, stale data) vs computing on demand (slow queries, fresh data), statistical computation (sample sizes, confidence intervals), how to handle users who switch between experiment groups, and how to prevent peeking bias in the serving layer.
System design interviews test fundamentals. Practice SQL, Python, and data modeling so your design answers are grounded in real implementation experience.
When to use each architecture pattern with cost and complexity trade-offs
Operational questions: debugging, backfills, error handling, and schema evolution in running pipelines
Transformation timing decisions that come up in every system design discussion