Data Engineering Interview Prep

Data Engineering System Design Interview

System design rounds test your ability to architect a data platform on a whiteboard. You will draw components, discuss trade-offs (consistency vs availability, cost vs latency), estimate capacity, and defend your choices. No code. All reasoning.

This guide covers whiteboard architecture and trade-off analysis. For operational questions like debugging production failures, backfill strategies, and error handling, see our pipeline interview questions guide.

How to Structure Your Answer in a System Design Interview

Most candidates fail system design by jumping straight to drawing boxes and arrows. The strongest answers follow a consistent structure that demonstrates both technical depth and communication skill.

Step 1: Clarify requirements (3-5 min). Ask about data volume, freshness requirements, query patterns, and who consumes the output. "Is this a daily batch report or a real-time dashboard?" changes everything about your design.

Step 2: Draw the high-level flow (5 min). Sources on the left, storage on the right, processing in the middle. Label each component. Do not pick specific tools yet. Just name the layers: ingestion, transformation, storage, serving.

Step 3: Walk through data flow (10 min). Follow one record from source to output. Explain what happens at each stage: validation, transformation, deduplication, loading. This is where you demonstrate depth.

Step 4: Discuss failure modes (10 min). What happens when the source is down? When a transformation fails halfway? When data arrives late? Your recovery strategy matters more than your happy-path design.

Step 5: Trade-offs and alternatives (10 min). "I chose batch here because the freshness requirement is 1 hour. If that changed to 1 minute, I would swap this component for a streaming consumer." Show that your design is deliberate, not default.

1. Batch vs Streaming: The First Architecture Decision

The first fork in any system design answer. Interviewers test whether you can reason about this trade-off from first principles, not just recite definitions.

  1. Batch is right when data freshness requirements are hours, not seconds. Daily revenue reports, weekly cohort analysis, and monthly aggregations are all batch workloads. Choosing batch when it suffices shows maturity.
  2. Streaming is right when business decisions depend on seconds-to-minutes freshness: fraud detection, real-time recommendations, live dashboards for operations teams.
  3. The Lambda architecture (batch + streaming in parallel) sounds elegant but doubles your maintenance burden. Prefer one or the other unless the business requirement genuinely demands both. State this trade-off explicitly in your answer.
  4. Cost is part of the architecture. Streaming infrastructure runs 24/7 and costs 3-10x more than equivalent batch processing. The interviewer wants to hear that you weigh economics alongside latency requirements.

2. Consistency vs Availability Trade-offs

Data systems force you to choose. Can your dashboard show slightly stale data? Can your pipeline tolerate duplicate records temporarily? Your answer to these questions shapes the entire architecture.

  1. Strong consistency (every reader sees the latest write) is expensive. It requires synchronous replication, distributed locks, or serializable transactions. Know when the business actually needs it vs when eventual consistency is fine.
  2. Eventual consistency is cheaper and faster but means downstream consumers may see stale or temporarily inconsistent data. For analytical workloads, this is usually acceptable. Say so in your answer.
  3. Exactly-once delivery is a spectrum, not a binary. At-least-once with idempotent consumers is the practical pattern for most data pipelines. If you claim exactly-once, the interviewer will push back.
  4. When drawing your architecture, label each connection with its consistency guarantee. This demonstrates that you think about data correctness at every boundary, not just at the endpoints.

3. Storage Layer Architecture

Where your data lives determines how fast you can query it, how much it costs, and how flexible your schema can be. This is where many candidates go shallow. Go deep.

  1. Data lake vs data warehouse is not either/or. Most modern architectures use both: raw data in object storage (S3, GCS), curated data in a warehouse (Snowflake, BigQuery). Explain the layering in your answer.
  2. File format matters. Parquet for analytical queries (columnar, compressed, schema-embedded). Avro for streaming (row-based, schema evolution). JSON for flexibility at the cost of performance. Pick the right format for each layer and explain why.
  3. Partitioning strategy drives query performance and cost. Partition by date for time-series workloads. Cluster by high-cardinality filter columns. Always state your partitioning choice when designing a table.
  4. Materialized views and pre-aggregation tables trade storage and refresh cost for query speed. In your design, identify which queries are latency-sensitive and pre-compute those results.

4. Capacity Estimation and Scaling

System design interviews expect back-of-envelope math. How many events per second? How much storage per day? What compute do you need? These numbers drive your architecture choices.

  1. Start with the input rate. If the prompt says '1M events per minute,' convert that to ~17K events/second and estimate record size. This gives you throughput in MB/s, which determines whether you need a message queue, what tier of compute, and how much storage per day.
  2. Storage growth compounds. 1M events/minute at 500 bytes each is 720 GB/day raw. With compression (3-5x for Parquet), that is 150-240 GB/day. Over a year, that is 50-85 TB. State these numbers to show you think about operational cost.
  3. Scaling bottlenecks differ by layer. Ingestion is usually network-bound. Transformation is CPU-bound. Storage is IOPS-bound for random access, throughput-bound for scans. Identify which layer is your bottleneck and design around it.
  4. Right-size your compute. A daily batch job that runs for 10 minutes does not need a cluster running 24/7. Mention auto-scaling and ephemeral compute to show cost awareness.

5. Component Selection and Justification

Naming Kafka or Airflow is not enough. Interviewers want to hear why you chose that component over the alternatives, and under what conditions you would choose differently.

  1. Lead with the requirement, then justify the tool. 'We need a durable message queue with replay capability for consumer recovery, so I would use Kafka here. If we only needed simple task queueing, SQS would be simpler and cheaper.'
  2. For orchestration, explain the trade-off between managed services (Step Functions, Cloud Composer) and self-hosted (Airflow, Dagster). Managed reduces ops burden but limits customization.
  3. For compute, distinguish between the SQL engine (warehouse queries), the processing framework (Spark, Flink for heavy transforms), and lightweight scripts (Python for small transforms). Not every stage needs the same engine.
  4. Draw clear boundaries between components. Each component should have a single responsibility. The orchestrator schedules. The queue buffers. The warehouse stores and queries. When you couple responsibilities, explain why.

6. Cost vs Latency Trade-offs

Every architecture decision has a cost dimension. Senior candidates discuss cost trade-offs as naturally as they discuss latency trade-offs.

  1. Streaming costs 3-10x more than batch for the same data volume because the infrastructure runs continuously. Quantify this in your answer when possible. If the business needs sub-minute freshness, the premium is justified. If daily is fine, batch wins.
  2. Storage tiering saves money. Hot data (last 30 days) in the warehouse for fast queries. Cold data (older than 90 days) in object storage for cheap archival. Define your retention policy and tiering strategy.
  3. Compute costs scale with query complexity and data volume. Pre-aggregating common query patterns into materialized views costs refresh compute but saves 10-100x on downstream queries. Identify which queries justify this investment.

5 Practice Design Scenarios

Practice these end to end. Set a 35-minute timer, talk through your design out loud, and draw the architecture on paper or a whiteboard. Then review against the hints below.

Scenario 1: Design the data platform for a ride-sharing company

Key considerations: Consider: real-time trip events vs daily financial reconciliation (batch + streaming trade-off), surge pricing signals that need sub-second latency, driver/rider dimension tables with SCD for profile changes, geo-partitioned storage for regional query performance, and how the platform serves both operational dashboards and monthly business reviews from the same data.

Scenario 2: Design a data warehouse for a marketplace with buyers, sellers, orders, and reviews

Key considerations: Consider: fact table grain (one row per order line item?), slowly changing dimensions for seller profiles, a bridge table for orders with multiple items, currency conversion for international transactions, and the trade-off between a fully normalized OLTP-style schema vs a denormalized star schema for analytics.

Scenario 3: Design a real-time fraud detection system for a payments company

Key considerations: Consider: streaming architecture with sub-30-second latency, feature computation (transaction velocity, geo-distance from last transaction), the consistency vs availability trade-off (blocking a legitimate transaction vs missing fraud), model serving infrastructure, and how to A/B test new fraud models without increasing risk.

Scenario 4: Design a data platform that unifies 50 SaaS sources into a single analytics layer

Key considerations: Consider: the architecture for ingestion at scale (centralized vs per-source connectors), how to normalize different schemas into a common model, the storage layer trade-off (data lake for raw, warehouse for curated), how to handle sources with wildly different data volumes and update frequencies, and the serving layer for self-service analytics.

Scenario 5: Design the analytics backend for an A/B testing platform

Key considerations: Consider: event collection with experiment assignment tracking, the trade-off between pre-computing experiment results (fast queries, stale data) vs computing on demand (slow queries, fresh data), statistical computation (sample sizes, confidence intervals), how to handle users who switch between experiment groups, and how to prevent peeking bias in the serving layer.

System Design Interview FAQ

How do I structure a system design answer in an interview?+
Follow this order: (1) Ask clarifying questions for 3-5 minutes. Nail down scale, latency requirements, and use cases. (2) Draw the high-level architecture: sources, ingestion, processing, storage, serving. (3) Walk through the data flow end to end. (4) Discuss failure modes and how you handle them. (5) Address monitoring and alerting. (6) Discuss trade-offs and alternatives. Spend roughly equal time on each step.
Should I mention specific tools like Kafka, Airflow, or Spark?+
Yes, but only after establishing the requirements and architecture. Say: "For the message queue layer, I would use Kafka because we need high throughput and replay capability." Do not lead with tools. Lead with the problem, then justify the tool choice. If you name a tool, be ready to explain why you chose it over alternatives.
How technical should my system design answer be?+
Technical enough to show you can build it, not so technical that you get lost in details. Mention partition keys, serialization formats, and retry strategies. Do not go line-by-line through config files. The interviewer wants to see that you understand the building blocks and how they connect. Depth on one or two components is better than shallow coverage of everything.
How is data engineering system design different from backend system design?+
Backend system design focuses on request/response latency, load balancing, caching, and database scaling. Data engineering system design focuses on throughput, data quality, batch vs streaming processing, and analytical query performance. The evaluation criteria are different: backend is measured in p99 latency; data engineering is measured in data freshness, correctness, and cost efficiency.

Ground Your Design Skills in Real Practice

System design interviews test fundamentals. Practice SQL, Python, and data modeling so your design answers are grounded in real implementation experience.