Updated April 2026By The DataDriven Team
Data Engineering Interview Prep

Data Engineering System Design Interview

System design rounds test your ability to architect a data platform on a whiteboard. You will draw components, discuss trade-offs (consistency vs availability, cost vs latency), estimate capacity, and defend your choices. No code. All reasoning.

This guide covers whiteboard architecture and trade-off analysis. For operational questions like debugging production failures, backfill strategies, and error handling, see our pipeline interview questions guide. For SQL interview prep, see our dedicated SQL guide. And if you need to review data modeling concepts, start there before tackling full system design.

How to Structure Your Answer in a System Design Interview

Most candidates fail system design by jumping straight to drawing boxes and arrows. The strongest answers follow a consistent structure that demonstrates both technical depth and communication skill.

Step 1: Clarify requirements (3-5 min). Ask about data volume, freshness requirements, query patterns, and who consumes the output. "Is this a daily batch report or a real-time dashboard?" changes everything about your design.

Step 2: Draw the high-level flow (5 min). Sources on the left, storage on the right, processing in the middle. Label each component. Do not pick specific tools yet. Just name the layers: ingestion, transformation, storage, serving.

Step 3: Walk through data flow (10 min). Follow one record from source to output. Explain what happens at each stage: validation, transformation, deduplication, loading. This is where you demonstrate depth.

Step 4: Discuss failure modes (10 min). What happens when the source is down? When a transformation fails halfway? When data arrives late? Your recovery strategy matters more than your happy-path design.

Step 5: Trade-offs and alternatives (10 min). "I chose batch here because the freshness requirement is 1 hour. If that changed to 1 minute, I would swap this component for a streaming consumer." Show that your design is deliberate, not default.

Understanding batch vs streaming trade-offs and ETL vs ELT patterns is essential background for every system design conversation.

5-Step Answer Framework (Quick Reference)

  Step    Time     What to Do
  -----   ------   ------------------------------------------
  1       3-5 m    Clarify: volume, latency, consumers, SLAs
  2       5 m      Draw: sources > ingest > process > store > serve
  3       10 m     Walk: trace one record end-to-end
  4       10 m     Failures: late data, partial fails, recovery
  5       10 m     Trade-offs: why this, not that, and when to switch

Print this or memorize it. Every system design answer you give should hit all five steps. The interviewer is evaluating your process as much as your architecture. Skipping Step 1 (clarification) is the single most common reason candidates receive a "no hire" signal.

1. Batch vs Streaming: The First Architecture Decision

The first fork in any system design answer. Interviewers test whether you can reason about this trade-off from first principles, not just recite definitions.

  1. Batch is right when data freshness requirements are hours, not seconds. Daily revenue reports, weekly cohort analysis, and monthly aggregations are all batch workloads. Choosing batch when it suffices shows maturity.
  2. Streaming is right when business decisions depend on seconds-to-minutes freshness: fraud detection, real-time recommendations, live dashboards for operations teams.
  3. The Lambda architecture (batch + streaming in parallel) sounds elegant but doubles your maintenance burden. Prefer one or the other unless the business requirement genuinely demands both. State this trade-off explicitly in your answer.
  4. Cost is part of the architecture. Streaming infrastructure runs 24/7 and costs 3-10x more than equivalent batch processing. The interviewer wants to hear that you weigh economics alongside latency requirements.

2. Consistency vs Availability Trade-offs

Data systems force you to choose. Can your dashboard show slightly stale data? Can your pipeline tolerate duplicate records temporarily? Your answer to these questions shapes the entire architecture.

  1. Strong consistency (every reader sees the latest write) is expensive. It requires synchronous replication, distributed locks, or serializable transactions. Know when the business actually needs it vs when eventual consistency is fine.
  2. Eventual consistency is cheaper and faster but means downstream consumers may see stale or temporarily inconsistent data. For analytical workloads, this is usually acceptable. Say so in your answer.
  3. Exactly-once delivery is a spectrum, not a binary. At-least-once with idempotent consumers is the practical pattern for most data pipelines. If you claim exactly-once, the interviewer will push back.
  4. When drawing your architecture, label each connection with its consistency guarantee. This demonstrates that you think about data correctness at every boundary, not just at the endpoints.

3. Storage Layer Architecture

Where your data lives determines how fast you can query it, how much it costs, and how flexible your schema can be. This is where many candidates go shallow. Go deep.

  1. Data lake vs data warehouse is not either/or. Most modern architectures use both: raw data in object storage (S3, GCS), curated data in a warehouse (Snowflake, BigQuery). Explain the layering in your answer.
  2. File format matters. Parquet for analytical queries (columnar, compressed, schema-embedded). Avro for streaming (row-based, schema evolution). JSON for flexibility at the cost of performance. Pick the right format for each layer and explain why.
  3. Partitioning strategy drives query performance and cost. Partition by date for time-series workloads. Cluster by high-cardinality filter columns. Always state your partitioning choice when designing a table.
  4. Materialized views and pre-aggregation tables trade storage and refresh cost for query speed. In your design, identify which queries are latency-sensitive and pre-compute those results.

4. Capacity Estimation and Scaling

System design interviews expect back-of-envelope math. How many events per second? How much storage per day? What compute do you need? These numbers drive your architecture choices.

  1. Start with the input rate. If the prompt says '1M events per minute,' convert that to ~17K events/second and estimate record size. This gives you throughput in MB/s, which determines whether you need a message queue, what tier of compute, and how much storage per day.
  2. Storage growth compounds. 1M events/minute at 500 bytes each is 720 GB/day raw. With compression (3-5x for Parquet), that is 150-240 GB/day. Over a year, that is 50-85 TB. State these numbers to show you think about operational cost.
  3. Scaling bottlenecks differ by layer. Ingestion is usually network-bound. Transformation is CPU-bound. Storage is IOPS-bound for random access, throughput-bound for scans. Identify which layer is your bottleneck and design around it.
  4. Right-size your compute. A daily batch job that runs for 10 minutes does not need a cluster running 24/7. Mention auto-scaling and ephemeral compute to show cost awareness.

5. Component Selection and Justification

Naming Kafka or Airflow is not enough. Interviewers want to hear why you chose that component over the alternatives, and under what conditions you would choose differently.

  1. Lead with the requirement, then justify the tool. 'We need a durable message queue with replay capability for consumer recovery, so I would use Kafka here. If we only needed simple task queueing, SQS would be simpler and cheaper.'
  2. For orchestration, explain the trade-off between managed services (Step Functions, Cloud Composer) and self-hosted (Airflow, Dagster). Managed reduces ops burden but limits customization.
  3. For compute, distinguish between the SQL engine (warehouse queries), the processing framework (Spark, Flink for heavy transforms), and lightweight scripts (Python for small transforms). Not every stage needs the same engine.
  4. Draw clear boundaries between components. Each component should have a single responsibility. The orchestrator schedules. The queue buffers. The warehouse stores and queries. When you couple responsibilities, explain why.

6. Cost vs Latency Trade-offs

Every architecture decision has a cost dimension. Senior candidates discuss cost trade-offs as naturally as they discuss latency trade-offs.

  1. Streaming costs 3-10x more than batch for the same data volume because the infrastructure runs continuously. Quantify this in your answer when possible. If the business needs sub-minute freshness, the premium is justified. If daily is fine, batch wins.
  2. Storage tiering saves money. Hot data (last 30 days) in the warehouse for fast queries. Cold data (older than 90 days) in object storage for cheap archival. Define your retention policy and tiering strategy.
  3. Compute costs scale with query complexity and data volume. Pre-aggregating common query patterns into materialized views costs refresh compute but saves 10-100x on downstream queries. Identify which queries justify this investment.

5 Practice Design Scenarios

Practice these end to end. Set a 35-minute timer, talk through your design out loud, and draw the architecture on paper or a whiteboard. Each scenario below includes a full walkthrough that covers requirements, architecture, data flow, failure modes, and trade-offs.

These scenarios are designed to mirror the types of prompts you will encounter in real interviews at companies like Meta, Uber, Stripe, and Airbnb. For supporting fundamentals, review our data engineering interview prep guide and pipeline interview questions.

Scenario 1: Design the data platform for a ride-sharing company

Requirements to Clarify

  • How many trips per day? (assume 10M for a mid-scale company)
  • Who are the consumers? Operations dashboards, finance reconciliation, data science for surge pricing?
  • What latency does each consumer need? Ops dashboards want sub-minute. Finance is happy with daily.
  • Are we serving multiple regions with data residency requirements?
  • Do we need to support replay of historical data for model retraining?

Architecture Diagram

  +------------+    +------------+    +----------------+
  | Rider App  |--->|            |    |                |
  +------------+    |   Kafka    |--->| Flink (stream) |---> Redis (surge)
  +------------+    |  (events)  |    |  trip events   |---> Ops Dashboard
  | Driver App |--->|            |    +----------------+
  +------------+    +------+-----+
                           |
                           v
                    +------+------+    +---------------+
                    | S3 Raw Lake |    |               |
                    | (Parquet)   |--->| Spark (batch) |
                    |  partitioned|    | daily ETL     |
                    |  by date    |    +-------+-------+
                    +-------------+            |
                                               v
                                     +---------+--------+
                                     | Warehouse        |
                                     | (fact_trips,     |
                                     |  dim_drivers,    |
                                     |  dim_riders,     |
                                     |  dim_locations)  |
                                     +---------+--------+
                                               |
                                     +---------+--------+
                                     | BI / Finance     |
                                     | Reports          |
                                     +------------------+

Data Flow Walkthrough

A rider requests a trip. The rider app emits a trip_requested event to Kafka. The driver app emits trip_accepted, trip_started, and trip_completed events as the trip progresses. Each event contains a trip ID, timestamps, GPS coordinates, and user IDs.

The streaming path (Flink) consumes these events in real time to compute surge pricing signals (trip request density per geo-cell in the last 5 minutes) and feed the operations dashboard. The surge pricing output goes to Redis for sub-millisecond reads by the pricing service.

In parallel, Kafka writes to S3 as Parquet files partitioned by date and region. A daily Spark job reads the raw data, deduplicates by trip ID, joins with driver and rider dimension tables (using slowly changing dimension type 2 for profile changes), and loads the star schema into the warehouse. Finance queries the warehouse for daily revenue reconciliation.

Failure Modes and Recovery

  • Kafka broker failure: Kafka replication factor of 3 ensures no data loss. Producers retry with idempotent delivery enabled.
  • Flink job crash: Flink checkpoints to S3 every 60 seconds. On restart, it resumes from the last checkpoint, reprocessing at most 60 seconds of events. Downstream consumers must handle duplicates idempotently.
  • Late-arriving events: A trip_completed event may arrive hours late if the driver had no connectivity. The streaming layer uses a watermark with a 10-minute allowed lateness. Events beyond that are dropped from real-time aggregations but still land in the S3 raw lake. The daily batch job picks them up.
  • Batch job failure: The Spark job is orchestrated by Airflow with automatic retries (3 attempts, 15-minute backoff). Idempotent writes to the warehouse (MERGE on trip ID) ensure retries do not create duplicates.

Key Trade-offs

This design uses both batch and streaming (a Kappa-lite approach where streaming handles hot-path analytics and batch handles the analytical warehouse). The alternative is pure streaming into the warehouse using a tool like Materialize or Snowflake Snowpipe. The trade-off: pure streaming simplifies the architecture but costs significantly more and makes backfills harder because you lose the batch reprocessing path.

Geo-partitioning the S3 raw data adds partition pruning benefits for regional queries but increases the number of small files. A compaction job (daily, after the main ETL) merges small Parquet files into larger ones to keep query performance high.

Common Mistakes for This Scenario

  • Designing only the streaming path and ignoring that finance needs a reliable daily batch pipeline.
  • Forgetting GPS coordinate storage and the geo-spatial indexing needed for surge pricing computation.
  • Not mentioning SCD type 2 for driver and rider dimensions, which interviewers specifically look for in ride-sharing prompts.

Scenario 2: Design a data warehouse for a marketplace with buyers, sellers, orders, and reviews

Requirements to Clarify

  • How many orders per day? (assume 5M for a mid-scale marketplace)
  • Is the marketplace multi-currency? If yes, we need conversion rates at time of transaction.
  • What are the primary analytical queries? GMV by category, seller performance, buyer retention cohorts?
  • Do reviews need to be linked at the order-item level or the order level?
  • What is the data freshness requirement for the analytics team?

Architecture Diagram

  +----------------+
  | OLTP Database  |    CDC (Debezium)
  | (orders,       |----------------------+
  |  users,        |                      |
  |  products,     |                      v
  |  reviews)      |              +-------+-------+
  +----------------+              | Kafka Topics  |
                                  | (per table)   |
                                  +-------+-------+
                                          |
                          +---------------+---------------+
                          v                               v
                  +-------+-------+             +---------+---------+
                  | S3 Raw Lake   |             | S3 Raw Lake       |
                  | /orders/      |             | /reviews/         |
                  | dt=YYYY-MM-DD |             | dt=YYYY-MM-DD     |
                  +-------+-------+             +---------+---------+
                          |                               |
                          +---------------+---------------+
                                          |
                                          v
                                  +-------+--------+
                                  | dbt (transform)|
                                  | staging >      |
                                  | intermediate > |
                                  | marts          |
                                  +-------+--------+
                                          |
                                          v
                                  +-------+--------+
                                  | Warehouse      |
                                  | fact_order_items|
                                  | dim_buyers     |
                                  | dim_sellers    |
                                  | dim_products   |
                                  | dim_dates      |
                                  | bridge_reviews |
                                  +----------------+

Data Flow Walkthrough

Debezium captures row-level changes from the OLTP database and publishes them to Kafka topics (one per source table). A Kafka Connect S3 sink writes these events to the raw lake as Parquet files, partitioned by date.

The fact table grain is one row per order line item. Each row contains the order ID, line item sequence, product ID, buyer ID, seller ID, quantity, unit price, currency code, and the converted USD amount (using a currency rate snapshot table). This grain supports both "total GMV by seller" and "average order value by category" queries without additional joins.

Reviews use a bridge table because one order can have multiple reviews (one per item) and one product can have reviews from multiple orders. The bridge table connects fact_order_items to dim_reviews and enables queries like "average review score for products sold by seller X." Seller dimensions use SCD type 2 to track changes in seller tier, verification status, and store name over time.

Failure Modes and Recovery

  • CDC lag: If Debezium falls behind, the Kafka consumer lag metric triggers an alert. The raw lake still receives data, just delayed. Downstream dbt models wait for a data freshness check (a sensor in Airflow) before running.
  • Currency rate gaps: If the exchange rate snapshot is missing for a date, the dbt model falls back to the most recent available rate and flags those rows with a rate_imputed boolean.
  • Schema drift: When the OLTP team adds a column, Debezium forwards the new schema. The S3 sink writes it as a new Parquet column. The dbt staging model must pin to known columns and ignore unknowns, with an alert when new columns appear.

Key Trade-offs

Star schema vs fully normalized: a star schema denormalizes for read performance at the cost of storage and update complexity. For an analytics warehouse, this is the right choice because queries run 5-20x faster with fewer joins. The alternative (normalized 3NF) is better suited for OLTP workloads where write efficiency matters. State this trade-off explicitly. Also consider whether to use Kimball-style dimensional modeling or a Data Vault approach. For marketplaces with rapidly changing business rules, Data Vault provides more flexibility but adds complexity in the transformation layer.

Common Mistakes for This Scenario

  • Setting the fact table grain at the order level instead of the order-line-item level, making per-product analysis impossible without re-architecting.
  • Ignoring multi-currency. If the marketplace is international, every monetary amount needs a conversion, and the rate used must be the one at transaction time, not query time.
  • Forgetting to mention how reviews link to both orders and products (the bridge table pattern).

Scenario 3: Design a real-time fraud detection system for a payments company

Requirements to Clarify

  • What is the maximum acceptable latency from transaction to decision? (assume under 200ms for blocking, under 30s for flagging)
  • What is the false positive tolerance? Blocking a legitimate transaction loses revenue. Missing fraud loses trust.
  • How many transactions per second? (assume 50K TPS at peak)
  • Do we need to support multiple fraud models running simultaneously for A/B testing?
  • What features does the model need? Transaction velocity, geo-distance, merchant category risk scores?

Architecture Diagram

  +-----------+     +-----------+     +------------------+
  | Payment   |---->| Kafka     |---->| Flink            |
  | Gateway   |     | (txn      |     | Feature Engine   |
  +-----------+     |  events)  |     | (velocity, geo,  |
                    +-----+-----+     |  merchant risk)  |
                          |           +--------+---------+
                          |                    |
                          |                    v
                          |           +--------+---------+
                          |           | Feature Store    |
                          |           | (Redis, <5ms)    |
                          |           +--------+---------+
                          |                    |
                          |                    v
                          |           +--------+---------+
                          |           | Model Service    |
                          |           | (fraud score)    |----> BLOCK / ALLOW
                          |           | A/B router       |      (< 200ms)
                          |           +--------+---------+
                          |                    |
                          v                    v
                    +-----+-----+     +--------+---------+
                    | S3 Raw    |     | Labeled Outcomes |
                    | (all txn) |     | (chargebacks,    |
                    +-----+-----+     |  disputes)       |
                          |           +--------+---------+
                          v                    |
                    +-----+-----+              v
                    | Batch     |     +--------+---------+
                    | Retrain   |<----| Training Dataset |
                    | Pipeline  |     | (features +      |
                    +-----+-----+     |  labels joined)  |
                          |           +------------------+
                          v
                    +-----+-----+
                    | Model     |
                    | Registry  |
                    +-----------+

Data Flow Walkthrough

A transaction event arrives from the payment gateway into Kafka. The Flink feature engine consumes it and computes real-time features: number of transactions from this card in the last 10 minutes (velocity), geo-distance from the previous transaction, and merchant category risk score (looked up from a static table in Redis).

These features are written to the feature store (Redis) and simultaneously passed to the model service. The model service runs the fraud scoring model (a gradient-boosted tree or a neural network, depending on the company) and returns a score between 0 and 1. Transactions scoring above 0.85 are blocked. Between 0.6 and 0.85, they are flagged for manual review. Below 0.6, they pass through. The total end-to-end latency budget is 200ms.

For A/B testing new models, the model service uses a routing layer that assigns each transaction to a model variant based on a hash of the card ID. Both models score the transaction, but only the assigned variant's decision is enforced. The other score is logged for offline comparison. This prevents peeking bias because the assignment is deterministic and the non-enforced model cannot influence the outcome.

Failure Modes and Recovery

  • Model service latency spike: If the model service exceeds the 200ms SLA, a circuit breaker falls back to a rules-based system (e.g., block transactions over $5,000 from new cards). The rules engine is simpler but catches the most obvious fraud patterns.
  • Feature store unavailable: If Redis is down, the Flink engine cannot look up historical features. The fallback is to score with only the features available from the current transaction (amount, merchant category, time of day). Accuracy degrades, but the system stays online.
  • Kafka consumer lag: If the Flink job falls behind, transactions are not scored in real time. An alerting threshold on consumer lag (e.g., more than 30 seconds behind) triggers pager alerts. The payment gateway should have a configurable timeout: if no fraud decision arrives within 500ms, default to ALLOW (availability over consistency for low-risk merchants).

Key Trade-offs

The fundamental trade-off is false positives vs false negatives. Blocking a legitimate $500 transaction costs the company revenue and customer trust. Missing a $500 fraudulent transaction costs a chargeback fee plus the transaction amount. The threshold between BLOCK and ALLOW must be tuned to the business's risk appetite, and different merchant categories may warrant different thresholds. This is a conversation the interviewer wants to hear you initiate, not wait to be asked about.

Common Mistakes for This Scenario

  • Designing a batch fraud system when the requirement is real-time. If you cannot score a transaction before it completes, the fraud has already happened.
  • Ignoring the model retraining loop. Fraud patterns evolve weekly. Without a pipeline that joins outcomes (chargebacks) back to features and retrains, the model degrades rapidly.
  • Not addressing the A/B testing question. Interviewers at payments companies specifically probe how you would safely test a new fraud model without increasing risk exposure.

Scenario 4: Design a data platform that unifies 50 SaaS sources into a single analytics layer

Requirements to Clarify

  • What are the SaaS sources? CRM (Salesforce), support (Zendesk), marketing (HubSpot), billing (Stripe), HR (Workday)?
  • How do these sources expose data? REST APIs, webhooks, SFTP, database replicas?
  • What is the data freshness requirement? Hourly? Daily?
  • Who consumes the unified data? Business analysts in a BI tool? Data scientists in notebooks?
  • Is there a data governance or PII masking requirement?

Architecture Diagram

  +------------+ +----------+ +----------+ +----------+
  | Salesforce | | Zendesk  | | Stripe   | | 47 more  |
  +-----+------+ +----+-----+ +----+-----+ +----+-----+
        |              |            |            |
        v              v            v            v
  +-----+--------------+------------+------------+-----+
  |            Ingestion Layer (Fivetran / Airbyte)     |
  |  Per-source connectors, rate limit handling,        |
  |  incremental sync where API supports cursors        |
  +----------------------------+-----------------------+
                               |
                               v
                    +----------+----------+
                    | S3 Raw Lake         |
                    | /source=salesforce/ |
                    | /source=zendesk/   |
                    | /source=stripe/    |
                    | partitioned by     |
                    | source + sync_date |
                    +----------+----------+
                               |
                               v
                    +----------+----------+
                    | dbt (staging)       |
                    | Rename, cast,       |
                    | deduplicate,        |
                    | PII mask            |
                    +----------+----------+
                               |
                               v
                    +----------+----------+
                    | dbt (marts)         |
                    | Unified customer    |
                    | entity, revenue     |
                    | metrics, support    |
                    | metrics             |
                    +----------+----------+
                               |
                    +----------+----------+
                    | Warehouse           |
                    | (Snowflake /        |
                    |  BigQuery)          |
                    +----------+----------+
                               |
                    +-----+----+----+-----+
                    |     |         |     |
                    v     v         v     v
                  Looker Notebooks Alerts Reverse
                                         ETL

Data Flow Walkthrough

Each SaaS source has a connector (Fivetran, Airbyte, or a custom Python script for sources without off-the-shelf connectors). Connectors sync incrementally where the API supports cursor-based pagination (Salesforce, Stripe) and do full syncs where it does not (some legacy SFTP sources). Each connector writes raw JSON or CSV to S3, partitioned by source name and sync date.

The dbt staging layer normalizes column names (e.g., AccountId from Salesforce becomes account_id), casts data types, deduplicates records (using the source system's primary key + a sync timestamp tiebreaker), and applies PII masking rules (hashing email addresses, redacting phone numbers).

The dbt marts layer creates unified entities. The most important is the unified customer entity, which joins the Salesforce account ID, the Stripe customer ID, and the Zendesk requester email into a single customer_id using a mapping table. This entity resolution is the hardest engineering problem in this design and should receive the most discussion time. For more on transformation patterns, see our ETL vs ELT guide.

Failure Modes and Recovery

  • API rate limiting: Salesforce limits to 100K API calls per 24 hours. The connector must implement exponential backoff and track quota consumption. If a sync exceeds the quota, it pauses and resumes in the next window rather than failing.
  • Schema changes in source SaaS: When Salesforce adds or removes a field, the raw layer captures it without breakage (schema-on-read). The dbt staging model pins to expected columns and raises an alert when unknown columns appear or expected columns disappear.
  • Entity resolution drift: When a customer has a new email in Zendesk that does not match Salesforce, the mapping table fails to join. A weekly reconciliation job flags unmatched records for manual review by the data team.

Key Trade-offs

Build vs buy for connectors. Fivetran/Airbyte handle 80% of sources out of the box but cost $1-3 per connector per month at scale (50 connectors = $50-150K/year). Custom connectors are cheaper but require engineering maintenance. The hybrid approach (managed connectors for complex APIs like Salesforce, custom scripts for simple SFTP/CSV sources) is usually optimal. State this trade-off and let the interviewer push back.

Common Mistakes for This Scenario

  • Treating all 50 sources as equivalent. In reality, Salesforce and Stripe dominate the data volume and complexity. Spend 70% of your design time on the 3-5 hardest sources.
  • Ignoring entity resolution entirely. The interviewer wants to hear how you join a Salesforce account to a Stripe customer to a Zendesk requester. If you skip this, you miss the core challenge.
  • Designing a single monolithic pipeline instead of per-source connectors. If one source fails, it should not block the other 49.

Scenario 5: Design the analytics backend for an A/B testing platform

Requirements to Clarify

  • How many concurrent experiments? (assume 200)
  • How many events per day across all experiments? (assume 500M)
  • Do experiment owners need real-time results or are daily refreshes acceptable?
  • What statistical methods? Frequentist (p-values, confidence intervals) or Bayesian?
  • How are users assigned to experiment groups? Client-side SDK? Server-side?

Architecture Diagram

  +------------+    +----------+    +-----------------+
  | Client SDK |    |          |    |                 |
  | (assigns   |--->|  Kafka   |--->| S3 Raw Lake     |
  |  variant,  |    | (events) |    | /experiment_id/ |
  |  emits     |    |          |    | /dt=YYYY-MM-DD/ |
  |  events)   |    +----+-----+    +--------+--------+
  +------------+         |                   |
                         |                   v
                         |         +---------+---------+
                         |         | Spark (daily)     |
                         |         | - join events to  |
                         |         |   assignments     |
                         |         | - aggregate per   |
                         |         |   experiment +    |
                         |         |   variant + metric|
                         |         +---------+---------+
                         |                   |
                         |                   v
                         |         +---------+---------+
                         |         | Stats Engine      |
                         |         | - sample sizes    |
                         |         | - confidence      |
                         |         |   intervals       |
                         |         | - p-values        |
                         |         | - sequential      |
                         |         |   testing bounds  |
                         |         +---------+---------+
                         |                   |
                         v                   v
                   +-----+-----+   +---------+---------+
                   | Assignment |   | Experiment        |
                   | Log Table  |   | Results Table     |
                   | (user_id,  |   | (experiment_id,   |
                   |  exp_id,   |   |  variant,         |
                   |  variant,  |   |  metric,          |
                   |  ts)       |   |  sample_size,     |
                   +------------+   |  mean, ci_lower,  |
                                    |  ci_upper, pval)  |
                                    +----------+--------+
                                               |
                                               v
                                    +----------+--------+
                                    | Experiment        |
                                    | Dashboard         |
                                    +-------------------+

Data Flow Walkthrough

The client SDK assigns users to experiment variants using a deterministic hash of user ID + experiment ID. This ensures consistent assignment across sessions without a server round-trip. The SDK emits events (page views, clicks, conversions) to Kafka, each tagged with the experiment ID and assigned variant.

The daily Spark job reads events from S3, joins them with the assignment log (to handle cases where events arrive without variant tags), and computes per-experiment, per-variant, per-metric aggregates: count, sum, mean, and variance. These aggregates feed into the statistics engine, which computes confidence intervals, p-values, and minimum detectable effect sizes.

To prevent peeking bias (experiment owners checking results too early and calling winners prematurely), the platform uses sequential testing bounds. The dashboard displays results but disables the "declare winner" button until the pre-registered sample size is reached or the sequential test crosses its boundary. This is a critical design decision that interviewers look for.

Failure Modes and Recovery

  • Assignment leakage: If the SDK hash function is poorly distributed, variants may have unequal sample sizes. A daily check compares variant sizes and alerts if the ratio deviates more than 5% from the expected split (e.g., 50/50).
  • Users switching groups: If a user clears cookies and gets a new assignment, they appear in two variants. The assignment log tracks user ID (not cookie ID) as the primary key. If a user appears in multiple variants, the system uses their first assignment and discards subsequent ones.
  • Metric pipeline failure: If the daily Spark job fails, experiment results are stale but not wrong. The dashboard shows "last updated" timestamps. Experiment owners see that results are from yesterday, not today, and can decide whether to wait or proceed.

Key Trade-offs

Pre-computed vs on-demand results. Pre-computing experiment results daily (materialize all metrics for all experiments into a results table) means the dashboard loads instantly but results are up to 24 hours stale. On-demand computation (query raw events at dashboard load time) gives fresh results but takes 30-60 seconds per experiment with 500M events/day. The hybrid approach pre-computes for active experiments and archives completed experiments. State this trade-off and note that most production A/B testing platforms (Statsig, Optimizely, Eppo) use pre-computation.

Common Mistakes for This Scenario

  • Ignoring peeking bias entirely. If experiment owners can check results daily and stop the experiment when they see significance, the false positive rate inflates dramatically. The interviewer will probe this.
  • Using server-side assignment without explaining the latency cost. Client-side hashing is faster but leaks experiment configuration. Server-side is more secure but adds a network round-trip to every page load.
  • Not discussing the statistical methodology. You do not need to derive formulas, but you should know the difference between fixed-horizon tests and sequential testing, and when each is appropriate.

Common Mistakes in System Design Interviews

These mistakes come up repeatedly across hundreds of system design interviews. Each one is a pattern that interviewers notice immediately and that costs candidates points. Study them before your next round.

1. Skipping Requirements Clarification

Jumping to architecture without understanding the problem. The interviewer intentionally leaves the prompt vague. If you do not ask about data volume, latency requirements, and consumer needs, you are designing blind. Spend 3-5 minutes on this. It is never wasted time.

2. Leading with Tools Instead of Requirements

"We will use Kafka, Spark, and Snowflake" is not an architecture. It is a shopping list. Instead: "We need a durable message queue because the source emits 50K events/sec and the consumer processes at 10K/sec, so we need buffering with replay. Kafka fits because..." The requirement justifies the tool, not the other way around.

3. Ignoring Failure Modes Entirely

The happy path is 30% of the evaluation. The interviewer cares more about what happens when Kafka is down, when a Spark job fails at hour 3 of a 4-hour run, when data arrives 6 hours late. If you only discuss the happy path, you cap your rating at "meets expectations." Proactively raise failure modes before the interviewer asks.

4. Over-engineering with Lambda Architecture

Running batch and streaming in parallel for the same data doubles the complexity, doubles the cost, and creates a consistency problem between the two paths. Unless the business genuinely requires both sub-second and daily-aggregated views of the same data, pick one. Mention the Lambda trade-off and explain why you are choosing not to use it. That shows more maturity than defaulting to it. See our batch vs streaming guide for the full analysis.

5. Not Drawing a Diagram

Talking without drawing forces the interviewer to build the architecture in their head. Draw boxes. Draw arrows. Label them. A visual reference keeps the conversation grounded and prevents misunderstandings. Even in a virtual interview, use the shared whiteboard or screen-share a drawing tool.

6. Going Too Deep on One Component

Spending 15 minutes on Kafka partition strategy while never mentioning the storage or serving layer is a signal that you cannot scope a system. Cover the full architecture at medium depth first. Then ask the interviewer which area to drill into. They will tell you. This keeps you from over-investing in one area that the interviewer may not even be evaluating.

7. Forgetting Cost as a Design Constraint

An architecture that processes 1M events/day with a $50K/month cloud bill when a $2K/month solution exists is not a good design. Senior-level candidates discuss cost alongside performance. Mention spot instances for batch workloads, reserved capacity for streaming, and storage tiering for cold data. The interviewer wants to see that you can build something the company can afford to operate.

8. Claiming Exactly-Once Processing

In distributed systems, exactly-once delivery is extremely difficult to achieve end-to-end. If you claim it, the interviewer will drill into how you guarantee it across every component boundary. The practical pattern is at-least-once delivery with idempotent consumers. Deduplication keys (event ID + timestamp) at the consumer level give you effectively-once semantics without the complexity of distributed transactions.

9. Not Stating Trade-offs Explicitly

Every design choice excludes alternatives. If you choose batch processing, state what you give up (freshness) and what you gain (simplicity, cost). If you choose streaming, state what you give up (cost, complexity) and what you gain (low latency). The interviewer is not looking for the "right" answer. They are looking for structured reasoning about trade-offs. Candidates who state trade-offs unprompted consistently score higher.

10. Ignoring Data Quality and Validation

A pipeline that moves data without validating it is a liability. Mention schema validation at ingestion, null checks and range checks in the transformation layer, row count reconciliation between source and destination, and data freshness monitoring. These are operational concerns that separate production-ready designs from whiteboard exercises. For more on pipeline reliability patterns, see our pipeline interview questions guide.

System Design Interview FAQ

How do I structure a system design answer in an interview?+
Follow this order: (1) Ask clarifying questions for 3-5 minutes. Nail down scale, latency requirements, and use cases. (2) Draw the high-level architecture: sources, ingestion, processing, storage, serving. (3) Walk through the data flow end to end. (4) Discuss failure modes and how you handle them. (5) Address monitoring and alerting. (6) Discuss trade-offs and alternatives. Spend roughly equal time on each step.
Should I mention specific tools like Kafka, Airflow, or Spark?+
Yes, but only after establishing the requirements and architecture. Say: "For the message queue layer, I would use Kafka because we need high throughput and replay capability." Do not lead with tools. Lead with the problem, then justify the tool choice. If you name a tool, be ready to explain why you chose it over alternatives.
How technical should my system design answer be?+
Technical enough to show you can build it, not so technical that you get lost in details. Mention partition keys, serialization formats, and retry strategies. Do not go line-by-line through config files. The interviewer wants to see that you understand the building blocks and how they connect. Depth on one or two components is better than shallow coverage of everything.
How is data engineering system design different from backend system design?+
Backend system design focuses on request/response latency, load balancing, caching, and database scaling. Data engineering system design focuses on throughput, data quality, batch vs streaming processing, and analytical query performance. The evaluation criteria are different: backend is measured in p99 latency; data engineering is measured in data freshness, correctness, and cost efficiency.
How much detail should I give on capacity estimation?+
Spend 2-3 minutes on capacity math, not 10. Convert the input rate to events per second, estimate record size, multiply to get MB/s throughput, then project daily and yearly storage. Round aggressively. The interviewer wants to see that you can do the math and that it influences your design choices (e.g., 'at 50 TB/year we need columnar compression and tiered storage'). They do not want a spreadsheet.
Should I draw on the whiteboard or on paper?+
Use whatever medium the interviewer provides. On a whiteboard, start in the upper left and draw left-to-right (sources to consumers). Leave space between components so you can add detail later. On paper or a virtual whiteboard, the same advice applies. The key habit is to draw first, then talk through it. A visual reference keeps both you and the interviewer anchored.
How do I handle a design scenario I have never seen before?+
Decompose the unfamiliar problem into familiar sub-problems. Every data system has sources, ingestion, storage, transformation, and serving. Ask clarifying questions to identify the data volume, latency needs, and consumer patterns. Then map each sub-problem to a pattern you know. If the interviewer asks about genomics pipelines and you have never touched biology, the underlying architecture is still ingestion, batch or stream processing, and analytical storage. State your reasoning aloud so the interviewer sees your approach, not just your answer.

Ground Your Design Skills in Real Practice

System design interviews test fundamentals. Practice SQL, Python, and data engineering concepts so your design answers are grounded in real implementation experience. Strong data modeling skills make every system design answer more credible.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats