System Design Interview Prep

Data Engineering System Design Mock Interview

Data engineering system design is not software engineering system design. No load balancers. No API gateways. Instead: data flow, storage layers, processing frameworks, and SLAs measured in data freshness, not response time. DataDriven's interactive discuss mode simulates a real 60-minute design round where the AI pushes back on every decision.

40+ design scenarios. Interactive AI discussion. Follow-up questions that test your reasoning, not just your architecture diagrams.

40+

Design Scenarios

40%

Weight at L5+

60 min

Typical Round Length

AI

Interactive Discuss Mode

DE System Design vs. SWE System Design

If you've studied system design from a software engineering perspective (Grokking, System Design Interview by Alex Xu, etc.), you're starting from the wrong foundation for DE interviews. Those resources teach you to design Twitter, design a URL shortener, or design a chat system. The primitives are HTTP requests, databases, caches, and load balancers.

Data engineering system design uses different primitives. You're designing data pipelines, not request-handling services. The questions sound different: "Design a real-time analytics pipeline" instead of "Design a notification system." The evaluation criteria are different: data freshness instead of API latency, data completeness instead of uptime, transformation correctness instead of response codes.

Some overlap exists. Both care about scalability, fault tolerance, and monitoring. But the specific patterns differ enough that studying SWE system design without DE-specific preparation will leave you unprepared for 60-70% of what DE interviewers test.

Primary Concern

SWE DESIGN

Request handling, API latency, user-facing reliability

DE DESIGN

Data flow, transformation correctness, processing latency, storage cost

Load Balancing

SWE DESIGN

Central to the design. Round-robin, consistent hashing, geographic routing.

DE DESIGN

Rarely discussed. Data pipelines don't have load balancers. Partitioning serves a similar purpose.

Storage Design

SWE DESIGN

Relational DB for OLTP, Redis for caching, CDN for static assets.

DE DESIGN

Data lake (S3/GCS), warehouse (Snowflake/BigQuery), OLAP engine (ClickHouse/Druid), and the data model that connects them.

Failure Handling

SWE DESIGN

Circuit breakers, retries, graceful degradation for users.

DE DESIGN

Idempotent processing, dead letter queues, backfill strategies, data reconciliation after failures.

SLA Definition

SWE DESIGN

p99 response time, uptime percentage, error rate.

DE DESIGN

Data freshness (how stale can it be?), completeness (are all records present?), accuracy (are transformations correct?).

Scaling Strategy

SWE DESIGN

Horizontal scaling of stateless services, database read replicas, sharding.

DE DESIGN

Partition-level parallelism, cluster sizing, storage tiering (hot/warm/cold), incremental processing.

The 5-Step Framework for DE System Design

Every DE system design interview follows a similar structure. Having a framework keeps you organized and ensures you cover what interviewers evaluate. Here's the framework DataDriven teaches, with time allocation for a 60-minute round.

Step 1: Requirements (10 min). Don't start designing. Start asking. What's the data volume? What's the freshness SLA? Who consumes the data and how? What are the access patterns (ad-hoc queries, scheduled reports, ML model serving)? This step is where senior candidates differentiate themselves. L4 candidates skip it and jump to architecture. L5+ candidates spend 10 minutes asking pointed questions that shape every subsequent decision.

Step 2: Data Model (15 min). Before you draw boxes and arrows, define what the data looks like. What are the entities? What are the relationships? What's the grain of your fact tables? How do dimensions change over time? This step catches the candidates who can architect infrastructure but can't model data. At Meta, data modeling is often the deciding factor.

Step 3: Processing Layer (15 min). Now choose how data moves and transforms. Batch, streaming, or hybrid? Which framework (Spark, Flink, dbt, custom Python)? How do you handle late-arriving data? How do you ensure idempotency? The processing layer is where your experience shows. Generic answers like "we'll use Spark" score poorly. Specific answers like "Spark with daily batch jobs, 128MB partitions, repartitioned by date, triggered by Airflow at 03:00 UTC after upstream sources confirm data readiness" score well.

Step 4: Serving Layer (10 min). How does the data reach consumers? If it's analysts running ad-hoc queries, you need a warehouse with a good query engine. If it's a dashboard with sub-second response time, you might need a materialized view or an OLAP engine. If it's an ML model, you need a feature serving layer with p99 latency constraints. The serving layer connects your data pipeline to the business value.

Step 5: Monitoring and Failure Handling (10 min). What breaks? How do you know it broke? How do you fix it? Every system design that doesn't address failure modes is incomplete. Cover: data quality checks (row counts, NULL rates, distribution anomalies), pipeline monitoring (job duration, success/failure alerts), and recovery strategy (backfill process, dead letter queues, manual reconciliation for edge cases).

4 System Design Prompts You Will See

These four prompts (or close variations) account for about 70% of DE system design questions at FAANG companies. Practice all four.

Design a Real-Time Analytics Pipeline

"Your company wants to track user behavior in real-time across web and mobile. Product managers need dashboards that update within 30 seconds. The system handles 50K events/second at peak. Design the pipeline end to end."

WHAT THEY TEST

Event ingestion architecture, choice between Kafka and Kinesis, stream processing framework (Flink vs. Spark Streaming), real-time serving layer (Druid, ClickHouse, or Pinot), and how you handle late-arriving events. The 30-second SLA drives your architecture. A batch pipeline won't work here.

FRAMEWORK APPLICATION

Requirements: 50K events/sec, 30-second freshness, web + mobile sources. Data model: event schema with user_id, event_type, timestamp, properties JSON. Processing: Kafka for ingestion, Flink for stream processing (windowed aggregations), ClickHouse for real-time OLAP. Serving: direct ClickHouse queries for dashboards. Monitoring: lag alerts on Kafka consumer groups, data freshness checks every 60 seconds.

Design a Data Warehouse for E-Commerce

"An e-commerce company with 10M daily active users needs a data warehouse. They want to track orders, products, customers, inventory, and marketing campaigns. Analytics teams run ad-hoc queries, and the finance team needs daily batch reports. Design the warehouse."

WHAT THEY TEST

Data modeling depth (star schema vs. snowflake, fact table granularity, slowly changing dimensions), ETL vs. ELT approach, choice of warehouse technology (Snowflake, BigQuery, Redshift), partitioning strategy, and how you handle both ad-hoc and scheduled workloads on the same system.

FRAMEWORK APPLICATION

Requirements: 10M DAU, mixed workloads (ad-hoc + batch), data freshness of T+1 for batch and near-real-time for key metrics. Data model: order_fact (grain: one row per order line item), customer_dim (SCD Type 2), product_dim, date_dim, campaign_dim. Processing: ELT with dbt, Airflow orchestration, incremental loads for facts, full refresh for small dims. Serving: Snowflake with separate warehouses for ad-hoc and scheduled workloads. Monitoring: dbt tests for referential integrity, row count anomaly detection.

Design an ML Feature Store

"Your ML team trains 20 models, each using 50-200 features. Features are computed from event data, transaction history, and user profiles. Training uses historical features (point-in-time correct). Serving needs features with p99 latency under 10ms. Design the feature store."

WHAT THEY TEST

Understanding of the training-serving skew problem, point-in-time correct feature computation for training, online vs. offline store architecture, feature freshness requirements, and how you prevent data leakage. This question also tests whether you can explain why a simple key-value store isn't enough.

FRAMEWORK APPLICATION

Requirements: 20 models, 200 features, p99 < 10ms serving, point-in-time training. Data model: feature definitions as code (schema, computation logic, entity key), feature groups by entity (user, product, session). Processing: offline store in Parquet on S3 (batch features computed by Spark), online store in Redis (low-latency serving). Dual-write pipeline: batch features materialized daily, streaming features updated in real-time via Flink. Serving: feature server with Redis reads, batched to minimize round trips. Monitoring: feature drift detection, staleness alerts, serving latency dashboards.

Design a Data Quality Monitoring System

"Your company has 400 data pipelines feeding a warehouse with 2,000 tables. Data quality issues cause downstream reports to show incorrect numbers, which erodes trust. Design a monitoring system that catches data quality problems before they reach analysts."

WHAT THEY TEST

Understanding of data quality dimensions (completeness, accuracy, consistency, timeliness, uniqueness), where to place quality checks in the pipeline (source, transformation, destination), alerting strategy (who gets paged and when), and how to handle quality issues without blocking the entire pipeline.

FRAMEWORK APPLICATION

Requirements: 400 pipelines, 2,000 tables, catch issues before downstream consumption. Data model: quality rules table (rule_id, table_name, check_type, threshold, severity), quality results table (run_id, rule_id, passed, value, timestamp). Processing: quality checks as dbt tests (row counts, NULL rates, referential integrity, value distributions), custom Python checks for statistical anomalies, freshness monitors on every table. Serving: quality dashboard with pass/fail per table, trend graphs for key metrics. Monitoring: PagerDuty alerts for critical failures, Slack notifications for warnings, weekly quality digest email for data producers.

DataDriven's Interactive Discuss Mode

System design can't be practiced by reading solutions. You need to practice the conversation. In a real interview, the interviewer doesn't sit quietly while you monologue for 60 minutes. They interrupt, ask clarifying questions, challenge your assumptions, and introduce new constraints mid-conversation.

DataDriven's discuss mode replicates this experience. You receive a design prompt. You start by typing your clarifying questions. The AI responds with specific answers about data volume, SLA requirements, team size, and existing infrastructure. As you propose architecture components, the AI pushes back.

"You chose Kafka for event ingestion. The prompt says the system handles 500 events/second. Is Kafka the right choice for that volume, or is it over-engineering? What's the simplest solution that meets the requirements?"

"You mentioned using Snowflake as the serving layer. The prompt requires sub-second dashboard queries. Snowflake's cold query latency is 2-5 seconds. How do you handle that?"

"Your pipeline has no error handling. What happens when the Spark job fails at step 3 of 5? Do you reprocess everything from scratch? How long does that take?"

These aren't generic questions. The AI generates follow-ups based on your specific design choices. If you propose a streaming architecture, it asks about exactly-once semantics and late event handling. If you propose batch, it asks about freshness trade-offs and backfill strategy. This targeted questioning is what makes discuss mode better than practicing with a generic study partner who doesn't know the domain deeply.

5 System Design Mistakes That Fail Interviews

1. Tool-first thinking. "Let's use Kafka, Spark, and Snowflake." Why? Because they're popular tools. This is the most common L4 mistake. Senior interviewers want to see requirements-first thinking. Start with the constraints (volume, freshness, access patterns), then choose tools that meet those constraints. If 500 events/second need to be ingested and daily batch freshness is acceptable, Kafka is over-engineering. A simple S3 file drop with hourly processing does the job at 5% of the operational complexity.

2. Skipping the data model. You draw a beautiful architecture diagram with boxes for ingestion, processing, and storage. The interviewer asks: "What does the data look like?" Silence. The data model is the foundation. Without it, your architecture is a collection of tools with no clear purpose. Define your entities, relationships, grain, and key dimensions before you discuss how to process them.

3. No failure handling. Your design works perfectly when everything goes right. But what happens when the source system sends duplicate records? When a Spark job fails halfway through a 4-hour run? When upstream data arrives 6 hours late? Every production pipeline fails. Interviewers want to see that your design accounts for failure, not just success. Discuss idempotency, dead letter queues, reconciliation processes, and alerting.

4. Over-engineering. You propose a real-time streaming architecture with Kafka, Flink, and Redis for a use case that requires daily batch reports. The interviewer asks: "The SLA is T+1. Why not a daily Spark job?" Over-engineering signals inexperience, not expertise. Senior engineers choose the simplest architecture that meets the requirements and articulate why complexity is (or isn't) justified.

5. Monologue instead of conversation. You talk for 20 minutes without checking in with the interviewer. System design is a collaborative exercise. After each major decision, pause: "Does this direction make sense, or should I go deeper on any part?" This shows communication skills and gives the interviewer opportunities to redirect you toward what they want to evaluate.

The Batch vs. Streaming Decision

Every DE system design interview includes this decision, either explicitly or implicitly. The interviewer is testing whether you can match the processing model to the requirements instead of defaulting to the one you're most comfortable with.

Choose batch when: Freshness SLA is measured in hours or days. Data volume is large but bounded (you know when today's data is "done"). Consumers are reports, dashboards with T+1 freshness, or ML training pipelines. Cost sensitivity is high (batch compute is 60-80% cheaper than always-on streaming clusters). The data sources deliver data in files or bulk exports.

Choose streaming when: Freshness SLA is measured in seconds or minutes. Data arrives continuously with no natural "end of day." Consumers need real-time alerts, live dashboards, or online ML predictions. The business value degrades rapidly with staleness (fraud detection, dynamic pricing, live personalization).

Choose hybrid when: Some consumers need real-time and others need batch (this is the most common real-world case). The lambda architecture (parallel batch and streaming paths) has fallen out of favor due to maintenance burden. The kappa architecture (streaming-only with reprocessing capability) is simpler but harder to debug. Most modern teams use a practical hybrid: streaming for the critical path and batch for backfills, historical aggregations, and ML training.

DataDriven's design problems explicitly state freshness and volume requirements. The AI evaluates whether your batch/streaming choice matches those requirements and flags over-engineering (streaming when batch would work) and under-engineering (batch when the SLA demands real-time).

System Design FAQ

How is data engineering system design different from software engineering system design?

The fundamental difference: SWE system design is about handling requests, DE system design is about moving and transforming data. SWE designs focus on load balancers, API gateways, caching layers, and database scaling. DE designs focus on ingestion patterns, transformation frameworks, storage layers, and data quality. There's overlap (both care about reliability and monitoring), but the core abstractions are different. Studying SWE system design will leave you unprepared for DE-specific questions about data modeling, pipeline orchestration, and batch vs. streaming trade-offs.

What framework should I use for DE system design interviews?

A five-step framework works well. First, clarify requirements: data volume, freshness SLA, access patterns, and who consumes the data. Second, design the data model: what are the entities, relationships, and grain? Third, design the processing layer: batch, streaming, or hybrid? What framework and why? Fourth, design the serving layer: how does the data reach consumers? Fifth, add monitoring and failure handling: what breaks, how do you detect it, and how do you recover? Spend roughly 10 minutes on requirements, 15 on data model, 15 on processing, 10 on serving, and 10 on monitoring.

How does DataDriven's interactive discuss mode work for system design?

You receive a design prompt with intentionally vague requirements. You start by asking clarifying questions, and the AI responds with specific answers (data volume, SLA, team constraints). As you propose architecture components, the AI asks follow-up questions: 'Why Kafka over Kinesis?' 'What happens when this component fails?' 'Your design assumes events arrive in order. What if they don't?' This simulates the back-and-forth of a real system design interview, where the interviewer actively challenges your decisions.

What are the most common system design mistakes in DE interviews?

Four mistakes are most frequent. First, jumping to tools before understanding requirements ('Let's use Kafka' before knowing the data volume or latency needs). Second, ignoring the data model (designing the pipeline without defining what the data looks like). Third, not addressing failure modes (what happens when Kafka goes down? What happens when a Spark job fails halfway?). Fourth, over-engineering (proposing a real-time streaming architecture for a use case where daily batch would meet the SLA at 10% of the complexity and cost).

How many system design problems should I practice for a senior interview?

8-10 full problems, practiced end to end (45-60 minutes each). That's roughly 10-12 hours of active practice, plus review time. Cover at least one problem from each category: real-time pipeline, batch warehouse, ML infrastructure, data quality, and migration/modernization. If you're targeting a specific company, add 2-3 company-specific problems. DataDriven has 40+ system design scenarios with AI discuss mode, so you won't run out of fresh problems.

Practice the Conversation, Not Just the Diagram

40+ system design scenarios with interactive AI discussion. Follow-up questions that test your reasoning. Feedback on what you missed and why it matters.