Data Engineering System Design Mock Interview (2026)

Data engineering system design is not software engineering system design. No load balancers. No API gateways. Instead: data flow, storage layers, processing frameworks, and SLAs measured in data freshness, not response time. DataDriven's interactive discuss mode simulates a real 60-minute design round where the AI pushes back on every decision.

40+
Design Scenarios
40%
Weight at L5+
60 min
Typical Round Length
AI
Interactive Discuss Mode

DE System Design vs. SWE System Design

If you have studied system design from a software engineering perspective (Grokking, System Design Interview by Alex Xu, etc.), you are starting from the wrong foundation for DE interviews. Those resources teach you to design Twitter, design a URL shortener, or design a chat system. The primitives are HTTP requests, databases, caches, and load balancers.

Data engineering system design uses different primitives. You are designing data pipelines, not request-handling services. The questions sound different: 'Design a real-time analytics pipeline' instead of 'Design a notification system.' The evaluation criteria are different: data freshness instead of API latency, data completeness instead of uptime, transformation correctness instead of response codes.

Some overlap exists. Both care about scalability, fault tolerance, and monitoring. But the specific patterns differ enough that studying SWE system design without DE-specific preparation will leave you unprepared for 60-70% of what DE interviewers test.

DE Design vs. SWE Design: Key Differences

AspectSWE DesignDE Design
Primary ConcernRequest handling, API latency, user-facing reliabilityData flow, transformation correctness, processing latency, storage cost
Load BalancingCentral to the design. Round-robin, consistent hashing, geographic routing.Rarely discussed. Data pipelines do not have load balancers. Partitioning serves a similar purpose.
Storage DesignRelational DB for OLTP, Redis for caching, CDN for static assets.Data lake (S3/GCS), warehouse (Snowflake/BigQuery), OLAP engine (ClickHouse/Druid), and the data model that connects them.
Failure HandlingCircuit breakers, retries, graceful degradation for users.Idempotent processing, dead letter queues, backfill strategies, data reconciliation after failures.
SLA Definitionp99 response time, uptime percentage, error rate.Data freshness (how stale can it be?), completeness (are all records present?), accuracy (are transformations correct?).
Scaling StrategyHorizontal scaling of stateless services, database read replicas, sharding.Partition-level parallelism, cluster sizing, storage tiering (hot/warm/cold), incremental processing.

The 5-Step Framework for DE System Design

Every DE system design interview follows a similar structure. Having a framework keeps you organized and ensures you cover what interviewers evaluate. Here is the framework DataDriven teaches, with time allocation for a 60-minute round.

Step 1: Requirements (10 min). Do not start designing. Start asking. What is the data volume? What is the freshness SLA? Who consumes the data and how? What are the access patterns (ad-hoc queries, scheduled reports, ML model serving)? This step is where senior candidates differentiate themselves. L4 candidates skip it and jump to architecture. L5+ candidates spend 10 minutes asking pointed questions that shape every subsequent decision.

Step 2: Data Model (15 min). Before you draw boxes and arrows, define what the data looks like. What are the entities? What are the relationships? What is the grain of your fact tables? How do dimensions change over time? This step catches the candidates who can architect infrastructure but cannot model data. At Meta, data modeling is often the deciding factor.

Step 3: Processing Layer (15 min). Now choose how data moves and transforms. Batch, streaming, or hybrid? Which framework (Spark, Flink, dbt, custom Python)? How do you handle late-arriving data? How do you ensure idempotency? The processing layer is where your experience shows. Generic answers like 'we will use Spark' score poorly. Specific answers like 'Spark with daily batch jobs, 128MB partitions, repartitioned by date, triggered by Airflow at 03:00 UTC after upstream sources confirm data readiness' score well.

Step 4: Serving Layer (10 min). How does the data reach consumers? If it is analysts running ad-hoc queries, you need a warehouse with a good query engine. If it is a dashboard with sub-second response time, you might need a materialized view or an OLAP engine. If it is an ML model, you need a feature serving layer with p99 latency constraints. The serving layer connects your data pipeline to the business value.

Step 5: Monitoring and Failure Handling (10 min). What breaks? How do you know it broke? How do you fix it? Every system design that does not address failure modes is incomplete. Cover: data quality checks (row counts, NULL rates, distribution anomalies), pipeline monitoring (job duration, success/failure alerts), and recovery strategy (backfill process, dead letter queues, manual reconciliation for edge cases).

4 System Design Prompts You Will See

Design a Real-Time Analytics Pipeline

Prompt: "Your company wants to track user behavior in real-time across web and mobile. Product managers need dashboards that update within 30 seconds. The system handles 50K events/second at peak. Design the pipeline end to end." What they test: Event ingestion architecture, choice between Kafka and Kinesis, stream processing framework (Flink vs. Spark Streaming), real-time serving layer (Druid, ClickHouse, or Pinot), and how you handle late-arriving events. The 30-second SLA drives your architecture. A batch pipeline will not work here. Framework application: Requirements: 50K events/sec, 30-second freshness, web + mobile sources. Data model: event schema with user_id, event_type, timestamp, properties JSON. Processing: Kafka for ingestion, Flink for stream processing (windowed aggregations), ClickHouse for real-time OLAP. Serving: direct ClickHouse queries for dashboards. Monitoring: lag alerts on Kafka consumer groups, data freshness checks every 60 seconds.

Design a Data Warehouse for E-Commerce

Prompt: "An e-commerce company with 10M daily active users needs a data warehouse. They want to track orders, products, customers, inventory, and marketing campaigns. Analytics teams run ad-hoc queries, and the finance team needs daily batch reports. Design the warehouse." What they test: Data modeling depth (star schema vs. snowflake, fact table granularity, slowly changing dimensions), ETL vs. ELT approach, choice of warehouse technology (Snowflake, BigQuery, Redshift), partitioning strategy, and how you handle both ad-hoc and scheduled workloads on the same system. Framework application: Requirements: 10M DAU, mixed workloads (ad-hoc + batch), data freshness of T+1 for batch and near-real-time for key metrics. Data model: order_fact (grain: one row per order line item), customer_dim (SCD Type 2), product_dim, date_dim, campaign_dim. Processing: ELT with dbt, Airflow orchestration, incremental loads for facts, full refresh for small dims. Serving: Snowflake with separate warehouses for ad-hoc and scheduled workloads. Monitoring: dbt tests for referential integrity, row count anomaly detection.

Design an ML Feature Store

Prompt: "Your ML team trains 20 models, each using 50-200 features. Features are computed from event data, transaction history, and user profiles. Training uses historical features (point-in-time correct). Serving needs features with p99 latency under 10ms. Design the feature store." What they test: Understanding of the training-serving skew problem, point-in-time correct feature computation for training, online vs. offline store architecture, feature freshness requirements, and how you prevent data leakage. This question also tests whether you can explain why a simple key-value store is not enough. Framework application: Requirements: 20 models, 200 features, p99 < 10ms serving, point-in-time training. Data model: feature definitions as code (schema, computation logic, entity key), feature groups by entity (user, product, session). Processing: offline store in Parquet on S3 (batch features computed by Spark), online store in Redis (low-latency serving). Dual-write pipeline: batch features materialized daily, streaming features updated in real-time via Flink. Serving: feature server with Redis reads, batched to minimize round trips. Monitoring: feature drift detection, staleness alerts, serving latency dashboards.

Design a Data Quality Monitoring System

Prompt: "Your company has 400 data pipelines feeding a warehouse with 2,000 tables. Data quality issues cause downstream reports to show incorrect numbers, which erodes trust. Design a monitoring system that catches data quality problems before they reach analysts." What they test: Understanding of data quality dimensions (completeness, accuracy, consistency, timeliness, uniqueness), where to place quality checks in the pipeline (source, transformation, destination), alerting strategy (who gets paged and when), and how to handle quality issues without blocking the entire pipeline. Framework application: Requirements: 400 pipelines, 2,000 tables, catch issues before downstream consumption. Data model: quality rules table (rule_id, table_name, check_type, threshold, severity), quality results table (run_id, rule_id, passed, value, timestamp). Processing: quality checks as dbt tests (row counts, NULL rates, referential integrity, value distributions), custom Python checks for statistical anomalies, freshness monitors on every table. Serving: quality dashboard with pass/fail per table, trend graphs for key metrics. Monitoring: PagerDuty alerts for critical failures, Slack notifications for warnings, weekly quality digest email for data producers.

DataDriven's Interactive Discuss Mode

System design cannot be practiced by reading solutions. You need to practice the conversation. In a real interview, the interviewer does not sit quietly while you monologue for 60 minutes. They interrupt, ask clarifying questions, challenge your assumptions, and introduce new constraints mid-conversation.

DataDriven's discuss mode replicates this experience. You receive a design prompt. You start by typing your clarifying questions. The AI responds with specific answers about data volume, SLA requirements, team size, and existing infrastructure. As you propose architecture components, the AI pushes back.

'You chose Kafka for event ingestion. The prompt says the system handles 500 events/second. Is Kafka the right choice for that volume, or is it over-engineering? What is the simplest solution that meets the requirements?'

'You mentioned using Snowflake as the serving layer. The prompt requires sub-second dashboard queries. Snowflake's cold query latency is 2-5 seconds. How do you handle that?'

'Your pipeline has no error handling. What happens when the Spark job fails at step 3 of 5? Do you reprocess everything from scratch? How long does that take?'

These are not generic questions. The AI generates follow-ups based on your specific design choices. If you propose a streaming architecture, it asks about exactly-once semantics and late event handling. If you propose batch, it asks about freshness trade-offs and backfill strategy.

5 System Design Mistakes That Fail Interviews

Tool-first thinking

'Let's use Kafka, Spark, and Snowflake.' Why? Because they are popular tools. This is the most common L4 mistake. Senior interviewers want to see requirements-first thinking. Start with the constraints (volume, freshness, access patterns), then choose tools that meet those constraints. If 500 events/second need to be ingested and daily batch freshness is acceptable, Kafka is over-engineering. A simple S3 file drop with hourly processing does the job at 5% of the operational complexity.

Skipping the data model

You draw a beautiful architecture diagram with boxes for ingestion, processing, and storage. The interviewer asks: 'What does the data look like?' Silence. The data model is the foundation. Without it, your architecture is a collection of tools with no clear purpose. Define your entities, relationships, grain, and key dimensions before you discuss how to process them.

No failure handling

Your design works perfectly when everything goes right. But what happens when the source system sends duplicate records? When a Spark job fails halfway through a 4-hour run? When upstream data arrives 6 hours late? Every production pipeline fails. Interviewers want to see that your design accounts for failure, not just success. Discuss idempotency, dead letter queues, reconciliation processes, and alerting.

Over-engineering

You propose a real-time streaming architecture with Kafka, Flink, and Redis for a use case that requires daily batch reports. The interviewer asks: 'The SLA is T+1. Why not a daily Spark job?' Over-engineering signals inexperience, not expertise. Senior engineers choose the simplest architecture that meets the requirements and articulate why complexity is (or is not) justified.

Monologue instead of conversation

You talk for 20 minutes without checking in with the interviewer. System design is a collaborative exercise. After each major decision, pause: 'Does this direction make sense, or should I go deeper on any part?' This shows communication skills and gives the interviewer opportunities to redirect you toward what they want to evaluate.

The Batch vs. Streaming Decision

Every DE system design interview includes this decision, either explicitly or implicitly. The interviewer is testing whether you can match the processing model to the requirements instead of defaulting to the one you are most comfortable with.

Choose batch when: Freshness SLA is measured in hours or days. Data volume is large but bounded (you know when today's data is 'done'). Consumers are reports, dashboards with T+1 freshness, or ML training pipelines. Cost sensitivity is high (batch compute is 60-80% cheaper than always-on streaming clusters). The data sources deliver data in files or bulk exports.

Choose streaming when: Freshness SLA is measured in seconds or minutes. Data arrives continuously with no natural 'end of day.' Consumers need real-time alerts, live dashboards, or online ML predictions. The business value degrades rapidly with staleness (fraud detection, dynamic pricing, live personalization).

Choose hybrid when: Some consumers need real-time and others need batch (this is the most common real-world case). The lambda architecture (parallel batch and streaming paths) has fallen out of favor due to maintenance burden. The kappa architecture (streaming-only with reprocessing capability) is simpler but harder to debug. Most modern teams use a practical hybrid: streaming for the critical path and batch for backfills, historical aggregations, and ML training.

DataDriven's design problems explicitly state freshness and volume requirements. The AI evaluates whether your batch/streaming choice matches those requirements and flags over-engineering (streaming when batch would work) and under-engineering (batch when the SLA demands real-time).

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a system design query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1source → bronze → silver → gold
2 ingest : CDC + Kafka
3 transform : dbt + Airflow
4 serve : Snowflake
5
Execute your solution0.4s avg.
PayPalInterview question
Solve a problem

System Design FAQ

How is data engineering system design different from software engineering system design?+
The fundamental difference: SWE system design is about handling requests, DE system design is about moving and transforming data. SWE designs focus on load balancers, API gateways, caching layers, and database scaling. DE designs focus on ingestion patterns, transformation frameworks, storage layers, and data quality. There is overlap (both care about reliability and monitoring), but the core abstractions are different. Studying SWE system design will leave you unprepared for DE-specific questions about data modeling, pipeline orchestration, and batch vs. streaming trade-offs.
What framework should I use for DE system design interviews?+
A five-step framework works well. First, clarify requirements: data volume, freshness SLA, access patterns, and who consumes the data. Second, design the data model: what are the entities, relationships, and grain? Third, design the processing layer: batch, streaming, or hybrid? What framework and why? Fourth, design the serving layer: how does the data reach consumers? Fifth, add monitoring and failure handling: what breaks, how do you detect it, and how do you recover? Spend roughly 10 minutes on requirements, 15 on data model, 15 on processing, 10 on serving, and 10 on monitoring.
How does DataDriven's interactive discuss mode work for system design?+
You receive a design prompt with intentionally vague requirements. You start by asking clarifying questions, and the AI responds with specific answers (data volume, SLA, team constraints). As you propose architecture components, the AI asks follow-up questions: 'Why Kafka over Kinesis?' 'What happens when this component fails?' 'Your design assumes events arrive in order. What if they do not?' This simulates the back-and-forth of a real system design interview, where the interviewer actively challenges your decisions.
What are the most common system design mistakes in DE interviews?+
Four mistakes are most frequent. First, jumping to tools before understanding requirements ('Let's use Kafka' before knowing the data volume or latency needs). Second, ignoring the data model (designing the pipeline without defining what the data looks like). Third, not addressing failure modes (what happens when Kafka goes down? What happens when a Spark job fails halfway?). Fourth, over-engineering (proposing a real-time streaming architecture for a use case where daily batch would meet the SLA at 10% of the complexity and cost).
How many system design problems should I practice for a senior interview?+
8-10 full problems, practiced end to end (45-60 minutes each). That is roughly 10-12 hours of active practice, plus review time. Cover at least one problem from each category: real-time pipeline, batch warehouse, ML infrastructure, data quality, and migration/modernization. If you are targeting a specific company, add 2-3 company-specific problems. DataDriven has 40+ system design scenarios with AI discuss mode, so you will not run out of fresh problems.
02 / Why practice

Practice the Conversation, Not Just the Diagram

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related Guides