Pipeline Design
System design rounds test whether you can reason about data flow at scale. You need to know the major architecture patterns (Lambda, Kappa, event-driven, request-driven), the components that make up a production pipeline, and how to draw and explain an architecture clearly. Here you will learn the four major patterns, the six components of a production pipeline, and how to answer architecture questions in interviews.
Each pattern solves the same fundamental problem differently: how to get data from sources to consumers reliably, at the right freshness, at scale.
Runs batch and streaming in parallel. The batch layer processes historical data for accuracy. The speed layer processes real-time data for low latency. A serving layer merges both views. The idea: batch is your source of truth; streaming gives you approximate real-time numbers until the batch catches up.
Strengths
+ Accurate historical data from batch layer
+ Low-latency approximations from speed layer
+ Well-suited for use cases where both accuracy and freshness matter
Weaknesses
- Two codebases: one for batch, one for streaming
- Complexity of maintaining two parallel systems
- The merge logic in the serving layer can be tricky
When to use: When you need both real-time dashboards and accurate historical reports. Example: ad click analytics where the dashboard shows live counts, but the billing system uses batch-computed totals.
Interview angle: Interviewers ask: 'Why not just use streaming for everything?' The answer: streaming gives approximate results due to late-arriving data, out-of-order events, and incomplete windows. Batch gives exact results because it processes complete datasets. Lambda gives you both.
Streaming only. All data is treated as a stream of events. There is no separate batch layer. Historical reprocessing happens by replaying the event stream from the beginning (or from a checkpoint). The event log (Kafka, Kinesis) is the single source of truth.
Strengths
+ Single codebase for both real-time and historical
+ Simpler than Lambda; no merge logic
+ Natural fit for event-sourced systems
Weaknesses
- Reprocessing the entire stream is expensive for large datasets
- Not all problems fit the event streaming model
- Requires a durable, replayable event log (Kafka with long retention)
When to use: When your use case is naturally event-driven (user activity streams, IoT sensor data) and you can afford to reprocess the stream for corrections.
Interview angle: The interviewer wants to hear: Kappa works when your event log is the source of truth and reprocessing is feasible. It does not work when batch sources (database snapshots, file exports) are your primary input. Know when Kappa is appropriate and when it is not.
Components communicate through events (messages). A producer publishes an event. One or more consumers react to it. This decouples producers from consumers: the producer does not know (or care) who processes the event. Message brokers (Kafka, RabbitMQ, Pub/Sub) sit in between.
Strengths
+ Loose coupling between pipeline components
+ Easy to add new consumers without modifying producers
+ Natural backpressure: consumers process at their own speed
Weaknesses
- Debugging is harder; events flow through multiple systems
- Ordering guarantees vary by broker and configuration
- At-least-once delivery means consumers must handle duplicates
When to use: Microservices architectures, real-time data platforms, and any system where multiple teams produce and consume data independently.
Interview angle: Discuss event ordering, at-least-once vs exactly-once semantics, and how you handle duplicate events. These are the technical depth points interviewers look for.
Traditional scheduled pipelines. An orchestrator (Airflow, Dagster, Prefect) triggers jobs on a schedule or in response to data availability. Each job extracts data from a source, transforms it, and loads it to a target. This is the most common architecture in data warehousing.
Strengths
+ Simple mental model: jobs run on a schedule
+ Mature tooling (Airflow has been production-tested for a decade)
+ Easy to reason about data freshness: the data is current as of the last successful run
Weaknesses
- Latency is bounded by the schedule interval
- Failures require manual intervention or complex retry logic
- Scaling individual jobs is harder than scaling event consumers
When to use: Batch analytics, data warehousing, and any use case where hourly or daily freshness is acceptable.
Interview angle: Most interviewers expect you to be fluent in this pattern. Discuss orchestration (DAGs, dependencies, retries), idempotency (re-running a job produces the same result), and monitoring (how you know a job failed).
Every pipeline, regardless of architecture pattern, has these components. Interviewers expect you to address each one in a system design answer.
Getting data from sources into your platform. Sources include databases (CDC, full exports), APIs (REST, GraphQL), files (S3, SFTP), and event streams (Kafka, Kinesis). The key decisions: push vs pull, full load vs incremental, and how to handle schema changes from upstream.
Common interview question: How do you handle schema evolution in an ingestion pipeline?
Converting raw data into analytical models. This is where business logic lives: cleaning, joining, aggregating, and reshaping data. Tools: SQL (dbt), Spark, Python. The key decisions: where to transform (in the warehouse vs in a processing framework), when to transform (on ingest vs on read), and how to test transformations.
Common interview question: Do you prefer ETL or ELT? Why?
Where data lives at rest. Raw data in a data lake (S3, GCS). Structured data in a warehouse (BigQuery, Snowflake, Redshift). Hot data in a serving store (Redis, DynamoDB). The key decisions: file format (Parquet, Avro, ORC), partitioning strategy, and retention policies.
Common interview question: When would you use a data lake vs a data warehouse?
Making data accessible to consumers. Dashboards (Looker, Tableau), APIs, ML feature stores, or direct SQL access. The key decisions: materialized views vs on-demand computation, access control, and query performance optimization.
Common interview question: How do you optimize query performance for a dashboard that scans a 10TB table?
Coordinating when and how jobs run. Tools: Airflow, Dagster, Prefect, Cloud Composer. The key decisions: scheduling strategy (cron vs event-triggered), dependency management (upstream jobs must complete first), and failure handling (retry, skip, alert).
Common interview question: How do you handle a job that depends on three upstream datasets that arrive at different times?
Knowing when something breaks before your stakeholders do. Data quality checks (row counts, NULL rates, distribution shifts), pipeline health (job duration, failure rates, SLA compliance), and cost monitoring. The key decisions: what to monitor, what thresholds to set, and who gets paged.
Common interview question: What data quality checks would you add to a production pipeline?
These are the system design questions that test pipeline architecture reasoning.
What they test:
Lambda or Kappa architecture decision. The interviewer wants to see: event ingestion (Kafka), stream processing for real-time, batch processing for historical, and a serving layer. They care about your reasoning for choosing Lambda vs Kappa.
Approach:
Start with requirements: what latency does the dashboard need? What accuracy does the historical report need? If the dashboard can tolerate approximate counts, use a single streaming path (Kappa). If billing depends on exact counts, use Lambda with a batch correction layer.
What they test:
Practical experience with pipeline evolution. The interviewer wants a phased approach, not a big-bang rewrite. They care about how you handle the transition period when both old and new pipelines coexist.
Approach:
Phase 1: add monitoring to the existing pipeline. Phase 2: decompose into modular jobs (one table per job). Phase 3: introduce orchestration (Airflow). Phase 4: migrate to a modern storage layer (from on-prem to cloud). Each phase delivers value independently.
What they test:
Scale reasoning. 1B events/day = ~11,500 events/second. The interviewer checks whether you do the math, choose appropriate tools (Kafka + Flink, not a cron job), and discuss partitioning, parallelism, and backpressure.
Approach:
Kafka for ingestion (partitioned by event key). Flink or Spark Structured Streaming for processing. Output to a columnar store (Parquet on S3 or BigQuery). Monitor lag to hit the 5-minute SLA. Discuss what happens when throughput exceeds processing capacity.
What they test:
Conceptual clarity. Batch: process all data at once on a schedule (hourly, daily). Micro-batch: process data in small intervals (every 30 seconds to 5 minutes), implemented by Spark Structured Streaming. True streaming: process each event individually as it arrives, implemented by Flink and Kafka Streams.
Approach:
Explain the spectrum: batch has the highest latency but simplest implementation. True streaming has the lowest latency but the most complex state management. Micro-batch is the practical middle ground that most teams choose.
What they test:
Incident response and pipeline design. The interviewer wants to see: immediate triage (what is wrong), impact assessment (who consumed the bad data), root cause analysis, fix, backfill, and prevention.
Approach:
Step 1: assess impact (which tables, which consumers). Step 2: disable the pipeline to prevent more bad data. Step 3: identify root cause. Step 4: fix and validate. Step 5: backfill the 3 affected days (your pipeline must be idempotent for this to work). Step 6: add monitoring to catch this earlier.
What they test:
Engineering judgment. The interviewer wants to hear cost analysis, maintenance burden, flexibility, and team expertise. They do not want a dogmatic answer.
Approach:
Managed services (Fivetran, Stitch) for standard SaaS connectors. Custom pipelines for proprietary data sources, complex transformation logic, or performance requirements that managed services cannot meet. The tiebreaker is often the team: do you have engineers who can maintain custom code?
What they test:
Deep understanding of distributed systems semantics. Exactly-once is technically at-least-once delivery plus idempotent processing. The interviewer wants you to explain: checkpointing (Flink savepoints), transactional writes (Kafka transactions), and deduplication at the consumer.
Approach:
True exactly-once is an illusion; it is achieved through at-least-once delivery plus idempotent sinks. Use Kafka transactions for produce-and-commit atomicity. Use Flink checkpointing for stateful processing. Use MERGE or upsert at the sink for idempotent writes.
Architecture knowledge gets you through the system design round. Hands-on practice with SQL and Python gets you through the coding rounds. Both matter.
Start PracticingThe design pattern that makes reruns, backfills, and failure recovery safe
When to transform before loading vs after, and why it matters
The tradeoff interviewers test most in system design rounds
The full system design interview preparation guide