Data Pipeline Architecture Patterns for Interviews

System design rounds test whether you can reason about data flow at scale. You need to know the major architecture patterns (Lambda, Kappa, event-driven, request-driven), the components that make up a production pipeline, and how to draw and explain an architecture clearly.

What this guide actually says

System design rounds test whether you can reason about data flow at scale. Know the four major patterns (Lambda, Kappa, event-driven, request-driven), the six components of a production pipeline (ingestion, transformation, storage, serving, orchestration, monitoring), and how to draw and explain an architecture clearly. Most companies run request-driven ETL/ELT; Kappa is the right framing for streaming questions; Lambda is mostly conceptual.

Four architecture patterns

Each pattern solves the same fundamental problem differently: how to get data from sources to consumers reliably, at the right freshness, at scale.

Lambda Architecture

Runs batch and streaming in parallel. The batch layer processes historical data for accuracy. The speed layer processes real-time data for low latency. A serving layer merges both views. Batch is your source of truth; streaming gives you approximate real-time numbers until batch catches up. Strengths: accurate historical data, low-latency approximations. Weaknesses: two codebases, complex merge logic. Use when you need both real-time dashboards and accurate historical reports (e.g., ad click analytics where the dashboard shows live counts but billing uses batch-computed totals). Interview angle: 'why not just streaming?' Streaming gives approximate results due to late events, out-of-order events, incomplete windows. Batch gives exact results. Lambda gives both.

Kappa Architecture

Streaming only. All data is a stream of events. No separate batch layer. Historical reprocessing happens by replaying the event stream from the beginning (or a checkpoint). The event log (Kafka, Kinesis) is the single source of truth. Strengths: single codebase, no merge logic, natural fit for event-sourced systems. Weaknesses: reprocessing the entire stream is expensive at scale, not all problems fit the event-streaming model, requires a durable replayable event log. Use when the use case is naturally event-driven and you can afford to reprocess for corrections. Interview angle: Kappa works when your event log is the source of truth and reprocessing is feasible. It doesn't work when batch sources (database snapshots, file exports) are your primary input.

Event-Driven Architecture

Components communicate through events (messages). A producer publishes an event; one or more consumers react. Decouples producers from consumers: the producer doesn't know (or care) who processes the event. Message brokers (Kafka, RabbitMQ, Pub/Sub) sit in between. Strengths: loose coupling, easy to add consumers without modifying producers, natural backpressure. Weaknesses: debugging is harder (events flow through multiple systems), ordering guarantees vary by broker and config, at-least-once delivery means consumers must handle duplicates. Use for microservices architectures, real-time data platforms, any system where multiple teams produce and consume independently. Interview angle: discuss event ordering, at-least-once vs exactly-once, and how you handle duplicates.

Request-Driven (ETL/ELT)

Traditional scheduled pipelines. An orchestrator (Airflow, Dagster, Prefect) triggers jobs on a schedule or in response to data availability. Each job extracts, transforms, loads. The most common architecture in data warehousing. Strengths: simple mental model (jobs run on a schedule), mature tooling (Airflow has been production-tested for a decade), easy to reason about freshness. Weaknesses: latency is bounded by schedule interval, failures require manual intervention or complex retry logic, scaling individual jobs is harder than scaling event consumers. Use for batch analytics, data warehousing, anywhere hourly or daily freshness is acceptable. Interview angle: discuss orchestration (DAGs, dependencies, retries), idempotency, monitoring.

Six components of a production pipeline

Every pipeline, regardless of architecture pattern, has these. Address each in a system design answer.

Ingestion

Getting data from sources into your platform. Sources: databases (CDC, full exports), APIs (REST, GraphQL), files (S3, SFTP), event streams (Kafka, Kinesis). Key decisions: push vs pull, full load vs incremental, how to handle schema changes from upstream. Common interview question: 'How do you handle schema evolution in an ingestion pipeline?'

Transformation

Converting raw data into analytical models. Where business logic lives: cleaning, joining, aggregating, reshaping. Tools: SQL (dbt), Spark, Python. Key decisions: where to transform (warehouse vs processing framework), when (on ingest vs on read), how to test transformations. Common interview question: 'Do you prefer ETL or ELT? Why?'

Storage

Where data lives at rest. Raw data in a data lake (S3, GCS). Structured data in a warehouse (BigQuery, Snowflake, Redshift). Hot data in a serving store (Redis, DynamoDB). Key decisions: file format (Parquet, Avro, ORC), partitioning strategy, retention policies. Common interview question: 'When would you use a data lake vs a data warehouse?'

Serving

Making data accessible to consumers. Dashboards (Looker, Tableau), APIs, ML feature stores, direct SQL access. Key decisions: materialized views vs on-demand computation, access control, query performance optimization. Common interview question: 'How do you optimize query performance for a dashboard that scans a 10TB table?'

Orchestration

Coordinating when and how jobs run. Tools: Airflow, Dagster, Prefect, Cloud Composer. Key decisions: scheduling strategy (cron vs event-triggered), dependency management (upstream jobs must complete first), failure handling (retry, skip, alert). Common interview question: 'How do you handle a job that depends on three upstream datasets that arrive at different times?'

Monitoring and alerting

Knowing when something breaks before your stakeholders do. Data quality checks (row counts, NULL rates, distribution shifts), pipeline health (job duration, failure rates, SLA compliance), cost monitoring. Key decisions: what to monitor, what thresholds to set, who gets paged. Common interview question: 'What data quality checks would you add to a production pipeline?'

7 architecture interview questions

The system design questions that test pipeline architecture reasoning.

Q01

Architecture for clickstream → real-time dashboard + historical analytics.

Lambda or Kappa decision. Start with requirements: what latency does the dashboard need? What accuracy does the historical report need? If the dashboard tolerates approximate counts, single streaming path (Kappa). If billing depends on exact counts, Lambda with a batch correction layer.

Q02

Migrate a monolithic ETL pipeline to a modern, modular architecture.

Phased, not big-bang. Phase 1: add monitoring to existing. Phase 2: decompose into modular jobs (one table per job). Phase 3: introduce orchestration (Airflow). Phase 4: migrate to a modern storage layer (on-prem to cloud). Each phase delivers value independently.

Q03

Design a pipeline handling 1B events/day with 5-minute freshness SLA.

1B events/day = ~11,500/sec. Kafka for ingestion (partitioned by event key). Flink or Spark Structured Streaming for processing. Output to a columnar store (Parquet on S3 or BigQuery). Monitor lag to hit the 5-minute SLA. Discuss what happens when throughput exceeds processing capacity.

Q04

Difference between batch, micro-batch, and true streaming?

Batch: process all data at once on a schedule (hourly, daily). Micro-batch: process in small intervals (30 seconds to 5 minutes), implemented by Spark Structured Streaming. True streaming: process each event as it arrives, implemented by Flink and Kafka Streams. Batch has highest latency, simplest implementation. True streaming has lowest latency, most complex state management. Micro-batch is the practical middle ground.

Q05

Your pipeline produced incorrect results for 3 days. How do you handle it?

Step 1: assess impact (which tables, which consumers). Step 2: disable the pipeline to prevent more bad data. Step 3: identify root cause. Step 4: fix and validate. Step 5: backfill the 3 affected days (pipeline must be idempotent). Step 6: add monitoring to catch this earlier.

Q06

Build vs buy: custom pipeline vs managed service?

Managed (Fivetran, Stitch) for standard SaaS connectors. Custom for proprietary data sources, complex transformation logic, or performance requirements managed services can't meet. The tiebreaker is often the team: do you have engineers who can maintain custom code?

Q07

How would you implement exactly-once processing in a streaming pipeline?

True exactly-once is an illusion; it's achieved through at-least-once delivery plus idempotent sinks. Use Kafka transactions for produce-and-commit atomicity. Use Flink checkpointing for stateful processing. Use MERGE or upsert at the sink for idempotent writes.

Pipeline architecture FAQ

What pipeline architecture do most companies use?+

Request-driven (scheduled ETL/ELT) by far. Airflow is the dominant orchestrator. Most companies run batch pipelines on hourly or daily schedules. Streaming is used for specific cases (real-time dashboards, fraud detection) but isn't the default.

Should I learn Lambda or Kappa?+

Learn both concepts, focus on Kappa for interviews. Lambda is seen as overly complex for most use cases. Most teams either use pure batch (ELT) or Kappa-style streaming. Lambda was influential as a concept but is rarely implemented in its full form.

What tools should I know for pipeline architecture?+

Airflow (orchestration), Kafka (streaming), Spark (batch/micro-batch processing), dbt (SQL transformation), a cloud warehouse (BigQuery, Snowflake, Redshift). These five cover the vast majority of architectures you'll discuss in interviews.

How do interviewers expect me to draw a pipeline architecture?+

Boxes and arrows. Each box is a component (source, ingestion, processing, storage, serving). Arrows show data flow. Label each component with the specific tool you'd use. Add annotations for key decisions: partitioning, processing interval, SLA. Start left to right.

Most common mistake in architecture interviews?+

Jumping to tools before understanding requirements. The interviewer wants: data volume, freshness requirement, who consumes the output, correctness guarantees. Once you establish these, the tool choices follow logically.

02 / Why practice

Design pipelines that survive production

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Start practicing

Related guides

Idempotent Pipelines→

The design pattern that makes reruns, backfills, and failure recovery safe.

ETL vs ELT→

When to transform before loading vs after.

Batch vs Streaming→

The tradeoff interviewers test most in system design rounds.