Data Pipeline Architecture Patterns for Interviews
System design rounds test whether you can reason about data flow at scale. You need to know the major architecture patterns (Lambda, Kappa, event-driven, request-driven), the components that make up a production pipeline, and how to draw and explain an architecture clearly. Here you will learn the four major patterns, the six components of a production pipeline, and how to answer architecture questions in interviews.
Four Architecture Patterns
Each pattern solves the same fundamental problem differently: how to get data from sources to consumers reliably, at the right freshness, at scale.
Lambda Architecture
Runs batch and streaming in parallel. The batch layer processes historical data for accuracy. The speed layer processes real-time data for low latency. A serving layer merges both views. The idea: batch is your source of truth; streaming gives you approximate real-time numbers until the batch catches up.
Strengths
+ Accurate historical data from batch layer
+ Low-latency approximations from speed layer
+ Well-suited for use cases where both accuracy and freshness matter
Weaknesses
- Two codebases: one for batch, one for streaming
- Complexity of maintaining two parallel systems
- The merge logic in the serving layer can be tricky
When to use: When you need both real-time dashboards and accurate historical reports. Example: ad click analytics where the dashboard shows live counts, but the billing system uses batch-computed totals.
Interview angle: Interviewers ask: 'Why not just use streaming for everything?' The answer: streaming gives approximate results due to late-arriving data, out-of-order events, and incomplete windows. Batch gives exact results because it processes complete datasets. Lambda gives you both.
Kappa Architecture
Streaming only. All data is treated as a stream of events. There is no separate batch layer. Historical reprocessing happens by replaying the event stream from the beginning (or from a checkpoint). The event log (Kafka, Kinesis) is the single source of truth.
Strengths
+ Single codebase for both real-time and historical
+ Simpler than Lambda; no merge logic
+ Natural fit for event-sourced systems
Weaknesses
- Reprocessing the entire stream is expensive for large datasets
- Not all problems fit the event streaming model
- Requires a durable, replayable event log (Kafka with long retention)
When to use: When your use case is naturally event-driven (user activity streams, IoT sensor data) and you can afford to reprocess the stream for corrections.
Interview angle: The interviewer wants to hear: Kappa works when your event log is the source of truth and reprocessing is feasible. It does not work when batch sources (database snapshots, file exports) are your primary input. Know when Kappa is appropriate and when it is not.
Event-Driven Architecture
Components communicate through events (messages). A producer publishes an event. One or more consumers react to it. This decouples producers from consumers: the producer does not know (or care) who processes the event. Message brokers (Kafka, RabbitMQ, Pub/Sub) sit in between.
Strengths
+ Loose coupling between pipeline components
+ Easy to add new consumers without modifying producers
+ Natural backpressure: consumers process at their own speed
Weaknesses
- Debugging is harder; events flow through multiple systems
- Ordering guarantees vary by broker and configuration
- At-least-once delivery means consumers must handle duplicates
When to use: Microservices architectures, real-time data platforms, and any system where multiple teams produce and consume data independently.
Interview angle: Discuss event ordering, at-least-once vs exactly-once semantics, and how you handle duplicate events. These are the technical depth points interviewers look for.
Request-Driven (ETL/ELT)
Traditional scheduled pipelines. An orchestrator (Airflow, Dagster, Prefect) triggers jobs on a schedule or in response to data availability. Each job extracts data from a source, transforms it, and loads it to a target. This is the most common architecture in data warehousing.
Strengths
+ Simple mental model: jobs run on a schedule
+ Mature tooling (Airflow has been production-tested for a decade)
+ Easy to reason about data freshness: the data is current as of the last successful run
Weaknesses
- Latency is bounded by the schedule interval
- Failures require manual intervention or complex retry logic
- Scaling individual jobs is harder than scaling event consumers
When to use: Batch analytics, data warehousing, and any use case where hourly or daily freshness is acceptable.
Interview angle: Most interviewers expect you to be fluent in this pattern. Discuss orchestration (DAGs, dependencies, retries), idempotency (re-running a job produces the same result), and monitoring (how you know a job failed).
Six Components of a Production Pipeline
Every pipeline, regardless of architecture pattern, has these components. Interviewers expect you to address each one in a system design answer.
Ingestion
Getting data from sources into your platform. Sources include databases (CDC, full exports), APIs (REST, GraphQL), files (S3, SFTP), and event streams (Kafka, Kinesis). The key decisions: push vs pull, full load vs incremental, and how to handle schema changes from upstream.
Common interview question: How do you handle schema evolution in an ingestion pipeline?
Transformation
Converting raw data into analytical models. This is where business logic lives: cleaning, joining, aggregating, and reshaping data. Tools: SQL (dbt), Spark, Python. The key decisions: where to transform (in the warehouse vs in a processing framework), when to transform (on ingest vs on read), and how to test transformations.
Common interview question: Do you prefer ETL or ELT? Why?
Storage
Where data lives at rest. Raw data in a data lake (S3, GCS). Structured data in a warehouse (BigQuery, Snowflake, Redshift). Hot data in a serving store (Redis, DynamoDB). The key decisions: file format (Parquet, Avro, ORC), partitioning strategy, and retention policies.
Common interview question: When would you use a data lake vs a data warehouse?
Serving
Making data accessible to consumers. Dashboards (Looker, Tableau), APIs, ML feature stores, or direct SQL access. The key decisions: materialized views vs on-demand computation, access control, and query performance optimization.
Common interview question: How do you optimize query performance for a dashboard that scans a 10TB table?
Orchestration
Coordinating when and how jobs run. Tools: Airflow, Dagster, Prefect, Cloud Composer. The key decisions: scheduling strategy (cron vs event-triggered), dependency management (upstream jobs must complete first), and failure handling (retry, skip, alert).
Common interview question: How do you handle a job that depends on three upstream datasets that arrive at different times?
Monitoring and Alerting
Knowing when something breaks before your stakeholders do. Data quality checks (row counts, NULL rates, distribution shifts), pipeline health (job duration, failure rates, SLA compliance), and cost monitoring. The key decisions: what to monitor, what thresholds to set, and who gets paged.
Common interview question: What data quality checks would you add to a production pipeline?
Every problem comes from a real interview report. Run code in your browser.
7 Architecture Interview Questions
These are the system design questions that test pipeline architecture reasoning.
Q1: Draw the architecture for a pipeline that ingests clickstream data and makes it available for both real-time dashboards and historical analytics.
What they test:
Lambda or Kappa architecture decision. The interviewer wants to see: event ingestion (Kafka), stream processing for real-time, batch processing for historical, and a serving layer. They care about your reasoning for choosing Lambda vs Kappa.
Approach:
Start with requirements: what latency does the dashboard need? What accuracy does the historical report need? If the dashboard can tolerate approximate counts, use a single streaming path (Kappa). If billing depends on exact counts, use Lambda with a batch correction layer.
Q2: How would you migrate a monolithic ETL pipeline to a modern, modular architecture?
What they test:
Practical experience with pipeline evolution. The interviewer wants a phased approach, not a big-bang rewrite. They care about how you handle the transition period when both old and new pipelines coexist.
Approach:
Phase 1: add monitoring to the existing pipeline. Phase 2: decompose into modular jobs (one table per job). Phase 3: introduce orchestration (Airflow). Phase 4: migrate to a modern storage layer (from on-prem to cloud). Each phase delivers value independently.
Q3: Design a pipeline that handles 1 billion events per day with a 5-minute freshness SLA.
What they test:
Scale reasoning. 1B events/day = ~11,500 events/second. The interviewer checks whether you do the math, choose appropriate tools (Kafka + Flink, not a cron job), and discuss partitioning, parallelism, and backpressure.
Approach:
Kafka for ingestion (partitioned by event key). Flink or Spark Structured Streaming for processing. Output to a columnar store (Parquet on S3 or BigQuery). Monitor lag to hit the 5-minute SLA. Discuss what happens when throughput exceeds processing capacity.
Q4: What is the difference between batch, micro-batch, and true streaming?
What they test:
Conceptual clarity. Batch: process all data at once on a schedule (hourly, daily). Micro-batch: process data in small intervals (every 30 seconds to 5 minutes), implemented by Spark Structured Streaming. True streaming: process each event individually as it arrives, implemented by Flink and Kafka Streams.
Approach:
Explain the spectrum: batch has the highest latency but simplest implementation. True streaming has the lowest latency but the most complex state management. Micro-batch is the practical middle ground that most teams choose.
Q5: You discover that your pipeline has been producing incorrect results for 3 days. How do you handle it?
What they test:
Incident response and pipeline design. The interviewer wants to see: immediate triage (what is wrong), impact assessment (who consumed the bad data), root cause analysis, fix, backfill, and prevention.
Approach:
Step 1: assess impact (which tables, which consumers). Step 2: disable the pipeline to prevent more bad data. Step 3: identify root cause. Step 4: fix and validate. Step 5: backfill the 3 affected days (your pipeline must be idempotent for this to work). Step 6: add monitoring to catch this earlier.
Q6: How do you decide between building a custom pipeline vs using a managed service?
What they test:
Engineering judgment. The interviewer wants to hear cost analysis, maintenance burden, flexibility, and team expertise. They do not want a dogmatic answer.
Approach:
Managed services (Fivetran, Stitch) for standard SaaS connectors. Custom pipelines for proprietary data sources, complex transformation logic, or performance requirements that managed services cannot meet. The tiebreaker is often the team: do you have engineers who can maintain custom code?
Q7: Explain how you would implement exactly-once processing in a streaming pipeline.
What they test:
Deep understanding of distributed systems semantics. Exactly-once is technically at-least-once delivery plus idempotent processing. The interviewer wants you to explain: checkpointing (Flink savepoints), transactional writes (Kafka transactions), and deduplication at the consumer.
Approach:
True exactly-once is an illusion; it is achieved through at-least-once delivery plus idempotent sinks. Use Kafka transactions for produce-and-commit atomicity. Use Flink checkpointing for stateful processing. Use MERGE or upsert at the sink for idempotent writes.
Pipeline Architecture FAQ
What pipeline architecture do most companies use?+
Should I learn Lambda or Kappa architecture?+
What tools should I know for pipeline architecture?+
How do interviewers expect me to draw a pipeline architecture?+
What is the most common mistake in pipeline architecture interviews?+
Design Pipelines That Survive Production
Architecture knowledge gets you through the system design round. Hands-on practice with SQL and Python gets you through the coding rounds. Both matter.
Related Guides
The design pattern that makes reruns, backfills, and failure recovery safe
When to transform before loading vs after, and why it matters
The tradeoff interviewers test most in system design rounds
The full system design interview preparation guide