Pipeline Design

Data Pipeline Architecture Patterns for Interviews

System design rounds test whether you can reason about data flow at scale. You need to know the major architecture patterns (Lambda, Kappa, event-driven, request-driven), the components that make up a production pipeline, and how to draw and explain an architecture clearly. Here you will learn the four major patterns, the six components of a production pipeline, and how to answer architecture questions in interviews.

Four Architecture Patterns

Each pattern solves the same fundamental problem differently: how to get data from sources to consumers reliably, at the right freshness, at scale.

Lambda Architecture

Runs batch and streaming in parallel. The batch layer processes historical data for accuracy. The speed layer processes real-time data for low latency. A serving layer merges both views. The idea: batch is your source of truth; streaming gives you approximate real-time numbers until the batch catches up.

Strengths

+ Accurate historical data from batch layer

+ Low-latency approximations from speed layer

+ Well-suited for use cases where both accuracy and freshness matter

Weaknesses

- Two codebases: one for batch, one for streaming

- Complexity of maintaining two parallel systems

- The merge logic in the serving layer can be tricky

When to use: When you need both real-time dashboards and accurate historical reports. Example: ad click analytics where the dashboard shows live counts, but the billing system uses batch-computed totals.

Interview angle: Interviewers ask: 'Why not just use streaming for everything?' The answer: streaming gives approximate results due to late-arriving data, out-of-order events, and incomplete windows. Batch gives exact results because it processes complete datasets. Lambda gives you both.

Kappa Architecture

Streaming only. All data is treated as a stream of events. There is no separate batch layer. Historical reprocessing happens by replaying the event stream from the beginning (or from a checkpoint). The event log (Kafka, Kinesis) is the single source of truth.

Strengths

+ Single codebase for both real-time and historical

+ Simpler than Lambda; no merge logic

+ Natural fit for event-sourced systems

Weaknesses

- Reprocessing the entire stream is expensive for large datasets

- Not all problems fit the event streaming model

- Requires a durable, replayable event log (Kafka with long retention)

When to use: When your use case is naturally event-driven (user activity streams, IoT sensor data) and you can afford to reprocess the stream for corrections.

Interview angle: The interviewer wants to hear: Kappa works when your event log is the source of truth and reprocessing is feasible. It does not work when batch sources (database snapshots, file exports) are your primary input. Know when Kappa is appropriate and when it is not.

Event-Driven Architecture

Components communicate through events (messages). A producer publishes an event. One or more consumers react to it. This decouples producers from consumers: the producer does not know (or care) who processes the event. Message brokers (Kafka, RabbitMQ, Pub/Sub) sit in between.

Strengths

+ Loose coupling between pipeline components

+ Easy to add new consumers without modifying producers

+ Natural backpressure: consumers process at their own speed

Weaknesses

- Debugging is harder; events flow through multiple systems

- Ordering guarantees vary by broker and configuration

- At-least-once delivery means consumers must handle duplicates

When to use: Microservices architectures, real-time data platforms, and any system where multiple teams produce and consume data independently.

Interview angle: Discuss event ordering, at-least-once vs exactly-once semantics, and how you handle duplicate events. These are the technical depth points interviewers look for.

Request-Driven (ETL/ELT)

Traditional scheduled pipelines. An orchestrator (Airflow, Dagster, Prefect) triggers jobs on a schedule or in response to data availability. Each job extracts data from a source, transforms it, and loads it to a target. This is the most common architecture in data warehousing.

Strengths

+ Simple mental model: jobs run on a schedule

+ Mature tooling (Airflow has been production-tested for a decade)

+ Easy to reason about data freshness: the data is current as of the last successful run

Weaknesses

- Latency is bounded by the schedule interval

- Failures require manual intervention or complex retry logic

- Scaling individual jobs is harder than scaling event consumers

When to use: Batch analytics, data warehousing, and any use case where hourly or daily freshness is acceptable.

Interview angle: Most interviewers expect you to be fluent in this pattern. Discuss orchestration (DAGs, dependencies, retries), idempotency (re-running a job produces the same result), and monitoring (how you know a job failed).

Six Components of a Production Pipeline

Every pipeline, regardless of architecture pattern, has these components. Interviewers expect you to address each one in a system design answer.

Ingestion

Getting data from sources into your platform. Sources include databases (CDC, full exports), APIs (REST, GraphQL), files (S3, SFTP), and event streams (Kafka, Kinesis). The key decisions: push vs pull, full load vs incremental, and how to handle schema changes from upstream.

Common interview question: How do you handle schema evolution in an ingestion pipeline?

Transformation

Converting raw data into analytical models. This is where business logic lives: cleaning, joining, aggregating, and reshaping data. Tools: SQL (dbt), Spark, Python. The key decisions: where to transform (in the warehouse vs in a processing framework), when to transform (on ingest vs on read), and how to test transformations.

Common interview question: Do you prefer ETL or ELT? Why?

Storage

Where data lives at rest. Raw data in a data lake (S3, GCS). Structured data in a warehouse (BigQuery, Snowflake, Redshift). Hot data in a serving store (Redis, DynamoDB). The key decisions: file format (Parquet, Avro, ORC), partitioning strategy, and retention policies.

Common interview question: When would you use a data lake vs a data warehouse?

Serving

Making data accessible to consumers. Dashboards (Looker, Tableau), APIs, ML feature stores, or direct SQL access. The key decisions: materialized views vs on-demand computation, access control, and query performance optimization.

Common interview question: How do you optimize query performance for a dashboard that scans a 10TB table?

Orchestration

Coordinating when and how jobs run. Tools: Airflow, Dagster, Prefect, Cloud Composer. The key decisions: scheduling strategy (cron vs event-triggered), dependency management (upstream jobs must complete first), and failure handling (retry, skip, alert).

Common interview question: How do you handle a job that depends on three upstream datasets that arrive at different times?

Monitoring and Alerting

Knowing when something breaks before your stakeholders do. Data quality checks (row counts, NULL rates, distribution shifts), pipeline health (job duration, failure rates, SLA compliance), and cost monitoring. The key decisions: what to monitor, what thresholds to set, and who gets paged.

Common interview question: What data quality checks would you add to a production pipeline?

7 Architecture Interview Questions

These are the system design questions that test pipeline architecture reasoning.

Q1: Draw the architecture for a pipeline that ingests clickstream data and makes it available for both real-time dashboards and historical analytics.

What they test:

Lambda or Kappa architecture decision. The interviewer wants to see: event ingestion (Kafka), stream processing for real-time, batch processing for historical, and a serving layer. They care about your reasoning for choosing Lambda vs Kappa.

Approach:

Start with requirements: what latency does the dashboard need? What accuracy does the historical report need? If the dashboard can tolerate approximate counts, use a single streaming path (Kappa). If billing depends on exact counts, use Lambda with a batch correction layer.

Q2: How would you migrate a monolithic ETL pipeline to a modern, modular architecture?

What they test:

Practical experience with pipeline evolution. The interviewer wants a phased approach, not a big-bang rewrite. They care about how you handle the transition period when both old and new pipelines coexist.

Approach:

Phase 1: add monitoring to the existing pipeline. Phase 2: decompose into modular jobs (one table per job). Phase 3: introduce orchestration (Airflow). Phase 4: migrate to a modern storage layer (from on-prem to cloud). Each phase delivers value independently.

Q3: Design a pipeline that handles 1 billion events per day with a 5-minute freshness SLA.

What they test:

Scale reasoning. 1B events/day = ~11,500 events/second. The interviewer checks whether you do the math, choose appropriate tools (Kafka + Flink, not a cron job), and discuss partitioning, parallelism, and backpressure.

Approach:

Kafka for ingestion (partitioned by event key). Flink or Spark Structured Streaming for processing. Output to a columnar store (Parquet on S3 or BigQuery). Monitor lag to hit the 5-minute SLA. Discuss what happens when throughput exceeds processing capacity.

Q4: What is the difference between batch, micro-batch, and true streaming?

What they test:

Conceptual clarity. Batch: process all data at once on a schedule (hourly, daily). Micro-batch: process data in small intervals (every 30 seconds to 5 minutes), implemented by Spark Structured Streaming. True streaming: process each event individually as it arrives, implemented by Flink and Kafka Streams.

Approach:

Explain the spectrum: batch has the highest latency but simplest implementation. True streaming has the lowest latency but the most complex state management. Micro-batch is the practical middle ground that most teams choose.

Q5: You discover that your pipeline has been producing incorrect results for 3 days. How do you handle it?

What they test:

Incident response and pipeline design. The interviewer wants to see: immediate triage (what is wrong), impact assessment (who consumed the bad data), root cause analysis, fix, backfill, and prevention.

Approach:

Step 1: assess impact (which tables, which consumers). Step 2: disable the pipeline to prevent more bad data. Step 3: identify root cause. Step 4: fix and validate. Step 5: backfill the 3 affected days (your pipeline must be idempotent for this to work). Step 6: add monitoring to catch this earlier.

Q6: How do you decide between building a custom pipeline vs using a managed service?

What they test:

Engineering judgment. The interviewer wants to hear cost analysis, maintenance burden, flexibility, and team expertise. They do not want a dogmatic answer.

Approach:

Managed services (Fivetran, Stitch) for standard SaaS connectors. Custom pipelines for proprietary data sources, complex transformation logic, or performance requirements that managed services cannot meet. The tiebreaker is often the team: do you have engineers who can maintain custom code?

Q7: Explain how you would implement exactly-once processing in a streaming pipeline.

What they test:

Deep understanding of distributed systems semantics. Exactly-once is technically at-least-once delivery plus idempotent processing. The interviewer wants you to explain: checkpointing (Flink savepoints), transactional writes (Kafka transactions), and deduplication at the consumer.

Approach:

True exactly-once is an illusion; it is achieved through at-least-once delivery plus idempotent sinks. Use Kafka transactions for produce-and-commit atomicity. Use Flink checkpointing for stateful processing. Use MERGE or upsert at the sink for idempotent writes.

Pipeline Architecture FAQ

What pipeline architecture do most companies use?+
Request-driven (scheduled ETL/ELT) is by far the most common. Airflow is the dominant orchestrator. Most companies run batch pipelines on hourly or daily schedules. Streaming architectures are used for specific use cases (real-time dashboards, fraud detection) but are not the default.
Should I learn Lambda or Kappa architecture?+
Learn both concepts, but focus on Kappa for interviews. Lambda is seen as overly complex for most use cases. In practice, most teams either use pure batch (ELT) or Kappa-style streaming. Lambda was influential as a concept but is rarely implemented in its full form.
What tools should I know for pipeline architecture?+
Airflow (orchestration), Kafka (streaming), Spark (batch/micro-batch processing), dbt (SQL transformation), and a cloud warehouse (BigQuery, Snowflake, Redshift). These five cover the vast majority of pipeline architectures you will discuss in interviews.
How do interviewers expect me to draw a pipeline architecture?+
Boxes and arrows. Each box is a component (source, ingestion, processing, storage, serving). Arrows show data flow. Label each component with the specific tool you would use. Add annotations for key decisions: partitioning strategy, processing interval, SLA. Start left to right.
What is the most common mistake in pipeline architecture interviews?+
Jumping to tools before understanding requirements. The interviewer wants to hear: what is the data volume, what is the freshness requirement, who consumes the output, and what are the correctness guarantees? Once you establish these, the tool choices follow logically.

Design Pipelines That Survive Production

Architecture knowledge gets you through the system design round. Hands-on practice with SQL and Python gets you through the coding rounds. Both matter.

Start Practicing