Data Pipeline Architecture Guide (2026)

Pipeline architecture is the set of contracts between four layers: ingestion, transformation, storage, and serving. Each layer has its own SLA, its own failure mode, and its own blast radius when it breaks. A well-architected pipeline makes those contracts explicit so the team downstream of you knows exactly when to trust the data and when to wait.

Pipeline architecture is the set of contracts between four layers: ingestion, transformation, storage, and serving. Each layer has its own SLA, its own failure mode, and its own blast radius when something breaks. A well-architected pipeline makes those contracts explicit so the team downstream of you knows when to trust the data and when to wait.

This guide covers the batch-versus-streaming decision, the four layers in plain terms, the three patterns interviewers expect you to know, and four design questions with the version that earns the senior signal.

Batch, streaming, hybrid

The first decision in a pipeline design cascades into every other one: state management, ordering guarantees, exactly-once semantics, observability tooling. Pick batch when latency can be measured in hours. Pick streaming when sub-minute freshness is the requirement, not the aspiration. Pick hybrid when analytics needs both fresh dashboards and reprocessable history.

Aspect	Batch	Streaming	Hybrid
Latency	Minutes to hours	Seconds to milliseconds	Batch for historical, streaming for real-time
Processing model	Process bounded datasets on a schedule	Process unbounded event streams continuously	Both, unified through a serving layer
Complexity	Lower; well-understood patterns	Higher; ordering, late data, exactly-once	Highest; two code paths to maintain
Tools	Airflow, dbt, Spark, Snowflake Tasks	Kafka, Flink, Spark Streaming, Kinesis	Both tool sets, often with a shared serving layer
Cost model	Pay per run; scales with data volume	Always-on infrastructure; baseline cost is higher	Batch cost + streaming infrastructure cost
Best for	Reporting, analytics, ML training, backfills	Real-time dashboards, fraud detection, alerting	Systems needing both historical accuracy and real-time freshness

The four layers

Ingestion Layer. The ingestion layer extracts data from source systems and lands it in the pipeline. Batch ingestion pulls data on a schedule (daily, hourly) from databases, APIs, and file drops. Streaming ingestion captures events in real time from message queues, CDC streams, and webhooks. The ingestion layer must handle schema drift (source columns changing), late-arriving data, deduplication, and backpressure (when the source produces data faster than the pipeline can consume). Tools: Fivetran, Airbyte, Kafka Connect, Debezium, AWS DMS, custom Python scripts.

Transformation Layer. The transformation layer converts raw ingested data into a format suitable for analysis. This includes cleaning (handling NULLs, deduplication, type casting), enrichment (joining reference data), aggregation (pre-computing metrics), and modeling (building dimensional models, fact tables). In ELT architectures, transformations happen inside the data warehouse using SQL. In ETL architectures, transformations happen before loading, often in Spark or Python. Tools: dbt, Spark, Flink, Snowflake SQL, BigQuery SQL, Dataflow.

Storage Layer. The storage layer persists data at various stages of the pipeline. Raw data lands in object storage (S3, GCS) or a data lake. Transformed data lives in a data warehouse (Snowflake, BigQuery, Redshift) or a lakehouse (Databricks, Iceberg). The storage layer must support partitioning (organize data by date, region), compaction (merge small files), time travel (query historical versions), and access control (restrict who sees what). Storage format matters: Parquet and ORC for columnar analytics, Avro for schema evolution, Delta/Iceberg for ACID transactions on lakes. Tools: S3, GCS, Snowflake, BigQuery, Redshift, Databricks, Apache Iceberg, Delta Lake.

Serving Layer. The serving layer exposes processed data to consumers: BI dashboards, APIs, ML models, and operational applications. The serving layer must optimize for the access patterns of its consumers. Dashboards need fast aggregation queries (pre-aggregated tables, materialized views). APIs need low-latency point lookups (Redis, DynamoDB). ML models need feature stores (Feast, Tecton). The serving layer is where SLAs live: query latency, freshness guarantees, and availability targets. Tools: Looker, Tableau, Metabase, Redis, Feast, dbt metrics, Cube.dev.

Ingestion Layer: CDC and Incremental Load

-- CDC ingestion with Debezium (logical replication)
-- Captures INSERT, UPDATE, DELETE as events

-- Kafka topic: dbserver.public.orders
-- Each message contains:
-- {
--   "op": "u",  -- operation: c=create, u=update, d=delete
--   "before": { "order_id": 1, "status": "pending" },
--   "after":  { "order_id": 1, "status": "shipped" },
--   "ts_ms": 1710000000000
-- }

-- Batch ingestion: incremental load pattern
SELECT *
FROM source_db.orders
WHERE updated_at > :last_watermark
  AND updated_at <= :current_watermark;

Common tools: Fivetran, Airbyte, Kafka Connect, Debezium, AWS DMS, custom Python scripts

Transformation Layer: dbt Fact Table

-- dbt transformation: build a fact table
-- models/marts/fct_orders.sql

WITH orders AS (
  SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
  SELECT * FROM {{ ref('dim_customers') }}
),
products AS (
  SELECT * FROM {{ ref('dim_products') }}
)
SELECT
  o.order_id,
  o.order_date,
  c.customer_key,
  p.product_key,
  o.quantity,
  o.unit_price,
  o.quantity * o.unit_price AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date IS NOT NULL

Common tools: dbt, Spark, Flink, Snowflake SQL, BigQuery SQL, Dataflow

Storage Layer: Iceberg with Partitioning and Time Travel

-- Iceberg table with partitioning and time travel
CREATE TABLE analytics.fct_orders (
  order_id    BIGINT,
  order_date  DATE,
  customer_id BIGINT,
  amount      DECIMAL(12, 2)
)
USING iceberg
PARTITIONED BY (months(order_date))
TBLPROPERTIES (
  'write.metadata.delete-after-commit.enabled' = 'true',
  'history.expire.max-snapshot-age-ms' = '604800000'
);

-- Time travel: query yesterday's version
SELECT * FROM analytics.fct_orders
FOR SYSTEM_TIME AS OF TIMESTAMP '2025-03-14 00:00:00';

Common tools: S3, GCS, Snowflake, BigQuery, Redshift, Databricks, Apache Iceberg, Delta Lake

Serving Layer: Materialized View

-- Materialized view for dashboard serving
CREATE MATERIALIZED VIEW dashboard.daily_revenue AS
SELECT
  DATE_TRUNC('day', order_date) AS day,
  region,
  product_category,
  SUM(amount) AS total_revenue,
  COUNT(DISTINCT customer_id) AS unique_customers,
  COUNT(*) AS order_count
FROM analytics.fct_orders
JOIN analytics.dim_products USING (product_id)
GROUP BY 1, 2, 3;

-- Refresh on schedule
-- Snowflake: ALTER MATERIALIZED VIEW ... CLUSTER BY (day)
-- PostgreSQL: REFRESH MATERIALIZED VIEW CONCURRENTLY

Common tools: Looker, Tableau, Metabase, Redis, Feast, dbt metrics, Cube.dev

The three patterns interviewers expect you to know

Lambda Architecture. Lambda architecture maintains two parallel processing paths: a batch layer for accuracy and a speed layer for low latency. The batch layer processes all historical data and produces correct, complete results on a schedule. The speed layer processes real-time events and produces approximate, up-to-the-second results. A serving layer merges outputs from both. The trade-off: you maintain two separate codebases (batch and streaming) that must produce consistent results. This dual-maintenance burden is Lambda's biggest drawback.

Kappa Architecture. Kappa architecture eliminates the batch layer entirely. All data is processed through a single streaming pipeline. Historical reprocessing is handled by replaying events from the stream's log (Kafka with long retention). This solves Lambda's dual-codebase problem: one pipeline, one codebase, one set of logic. The trade-off: streaming infrastructure must handle both real-time events and historical replay, which requires careful capacity planning. Kappa works best when the source system produces events (CDC, clickstream, IoT sensors).

Medallion Architecture (Bronze/Silver/Gold). The medallion architecture organizes data into three quality tiers. Bronze is raw, unprocessed data landed directly from sources. Silver is cleaned, validated, and deduplicated data. Gold is business-level aggregations and models ready for consumption. Each tier builds on the previous one. This pattern is popular in lakehouse environments (Databricks, Iceberg) because it provides clear data lineage, easy debugging (you can always go back to bronze), and progressive quality improvement.

Prepare for the interview

01 / Open invite

02min.

Know Pipeline Architecture the way the interviewer who asks it knows it.

a Pipeline Architecture query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

AppleInterview question

Solve a Pipeline Architecture problem

How to pick

Start with batch. If consumers tolerate hourly or daily freshness, batch is simpler, cheaper, and easier to debug. Most analytics use cases fit. Airflow plus dbt plus Snowflake is a stack that scales further than most teams will need. Don't add streaming complexity until you have a concrete sub-minute latency requirement.

Add streaming for specific use cases. Real-time pricing, fraud detection, live dashboards, alerting. Add a streaming path alongside batch (hybrid or Lambda) rather than replacing batch entirely. Kafka plus Flink for the processing, Redis or a fast-query store for the serving layer.

Consider Kappa when sources are event-native. If every source produces events (clickstream, IoT, CDC) and the team is comfortable with stream processing, Kappa eliminates the batch path. Replay handles historical reprocessing. The catch is long Kafka retention and careful capacity planning for replay.

Medallion is orthogonal. Bronze, silver, gold applies to any processing model. It's about data quality tiers, not about how the data moves. Most modern teams adopt some version regardless of their batch-versus-streaming choice.

Four pipeline architecture interview questions

Design a data pipeline for a ride-sharing company that needs both real-time pricing and daily analytics reports.

What they're testing: System design fundamentals and the ability to choose between batch, streaming, and hybrid approaches. The interviewer wants to see you identify the two different latency requirements and design appropriate paths for each. Real-time pricing needs sub-second data. Daily reports need complete, accurate aggregations. How to answer: Start by identifying the two access patterns: real-time pricing (streaming) and daily reports (batch). For pricing: CDC or event stream from the rides database into Kafka. Flink or Spark Streaming computes surge metrics in real time and writes to Redis for the pricing service. For reports: Airflow orchestrates a daily batch pipeline that reads from the data lake (where streaming data is also persisted), transforms with dbt/Spark, and loads into the warehouse for BI dashboards. Share the event stream between both paths to avoid duplicate ingestion. This is a hybrid/Lambda approach. Mention the trade-off: two code paths to maintain, but each is optimized for its access pattern.

What is the medallion architecture, and why is it popular in lakehouse environments?

What they're testing: Whether you understand progressive data quality and the lakehouse paradigm. The interviewer wants to hear about bronze (raw), silver (cleaned), gold (modeled) tiers and why this separation matters for debugging, reprocessing, and data quality. How to answer: Define the three tiers. Bronze: raw data landed exactly as received from sources, with metadata columns (ingestion timestamp, source system). Silver: cleaned, deduplicated, validated data with consistent types and applied business rules. Gold: aggregated, modeled data ready for specific business use cases (fact tables, metric tables, feature tables). Why it works for lakehouses: Iceberg/Delta provide ACID transactions on the lake, so each tier can be a set of tables with schema enforcement. Debugging is easy because you can always inspect the bronze layer. Reprocessing means re-running silver from bronze, not re-ingesting from sources.

How do you handle late-arriving data in a streaming pipeline?

What they're testing: Practical streaming experience. Late data is the hardest problem in stream processing. Events can arrive seconds, minutes, or hours after they occurred. The interviewer checks whether you know about event time vs processing time, watermarks, and allowed lateness. How to answer: Distinguish event time (when the event happened) from processing time (when the pipeline sees it). Use watermarks to track how far behind the pipeline allows: a watermark of 5 minutes means events up to 5 minutes late are included in the correct window. Events arriving after the watermark are either dropped or sent to a side output for separate handling. In Flink: set watermark strategy with bounded out-of-orderness. In Spark Streaming: use withWatermark(). Mention the trade-off: longer watermarks increase completeness but add latency. For very late data (hours, days), use batch reprocessing to correct the streaming results.

Compare Lambda and Kappa architectures. When would you choose one over the other?

What they're testing: Architecture comparison skills and practical judgment. Lambda provides accuracy guarantees through the batch layer. Kappa simplifies operations with a single streaming pipeline. The interviewer wants trade-off analysis, not just definitions. How to answer: Lambda: two paths (batch + speed), batch layer is the source of truth, speed layer provides low-latency approximations. Pro: batch results are complete and correct. Con: two codebases computing the same metrics. Kappa: single streaming path, replay from the event log for reprocessing. Pro: one codebase, simpler operations. Con: replay can be slow for large historical datasets, and not all sources produce events naturally. Choose Lambda when: batch accuracy is critical (financial reconciliation, regulatory reporting), and the team can maintain two paths. Choose Kappa when: the source is naturally event-based (clickstream, IoT), latency requirements are uniform, and the team wants operational simplicity.

Common questions

What is data pipeline architecture?+

Data pipeline architecture is the design of systems that move data from sources to destinations, transforming it along the way. It includes four layers: ingestion (extracting from sources), transformation (cleaning, enriching, modeling), storage (persisting at various quality tiers), and serving (exposing to consumers). Architecture choices involve batch vs streaming processing, tool selection, error handling, monitoring, and scalability patterns.

What is the difference between batch and streaming pipelines?+

Batch pipelines process bounded datasets on a schedule (hourly, daily). They are simpler, cheaper, and produce complete results for each run. Streaming pipelines process unbounded event streams continuously with sub-second latency. They are more complex (ordering, late data, exactly-once semantics) and have higher baseline infrastructure costs. Choose batch for analytics and reporting. Choose streaming for real-time applications like fraud detection and live dashboards.

What is the medallion architecture?+

The medallion architecture organizes data into three progressive quality tiers: Bronze (raw, unprocessed), Silver (cleaned, validated, deduplicated), and Gold (business-level models and aggregations). Each tier builds on the previous one. This pattern is popular in lakehouse environments (Databricks, Iceberg, Delta Lake) because it provides clear lineage, easy debugging, and the ability to reprocess from raw data without re-ingesting from sources.

What tools are commonly used in data pipeline architecture?+

Orchestration: Airflow, Dagster, Prefect. Ingestion: Fivetran, Airbyte, Kafka Connect, Debezium. Transformation: dbt, Spark, Flink. Storage: S3/GCS (lake), Snowflake/BigQuery/Redshift (warehouse), Iceberg/Delta (lakehouse). Serving: Looker, Tableau, Redis, Feast. Streaming: Kafka, Flink, Spark Streaming, Kinesis. The specific tools matter less than understanding the architecture patterns they implement.

02 / Why practice

Sketch a topology against a real prompt

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
System design is graded on the calls you defend out loud
Ingestion, batch vs streaming, the bronze/silver/gold layers, idempotency, backfill and replay. Sketching the pipeline and naming the failure modes is the signal, not the boxes

Open a design problem

Related Guides

Pipeline Architecture Deep Dive→

Detailed patterns for building production-grade data pipelines

Batch vs Streaming→

When to use batch, streaming, or hybrid processing models

System Design for DE→

System design interview prep for data engineering roles