Data Pipeline Architecture Guide (2026)
Pipeline architecture is the set of contracts between four layers: ingestion, transformation, storage, and serving. Each layer has its own SLA, its own failure mode, and its own blast radius when it breaks. A well-architected pipeline makes those contracts explicit so the team downstream of you knows exactly when to trust the data and when to wait.
Pipeline architecture is the set of contracts between four layers: ingestion, transformation, storage, and serving. Each layer has its own SLA, its own failure mode, and its own blast radius when something breaks. A well-architected pipeline makes those contracts explicit so the team downstream of you knows when to trust the data and when to wait.
Below: the batch-versus-streaming decision, the four layers in plain terms, the three patterns interviewers expect you to know, and four design questions with the version that earns the senior signal.
Know Pipeline Architecture the way the interviewer who asks it knows it.
Batch, streaming, hybrid
The first decision in a pipeline design cascades into every other one: state management, ordering guarantees, exactly-once semantics, observability tooling. Pick batch when latency can be measured in hours. Pick streaming when sub-minute freshness is the requirement, not the aspiration. Pick hybrid when analytics needs both fresh dashboards and reprocessable history.
| Aspect | Batch | Streaming | Hybrid |
|---|---|---|---|
| Latency | Minutes to hours | Seconds to milliseconds | Batch for historical, streaming for real-time |
| Processing model | Process bounded datasets on a schedule | Process unbounded event streams continuously | Both, unified through a serving layer |
| Complexity | Lower; well-understood patterns | Higher; ordering, late data, exactly-once | Highest; two code paths to maintain |
| Tools | Airflow, dbt, Spark, Snowflake Tasks | Kafka, Flink, Spark Streaming, Kinesis | Both tool sets, often with a shared serving layer |
| Cost model | Pay per run; scales with data volume | Always-on infrastructure; baseline cost is higher | Batch cost + streaming infrastructure cost |
| Best for | Reporting, analytics, ML training, backfills | Real-time dashboards, fraud detection, alerting | Systems needing both historical accuracy and real-time freshness |
The four layers
Ingestion Layer. The ingestion layer extracts data from source systems and lands it in the pipeline. Batch ingestion pulls data on a schedule (daily, hourly) from databases, APIs, and file drops. Streaming ingestion captures events in real time from message queues, CDC streams, and webhooks. The ingestion layer must handle schema drift (source columns changing), late-arriving data, deduplication, and backpressure (when the source produces data faster than the pipeline can consume). Tools: Fivetran, Airbyte, Kafka Connect, Debezium, AWS DMS, custom Python scripts.
Transformation Layer. The transformation layer converts raw ingested data into a format suitable for analysis. This includes cleaning (handling NULLs, deduplication, type casting), enrichment (joining reference data), aggregation (pre-computing metrics), and modeling (building dimensional models, fact tables). In ELT architectures, transformations happen inside the data warehouse using SQL. In ETL architectures, transformations happen before loading, often in Spark or Python. Tools: dbt, Spark, Flink, Snowflake SQL, BigQuery SQL, Dataflow.
Storage Layer. The storage layer persists data at various stages of the pipeline. Raw data lands in object storage (S3, GCS) or a data lake. Transformed data lives in a data warehouse (Snowflake, BigQuery, Redshift) or a lakehouse (Databricks, Iceberg). The storage layer must support partitioning (organize data by date, region), compaction (merge small files), time travel (query historical versions), and access control (restrict who sees what). Storage format matters: Parquet and ORC for columnar analytics, Avro for schema evolution, Delta/Iceberg for ACID transactions on lakes. Tools: S3, GCS, Snowflake, BigQuery, Redshift, Databricks, Apache Iceberg, Delta Lake.
Serving Layer. The serving layer exposes processed data to consumers: BI dashboards, APIs, ML models, and operational applications. The serving layer must optimize for the access patterns of its consumers. Dashboards need fast aggregation queries (pre-aggregated tables, materialized views). APIs need low-latency point lookups (Redis, DynamoDB). ML models need feature stores (Feast, Tecton). The serving layer is where SLAs live: query latency, freshness guarantees, and availability targets. Tools: Looker, Tableau, Metabase, Redis, Feast, dbt metrics, Cube.dev.
Ingestion Layer: CDC and Incremental Load
-- CDC ingestion with Debezium (logical replication)
-- Captures INSERT, UPDATE, DELETE as events
-- Kafka topic: dbserver.public.orders
-- Each message contains:
-- {
-- "op": "u", -- operation: c=create, u=update, d=delete
-- "before": { "order_id": 1, "status": "pending" },
-- "after": { "order_id": 1, "status": "shipped" },
-- "ts_ms": 1710000000000
-- }
-- Batch ingestion: incremental load pattern
SELECT *
FROM source_db.orders
WHERE updated_at > :last_watermark
AND updated_at <= :current_watermark;Common tools: Fivetran, Airbyte, Kafka Connect, Debezium, AWS DMS, custom Python scripts
The three patterns interviewers expect you to know
Lambda Architecture. Lambda architecture maintains two parallel processing paths: a batch layer for accuracy and a speed layer for low latency. The batch layer processes all historical data and produces correct, complete results on a schedule. The speed layer processes real-time events and produces approximate, up-to-the-second results. A serving layer merges outputs from both. The trade-off: you maintain two separate codebases (batch and streaming) that must produce consistent results. This dual-maintenance burden is Lambda's biggest drawback.
Kappa Architecture. Kappa architecture eliminates the batch layer entirely. All data is processed through a single streaming pipeline. Historical reprocessing is handled by replaying events from the stream's log (Kafka with long retention). This solves Lambda's dual-codebase problem: one pipeline, one codebase, one set of logic. The trade-off: streaming infrastructure must handle both real-time events and historical replay, which requires careful capacity planning. Kappa works best when the source system produces events (CDC, clickstream, IoT sensors).
Medallion Architecture (Bronze/Silver/Gold). The medallion architecture organizes data into three quality tiers. Bronze is raw, unprocessed data landed directly from sources. Silver is cleaned, validated, and deduplicated data. Gold is business-level aggregations and models ready for consumption. Each tier builds on the previous one. This pattern is popular in lakehouse environments (Databricks, Iceberg) because it provides clear data lineage, easy debugging (you can always go back to bronze), and progressive quality improvement.
Two Hundred Million Redirects
Billions of clicks. One tiny code. Two very different clocks.
Pulled from debriefs where system design separated levels.
Transformation Layer: dbt Fact Table
-- dbt transformation: build a fact table
-- models/marts/fct_orders.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('dim_customers') }}
),
products AS (
SELECT * FROM {{ ref('dim_products') }}
)
SELECT
o.order_id,
o.order_date,
c.customer_key,
p.product_key,
o.quantity,
o.unit_price,
o.quantity * o.unit_price AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date IS NOT NULLCommon tools: dbt, Spark, Flink, Snowflake SQL, BigQuery SQL, Dataflow
How to pick
Start with batch. If consumers tolerate hourly or daily freshness, batch is simpler, cheaper, and easier to debug. Most analytics use cases fit. Airflow plus dbt plus Snowflake is a stack that scales further than most teams will need. Don't add streaming complexity until you have a concrete sub-minute latency requirement.
Add streaming for specific use cases. Real-time pricing, fraud detection, live dashboards, alerting. Add a streaming path alongside batch (hybrid or Lambda) rather than replacing batch entirely. Kafka plus Flink for the processing, Redis or a fast-query store for the serving layer.
Consider Kappa when sources are event-native. If every source produces events (clickstream, IoT, CDC) and the team is comfortable with stream processing, Kappa eliminates the batch path. Replay handles historical reprocessing. The catch is long Kafka retention and careful capacity planning for replay.
Medallion is orthogonal. Bronze, silver, gold applies to any processing model. It's about data quality tiers, not about how the data moves. Most modern teams adopt some version regardless of their batch-versus-streaming choice.
Storage Layer: Iceberg with Partitioning and Time Travel
-- Iceberg table with partitioning and time travel
CREATE TABLE analytics.fct_orders (
order_id BIGINT,
order_date DATE,
customer_id BIGINT,
amount DECIMAL(12, 2)
)
USING iceberg
PARTITIONED BY (months(order_date))
TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled' = 'true',
'history.expire.max-snapshot-age-ms' = '604800000'
);
-- Time travel: query yesterday's version
SELECT * FROM analytics.fct_orders
FOR SYSTEM_TIME AS OF TIMESTAMP '2025-03-14 00:00:00';Common tools: S3, GCS, Snowflake, BigQuery, Redshift, Databricks, Apache Iceberg, Delta Lake
Serving Layer: Materialized View
-- Materialized view for dashboard serving
CREATE MATERIALIZED VIEW dashboard.daily_revenue AS
SELECT
DATE_TRUNC('day', order_date) AS day,
region,
product_category,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) AS order_count
FROM analytics.fct_orders
JOIN analytics.dim_products USING (product_id)
GROUP BY 1, 2, 3;
-- Refresh on schedule
-- Snowflake: ALTER MATERIALIZED VIEW ... CLUSTER BY (day)
-- PostgreSQL: REFRESH MATERIALIZED VIEW CONCURRENTLYCommon tools: Looker, Tableau, Metabase, Redis, Feast, dbt metrics, Cube.dev
Four pipeline architecture interview questions
Design a data pipeline for a ride-sharing company that needs both real-time pricing and daily analytics reports.
What they're testing: System design fundamentals and the ability to choose between batch, streaming, and hybrid approaches. The interviewer wants to see you identify the two different latency requirements and design appropriate paths for each. Real-time pricing needs sub-second data. Daily reports need complete, accurate aggregations. How to answer: Start by identifying the two access patterns: real-time pricing (streaming) and daily reports (batch). For pricing: CDC or event stream from the rides database into Kafka. Flink or Spark Streaming computes surge metrics in real time and writes to Redis for the pricing service. For reports: Airflow orchestrates a daily batch pipeline that reads from the data lake (where streaming data is also persisted), transforms with dbt/Spark, and loads into the warehouse for BI dashboards. Share the event stream between both paths to avoid duplicate ingestion. This is a hybrid/Lambda approach. Mention the trade-off: two code paths to maintain, but each is optimized for its access pattern.
What is the medallion architecture, and why is it popular in lakehouse environments?
What they're testing: Whether you understand progressive data quality and the lakehouse paradigm. The interviewer wants to hear about bronze (raw), silver (cleaned), gold (modeled) tiers and why this separation matters for debugging, reprocessing, and data quality. How to answer: Define the three tiers. Bronze: raw data landed exactly as received from sources, with metadata columns (ingestion timestamp, source system). Silver: cleaned, deduplicated, validated data with consistent types and applied business rules. Gold: aggregated, modeled data ready for specific business use cases (fact tables, metric tables, feature tables). Why it works for lakehouses: Iceberg/Delta provide ACID transactions on the lake, so each tier can be a set of tables with schema enforcement. Debugging is easy because you can always inspect the bronze layer. Reprocessing means re-running silver from bronze, not re-ingesting from sources.
How do you handle late-arriving data in a streaming pipeline?
What they're testing: Practical streaming experience. Late data is the hardest problem in stream processing. Events can arrive seconds, minutes, or hours after they occurred. The interviewer checks whether you know about event time vs processing time, watermarks, and allowed lateness. How to answer: Distinguish event time (when the event happened) from processing time (when the pipeline sees it). Use watermarks to track how far behind the pipeline allows: a watermark of 5 minutes means events up to 5 minutes late are included in the correct window. Events arriving after the watermark are either dropped or sent to a side output for separate handling. In Flink: set watermark strategy with bounded out-of-orderness. In Spark Streaming: use withWatermark(). Mention the trade-off: longer watermarks increase completeness but add latency. For very late data (hours, days), use batch reprocessing to correct the streaming results.
Compare Lambda and Kappa architectures. When would you choose one over the other?
What they're testing: Architecture comparison skills and practical judgment. Lambda provides accuracy guarantees through the batch layer. Kappa simplifies operations with a single streaming pipeline. The interviewer wants trade-off analysis, not just definitions. How to answer: Lambda: two paths (batch + speed), batch layer is the source of truth, speed layer provides low-latency approximations. Pro: batch results are complete and correct. Con: two codebases computing the same metrics. Kappa: single streaming path, replay from the event log for reprocessing. Pro: one codebase, simpler operations. Con: replay can be slow for large historical datasets, and not all sources produce events naturally. Choose Lambda when: batch accuracy is critical (financial reconciliation, regulatory reporting), and the team can maintain two paths. Choose Kappa when: the source is naturally event-based (clickstream, IoT), latency requirements are uniform, and the team wants operational simplicity.
Common questions
What is data pipeline architecture?+
What is the difference between batch and streaming pipelines?+
What is the medallion architecture?+
What tools are commonly used in data pipeline architecture?+
Sketch a topology against a real prompt
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition