Pipeline architecture is the set of contracts between four layers: ingestion, transformation, storage, and serving. Each layer has its own SLA, its own failure mode, and its own blast radius when it breaks. A well-architected pipeline makes those contracts explicit so the team downstream of you knows exactly when to trust the data and when to wait.
This guide maps the three processing models (batch, streaming, hybrid), the four components you must know cold, and the three design patterns that dominate production stacks in 2026: Lambda, Kappa, and Medallion. Each section connects a design decision to the downstream consequence it triggers six months later at scale.
Pure system-design rounds
L6 staff rounds
Pipeline layers
Pipeline challenges
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Your processing-model choice cascades through every other architectural decision: state management, ordering guarantees, exactly-once semantics, observability tooling. Pick batch when latency can be measured in hours. Pick streaming when you care about sub-minute freshness. Pick hybrid when your analytics team needs both fresh dashboards and reprocessable history on the same canonical dataset.
| Aspect | Batch | Streaming | Hybrid |
|---|---|---|---|
| Latency | Minutes to hours | Seconds to milliseconds | Batch for historical, streaming for real-time |
| Processing model | Process bounded datasets on a schedule | Process unbounded event streams continuously | Both, unified through a serving layer |
| Complexity | Lower; well-understood patterns | Higher; ordering, late data, exactly-once | Highest; two code paths to maintain |
| Tools | Airflow, dbt, Spark, Snowflake Tasks | Kafka, Flink, Spark Streaming, Kinesis | Both tool sets, often with a shared serving layer |
| Cost model | Pay per run; scales with data volume | Always-on infrastructure; baseline cost is higher | Batch cost + streaming infrastructure cost |
| Best for | Reporting, analytics, ML training, backfills | Real-time dashboards, fraud detection, alerting | Systems needing both historical accuracy and real-time freshness |
Every data pipeline, regardless of processing model, has four components: ingestion, transformation, storage, and serving. Each component has distinct responsibilities, tool options, and failure modes.
The ingestion layer extracts data from source systems and lands it in the pipeline. Batch ingestion pulls data on a schedule (daily, hourly) from databases, APIs, and file drops. Streaming ingestion captures events in real time from message queues, CDC streams, and webhooks. The ingestion layer must handle schema drift (source columns changing), late-arriving data, deduplication, and backpressure (when the source produces data faster than the pipeline can consume).
Common tools: Fivetran, Airbyte, Kafka Connect, Debezium, AWS DMS, custom Python scripts
-- CDC ingestion with Debezium (logical replication)
-- Captures INSERT, UPDATE, DELETE as events
-- Kafka topic: dbserver.public.orders
-- Each message contains:
-- {
-- "op": "u", -- operation: c=create, u=update, d=delete
-- "before": { "order_id": 1, "status": "pending" },
-- "after": { "order_id": 1, "status": "shipped" },
-- "ts_ms": 1710000000000
-- }
-- Batch ingestion: incremental load pattern
SELECT *
FROM source_db.orders
WHERE updated_at > :last_watermark
AND updated_at <= :current_watermark;The transformation layer converts raw ingested data into a format suitable for analysis. This includes cleaning (handling NULLs, deduplication, type casting), enrichment (joining reference data), aggregation (pre-computing metrics), and modeling (building dimensional models, fact tables). In ELT architectures, transformations happen inside the data warehouse using SQL. In ETL architectures, transformations happen before loading, often in Spark or Python.
Common tools: dbt, Spark, Flink, Snowflake SQL, BigQuery SQL, Dataflow
-- dbt transformation: build a fact table
-- models/marts/fct_orders.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
customers AS (
SELECT * FROM {{ ref('dim_customers') }}
),
products AS (
SELECT * FROM {{ ref('dim_products') }}
)
SELECT
o.order_id,
o.order_date,
c.customer_key,
p.product_key,
o.quantity,
o.unit_price,
o.quantity * o.unit_price AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date IS NOT NULLThe storage layer persists data at various stages of the pipeline. Raw data lands in object storage (S3, GCS) or a data lake. Transformed data lives in a data warehouse (Snowflake, BigQuery, Redshift) or a lakehouse (Databricks, Iceberg). The storage layer must support partitioning (organize data by date, region), compaction (merge small files), time travel (query historical versions), and access control (restrict who sees what). Storage format matters: Parquet and ORC for columnar analytics, Avro for schema evolution, Delta/Iceberg for ACID transactions on lakes.
Common tools: S3, GCS, Snowflake, BigQuery, Redshift, Databricks, Apache Iceberg, Delta Lake
-- Iceberg table with partitioning and time travel
CREATE TABLE analytics.fct_orders (
order_id BIGINT,
order_date DATE,
customer_id BIGINT,
amount DECIMAL(12, 2)
)
USING iceberg
PARTITIONED BY (months(order_date))
TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled' = 'true',
'history.expire.max-snapshot-age-ms' = '604800000'
);
-- Time travel: query yesterday's version
SELECT * FROM analytics.fct_orders
FOR SYSTEM_TIME AS OF TIMESTAMP '2025-03-14 00:00:00';The serving layer exposes processed data to consumers: BI dashboards, APIs, ML models, and operational applications. The serving layer must optimize for the access patterns of its consumers. Dashboards need fast aggregation queries (pre-aggregated tables, materialized views). APIs need low-latency point lookups (Redis, DynamoDB). ML models need feature stores (Feast, Tecton). The serving layer is where SLAs live: query latency, freshness guarantees, and availability targets.
Common tools: Looker, Tableau, Metabase, Redis, Feast, dbt metrics, Cube.dev
-- Materialized view for dashboard serving
CREATE MATERIALIZED VIEW dashboard.daily_revenue AS
SELECT
DATE_TRUNC('day', order_date) AS day,
region,
product_category,
SUM(amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers,
COUNT(*) AS order_count
FROM analytics.fct_orders
JOIN analytics.dim_products USING (product_id)
GROUP BY 1, 2, 3;
-- Refresh on schedule
-- Snowflake: ALTER MATERIALIZED VIEW ... CLUSTER BY (day)
-- PostgreSQL: REFRESH MATERIALIZED VIEW CONCURRENTLYThese three patterns define how data flows through the pipeline and how different processing modes interact. Each pattern has clear trade-offs. Interviewers expect you to know all three and explain when each is appropriate.
Lambda architecture maintains two parallel processing paths: a batch layer for accuracy and a speed layer for low latency. The batch layer processes all historical data and produces correct, complete results on a schedule. The speed layer processes real-time events and produces approximate, up-to-the-second results. A serving layer merges outputs from both. The trade-off: you maintain two separate codebases (batch and streaming) that must produce consistent results. This dual-maintenance burden is Lambda's biggest drawback.
Sources -> [Batch Layer (Spark/Airflow)] -> Serving DB
Sources -> [Speed Layer (Kafka/Flink)] -> Serving DB
Serving DB -> ConsumersKappa architecture eliminates the batch layer entirely. All data is processed through a single streaming pipeline. Historical reprocessing is handled by replaying events from the stream's log (Kafka with long retention). This solves Lambda's dual-codebase problem: one pipeline, one codebase, one set of logic. The trade-off: streaming infrastructure must handle both real-time events and historical replay, which requires careful capacity planning. Kappa works best when the source system produces events (CDC, clickstream, IoT sensors).
Sources -> Event Log (Kafka) -> Stream Processor (Flink) -> Serving DB -> Consumers
Replay: re-read from offset 0The medallion architecture organizes data into three quality tiers. Bronze is raw, unprocessed data landed directly from sources. Silver is cleaned, validated, and deduplicated data. Gold is business-level aggregations and models ready for consumption. Each tier builds on the previous one. This pattern is popular in lakehouse environments (Databricks, Iceberg) because it provides clear data lineage, easy debugging (you can always go back to bronze), and progressive quality improvement. The tiers can be implemented as schemas, databases, or directory structures.
Sources -> Bronze (raw) -> Silver (cleaned) -> Gold (modeled) -> ConsumersThe right architecture depends on your latency requirements, team size, data volume, and organizational maturity. Here is a decision framework.
If your consumers can tolerate hourly or daily freshness, batch pipelines are simpler, cheaper, and easier to debug. Most analytics use cases fit here. Airflow + dbt + Snowflake is a proven stack that handles most batch needs. Do not add streaming complexity until you have a concrete sub-minute latency requirement.
Real-time pricing, fraud detection, live dashboards, and alerting require streaming. Add a streaming path alongside batch (hybrid/Lambda) rather than replacing batch entirely. Kafka + Flink for stream processing, writing results to Redis or a fast-query store for serving.
If every data source produces events (clickstream, IoT, CDC), and your team is comfortable with stream processing, Kappa simplifies operations by eliminating the batch path. Historical reprocessing happens by replaying from Kafka. This requires long-retention Kafka topics and careful capacity planning for replay scenarios.
The bronze/silver/gold tier pattern is orthogonal to the processing model. You can apply it to batch, streaming, or hybrid pipelines. It organizes data by quality level and provides clear separation of concerns. Most modern data teams adopt some version of the medallion pattern regardless of their processing model choice.
These questions test system design skills, pattern selection, streaming challenges, and architecture comparison abilities.
What they test:
System design fundamentals and the ability to choose between batch, streaming, and hybrid approaches. The interviewer wants to see you identify the two different latency requirements and design appropriate paths for each. Real-time pricing needs sub-second data. Daily reports need complete, accurate aggregations.
Approach:
Start by identifying the two access patterns: real-time pricing (streaming) and daily reports (batch). For pricing: CDC or event stream from the rides database into Kafka. Flink or Spark Streaming computes surge metrics in real time and writes to Redis for the pricing service. For reports: Airflow orchestrates a daily batch pipeline that reads from the data lake (where streaming data is also persisted), transforms with dbt/Spark, and loads into the warehouse for BI dashboards. Share the event stream between both paths to avoid duplicate ingestion. This is a hybrid/Lambda approach. Mention the trade-off: two code paths to maintain, but each is optimized for its access pattern.
What they test:
Whether you understand progressive data quality and the lakehouse paradigm. The interviewer wants to hear about bronze (raw), silver (cleaned), gold (modeled) tiers and why this separation matters for debugging, reprocessing, and data quality.
Approach:
Define the three tiers. Bronze: raw data landed exactly as received from sources, with metadata columns (ingestion timestamp, source system). Silver: cleaned, deduplicated, validated data with consistent types and applied business rules. Gold: aggregated, modeled data ready for specific business use cases (fact tables, metric tables, feature tables). Why it works for lakehouses: Iceberg/Delta provide ACID transactions on the lake, so each tier can be a set of tables with schema enforcement. Debugging is easy because you can always inspect the bronze layer. Reprocessing means re-running silver from bronze, not re-ingesting from sources.
What they test:
Practical streaming experience. Late data is the hardest problem in stream processing. Events can arrive seconds, minutes, or hours after they occurred. The interviewer checks whether you know about event time vs processing time, watermarks, and allowed lateness.
Approach:
Distinguish event time (when the event happened) from processing time (when the pipeline sees it). Use watermarks to track how far behind the pipeline allows: a watermark of 5 minutes means events up to 5 minutes late are included in the correct window. Events arriving after the watermark are either dropped or sent to a side output for separate handling. In Flink: set watermark strategy with bounded out-of-orderness. In Spark Streaming: use withWatermark(). Mention the trade-off: longer watermarks increase completeness but add latency. For very late data (hours, days), use batch reprocessing to correct the streaming results.
What they test:
Architecture comparison skills and practical judgment. Lambda provides accuracy guarantees through the batch layer. Kappa simplifies operations with a single streaming pipeline. The interviewer wants trade-off analysis, not just definitions.
Approach:
Lambda: two paths (batch + speed), batch layer is the source of truth, speed layer provides low-latency approximations. Pro: batch results are complete and correct. Con: two codebases computing the same metrics. Kappa: single streaming path, replay from the event log for reprocessing. Pro: one codebase, simpler operations. Con: replay can be slow for large historical datasets, and not all sources produce events naturally. Choose Lambda when: batch accuracy is critical (financial reconciliation, regulatory reporting), and the team can maintain two paths. Choose Kappa when: the source is naturally event-based (clickstream, IoT), latency requirements are uniform, and the team wants operational simplicity.
Practice sketching a Kappa topology from scratch in under ten minutes. That's the bar at staff-level DE rounds.