Batch vs Streaming: Advanced

A streaming media company at IPO scale ran a textbook Lambda architecture for three years. A batch layer in Spark recomputed daily aggregates from the full event log every night. A speed layer in Storm produced near-real-time approximations during the day. A serving layer merged the two at query time. The architecture worked. The architecture also required maintaining two implementations of the same business logic in two languages with two operational profiles. When a regulator asked why a particular metric differed by 0.4 percent between the two layers, nobody had a clean answer. The team spent six months migrating to a Kappa-style architecture: one streaming pipeline, with batch reduced to replays from the same event log. The metric drift went to zero. The cost dropped by 40 percent. The team shrunk by two engineers because there was only one codebase to operate. This lesson is about the architectures that try to bridge batch and streaming, and the discipline that determines whether the bridges hold.

What you will be able to do

Distinguish Lambda and Kappa architectures by their treatment of batch and stream layers

Apply freshness tier analysis to identify which nodes in a pipeline need which rhythm

Redesign a Lambda workload as Kappa and name what changes in code, storage, and operations

Lambda Architecture

Daily Life

Interviews

Identify the three layers of Lambda architecture and explain why the constraints that motivated it have shifted.

Lambda architecture is the first widely adopted attempt to combine batch and streaming in one system. Nathan Marz proposed it around 2011 in his book Big Data, drawing on his experience at Twitter and BackType. The motivation was specific to the era: batch frameworks (Hadoop MapReduce) were correct but slow; stream frameworks (Storm) were fast but produced approximate results. Lambda combined the two, using batch for the durable correct view and streaming for the live approximate view. Both layers fed a serving layer that merged them at query time. For a few years, Lambda was the canonical way to build a system that needed both freshness and correctness. The architecture became unfashionable not because it was wrong but because the underlying constraints changed.

The Three Layers

Batch layerSpeed layerServing layer

Batch layer

Correct, slow, complete

Processes the entire event log on a schedule (nightly or hourly). Produces the canonical view of the data. Tolerates being slow because the streaming layer covers the freshness gap.

Speed layer

Fresh, fast, approximate

Processes events as they arrive. Produces a real-time delta over the batch layer. Allowed to be approximate because the next batch run will overwrite it with the correct value.

Serving layer

Merges the two at query time

Reads the batch view for everything older than the last batch run; reads the speed layer for everything since. Returns the union as a single answer to the consumer.

The Architecture in a Diagram

+ | LAST run : midnight | Event log(immutable) | |(HBase + cache) + CURRENT minute : live(sources)(consumer)

The event log is immutable: every event written, ever, in append-only storage. The master data set is the entire history. The batch layer recomputes the canonical view from the master data set on a schedule. The speed layer processes events as they arrive and emits a real-time delta. The serving layer merges the two: for any query, it returns batch_view + speed_view restricted to events after the last batch run. When the next batch run completes, the speed view for the now-covered window is discarded.

Why Marz Proposed It

Constraint in 2011	What It Forced	Lambda's Answer
Hadoop was the cheap correct engine	Batch was unavoidable for the canonical view	Use batch for the durable layer
Storm was the fast engine but tradeoffs were harsh	Streaming had at-most-once or at-least-once but rarely exactly-once	Use streaming only as an approximation; batch overwrites it
Storage was cheap; reprocessing was the cure for any bug	Recomputing from the master log was always available	The batch layer is a complete recompute on every run
Serving had to be fast and merge two sources	A serving layer that combined views was needed	HBase or Cassandra plus a cache, with merge logic

Why Lambda Became Unfashionable

Lambda was correct for its constraints, but the constraints changed. By 2018 or so, three things were different. Streaming engines (Flink, Spark Structured Streaming) supported exactly-once semantics on most sinks. The need for a separate approximate layer dropped. Cloud warehouses (Snowflake, BigQuery, Databricks) made running a single SQL transform across both batch and streaming inputs cheap. The two-codebase tax of Lambda became visible. And data contracts and reproducibility expectations rose: a 0.4 percent drift between the batch and speed layers was no longer acceptable when the speed layer was supposed to be a faster version of the same logic.

•Lambda's Strengths

Correctness is guaranteed by the batch layer overwriting the speed layer
Streaming can be approximate, opening up cheaper engines
The master event log is durable; bugs are fixed by reprocessing
Each layer has clear semantics and a clear lifecycle

•Lambda's Weaknesses

Two implementations of the same logic in two languages
Operational complexity doubles: two pipelines, two failure modes
Drift between layers is real and hard to explain to consumers
Serving layer merge logic adds latency and edge cases at the boundary

What Lambda Got Right

The principles that made Lambda work are still right, even when the architecture itself is no longer the default. The immutable event log is the source of truth, not derived views. Recompute from the log is always available; bugs in the transform are recoverable. The serving layer is a separate concern from the processing layer. Modern architectures inherit all three principles. What modern architectures discard is the assumption that batch and streaming require two distinct codebases. The principles persist even though the specific three-layer shape has been mostly replaced.

Lambda's durable contributions to data engineering thinking:

▸The master event log is the source of truth; everything else is a derived view
▸Recomputability is a first-class property; bugs are fixed by reprocessing, not patching
▸Serving is a separate concern from processing
▸Different freshness tiers can coexist on the same data, served from different views

TIP

When inheriting a Lambda system, name which of its three principles still apply (immutable log, recompute, separated serving) and which constraint that drove its three-layer shape (Hadoop slow, Storm approximate) no longer holds. The audit usually reveals which parts to keep and which to retire.

✓Do

Keep the immutable event log as the source of truth, regardless of architecture
Preserve the recompute-from-log property when migrating away from Lambda
Document why Lambda was chosen historically; the constraints may inform the next architecture

✗Don't

Build a new Lambda system in 2026 unless the streaming layer must be approximate
Discard the master log when migrating away from Lambda; it is the most durable piece
Treat Lambda as obsolete; its principles are still the right framing

events

batch: nightly

batch

stream: continuous

stream

Storage

warehouse

Consumer

dashboard

Two ways data moves: batch processes a whole chunk on a schedule (accurate, delayed); streaming processes each event as it arrives (fast, continuous). The latency SLA decides which.

Kappa: Stream Only, Batch Replay

Daily Life

Interviews

Apply Kappa architecture: stream-only with batch as replay, and explain what is given up in storage retention to gain a single codebase.

Kappa architecture, proposed by Jay Kreps in 2014, is the answer to Lambda's two-codebase problem. The idea is simple: keep only the streaming layer. The event log is the source of truth, the streaming pipeline produces the canonical view, and batch becomes a special case (replaying the event log through the same streaming pipeline) rather than a separate codebase. One implementation of the logic, one operational profile, one set of failure modes. The simplification is real, and Kappa has become the default architecture for new event-driven systems built since around 2018.

The Architecture in a Diagram

	Event log(Kafka, Pulsar) \| \| \| v \| Materialized VIEW 1(CURRENT logic) \| \| Materialized VIEW 2(replay WITH new logic) \| + Switch consumers
	FROM VIEW 1 to VIEW 2 WHEN validated(source)(single processor ; batch AND stream are the same code)

There is no batch layer. Reprocessing happens by replaying the event log through the same streaming pipeline, usually into a new output table. When the new table catches up to the live one and consumers have validated the result, traffic is cut over. The pattern is sometimes called the parallel materialized view: the new version exists alongside the old until cutover. This pattern replaces the Lambda batch-overwrites-speed mechanic with a simpler one: the new code runs against the entire log and produces a canonical replacement view.

What Kappa Gives Up

Lambda Property	Kappa's Tradeoff
Storage retention can be short; only the most recent window is in the log	Kappa requires keeping the full event log indefinitely or paying for cold replay
The batch layer can use cheap correct engines	Kappa pays streaming compute prices for everything, even the daily aggregate
Reprocessing happens automatically every batch run	Kappa requires explicit replay jobs and parallel materialized views
Approximations in the speed layer are acceptable	Kappa demands correctness in streaming; exactly-once is not optional

Why Kappa Became the Default

Three things changed since Lambda was proposed. Streaming engines (Flink, Kafka Streams, Spark Structured Streaming with Delta sinks) now support exactly-once semantics for most sinks. The approximation excuse for streaming is gone. Storage costs dropped enough that keeping a year of event log in Kafka or Pulsar (with tiered storage to S3) is affordable. The retention concern is gone for most companies. And operational tooling for streaming (Datadog integrations, lag dashboards, deployment patterns) matured to the point where running a single streaming pipeline is no longer harder than running a batch pipeline. The two-codebase tax of Lambda is no longer worth paying when one codebase can do the same job.

Replay as the Universal Backfill

The single defining property of Kappa is that backfill, reprocessing, schema migration, and bug fixing are all the same operation: replay the event log from a chosen offset through the pipeline. There is no separate backfill code path. The batch backfill that used to be a distinct job becomes a streaming job that reads from offset 0, runs through the entire log at high throughput, and writes to a parallel materialized view. The cost of replay is bounded by the log size and the streaming engine's throughput; for most companies that cost is real but predictable, and the benefit is a single code path for all reprocessing scenarios.

# A Kappa replay job, expressed AS a Flink deployment manifest apiVersion : flink.apache.org / v1 kind : FlinkDeployment metadata : name : orders - pipeline - v2 - replay spec : job : parallelism : 16 args : - - - - upgradeMode : stateless restartNonce : 1 # WHEN orders_v2 catches up to live AND consumers validate, swap them OVER

What Kappa Cannot Do

Kappa is not the right architecture for every workload. Pipelines that involve genuinely non-incremental joins (a join between the entire customer table and the entire orders table, with no useful temporal semantics) are still better served by batch. Pipelines on data that does not have a clean event-log source (a snapshot of a third-party warehouse pulled daily, with no incremental access) cannot be Kappa because there is no log to replay. Pipelines that require the absolute lowest cost per row (cohort analyses run quarterly, ML training data builds run weekly) are still cheaper as batch. Kappa won the default conversation but did not win every workload.

✓Kappa Wins When

Source data is naturally an event log (Kafka, Pulsar, CDC stream)
Streaming engine supports exactly-once for the sinks in use
Log retention is affordable; tiered storage is in place
Single codebase is operationally cheaper than two codebases

•Lambda or Pure Batch Wins When

Source data is a periodic snapshot (no log to replay)
Workload is rare and reading the log every replay is wasteful
Streaming exactly-once is hard or unsupported for the sink
Cost-per-row is the dominant constraint, not freshness

The Storage Tradeoff

Storage retention is the cost of admission for Kappa. A company that wants to be able to replay the last year of events through a new pipeline version must keep the last year of events. Kafka tiered storage and Pulsar's segmented storage moved this cost from prohibitive to manageable. A modern Kafka cluster with tiered storage to S3 can keep years of events for cents per GB-month. Without tiered storage, Kappa retention costs were the architecture's biggest weakness; with it, the weakness mostly disappears.

Signals that Kappa is the right architecture:

▸Source is a Kafka topic, Pulsar topic, or CDC stream from the start
▸Exactly-once semantics are supported by the engine and the sink
▸Tiered storage or another long-retention mechanism is affordable
▸The team has streaming engineering capacity; one pipeline is one operational profile

TIP

When proposing Kappa for a new system, write the retention requirement in concrete months and the cost per GB of long retention. The retention conversation is the one that decides whether the architecture is feasible, not the streaming engine choice.

Unified Engines: Where Lines Blur

Daily Life

Interviews

Identify which aspects of batch and streaming are unified by modern engines and which still differ at runtime.

The cleanest version of Kappa requires an engine that runs the same code in batch and streaming modes. Modern engines have moved toward this ideal. Spark Structured Streaming exposes a unified DataFrame API where the same query can run as a batch job, a micro-batch streaming job, or a continuous streaming job by changing one configuration. Apache Flink runs streaming as the default and batch as a special case (a bounded stream). Apache Beam abstracts both into a single programming model. The convergence is real and is one of the most consequential architectural shifts of the last decade. But the line between batch and streaming has not disappeared; it has moved underneath the API into the runtime, where it still shapes cost, failure modes, and operational behavior.

What the Unified Engines Unify

Layer	Spark	Flink	Beam
API	Same DataFrame for batch and streaming	DataStream API; bounded stream is batch	Single Pipeline; runtime selects the engine
Trigger model	processingTime, once, continuous	Time-based, count-based, custom	Trigger objects abstracted across runners
Watermarks	withWatermark on event-time columns	Native watermark assigners	WatermarkStrategy at the source
State	Checkpointed RocksDB or HDFS	RocksDB local plus async snapshots	Runner-specific state backend

What the Unified Engines Still Distinguish

The unified API hides differences that still exist underneath. Cost remains different: a unified pipeline running in streaming mode pays continuous compute prices, while the same pipeline running in batch mode pays only for the run window. Failure recovery still differs: a batch run reruns the partition; a streaming run restarts from the latest checkpoint and replays in-flight events. Observability still differs: batch has a binary did-it-finish signal; streaming has lag, throughput, and latency percentiles. Schema evolution still differs: a batch run with a new schema starts fresh next time; a streaming pipeline must drain, redeploy, and restart with state migration. The API is unified; the runtime is not.

•Unified API Hides

Whether the engine reads bounded or unbounded input
Whether the trigger fires once or repeatedly
Whether state is in-memory for one run or persistent across runs
Whether the engine runs to completion or runs forever

•Runtime Still Distinguishes

Cost: continuous compute vs. on-demand compute
Failure mode: rerun partition vs. checkpoint-and-replay
Observability: binary success vs. lag and latency percentiles
Deployment: redeploy at next run vs. drain-and-replace mid-flight

Spark Structured Streaming as a Concrete Case

	# The same logic, three runtimes, three cost profiles

	# 1. Pure batch: trigger=once, runs once and exits
	(events.writeStream
	.format('delta')
	.trigger(once=True) # one-shot batch
	.start('s3://lake/daily_view'))

	# 2. Micro-batch: trigger=processingTime, runs every 1 minute forever
	(events.writeStream
	.format('delta')
	.trigger(processingTime='1 minute') # continuous micro-batch
	.start('s3://lake/live_view'))

	# 3. Continuous: trigger=continuous, ultra-low latency, experimental
	(events.writeStream
	.format('delta')
	.trigger(continuous='100 milliseconds') # true streaming
	.start('s3://lake/realtime_view'))

Three lines of configuration produce three different cost profiles, three different latency profiles, and three different failure modes. The application code is identical. This is the unified-engine ideal. It has trade-offs: the continuous mode is much less mature than micro-batch and supports a smaller set of operations. The unified engine is most useful as a way to start in batch and graduate to micro-batch, then to streaming, without rewriting the logic. Each step is a configuration change rather than a rewrite.

Where the Line Still Matters

Concern	Why the Line Still Matters
Cost budgeting	Streaming and batch have order-of-magnitude different costs even at identical logic
On-call rotation	Streaming requires lag-based alerting; batch requires schedule-based alerting
Schema migration	Streaming requires drain-and-replace; batch swaps at the next run boundary
Backfill mechanics	Streaming replays from offsets; batch reruns by date partition
Failure-recovery testing	Streaming needs chaos tests during runs; batch needs partition replay tests

How Senior Engineers Use Unified Engines

Unified engines do not eliminate the batch-versus-streaming decision. They reduce its cost. The decision can be made later, revisited, and reversed without rewriting the logic. A pipeline that starts as nightly batch can graduate to hourly micro-batch when freshness needs tighten, then to streaming when they tighten further, all while keeping the same SQL or DataFrame code. The senior pattern is to write the logic once, run it batch by default, and graduate the rhythm only when a specific consumer's freshness need requires it. The unified engine pays back its complexity precisely in this case, where the decision is wrong less often because it can be revisited cheaply.

Questions a unified engine answers cheaply that a split-engine architecture answered expensively:

▸Can this batch pipeline run as micro-batch without a code change? Yes, with a config change.
▸If a team made this streaming pipeline batch for cost reasons, what would the team lose? The unified API lets both modes be run and compared.
▸How does the same logic behave under streaming versus batch failure modes? Run both and compare.
▸Can a streaming pipeline be backfilled by running its code as a one-shot batch? Yes, the unified API supports it.

Unified engines hide the batch-streaming line at the API level; the same code runs as either.

The line still matters in cost, failure mode, observability, schema evolution, and backfill mechanics.

Senior engineers use the unified API to write logic once and revisit the rhythm cheaply as freshness needs change.

TIP

When picking an engine for a new pipeline, choose the unified engine that supports the rhythm range the consumer might want over the next two years, not the one that fits the current rhythm exactly. The optionality compounds.

✓Do

Write logic against a unified engine API so rhythm changes are config changes, not rewrites
Test both batch and streaming runtimes for any pipeline that might graduate between them
Keep observability separate per rhythm; unified API does not unify lag and schedule semantics

✗Don't

Treat the unified API as a guarantee of identical operational behavior
Pick a streaming-only engine for a workload that may always be batch
Skip the cost conversation because the API is the same; runtime cost differs by an order of magnitude

Per-Node Freshness Tier Analysis

Daily Life

Interviews

Annotate every node in a pipeline with an explicit freshness tier and identify mismatches that produce hidden cost or unmet consumer expectations.

A single pipeline rarely needs one freshness tier across every node. The source might produce events continuously. The raw landing layer might lag the source by seconds. The curated layer might rebuild hourly. The serving layer might refresh on a per-consumer schedule. Treating the entire pipeline as one tier (the strictest one) overbuilds most nodes; treating it as the loosest tier underbuilds the consumer-facing edge. Senior engineers tier each node explicitly and label it on the architecture diagram. The discipline is the difference between a pipeline that meets its consumers' needs at minimum cost and one that does not.

Tiers Per Layer in a Layered Pipeline

Layer	Typical Freshness Tier	Why
Source	Continuous (what the producer emits)	Cannot be tightened by the pipeline; bound is set upstream
Raw landing	Tier 2 to 3 (under 15 min to under 2 hr)	Lags source by ingestion overhead; cheap to keep tight
Curated	Tier 3 to 4 (under 2 hr to daily)	Joins and aggregations are expensive; refreshed when consumers actually read
Serving	Tier 1 to 4 (per consumer)	Consumer-facing; tier matches the specific consumer's need

Why Mixing Tiers Is Fine

A pipeline whose raw layer is tier 2 (under 15 minutes) and whose curated layer is tier 4 (daily) is not contradictory. It means the raw data is available within 15 minutes for any consumer that wants to read directly from it, and the curated rollups are refreshed once a day for consumers that read the precomputed shapes. Two consumers can read from the same pipeline at different freshness, served from different layers. The architecture supports both as long as the layers are labeled and the consumers know which one they read. Mixing tiers becomes a problem only when the labels are missing and consumers assume the wrong tier.

Raw layer tierCurated layer tierServing layer tier

Raw layer tier

As fresh as ingestion allows

Typically tier 2 to 3. Cheap to keep tight because no transform runs here. Ingestion lag plus storage commit time sets the floor.

Curated layer tier

As fresh as the most demanding shared consumer

Typically tier 3 to 4. Refreshed when the joins and aggregations are worth the compute. Multiple downstream consumers share this tier.

Serving layer tier

Exactly as fresh as the named consumer requires

Per consumer; ranges tier 1 to 4. The serving layer is where the freshness-cost tradeoff is exposed and tuned to a specific dashboard, model, or application.

The Tier Label as a Diagram Element

Kafka events | freshness : continuous v S3 raw/events/dt=YYYY-MM-DD/hr=HH | freshness : tier 2(5 MIN lag) | + | freshness : tier 2(10 MIN lag) | consumer : live ops dashboard | + | freshness : tier 4(next morning) | consumer : executive dashboard | + freshness : tier 3(under 2 hr) consumer : ML training, online inference

Each edge carries an explicit freshness label. Each consumer-facing node names its tier. The architecture is one pipeline with multiple tiers downstream, each fed from the same raw layer. The shape is more efficient than three independent pipelines and more legible than a single tier-1 pipeline that meets all consumers' needs at peak cost. The labels are what make the shape work; without them, consumers and operators cannot tell which tier they are reading.

How to Pick a Tier per Node

The tier-per-node algorithm:

▸Start at the consumer. What freshness does each consumer actually need?
▸Walk backward. Each upstream node must be at least as fresh as its strictest downstream consumer.
▸Allow upstream nodes to be tighter if they have other consumers with stricter needs.
▸Allow upstream nodes to be looser if no downstream consumer reads them at the looser cadence.
▸Document each node's tier; mismatches between adjacent tiers are the most common bug.

Tier Mismatches as a Failure Mode

A tier mismatch is when an upstream node refreshes less often than a downstream consumer expects. If the curated layer rebuilds hourly and the serving layer runs every minute, the serving layer cannot be fresher than the curated layer feeding it. The consumer sees data that is up to an hour stale, regardless of how often the serving layer queries. Tier mismatches are usually invisible until a consumer asks why the dashboard is not as fresh as it should be. Annotating tiers on the diagram exposes the mismatch at design time. Skipping the annotation produces post-launch surprises.

Tier Mismatch	Symptom	Fix
Upstream daily, downstream hourly	Hourly view does not change between daily upstream runs	Tighten upstream cadence or relax downstream tier
Upstream tier-2 streaming, downstream nightly batch	Streaming work is wasted; nightly only sees what daily batch would see	Either consumer reads streaming directly, or upstream is downgraded to nightly
Two consumers at different tiers reading the same node	One consumer overpays for freshness, or the other underreceives	Split into two serving nodes at the appropriate tiers
Source faster than the rest of the pipeline	End-to-end latency floor is much higher than source latency	Tighten the slowest node; the latency is the max of all nodes

The Cost Story Per Tier

Cost grows roughly geometrically as tiers tighten. Tier 4 to tier 3 is roughly 2x to 5x. Tier 3 to tier 2 is another 2x to 5x. Tier 2 to tier 1 is another 5x to 20x. A pipeline whose serving layer runs at tier 1 because one consumer asked for it but whose curated and raw layers run at tier 4 is paying tier-4 prices for most of the work and tier-1 prices only at the consumer-facing edge. Tier-per-node analysis is the discipline that lets that cost shape be deliberate. Without it, engineers tend to either tier everything to the strictest consumer (overspending) or tier everything to the loosest consumer (under-delivering).

•Single-Tier Pipeline

Every node refreshes at the strictest consumer's tier
Cost is dominated by the strictest tier multiplied by every node
Consumers with looser needs overpay for unused freshness
Architecture is simpler but more expensive

✓Tier-Per-Node Pipeline

Each node refreshes at the tier its downstream consumers actually need
Cost is the sum across nodes, each at its appropriate tier
Consumers pay for the freshness they read; no excess
Architecture is more complex but operates at minimum cost

TIP

When designing a new pipeline, tier each node explicitly on the architecture diagram. The label is the difference between a pipeline whose cost shape is deliberate and one whose cost shape is accidental.

Lambda to Kappa Worked Example

Daily Life

Interviews

Walk through a Lambda-to-Kappa migration on a real workload and name what changes in code, storage, and operations.

The synthesis exercise walks through a real-shaped migration: a workload originally designed as Lambda, redesigned as Kappa, with explicit notes on what changes in code, in storage, and in operations. The example is a streaming media company's content engagement pipeline. The exercise shows that the migration is not a rewrite; it is a careful retirement of the batch layer and a tightening of the streaming layer, with the immutable event log surviving as the architectural anchor.

The Lambda Starting Point

The original architecture has three layers. A batch layer, written in Spark, recomputes daily content engagement aggregates (views, likes, completion rate) from the full event log every night. A speed layer, written in Storm, processes the live event stream and produces near-real-time approximations. A serving layer, in HBase, merges the two on read: the batch view for everything older than midnight, the speed view for the current day. A 0.4 percent drift between the two layers occasionally appears and is hard to explain. The team operates two codebases (Spark for batch, Storm for streaming) and two deployment pipelines.

	Lambda starting point = = = = = = = = = = = = = = = = = = = = = Kafka content_events \| + \| \| \| v \| Spark batch \| ^ \| + \| v + Operations : two codebases(Scala / Spark + Java / Storm), two


	ON - calls, drift

The Kappa Redesign

The redesign keeps the immutable Kafka event log as the source of truth and replaces the two processing layers with one Flink streaming pipeline. Flink processes events end to end with exactly-once semantics, writes to a single materialized view in Apache Iceberg on S3, and serves consumers through Trino. Backfill, schema migration, and bug fixes all become replay jobs: Flink starts from offset 0 with the new code, writes to a parallel Iceberg table, and the serving layer cuts over when the parallel view catches up. There is one codebase, one operational profile, and no drift between layers because there is only one layer.

	Kappa redesign = = = = = = = = = = = = = = Kafka content_events \| v Flink streaming ^ + Replay job: Flink starts from offset 0 \| v Iceberg content_engagement_v2 Operations : one codebase(Java / Flink), one



	ON - call, no drift

What Changes in Code

Concern	Lambda	Kappa
Aggregation logic	Implemented twice: Spark Scala and Storm Java	Implemented once in Flink Java
Window semantics	Daily windows in batch; rolling windows in speed	Single windowing model in Flink with watermarks
Idempotency	Batch overwrites the day; speed approximates	Flink exactly-once with Iceberg ACID transactions
Backfill	Re-run Spark for the date range	Replay Flink from a chosen offset, write to parallel table
Schema migration	Coordinate two codebases; redeploy both	One Flink redeploy with state migration

What Changes in Storage

Storage shifts from a two-tier (HDFS for the master log plus HBase for the merged view) to a single immutable log in Kafka with tiered storage to S3 plus a single Iceberg table for the canonical materialized view. Kafka tiered storage keeps two years of events at S3 prices for the cold tier. Iceberg gives ACID transactions, schema evolution, and time travel on the materialized view. The total storage cost decreases roughly 30 to 50 percent because the duplicate state in HBase is gone, and the log is now stored once with cold-tier economics. Schema evolution becomes a first-class operation rather than a coordinated migration across two systems.

What Changes in Operations

Operational Concern	Lambda	Kappa
On-call rotation	Two rotations: batch on-call, streaming on-call	One rotation: streaming pipeline plus replay jobs
Failure recovery	Batch reruns the night; streaming restarts from checkpoint	Streaming restarts from checkpoint; backfill is a replay
Drift investigation	Reconcile batch vs speed; trace the discrepancy	No drift exists; eliminated by single source of truth
Cost attribution	Two cost centers (batch cluster, speed cluster)	One cost center; cost per pipeline is direct
Deployment cadence	Two pipelines, two release cycles	One pipeline, one release cycle

What Stays the Same

The principles that made Lambda work are preserved: the immutable event log is the source of truth, recompute from the log is always available, and serving is separate from processing. The shape changes; the principles do not. This continuity is the reason a Lambda-to-Kappa migration is a redesign, not a rewrite from scratch. Teams that try to do too much at once (replace the engine, change the storage format, redesign the schema, migrate consumers) fail; teams that preserve the principles and migrate incrementally succeed. The immutable log is the architectural anchor that makes the migration possible.

The Migration Path

A typical Lambda-to-Kappa migration order:

▸Confirm exactly-once semantics in the streaming engine for the existing sinks
▸Extend Kafka retention to cover the longest backfill window the team needs
▸Build the Kappa pipeline alongside Lambda, writing to a parallel materialized view
▸Validate the Kappa view matches the Lambda merged view within an acceptable tolerance
▸Migrate consumers one by one; the Lambda system runs alongside until the last consumer has cut over
▸Retire the Lambda batch layer and speed layer; the Kappa pipeline owns the workload

When the Migration Fails

Migrations fail when teams underestimate one of three components. Exactly-once semantics in the streaming engine sound simple but require sink support and careful checkpoint configuration. Storage retention sounds simple but requires real money and operational discipline. Consumer migration sounds simple but every consumer has implicit assumptions about freshness, schema, and shape that surface only when they are actually moved. The failures are rarely about the streaming code; they are about the things the migration plan did not name. The plan should explicitly call out exactly-once boundaries, retention costs, and the consumer migration order before the first line of new code is written.

•Lambda Workload

Two codebases (Spark + Storm)
Drift between batch and speed layers
Two on-call rotations
Backfill is rerun-the-night
Storage in HDFS plus HBase
0.4 percent unexplained drift

✓Kappa Workload

One codebase (Flink)
Single source of truth; no drift
One on-call rotation
Backfill is replay from offset
Storage in Kafka tiered plus Iceberg
Drift eliminated by single layer

The senior framing of the migration is not Kappa is better than Lambda; it is the constraints that motivated Lambda have changed and the redesign reflects the new constraints. Streaming engines now offer exactly-once. Storage retention is affordable. Tooling for streaming has matured. The right architecture today is different from the right architecture in 2011, and the difference is structural, not stylistic. A senior engineer reads the constraints first and the architecture second; the architecture is a function of the constraints, not a fashion choice.

TIP

When inheriting any Lambda system, write the original constraints (engine maturity, storage cost, exactly-once support) and the current constraints in two columns. The columns reveal which parts of the architecture are still load-bearing and which are vestigial. The vestigial parts are the migration candidates.

❯❯❯PUTTING IT ALL TOGETHER

> A Series E retail platform inherited a Lambda content engagement pipeline from its founding-team era. Two codebases, two on-calls, 0.4 percent drift between layers, and a CFO who asked last week why the data infrastructure bill grew 60 percent year over year. The new principal data engineer is asked to redesign the system, write a migration plan, and make the cost shape defensible.

Start with freshness tier analysis per consumer. Executive dashboards are tier 4. Operational dashboards are tier 2. ML inference is tier 3. The single-tier Lambda speed layer overserves tier 4 consumers and underserves tier 2 ones. The cost shape is wrong because every node runs at the strictest tier.

Apply the layered architecture from Lesson 1. Raw, curated, and serving layers from the same Kafka log; each layer at the tier its consumers need. The four pipeline roles (source, transform, storage, consumer) stay the same; what changes is the rhythm at each layer.

Migrate Lambda to Kappa. The two codebases collapse to one Flink pipeline with exactly-once semantics. The drift problem disappears because there is only one source of truth. Backfill becomes replay from a Kafka offset, not a separate Spark job.

Run the unified engine in three modes for three tiers. The same Flink code runs as nightly batch for the executive dashboard (tier 4), as 1-minute micro-batch for the operations dashboard (tier 2), and as hourly batch for the ML feature store (tier 3). Three rhythms, one codebase, three cost profiles.

Cost is now defensible. Each consumer pays for the freshness they read. The unused tier-1 capacity from Lambda's speed layer is gone. The CFO has a per-pipeline-per-consumer cost breakdown rather than two unattributed cluster bills.

The bridge move (one sentence): a senior engineer's framing is that batch and streaming are points on a freshness-cost frontier, not philosophies; tier each node, run the same logic at the rhythm each consumer requires, and the cost shape follows.

KEY TAKEAWAYS

Lambda solved 2011 constraints, not 2026 constraints: Hadoop slow plus Storm approximate forced a two-layer split. Modern streaming engines and tiered storage retired both excuses.

Kappa is one codebase against an immutable event log: backfill, schema migration, and bug fixes are all replays. The cost is keeping the log retained long enough for replay to be feasible.

Unified engines hide the line at the API but not the runtime: Spark, Flink, and Beam let the same code run as batch, micro-batch, or streaming with a config change; cost and operations still differ underneath.

Tier each node explicitly: a single pipeline can have nodes at different freshness tiers; mismatches between adjacent tiers are the most common hidden cost or unmet expectation.

Architecture follows constraints, not fashion: the right rhythm for a workload is a function of the consumer's tier, the engine's capabilities, and the storage budget. Senior engineers read the constraints first.

Lambda, Kappa, and unified engines: architectures live or die on freshness tier discipline

Category: Pipeline Architecture
Difficulty: advanced
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: Lambda Architecture, Kappa: Stream Only, Batch Replay, Unified Engines: Where Lines Blur, Per-Node Freshness Tier Analysis, Lambda to Kappa Worked Example

Lesson Sections

Lambda Architecture (concepts: paLambdaArch)
Lambda architecture is the first widely adopted attempt to combine batch and streaming in one system. Nathan Marz proposed it around 2011 in his book Big Data, drawing on his experience at Twitter and BackType. The motivation was specific to the era: batch frameworks (Hadoop MapReduce) were correct but slow; stream frameworks (Storm) were fast but produced approximate results. Lambda combined the two, using batch for the durable correct view and streaming for the live approximate view. Both laye
Kappa: Stream Only, Batch Replay (concepts: paKappaArch)
Kappa architecture, proposed by Jay Kreps in 2014, is the answer to Lambda's two-codebase problem. The idea is simple: keep only the streaming layer. The event log is the source of truth, the streaming pipeline produces the canonical view, and batch becomes a special case (replaying the event log through the same streaming pipeline) rather than a separate codebase. One implementation of the logic, one operational profile, one set of failure modes. The simplification is real, and Kappa has become
Unified Engines: Where Lines Blur (concepts: paBatchVsStreaming)
The cleanest version of Kappa requires an engine that runs the same code in batch and streaming modes. Modern engines have moved toward this ideal. Spark Structured Streaming exposes a unified DataFrame API where the same query can run as a batch job, a micro-batch streaming job, or a continuous streaming job by changing one configuration. Apache Flink runs streaming as the default and batch as a special case (a bounded stream). Apache Beam abstracts both into a single programming model. The con
Per-Node Freshness Tier Analysis (concepts: paCostOptimization)
A single pipeline rarely needs one freshness tier across every node. The source might produce events continuously. The raw landing layer might lag the source by seconds. The curated layer might rebuild hourly. The serving layer might refresh on a per-consumer schedule. Treating the entire pipeline as one tier (the strictest one) overbuilds most nodes; treating it as the loosest tier underbuilds the consumer-facing edge. Senior engineers tier each node explicitly and label it on the architecture
Lambda to Kappa Worked Example (concepts: paKappaArch)
The synthesis exercise walks through a real-shaped migration: a workload originally designed as Lambda, redesigned as Kappa, with explicit notes on what changes in code, in storage, and in operations. The example is a streaming media company's content engagement pipeline. The exercise shows that the migration is not a rewrite; it is a careful retirement of the batch layer and a tightening of the streaming layer, with the immutable event log surviving as the architectural anchor. The Lambda Start