A logistics company at Series D scale ran a nightly batch that produced the next morning's delivery routes. The pipeline took three hours to run, processing about 18 million events from the previous day. As the company grew to 80 million events a day, the same pipeline took eleven hours and missed the 6am cutoff. The product team's first proposal was to migrate everything to Kafka and Flink. The senior engineer pushed back. The actual problem was not the rhythm; it was the daily window plus a non-incremental transform. Switching to micro-batch every five minutes solved the freshness problem without rewriting the entire stack. The cost stayed within 30 percent of the original budget. The wrong instinct was the dramatic migration; the right one was naming exactly which dimension was failing. This lesson is the dimensions: latency, throughput, state, cost, and the middle ground where most production pipelines actually live.
Latency vs Throughput Tradeoff
Daily Life
Interviews
Distinguish latency and throughput as separate metrics and identify which one constrains a given pipeline.
Batch and streaming are usually framed as a single axis, fast versus slow. The framing hides the actual engineering decision, which has two axes. Latency is the time from event arrival to event being processed and visible. Throughput is how many events the pipeline can process per unit of time. The two are not the same and are often in tension: optimizing for one usually costs the other. A pipeline that processes one event in 100 milliseconds has low latency but may have low throughput because the per-event overhead dominates. A pipeline that processes ten million events in five minutes has high throughput but a per-event latency of five minutes. Naming both dimensions before picking architecture is the difference between a deliberate choice and a default.
Latency and Throughput Defined
Metric
What It Measures
Unit
End-to-end latency
Time from event arrival at the source to event visible at the consumer
Milliseconds, seconds, minutes
Processing latency
Time the pipeline spends actively transforming the event
Milliseconds, microseconds
Throughput
Events processed per unit time
Events per second, GB per hour
Cost per event
Compute and storage cost amortized over each event
Dollars per million events
Concrete Numbers from Real Workloads
Workload
Throughput
Latency
Nightly Spark batch on 100GB
Roughly 5 million events processed per minute during the run
Up to 24 hours from arrival to result
Hourly batch on 4GB
Roughly 700K events per minute during the run
Up to 1 hour
Spark Structured Streaming, 1-min trigger
Roughly 80K events per second sustained
60 to 90 seconds end to end
Flink, true streaming on Kafka
Roughly 100K events per second per task slot
100ms to a few seconds
Custom low-latency C++ trading system
Lower aggregate throughput, dedicated path
Sub-millisecond
The numbers are illustrative, not gospel; the actual values vary by hardware, configuration, and event size. The pattern is durable across the variation. Higher throughput at the same hardware cost almost always comes with higher latency because the system batches work to amortize per-event overhead. Lower latency almost always costs throughput because per-event work cannot amortize. The two are a Pareto frontier; moving along it is a deliberate engineering choice.
Why Batch Wins on Throughput
A batch pipeline reads a large block of data, applies the transform once across the whole block, and writes the output once. Per-event overhead (function call, serialization, network round trip, transaction commit) amortizes across millions of events. A streaming pipeline pays the overhead per event or per tiny micro-batch. The same Spark code that processes a billion rows in batch can run at a fraction of the throughput in streaming mode against the same data. The difference is not the engine; it is the per-batch overhead being divided among many fewer events.
1
# Batch: one network round trip, one transaction, one output write per million events
2
rows=read_million_orders(window)
3
transformed=transform_all(rows)
4
write_million(transformed,partition='2026-04-25')
5
6
# Streaming: one network round trip, one update per event
7
fororderinconsume_kafka_topic('orders'):
8
transformed=transform_one(order)
9
write_one(transformed)# per-event overhead, repeated 1M times
Why Streaming Wins on Latency
A streaming pipeline starts processing the moment an event arrives. There is no scheduled wait, no chunk to fill, no minimum block size. End-to-end latency is bounded by the time to read, transform, and write a single event, plus a small queue. A batch pipeline cannot beat its own scheduling interval. An hourly pipeline has a worst-case latency of one hour because an event arriving immediately after a run completes waits for the next run. Streaming removes the scheduling wait at the cost of paying for compute to be ready at all times.
•Batch Optimizes For
Throughput per dollar of compute
Amortized per-event overhead
Predictable resource usage during the run window
Simple failure recovery: rerun the chunk
•Streaming Optimizes For
Latency from arrival to processed
Steady, predictable end-to-end time
Continuous resource availability
Continuous progress instead of stop-and-go cycles
The Tradeoff in One Number
The exact multiplier varies by workload, but the order of magnitude holds. A pipeline that processes one billion events a day at $200 in nightly batch typically costs $2,000 to $10,000 a day in pure streaming for the same logic. The 10-50x multiplier is not a fixed number; it depends on per-event work, on engine maturity, and on how much state the streaming version maintains. But the multiplier is real, and it is why a freshness conversation has to happen before a streaming conversation.
Questions to ask before choosing the rhythm:
▸What is the latency target in concrete units (milliseconds, seconds, minutes)?
▸What is the throughput target in events per second at peak?
▸What is the budget per million events processed?
▸Which dimension is firm and which is negotiable?
TIP
When the latency target can be met by hourly batch and the throughput target is comfortably under the hourly run budget, batch wins. When the latency target requires sub-15-minute freshness, the conversation moves to micro-batch or streaming.
Micro-Batch: The Middle Ground
Daily Life
Interviews
Apply the micro-batch pattern by setting an explicit trigger interval and explain why it sits between pure batch and pure streaming.
Most production pipelines that look like streaming are not pure streaming. They are micro-batch: very small batches, often every few seconds or every minute, processed by an engine that exposes a streaming API on top. Spark Structured Streaming is the largest example. Flink can run in batch or streaming mode with a tunable trigger interval. The pattern exists because pure streaming is expensive and pure batch cannot meet sub-15-minute freshness. Micro-batch sits in the middle: latency low enough to feel real-time, throughput high enough to amortize per-batch overhead, complexity simpler than full streaming.
How Micro-Batch Works
Aspect
Pure Streaming
Micro-Batch
Trigger
Every event arrival
A configured interval (10 sec, 1 min, 5 min)
Per-event work
Paid per event
Paid per micro-batch; amortized within the batch
End-to-end latency
Bounded by single-event processing time (100ms to a few seconds)
Bounded by the trigger interval (10 sec to 5 min)
Engine examples
Flink, Kafka Streams, custom
Spark Structured Streaming, Flink in batch mode, dbt incremental
State handling
Continuous in-memory state
State checkpointed each batch; smaller surface for bugs
The trigger argument is the key knob. Setting processingTime to one minute tells Spark to process whatever has accumulated in the last minute and write the result, then wait for the next minute. The application code does not change. The latency-throughput tradeoff is exposed as a single number. A trigger of one second behaves more like streaming and costs more. A trigger of ten minutes behaves more like batch and costs less. Production pipelines tune the trigger to the actual freshness requirement and stop there.
Why Micro-Batch Exists
Operational simplicityThroughput at low latencyTooling continuity
Operational simplicity
Closer to batch in failure mode
A failed micro-batch reruns from a checkpoint. There is no in-flight event hanging in the system. Recovery has the same shape as a batch job that failed and restarted.
Throughput at low latency
Per-batch overhead amortizes
A 1-minute micro-batch processing 60K events per minute amortizes the per-batch overhead 60K times. The throughput-per-dollar approaches batch while latency stays sub-2-minute.
Tooling continuity
Same engine, same code
Spark and Flink let the same SQL or DataFrame code run as batch or as micro-batch. Migration between modes does not require a rewrite, only a config change.
When Micro-Batch Is the Right Answer
Micro-batch is right when the freshness target is between one minute and fifteen minutes, the volume is high enough to amortize per-batch overhead, and the team prefers operational simplicity over the absolute lowest latency. Most live operational dashboards, fraud retrospective pipelines, and CDC consumers fit this description. Setting up Spark Structured Streaming or Flink in batch mode with a tunable trigger gives most of the latency win of streaming for most of the cost win of batch. For a freshness target of one second or below, micro-batch is too slow; pure streaming is required. For a freshness target of one hour or above, hourly batch is cheaper; micro-batch is overkill.
Signals that micro-batch is the right architecture:
▸The freshness target is in the 1-to-15-minute range
▸Volume is high (more than a few thousand events per second sustained)
▸The team has Spark or Flink experience and wants to reuse it
▸State is bounded and checkpointable; not millions of independent keys
When Micro-Batch Is the Wrong Answer
Situation
Why Micro-Batch Fails
Better Choice
Sub-second latency target
Trigger interval cannot go below the per-batch overhead
Pure streaming with Flink, Kafka Streams, or custom
Very low volume
Per-batch overhead dominates; cost is wasteful
Hourly batch is cheaper and simpler
Per-event side effects
Batches can fail and reprocess; side effects must be idempotent
Pure streaming with at-least-once semantics and idempotent sinks
Unbounded state
Each batch must checkpoint state; growth is not sustainable
True streaming with key-managed state and TTLs
Tuning the Trigger Interval
The trigger interval is the most important knob in a micro-batch pipeline. Setting it too low forces per-batch overhead to dominate cost. Setting it too high pushes latency above the consumer's threshold. The right interval is usually the largest value the consumer can tolerate, because larger intervals cost less. A consumer asking for under 5 minutes is best served by a 4-minute trigger, not a 30-second trigger. The 30-second version is technically faster and costs five to ten times more. Naming the consumer's actual freshness floor pushes the interval up to the right place.
Spark Structured Streaming defaults to a 'micro-batch ASAP' mode if no trigger is set. The default is rarely right; it produces tiny batches with high per-batch overhead. Always set processingTime explicitly.
TIP
When in doubt, start with a 1-minute trigger. It produces sub-2-minute end-to-end latency, amortizes per-batch overhead reasonably, and aligns with most operational dashboard freshness targets.
✓Do
Use micro-batch as the default for tier 2 freshness (under 15 minutes)
Set the trigger interval to the largest value the consumer accepts
Treat checkpointed state as recoverable; design transforms to survive a restart
✗Don't
Default to a sub-second trigger because lower sounds better; cost compounds fast
Use micro-batch when the freshness target is sub-second; the architecture cannot reach it
Mix per-event side effects with micro-batch without idempotent sinks; replays will duplicate
Why Streaming Costs More
Daily Life
Interviews
Estimate the cost premium of a streaming pipeline over a batch equivalent and decide when the latency justifies the spend.
Streaming costs more than batch for the same logic on the same data. The factor is rarely 10 percent; it is more often 5x to 50x. The cost difference is real and measurable, and it is the single most important variable in batch-versus-streaming decisions after freshness. Engineers who skip the cost conversation end up with streaming pipelines that consume budget the company does not want to spend, on freshness consumers do not need. The cost story has three components: continuous compute, state storage, and operational overhead.
Component 1: Continuous Compute
A streaming pipeline runs all the time. A Flink cluster of four task managers at 16 cores each is provisioned for peak load and stays provisioned at 3am. Cloud compute is paid per hour of provisioned capacity. The same logic running as nightly batch on a Snowflake warehouse pays for forty minutes of warehouse time and zero for the other twenty-three hours and twenty minutes. The ratio is roughly 36 to 1 in raw compute hours. Real-world cost ratios are smaller because streaming clusters are usually smaller than peak batch clusters, but the gap is structural.
Workload
Compute Hours per Day
Approximate Daily Cost
Nightly Spark batch, 4-node, 40 min run
About 2.7 compute hours
$30 to $60 per day at AWS on-demand
Hourly Spark batch, 4-node, 5 min runs
About 8 compute hours
$80 to $200 per day
Spark Structured Streaming, 1-min trigger, 4-node
96 compute hours per day
$800 to $2,000 per day
Flink streaming, 4-node, always-on
96 compute hours per day plus state nodes
$1,200 to $3,500 per day
Component 2: State Storage
Stateful streaming pipelines maintain state across events. A pipeline computing rolling 7-day counts per user holds counts for every active user in memory or on local disk. State storage costs money: the local SSD on a Flink task manager, the RocksDB instance, the snapshot to S3 for fault tolerance. State also grows. A million active users with 100 features per user is 100 million data points kept hot. Batch pipelines do not maintain state between runs; each run reads the source data and recomputes. The state cost is one of the silent components of streaming bills, not visible at design time but very visible six months in.
1
# Streaming state grows with active users
2
# 1M users * 100 features * 8 bytes = 800MB just for the values
3
# Plus indexes, plus checkpoints, plus replication overhead
4
# Hot state can grow to 5-10x the raw data size
5
6
# Batch state
7
# Always zero between runs
8
# Each run reads the source and recomputes from scratch
9
# State cost: zero
Component 3: Operational Overhead
Streaming pipelines require more operational tooling. Monitoring is more complex because the pipeline is always running and lag is the dominant signal. Alerting must distinguish between a slow consumer (latency lag growing) and a dead consumer (lag stuck). Failure handling is more complex because there is no clean rerun-the-partition. Schema changes are more disruptive because the running consumer cannot pause and migrate; it has to be drained and replaced with care. The engineering cost of running a streaming pipeline in production is roughly twice the engineering cost of running an equivalent batch pipeline, in time and in headcount.
•Batch Operational Profile
Failures rerun the partition; no in-flight events to worry about
Schema changes deploy with the next run; old runs are unaffected
Lag does not exist as a concept; freshness is bounded by schedule
On-call sees binary signals: ran or failed
•Streaming Operational Profile
Failures must handle in-flight events; replay logic is required
Schema changes require a careful drain-and-redeploy
Lag is the dominant signal; growing lag means the consumer is falling behind
On-call sees graded signals: latency, throughput, lag percentile
When the Cost Is Worth Paying
Streaming costs are worth paying when the latency the streaming pipeline provides has explicit dollar value. A fraud detection pipeline that catches a fraudulent card within seconds prevents losses that would have happened over the next minutes. The dollar value of the latency is the prevented loss. A real-time bidding system that decides ad placements within 100 milliseconds bids on impressions that batch could not bid on. The dollar value of the latency is the won impressions. A live operational dashboard that helps the on-call engineer fix an outage twenty minutes faster prevents twenty minutes of revenue loss. The dollar value of the latency is the prevented loss. Without a dollar value attached to the latency, streaming is paying for a feature nobody uses.
The cost-justification questions for streaming:
▸What does each minute of latency cost the business?
▸Above what latency threshold does the cost become non-zero?
▸What is the streaming pipeline's marginal cost over a batch alternative?
▸Does the cost of the latency, integrated over a year, exceed the streaming bill?
The Cost Profile in One Number
The exact multiplier depends on volume, state size, and engine. The shape is structural: continuous compute plus state plus operational overhead always exceeds the cost of a function that runs and exits. Treating the multiplier as part of the design conversation, not a surprise on the cloud bill, is the difference between a pipeline that earns its budget and one that gets cut in the next quarterly review.
TIP
Before approving a streaming pipeline, write the latency value the streaming provides, in dollars per year, and compare it to the streaming cost premium over the batch alternative. If the value is smaller, the batch alternative is the right answer.
✓Do
Estimate streaming costs at design time; the surprise on the bill is avoidable
Write the dollar value of the latency the streaming provides; if zero, build batch
Right-size streaming clusters for actual traffic, not peak-of-peak
✗Don't
Compare streaming to batch on raw compute hours alone; state and operations are real
Build streaming because batch was slow once at peak; profile first
Default to streaming for new pipelines; the cost premium compounds across many pipelines
Stateful vs Stateless Transforms
Daily Life
Interviews
Classify each transform as stateful or stateless and explain how the category changes cost and recovery behavior in streaming.
Transforms divide into two categories that matter much more in streaming than in batch. A stateless transform processes one event at a time and produces output that depends only on that event. A stateful transform produces output that depends on more than one event: a count, a sum, a window, a join with another stream. The category changes the cost, the complexity, and the failure-recovery story. In batch, both categories look about the same because the engine has all the data in memory at once. In streaming, the difference is structural and shapes nearly every design decision.
Stateless Transforms
Transform
What It Does
Why It Is Stateless
Filter (where clause)
Drops events that fail a predicate
Decision uses only the current event
Projection (select columns)
Reshapes the event without combining with others
Output is a function of the single input
Per-event enrichment from a static lookup
Joins with a small reference table loaded into memory
Reference data is constant; not derived from event history
Type conversion
Casts string to int, parses JSON
Single-event operation; no history needed
Field-level redaction
Hashes or removes PII fields
Per-event transformation
Stateless transforms are cheap and easy in both batch and streaming. They scale linearly with event count, recover trivially from failure, and require no state store. A streaming pipeline of pure stateless transforms behaves almost like batch in cost and complexity, only with a different rhythm. The interesting cases, the ones that justify the streaming infrastructure, are stateful.
Stateful Transforms
Transform
What It Does
Why It Is Stateful
GROUP BY aggregations
Counts, sums, averages over a key
Result depends on every event for that key seen so far
Windowed aggregation
Rolling counts over a time window
Result depends on which events fall inside the window
Stream-stream join
Matches events from two streams within a time window
Requires holding both streams in state until matched
Deduplication
Drops repeated events by key
Requires remembering keys already seen
Sessionization
Groups events into sessions by gap timeout
Each session is open until a gap closes it
How Streaming Engines Handle State
Modern streaming engines maintain state in a key-value store that is local to the worker for performance and snapshotted to durable storage for fault tolerance. Flink uses RocksDB local on the task manager and asynchronously checkpoints to S3 or HDFS. Spark Structured Streaming uses HDFS or S3 directly with a write-ahead log. Kafka Streams uses RocksDB plus a Kafka topic as the changelog. The pattern is the same: keep state hot for read performance, persist it for recovery. The cost is the dual storage layer, and the complexity is making the snapshot consistent with what has been emitted.
1
# A stateful streaming transform: rolling 5-minute count per page
The withWatermark call is the bound on how long state is kept. Without it, state grows without limit because the engine cannot decide when a window is closed. With a 10-minute watermark, the engine drops state for windows whose end is more than 10 minutes behind the maximum event timestamp seen. Watermarks are the streaming equivalent of taking out the trash; without them, every state-laden pipeline eventually runs out of memory.
The Cost Difference Per Category
•Stateless Streaming
Memory and disk usage scale with batch size, not history
Recovery: replay last batch or two; state is reconstructed from data
Schema changes: relatively safe; no embedded state to migrate
Cost premium over batch: roughly 2x to 5x
•Stateful Streaming
Memory and disk usage scale with key cardinality and window size
Recovery: restore from checkpoint; in-flight state must be consistent
Schema changes: state migrations are required; not always trivial
Cost premium over batch: roughly 5x to 50x
Stateful Transforms in Batch
Batch handles stateful transforms naturally because the engine has all the data in memory at once. A GROUP BY in SQL is stateful in the engineering sense (the result depends on every row in the group), but the engine sees the whole input and produces the result in one pass. There is no checkpointing, no watermark, no out-of-order arrival to worry about. The cost is correctness without complexity. The downside is that batch cannot react to events in motion; it can only react to events that have already landed. Streaming pays for the ability to react in motion with the cost and complexity of state management.
Questions to ask before adding a stateful transform to a streaming pipeline:
▸What is the key cardinality, and how fast does it grow?
▸What is the window size, and how long is state kept after the window closes?
▸What is the watermark strategy, and how late can events arrive?
▸What happens to state on a failure-and-restart?
Stateless transforms are cheap and recover trivially; they look the same in batch and streaming.
Stateful transforms drive most of the streaming cost premium; key cardinality and window size dominate.
Watermarks bound how long state is kept; without them, stateful pipelines run out of memory.
TIP
When designing a streaming pipeline, list every stateful transform and write the key cardinality, the window size, and the watermark strategy next to each. The list reveals the cost and the failure modes before the pipeline is built.
When Batch Outgrows Itself
Daily Life
Interviews
Diagnose a slow batch pipeline and pick the smallest change that meets the consumer's freshness tier without overbuilding.
The exercise below walks through a real-shaped scenario: a pipeline that started as nightly batch, grew, and stopped meeting its freshness target. The redesign is not a wholesale switch to streaming. The redesign is a careful examination of which dimension is failing and the smallest change that fixes it. Most batch-to-streaming migrations in production look like this exercise, not like a rewrite.
The Starting Pipeline
An e-commerce company's nightly pipeline reads orders from a Postgres database, joins with a product catalog, computes daily aggregates by category and region, and writes a fact_daily_orders table consumed by the executive dashboard. The pipeline runs at 2am, finishes by 5am, and the dashboard is fresh by 6am Pacific. Volume is 4 million orders per day. Compute cost is roughly $40 per night. The architecture has worked unchanged for two years.
1
INSERTINTOmart.fact_daily_orders
2
SELECT
3
DATE(o.order_timestamp)ASorder_date,
4
p.category,
5
o.region,
6
COUNT(*)ASorders,
7
SUM(o.amount_cents)/100.0ASrevenue
8
FROMraw.orderso
9
JOINdim.productp
10
ONo.product_id=p.product_id
11
WHEREDATE(o.order_timestamp)=:run_date
12
GROUPBY1,2,3;
The Symptoms
Symptom
Numbers
When It Started
Pipeline runtime stretching
From 3 hours to 11 hours
Began drifting after a 4x volume increase
Missed 6am SLA
Dashboard now fresh at noon, not 6am
After volume hit 18M orders/day
Operational pages
On-call paged 3-4 mornings a week
When run started bumping into 9am compute window
Marketing team built shadow pipeline
Hourly streaming consumer feeding their own dashboard
After two months of missed SLA
The Wrong Instinct
The product team's first proposal is to migrate everything to a Flink streaming pipeline. The proposal sounds reasonable: streaming is real-time, batch is slow, the freshness problem is obvious. The proposal is wrong. Migrating a billion-row-per-day join to streaming costs roughly 20x the current spend, takes nine months of engineering, and requires building state management for a stateful join that does not currently exist. The freshness problem is real but the diagnosis is incomplete. The actual problem is two separable issues that look like one.
The Right Diagnosis
Issue
What Is Happening
Right Fix
Volume outgrew the cadence
11 hours of batch cannot fit in a 7-hour overnight window
Run more often (hourly or micro-batch), each run does less work
Non-incremental transform
The transform reads a full day of orders even when only the last hour matters
Make the transform incremental on order_timestamp
Marketing wants tier-2 freshness
Batch tier-4 cannot meet a tier-2 need
Add a tier-2 path for the marketing dashboard only, leave the executive on tier 4
Single pipeline serving multiple consumers
Different consumers have different freshness needs but share one pipeline
Split the consumers; let each have the cadence its tier requires
The Redesign
The redesign keeps batch where it works and adds streaming only where the consumer requires it. The executive dashboard stays tier 4 (daily) and uses an hourly micro-batch instead of a single nightly run. The marketing team gets a tier-2 path that uses Spark Structured Streaming with a 1-minute trigger reading from a CDC stream of the orders table. The two paths share the curated layer concept from Lesson 1: both write to the same fact_daily_orders table family, partitioned in a way that lets each path update only its own partitions. The redesign respects the freshness tiers; it does not impose one rhythm on consumers who want different things.
# Reads only the last hour, writes hourly partition
4
spark.sql('''
INSERT OVERWRITE mart.fact_hourly_orders PARTITION (hour_ts = :run_hour_ts)
SELECT
DATE_TRUNC('hour', o.order_timestamp) AS hour_ts,
p.category,
o.region,
COUNT(*) AS orders,
SUM(o.amount_cents) / 100.0 AS revenue
FROM raw.orders o
JOIN dim.product p ON o.product_id = p.product_id
WHERE o.order_timestamp >= :run_hour_ts
AND o.order_timestamp < :run_hour_ts + INTERVAL '1 hour'
GROUP BY 1, 2, 3
''')
5
6
# Path 2: Streaming for marketing (tier 2, micro-batch every 1 min)
Tier 4 for executive dashboard; tier 2 for marketing dashboard; both consumers happy
One canonical fact table family; shadow pipeline retired
On-call paged less than once a week; both paths run independently
What This Example Teaches
The example is not a story about streaming winning. It is a story about diagnosing which dimension is failing. Volume outgrew cadence; the fix was a tighter cadence, not a wholesale rhythm change. The non-incremental transform read a full day every time; the fix was incrementality, not a new engine. Different consumers had different freshness needs; the fix was to give each consumer the rhythm it actually needs, not a single shared rhythm. Senior engineers fix freshness problems by naming the dimension that is failing. The wrong instinct is to swap the architecture; the right one is to swap the smallest piece that solves the actual problem.
The diagnosis questions for a slow batch:
▸Has volume outgrown the available run window? If yes, run more often, not differently.
▸Is the transform non-incremental? If yes, make it incremental before changing engines.
▸Do consumers have different freshness needs? If yes, split the paths by tier.
▸Is the slowness in the transform itself or in the read/write paths? If reads and writes dominate, swapping the engine does not help.
TIP
Before migrating any batch pipeline to streaming, write the four diagnosis answers in plain prose. Most slow batch problems resolve to incrementality and cadence, not rhythm change.
❯❯❯PUTTING IT ALL TOGETHER
> A growth-stage logistics company has a nightly batch that processes 80M delivery events and computes routes for the next morning. The pipeline is missing its 6am SLA and the cost has tripled in twelve months. The new tech lead is asked to redesign without rewriting everything as streaming.
Start with the dimensions. Latency target stays at 6am (tier 3 freshness). Throughput is the bottleneck: the run window cannot fit 80M events in seven hours. The dimension failing is throughput-per-run, not latency floor.
The fix is incrementality plus cadence, not a rhythm change. Make the transform incremental on event_timestamp; run hourly instead of nightly. Each hourly run does 1/24th of the work and finishes in minutes.
If a separate consumer (the operations team) wants tier-2 freshness, give that consumer a separate path: a Spark Structured Streaming job with a 1-minute trigger. The original consumer keeps tier-3 hourly batch. Both feed a shared curated fact-event family, applying the layered architecture from Lesson 1.
Cost it explicitly. The tier-3 hourly batch costs roughly 30 percent more than the original nightly. The tier-2 streaming path adds another tier of cost on top. The four pipeline roles (source, transform, storage, consumer) stay the same; the rhythm of each role is what changes.
The outcome is two paths, one tier each, both operationally distinct, each meeting its own freshness target. The wrong move would have been one streaming pipeline serving everyone at the cost of the highest-freshness consumer; the right move is per-consumer tiering.
KEY TAKEAWAYS
Latency and throughput are separate dimensions: optimizing one usually costs the other. Most production decisions resolve to which dimension is firm.
Micro-batch is the middle ground most production pipelines occupy: Spark Structured Streaming and Flink in batch mode let the trigger interval tune the latency-throughput-cost balance.
Streaming costs 5x to 50x more than batch for the same logic: continuous compute, state storage, and operational overhead combine. Cost the latency in dollars before approving the spend.
Stateful transforms drive most of the cost premium: key cardinality and window size dominate. Watermarks bound how long state is kept and prevent unbounded growth.
When batch outgrows itself, fix the dimension that is failing: incrementality and cadence usually beat a rhythm change. Different consumers can have different tiers on the same source data.
Batch vs Streaming: Intermediate
Latency, throughput, state, and cost are the dimensions; pick deliberately, not by default
Category
Pipeline Architecture
Difficulty
intermediate
Duration
30 minutes
Challenges
0 hands-on challenges
Topics covered: Latency vs Throughput Tradeoff, Micro-Batch: The Middle Ground, Why Streaming Costs More, Stateful vs Stateless Transforms, When Batch Outgrows Itself
Batch and streaming are usually framed as a single axis, fast versus slow. The framing hides the actual engineering decision, which has two axes. Latency is the time from event arrival to event being processed and visible. Throughput is how many events the pipeline can process per unit of time. The two are not the same and are often in tension: optimizing for one usually costs the other. A pipeline that processes one event in 100 milliseconds has low latency but may have low throughput because t
Most production pipelines that look like streaming are not pure streaming. They are micro-batch: very small batches, often every few seconds or every minute, processed by an engine that exposes a streaming API on top. Spark Structured Streaming is the largest example. Flink can run in batch or streaming mode with a tunable trigger interval. The pattern exists because pure streaming is expensive and pure batch cannot meet sub-15-minute freshness. Micro-batch sits in the middle: latency low enough
Streaming costs more than batch for the same logic on the same data. The factor is rarely 10 percent; it is more often 5x to 50x. The cost difference is real and measurable, and it is the single most important variable in batch-versus-streaming decisions after freshness. Engineers who skip the cost conversation end up with streaming pipelines that consume budget the company does not want to spend, on freshness consumers do not need. The cost story has three components: continuous compute, state
Transforms divide into two categories that matter much more in streaming than in batch. A stateless transform processes one event at a time and produces output that depends only on that event. A stateful transform produces output that depends on more than one event: a count, a sum, a window, a join with another stream. The category changes the cost, the complexity, and the failure-recovery story. In batch, both categories look about the same because the engine has all the data in memory at once.
The exercise below walks through a real-shaped scenario: a pipeline that started as nightly batch, grew, and stopped meeting its freshness target. The redesign is not a wholesale switch to streaming. The redesign is a careful examination of which dimension is failing and the smallest change that fixes it. Most batch-to-streaming migrations in production look like this exercise, not like a rewrite. The Starting Pipeline An e-commerce company's nightly pipeline reads orders from a Postgres databas