Batch vs Streaming: Intermediate

A logistics company at Series D scale ran a nightly batch that produced the next morning's delivery routes. The pipeline took three hours to run, processing about 18 million events from the previous day. As the company grew to 80 million events a day, the same pipeline took eleven hours and missed the 6am cutoff. The product team's first proposal was to migrate everything to Kafka and Flink. The senior engineer pushed back. The actual problem was not the rhythm; it was the daily window plus a non-incremental transform. Switching to micro-batch every five minutes solved the freshness problem without rewriting the entire stack. The cost stayed within 30 percent of the original budget. The wrong instinct was the dramatic migration; the right one was naming exactly which dimension was failing. This lesson is the dimensions: latency, throughput, state, cost, and the middle ground where most production pipelines actually live.

Latency vs Throughput Tradeoff

Daily Life
Interviews

Distinguish latency and throughput as separate metrics and identify which one constrains a given pipeline.

Batch and streaming are usually framed as a single axis, fast versus slow. The framing hides the actual engineering decision, which has two axes. Latency is the time from event arrival to event being processed and visible. Throughput is how many events the pipeline can process per unit of time. The two are not the same and are often in tension: optimizing for one usually costs the other. A pipeline that processes one event in 100 milliseconds has low latency but may have low throughput because the per-event overhead dominates. A pipeline that processes ten million events in five minutes has high throughput but a per-event latency of five minutes. Naming both dimensions before picking architecture is the difference between a deliberate choice and a default.

Latency and Throughput Defined

MetricWhat It MeasuresUnit
End-to-end latencyTime from event arrival at the source to event visible at the consumerMilliseconds, seconds, minutes
Processing latencyTime the pipeline spends actively transforming the eventMilliseconds, microseconds
ThroughputEvents processed per unit timeEvents per second, GB per hour
Cost per eventCompute and storage cost amortized over each eventDollars per million events

Concrete Numbers from Real Workloads

WorkloadThroughputLatency
Nightly Spark batch on 100GBRoughly 5 million events processed per minute during the runUp to 24 hours from arrival to result
Hourly batch on 4GBRoughly 700K events per minute during the runUp to 1 hour
Spark Structured Streaming, 1-min triggerRoughly 80K events per second sustained60 to 90 seconds end to end
Flink, true streaming on KafkaRoughly 100K events per second per task slot100ms to a few seconds
Custom low-latency C++ trading systemLower aggregate throughput, dedicated pathSub-millisecond
The numbers are illustrative, not gospel; the actual values vary by hardware, configuration, and event size. The pattern is durable across the variation. Higher throughput at the same hardware cost almost always comes with higher latency because the system batches work to amortize per-event overhead. Lower latency almost always costs throughput because per-event work cannot amortize. The two are a Pareto frontier; moving along it is a deliberate engineering choice.

Why Batch Wins on Throughput

A batch pipeline reads a large block of data, applies the transform once across the whole block, and writes the output once. Per-event overhead (function call, serialization, network round trip, transaction commit) amortizes across millions of events. A streaming pipeline pays the overhead per event or per tiny micro-batch. The same Spark code that processes a billion rows in batch can run at a fraction of the throughput in streaming mode against the same data. The difference is not the engine; it is the per-batch overhead being divided among many fewer events.
1# Batch: one network round trip, one transaction, one output write per million events
2rows = read_million_orders(window)
3transformed = transform_all(rows)
4write_million(transformed, partition='2026-04-25')
5
6# Streaming: one network round trip, one update per event
7for order in consume_kafka_topic('orders'):
8 transformed = transform_one(order)
9 write_one(transformed) # per-event overhead, repeated 1M times

Why Streaming Wins on Latency

A streaming pipeline starts processing the moment an event arrives. There is no scheduled wait, no chunk to fill, no minimum block size. End-to-end latency is bounded by the time to read, transform, and write a single event, plus a small queue. A batch pipeline cannot beat its own scheduling interval. An hourly pipeline has a worst-case latency of one hour because an event arriving immediately after a run completes waits for the next run. Streaming removes the scheduling wait at the cost of paying for compute to be ready at all times.
Batch Optimizes For
  • Throughput per dollar of compute
  • Amortized per-event overhead
  • Predictable resource usage during the run window
  • Simple failure recovery: rerun the chunk
Streaming Optimizes For
  • Latency from arrival to processed
  • Steady, predictable end-to-end time
  • Continuous resource availability
  • Continuous progress instead of stop-and-go cycles

The Tradeoff in One Number

The exact multiplier varies by workload, but the order of magnitude holds. A pipeline that processes one billion events a day at $200 in nightly batch typically costs $2,000 to $10,000 a day in pure streaming for the same logic. The 10-50x multiplier is not a fixed number; it depends on per-event work, on engine maturity, and on how much state the streaming version maintains. But the multiplier is real, and it is why a freshness conversation has to happen before a streaming conversation.
Questions to ask before choosing the rhythm:
  • What is the latency target in concrete units (milliseconds, seconds, minutes)?
  • What is the throughput target in events per second at peak?
  • What is the budget per million events processed?
  • Which dimension is firm and which is negotiable?
TIP
When the latency target can be met by hourly batch and the throughput target is comfortably under the hourly run budget, batch wins. When the latency target requires sub-15-minute freshness, the conversation moves to micro-batch or streaming.

Micro-Batch: The Middle Ground

Daily Life
Interviews

Apply the micro-batch pattern by setting an explicit trigger interval and explain why it sits between pure batch and pure streaming.

Most production pipelines that look like streaming are not pure streaming. They are micro-batch: very small batches, often every few seconds or every minute, processed by an engine that exposes a streaming API on top. Spark Structured Streaming is the largest example. Flink can run in batch or streaming mode with a tunable trigger interval. The pattern exists because pure streaming is expensive and pure batch cannot meet sub-15-minute freshness. Micro-batch sits in the middle: latency low enough to feel real-time, throughput high enough to amortize per-batch overhead, complexity simpler than full streaming.

How Micro-Batch Works

AspectPure StreamingMicro-Batch
TriggerEvery event arrivalA configured interval (10 sec, 1 min, 5 min)
Per-event workPaid per eventPaid per micro-batch; amortized within the batch
End-to-end latencyBounded by single-event processing time (100ms to a few seconds)Bounded by the trigger interval (10 sec to 5 min)
Engine examplesFlink, Kafka Streams, customSpark Structured Streaming, Flink in batch mode, dbt incremental
State handlingContinuous in-memory stateState checkpointed each batch; smaller surface for bugs

The Spark Structured Streaming Pattern

1from pyspark.sql import SparkSession
2from pyspark.sql.functions import col, count
3
4spark = SparkSession.builder.getOrCreate()
5
6events = (
7 spark.readStream
8 .format('kafka')
9 .option('subscribe', 'raw.events')
10 .load()
11 .select(parse_json(col('value')).alias('event'))
12)
13
14running = (
15 events.groupBy('event.country')
16 .agg(count('*').alias('events'))
17)
18
19query = (
20 running.writeStream
21 .outputMode('complete')
22 .format('delta')
23 .option('checkpointLocation', 's3://chk/country_counts')
24 .trigger(processingTime='1 minute') # micro-batch interval
25 .start('s3://lake/country_counts')
26)
The trigger argument is the key knob. Setting processingTime to one minute tells Spark to process whatever has accumulated in the last minute and write the result, then wait for the next minute. The application code does not change. The latency-throughput tradeoff is exposed as a single number. A trigger of one second behaves more like streaming and costs more. A trigger of ten minutes behaves more like batch and costs less. Production pipelines tune the trigger to the actual freshness requirement and stop there.

Why Micro-Batch Exists

Operational simplicityThroughput at low latencyTooling continuity
Operational simplicity
Closer to batch in failure mode
A failed micro-batch reruns from a checkpoint. There is no in-flight event hanging in the system. Recovery has the same shape as a batch job that failed and restarted.
Throughput at low latency
Per-batch overhead amortizes
A 1-minute micro-batch processing 60K events per minute amortizes the per-batch overhead 60K times. The throughput-per-dollar approaches batch while latency stays sub-2-minute.
Tooling continuity
Same engine, same code
Spark and Flink let the same SQL or DataFrame code run as batch or as micro-batch. Migration between modes does not require a rewrite, only a config change.

When Micro-Batch Is the Right Answer

Micro-batch is right when the freshness target is between one minute and fifteen minutes, the volume is high enough to amortize per-batch overhead, and the team prefers operational simplicity over the absolute lowest latency. Most live operational dashboards, fraud retrospective pipelines, and CDC consumers fit this description. Setting up Spark Structured Streaming or Flink in batch mode with a tunable trigger gives most of the latency win of streaming for most of the cost win of batch. For a freshness target of one second or below, micro-batch is too slow; pure streaming is required. For a freshness target of one hour or above, hourly batch is cheaper; micro-batch is overkill.
Signals that micro-batch is the right architecture:
  • The freshness target is in the 1-to-15-minute range
  • Volume is high (more than a few thousand events per second sustained)
  • The team has Spark or Flink experience and wants to reuse it
  • State is bounded and checkpointable; not millions of independent keys

When Micro-Batch Is the Wrong Answer

SituationWhy Micro-Batch FailsBetter Choice
Sub-second latency targetTrigger interval cannot go below the per-batch overheadPure streaming with Flink, Kafka Streams, or custom
Very low volumePer-batch overhead dominates; cost is wastefulHourly batch is cheaper and simpler
Per-event side effectsBatches can fail and reprocess; side effects must be idempotentPure streaming with at-least-once semantics and idempotent sinks
Unbounded stateEach batch must checkpoint state; growth is not sustainableTrue streaming with key-managed state and TTLs

Tuning the Trigger Interval

The trigger interval is the most important knob in a micro-batch pipeline. Setting it too low forces per-batch overhead to dominate cost. Setting it too high pushes latency above the consumer's threshold. The right interval is usually the largest value the consumer can tolerate, because larger intervals cost less. A consumer asking for under 5 minutes is best served by a 4-minute trigger, not a 30-second trigger. The 30-second version is technically faster and costs five to ten times more. Naming the consumer's actual freshness floor pushes the interval up to the right place.

Spark Structured Streaming defaults to a 'micro-batch ASAP' mode if no trigger is set. The default is rarely right; it produces tiny batches with high per-batch overhead. Always set processingTime explicitly.

TIP
When in doubt, start with a 1-minute trigger. It produces sub-2-minute end-to-end latency, amortizes per-batch overhead reasonably, and aligns with most operational dashboard freshness targets.
Do
  • Use micro-batch as the default for tier 2 freshness (under 15 minutes)
  • Set the trigger interval to the largest value the consumer accepts
  • Treat checkpointed state as recoverable; design transforms to survive a restart
Don't
  • Default to a sub-second trigger because lower sounds better; cost compounds fast
  • Use micro-batch when the freshness target is sub-second; the architecture cannot reach it
  • Mix per-event side effects with micro-batch without idempotent sinks; replays will duplicate

Why Streaming Costs More

Daily Life
Interviews

Estimate the cost premium of a streaming pipeline over a batch equivalent and decide when the latency justifies the spend.

Streaming costs more than batch for the same logic on the same data. The factor is rarely 10 percent; it is more often 5x to 50x. The cost difference is real and measurable, and it is the single most important variable in batch-versus-streaming decisions after freshness. Engineers who skip the cost conversation end up with streaming pipelines that consume budget the company does not want to spend, on freshness consumers do not need. The cost story has three components: continuous compute, state storage, and operational overhead.

Component 1: Continuous Compute

A streaming pipeline runs all the time. A Flink cluster of four task managers at 16 cores each is provisioned for peak load and stays provisioned at 3am. Cloud compute is paid per hour of provisioned capacity. The same logic running as nightly batch on a Snowflake warehouse pays for forty minutes of warehouse time and zero for the other twenty-three hours and twenty minutes. The ratio is roughly 36 to 1 in raw compute hours. Real-world cost ratios are smaller because streaming clusters are usually smaller than peak batch clusters, but the gap is structural.
WorkloadCompute Hours per DayApproximate Daily Cost
Nightly Spark batch, 4-node, 40 min runAbout 2.7 compute hours$30 to $60 per day at AWS on-demand
Hourly Spark batch, 4-node, 5 min runsAbout 8 compute hours$80 to $200 per day
Spark Structured Streaming, 1-min trigger, 4-node96 compute hours per day$800 to $2,000 per day
Flink streaming, 4-node, always-on96 compute hours per day plus state nodes$1,200 to $3,500 per day

Component 2: State Storage

Stateful streaming pipelines maintain state across events. A pipeline computing rolling 7-day counts per user holds counts for every active user in memory or on local disk. State storage costs money: the local SSD on a Flink task manager, the RocksDB instance, the snapshot to S3 for fault tolerance. State also grows. A million active users with 100 features per user is 100 million data points kept hot. Batch pipelines do not maintain state between runs; each run reads the source data and recomputes. The state cost is one of the silent components of streaming bills, not visible at design time but very visible six months in.
1# Streaming state grows with active users
2# 1M users * 100 features * 8 bytes = 800MB just for the values
3# Plus indexes, plus checkpoints, plus replication overhead
4# Hot state can grow to 5-10x the raw data size
5
6# Batch state
7# Always zero between runs
8# Each run reads the source and recomputes from scratch
9# State cost: zero

Component 3: Operational Overhead

Streaming pipelines require more operational tooling. Monitoring is more complex because the pipeline is always running and lag is the dominant signal. Alerting must distinguish between a slow consumer (latency lag growing) and a dead consumer (lag stuck). Failure handling is more complex because there is no clean rerun-the-partition. Schema changes are more disruptive because the running consumer cannot pause and migrate; it has to be drained and replaced with care. The engineering cost of running a streaming pipeline in production is roughly twice the engineering cost of running an equivalent batch pipeline, in time and in headcount.
Batch Operational Profile
  • Failures rerun the partition; no in-flight events to worry about
  • Schema changes deploy with the next run; old runs are unaffected
  • Lag does not exist as a concept; freshness is bounded by schedule
  • On-call sees binary signals: ran or failed
Streaming Operational Profile
  • Failures must handle in-flight events; replay logic is required
  • Schema changes require a careful drain-and-redeploy
  • Lag is the dominant signal; growing lag means the consumer is falling behind
  • On-call sees graded signals: latency, throughput, lag percentile

When the Cost Is Worth Paying

Streaming costs are worth paying when the latency the streaming pipeline provides has explicit dollar value. A fraud detection pipeline that catches a fraudulent card within seconds prevents losses that would have happened over the next minutes. The dollar value of the latency is the prevented loss. A real-time bidding system that decides ad placements within 100 milliseconds bids on impressions that batch could not bid on. The dollar value of the latency is the won impressions. A live operational dashboard that helps the on-call engineer fix an outage twenty minutes faster prevents twenty minutes of revenue loss. The dollar value of the latency is the prevented loss. Without a dollar value attached to the latency, streaming is paying for a feature nobody uses.
The cost-justification questions for streaming:
  • What does each minute of latency cost the business?
  • Above what latency threshold does the cost become non-zero?
  • What is the streaming pipeline's marginal cost over a batch alternative?
  • Does the cost of the latency, integrated over a year, exceed the streaming bill?

The Cost Profile in One Number

The exact multiplier depends on volume, state size, and engine. The shape is structural: continuous compute plus state plus operational overhead always exceeds the cost of a function that runs and exits. Treating the multiplier as part of the design conversation, not a surprise on the cloud bill, is the difference between a pipeline that earns its budget and one that gets cut in the next quarterly review.
TIP
Before approving a streaming pipeline, write the latency value the streaming provides, in dollars per year, and compare it to the streaming cost premium over the batch alternative. If the value is smaller, the batch alternative is the right answer.
Do
  • Estimate streaming costs at design time; the surprise on the bill is avoidable
  • Write the dollar value of the latency the streaming provides; if zero, build batch
  • Right-size streaming clusters for actual traffic, not peak-of-peak
Don't
  • Compare streaming to batch on raw compute hours alone; state and operations are real
  • Build streaming because batch was slow once at peak; profile first
  • Default to streaming for new pipelines; the cost premium compounds across many pipelines

Stateful vs Stateless Transforms

Daily Life
Interviews

Classify each transform as stateful or stateless and explain how the category changes cost and recovery behavior in streaming.

Transforms divide into two categories that matter much more in streaming than in batch. A stateless transform processes one event at a time and produces output that depends only on that event. A stateful transform produces output that depends on more than one event: a count, a sum, a window, a join with another stream. The category changes the cost, the complexity, and the failure-recovery story. In batch, both categories look about the same because the engine has all the data in memory at once. In streaming, the difference is structural and shapes nearly every design decision.

Stateless Transforms

TransformWhat It DoesWhy It Is Stateless
Filter (where clause)Drops events that fail a predicateDecision uses only the current event
Projection (select columns)Reshapes the event without combining with othersOutput is a function of the single input
Per-event enrichment from a static lookupJoins with a small reference table loaded into memoryReference data is constant; not derived from event history
Type conversionCasts string to int, parses JSONSingle-event operation; no history needed
Field-level redactionHashes or removes PII fieldsPer-event transformation
Stateless transforms are cheap and easy in both batch and streaming. They scale linearly with event count, recover trivially from failure, and require no state store. A streaming pipeline of pure stateless transforms behaves almost like batch in cost and complexity, only with a different rhythm. The interesting cases, the ones that justify the streaming infrastructure, are stateful.

Stateful Transforms

TransformWhat It DoesWhy It Is Stateful
GROUP BY aggregationsCounts, sums, averages over a keyResult depends on every event for that key seen so far
Windowed aggregationRolling counts over a time windowResult depends on which events fall inside the window
Stream-stream joinMatches events from two streams within a time windowRequires holding both streams in state until matched
DeduplicationDrops repeated events by keyRequires remembering keys already seen
SessionizationGroups events into sessions by gap timeoutEach session is open until a gap closes it

How Streaming Engines Handle State

Modern streaming engines maintain state in a key-value store that is local to the worker for performance and snapshotted to durable storage for fault tolerance. Flink uses RocksDB local on the task manager and asynchronously checkpoints to S3 or HDFS. Spark Structured Streaming uses HDFS or S3 directly with a write-ahead log. Kafka Streams uses RocksDB plus a Kafka topic as the changelog. The pattern is the same: keep state hot for read performance, persist it for recovery. The cost is the dual storage layer, and the complexity is making the snapshot consistent with what has been emitted.
1# A stateful streaming transform: rolling 5-minute count per page
2from pyspark.sql.functions import window, count
3
4stream = (
5 spark.readStream
6 .format('kafka')
7 .option('subscribe', 'raw.clicks')
8 .load()
9 .select(parse_json('value').alias('event'))
10)
11
12windowed = (
13 stream.withWatermark('event.timestamp', '10 minutes')
14 .groupBy(
15 window('event.timestamp', '5 minutes'),
16 'event.page'
17 )
18 .agg(count('*').alias('clicks'))
19)
20
21# State per (page, window) is held until the watermark passes the window's end
22query = (
23 windowed.writeStream
24 .outputMode('update')
25 .format('delta')
26 .option('checkpointLocation', 's3://chk/page_5min')
27 .trigger(processingTime='30 seconds')
28 .start('s3://lake/page_5min')
29)
The withWatermark call is the bound on how long state is kept. Without it, state grows without limit because the engine cannot decide when a window is closed. With a 10-minute watermark, the engine drops state for windows whose end is more than 10 minutes behind the maximum event timestamp seen. Watermarks are the streaming equivalent of taking out the trash; without them, every state-laden pipeline eventually runs out of memory.

The Cost Difference Per Category

Stateless Streaming
  • Memory and disk usage scale with batch size, not history
  • Recovery: replay last batch or two; state is reconstructed from data
  • Schema changes: relatively safe; no embedded state to migrate
  • Cost premium over batch: roughly 2x to 5x
Stateful Streaming
  • Memory and disk usage scale with key cardinality and window size
  • Recovery: restore from checkpoint; in-flight state must be consistent
  • Schema changes: state migrations are required; not always trivial
  • Cost premium over batch: roughly 5x to 50x

Stateful Transforms in Batch

Batch handles stateful transforms naturally because the engine has all the data in memory at once. A GROUP BY in SQL is stateful in the engineering sense (the result depends on every row in the group), but the engine sees the whole input and produces the result in one pass. There is no checkpointing, no watermark, no out-of-order arrival to worry about. The cost is correctness without complexity. The downside is that batch cannot react to events in motion; it can only react to events that have already landed. Streaming pays for the ability to react in motion with the cost and complexity of state management.
Questions to ask before adding a stateful transform to a streaming pipeline:
  • What is the key cardinality, and how fast does it grow?
  • What is the window size, and how long is state kept after the window closes?
  • What is the watermark strategy, and how late can events arrive?
  • What happens to state on a failure-and-restart?
Stateless transforms are cheap and recover trivially; they look the same in batch and streaming.
Stateful transforms drive most of the streaming cost premium; key cardinality and window size dominate.
Watermarks bound how long state is kept; without them, stateful pipelines run out of memory.
TIP
When designing a streaming pipeline, list every stateful transform and write the key cardinality, the window size, and the watermark strategy next to each. The list reveals the cost and the failure modes before the pipeline is built.

When Batch Outgrows Itself

Daily Life
Interviews

Diagnose a slow batch pipeline and pick the smallest change that meets the consumer's freshness tier without overbuilding.

The exercise below walks through a real-shaped scenario: a pipeline that started as nightly batch, grew, and stopped meeting its freshness target. The redesign is not a wholesale switch to streaming. The redesign is a careful examination of which dimension is failing and the smallest change that fixes it. Most batch-to-streaming migrations in production look like this exercise, not like a rewrite.

The Starting Pipeline

An e-commerce company's nightly pipeline reads orders from a Postgres database, joins with a product catalog, computes daily aggregates by category and region, and writes a fact_daily_orders table consumed by the executive dashboard. The pipeline runs at 2am, finishes by 5am, and the dashboard is fresh by 6am Pacific. Volume is 4 million orders per day. Compute cost is roughly $40 per night. The architecture has worked unchanged for two years.
1INSERT INTO mart.fact_daily_orders
2SELECT
3 DATE(o.order_timestamp) AS order_date,
4 p.category,
5 o.region,
6 COUNT(*) AS orders,
7 SUM(o.amount_cents) / 100.0 AS revenue
8FROM raw.orders o
9JOIN dim.product p
10 ON o.product_id = p.product_id
11WHERE DATE(o.order_timestamp) = : run_date
12GROUP BY 1, 2, 3 ;

The Symptoms

SymptomNumbersWhen It Started
Pipeline runtime stretchingFrom 3 hours to 11 hoursBegan drifting after a 4x volume increase
Missed 6am SLADashboard now fresh at noon, not 6amAfter volume hit 18M orders/day
Operational pagesOn-call paged 3-4 mornings a weekWhen run started bumping into 9am compute window
Marketing team built shadow pipelineHourly streaming consumer feeding their own dashboardAfter two months of missed SLA

The Wrong Instinct

The product team's first proposal is to migrate everything to a Flink streaming pipeline. The proposal sounds reasonable: streaming is real-time, batch is slow, the freshness problem is obvious. The proposal is wrong. Migrating a billion-row-per-day join to streaming costs roughly 20x the current spend, takes nine months of engineering, and requires building state management for a stateful join that does not currently exist. The freshness problem is real but the diagnosis is incomplete. The actual problem is two separable issues that look like one.

The Right Diagnosis

IssueWhat Is HappeningRight Fix
Volume outgrew the cadence11 hours of batch cannot fit in a 7-hour overnight windowRun more often (hourly or micro-batch), each run does less work
Non-incremental transformThe transform reads a full day of orders even when only the last hour mattersMake the transform incremental on order_timestamp
Marketing wants tier-2 freshnessBatch tier-4 cannot meet a tier-2 needAdd a tier-2 path for the marketing dashboard only, leave the executive on tier 4
Single pipeline serving multiple consumersDifferent consumers have different freshness needs but share one pipelineSplit the consumers; let each have the cadence its tier requires

The Redesign

The redesign keeps batch where it works and adds streaming only where the consumer requires it. The executive dashboard stays tier 4 (daily) and uses an hourly micro-batch instead of a single nightly run. The marketing team gets a tier-2 path that uses Spark Structured Streaming with a 1-minute trigger reading from a CDC stream of the orders table. The two paths share the curated layer concept from Lesson 1: both write to the same fact_daily_orders table family, partitioned in a way that lets each path update only its own partitions. The redesign respects the freshness tiers; it does not impose one rhythm on consumers who want different things.
1# Path 1: Hourly micro-batch (executive dashboard, tier 4)
2# Spark batch job, runs at the top of every hour
3# Reads only the last hour, writes hourly partition
4spark.sql(''' INSERT OVERWRITE mart.fact_hourly_orders PARTITION (hour_ts = :run_hour_ts) SELECT DATE_TRUNC('hour', o.order_timestamp) AS hour_ts, p.category, o.region, COUNT(*) AS orders, SUM(o.amount_cents) / 100.0 AS revenue FROM raw.orders o JOIN dim.product p ON o.product_id = p.product_id WHERE o.order_timestamp >= :run_hour_ts AND o.order_timestamp < :run_hour_ts + INTERVAL '1 hour' GROUP BY 1, 2, 3 ''')
5
6# Path 2: Streaming for marketing (tier 2, micro-batch every 1 min)
7# Reads CDC stream of the orders table
8stream = spark.readStream.format('delta').load('s3://cdc/raw_orders')
9live = (
10 stream.join(broadcast(product_dim), 'product_id')
11 .groupBy(window('order_timestamp', '1 minute'),
12 'category', 'region')
13 .agg(count('*').alias('orders'),
14 sum('amount_cents').alias('cents'))
15)
16live.writeStream.outputMode('update') \
17 .trigger(processingTime='1 minute') \
18 .toTable('mart.live_minute_orders')

The Result

Before
  • One nightly batch, 11 hours, $40 per run
  • Tier 4 for everyone; tier 2 needs unmet
  • Marketing built shadow pipeline that drifts from canon
  • On-call paged 3-4 mornings a week
After
  • Hourly micro-batch (5 min/run, $4) plus streaming micro-batch (1-min trigger, $200/day)
  • Tier 4 for executive dashboard; tier 2 for marketing dashboard; both consumers happy
  • One canonical fact table family; shadow pipeline retired
  • On-call paged less than once a week; both paths run independently

What This Example Teaches

The example is not a story about streaming winning. It is a story about diagnosing which dimension is failing. Volume outgrew cadence; the fix was a tighter cadence, not a wholesale rhythm change. The non-incremental transform read a full day every time; the fix was incrementality, not a new engine. Different consumers had different freshness needs; the fix was to give each consumer the rhythm it actually needs, not a single shared rhythm. Senior engineers fix freshness problems by naming the dimension that is failing. The wrong instinct is to swap the architecture; the right one is to swap the smallest piece that solves the actual problem.
The diagnosis questions for a slow batch:
  • Has volume outgrown the available run window? If yes, run more often, not differently.
  • Is the transform non-incremental? If yes, make it incremental before changing engines.
  • Do consumers have different freshness needs? If yes, split the paths by tier.
  • Is the slowness in the transform itself or in the read/write paths? If reads and writes dominate, swapping the engine does not help.
TIP
Before migrating any batch pipeline to streaming, write the four diagnosis answers in plain prose. Most slow batch problems resolve to incrementality and cadence, not rhythm change.
PUTTING IT ALL TOGETHER

> A growth-stage logistics company has a nightly batch that processes 80M delivery events and computes routes for the next morning. The pipeline is missing its 6am SLA and the cost has tripled in twelve months. The new tech lead is asked to redesign without rewriting everything as streaming.

Start with the dimensions. Latency target stays at 6am (tier 3 freshness). Throughput is the bottleneck: the run window cannot fit 80M events in seven hours. The dimension failing is throughput-per-run, not latency floor.
The fix is incrementality plus cadence, not a rhythm change. Make the transform incremental on event_timestamp; run hourly instead of nightly. Each hourly run does 1/24th of the work and finishes in minutes.
If a separate consumer (the operations team) wants tier-2 freshness, give that consumer a separate path: a Spark Structured Streaming job with a 1-minute trigger. The original consumer keeps tier-3 hourly batch. Both feed a shared curated fact-event family, applying the layered architecture from Lesson 1.
Cost it explicitly. The tier-3 hourly batch costs roughly 30 percent more than the original nightly. The tier-2 streaming path adds another tier of cost on top. The four pipeline roles (source, transform, storage, consumer) stay the same; the rhythm of each role is what changes.
The outcome is two paths, one tier each, both operationally distinct, each meeting its own freshness target. The wrong move would have been one streaming pipeline serving everyone at the cost of the highest-freshness consumer; the right move is per-consumer tiering.
KEY TAKEAWAYS
Latency and throughput are separate dimensions: optimizing one usually costs the other. Most production decisions resolve to which dimension is firm.
Micro-batch is the middle ground most production pipelines occupy: Spark Structured Streaming and Flink in batch mode let the trigger interval tune the latency-throughput-cost balance.
Streaming costs 5x to 50x more than batch for the same logic: continuous compute, state storage, and operational overhead combine. Cost the latency in dollars before approving the spend.
Stateful transforms drive most of the cost premium: key cardinality and window size dominate. Watermarks bound how long state is kept and prevent unbounded growth.
When batch outgrows itself, fix the dimension that is failing: incrementality and cadence usually beat a rhythm change. Different consumers can have different tiers on the same source data.

Batch vs Streaming: Intermediate

Latency, throughput, state, and cost are the dimensions; pick deliberately, not by default

Category
Pipeline Architecture
Difficulty
intermediate
Duration
30 minutes
Challenges
0 hands-on challenges

Topics covered: Latency vs Throughput Tradeoff, Micro-Batch: The Middle Ground, Why Streaming Costs More, Stateful vs Stateless Transforms, When Batch Outgrows Itself

Lesson Sections

  1. Latency vs Throughput Tradeoff (concepts: paLatencyVsThroughput)

    Batch and streaming are usually framed as a single axis, fast versus slow. The framing hides the actual engineering decision, which has two axes. Latency is the time from event arrival to event being processed and visible. Throughput is how many events the pipeline can process per unit of time. The two are not the same and are often in tension: optimizing for one usually costs the other. A pipeline that processes one event in 100 milliseconds has low latency but may have low throughput because t

  2. Micro-Batch: The Middle Ground (concepts: paMicroBatchVsTrue)

    Most production pipelines that look like streaming are not pure streaming. They are micro-batch: very small batches, often every few seconds or every minute, processed by an engine that exposes a streaming API on top. Spark Structured Streaming is the largest example. Flink can run in batch or streaming mode with a tunable trigger interval. The pattern exists because pure streaming is expensive and pure batch cannot meet sub-15-minute freshness. Micro-batch sits in the middle: latency low enough

  3. Why Streaming Costs More (concepts: paStreamingCost)

    Streaming costs more than batch for the same logic on the same data. The factor is rarely 10 percent; it is more often 5x to 50x. The cost difference is real and measurable, and it is the single most important variable in batch-versus-streaming decisions after freshness. Engineers who skip the cost conversation end up with streaming pipelines that consume budget the company does not want to spend, on freshness consumers do not need. The cost story has three components: continuous compute, state

  4. Stateful vs Stateless Transforms (concepts: paStatefulVsStateless, paWatermarks)

    Transforms divide into two categories that matter much more in streaming than in batch. A stateless transform processes one event at a time and produces output that depends only on that event. A stateful transform produces output that depends on more than one event: a count, a sum, a window, a join with another stream. The category changes the cost, the complexity, and the failure-recovery story. In batch, both categories look about the same because the engine has all the data in memory at once.

  5. When Batch Outgrows Itself (concepts: paBatchOutgrowsItself, paIncrementalTransforms)

    The exercise below walks through a real-shaped scenario: a pipeline that started as nightly batch, grew, and stopped meeting its freshness target. The redesign is not a wholesale switch to streaming. The redesign is a careful examination of which dimension is failing and the smallest change that fixes it. Most batch-to-streaming migrations in production look like this exercise, not like a rewrite. The Starting Pipeline An e-commerce company's nightly pipeline reads orders from a Postgres databas