Loading lesson...

Batch vs Streaming: Intermediate

Latency, throughput, state, and cost are the dimensions; pick deliberately, not by default

Latency, throughput, state, and cost are the dimensions; pick deliberately, not by default

Category
Pipeline Architecture
Difficulty
intermediate
Duration
30 minutes
Challenges
0 hands-on challenges

Topics covered: Latency vs Throughput Tradeoff, Micro-Batch: The Middle Ground, Why Streaming Costs More, Stateful vs Stateless Transforms, When Batch Outgrows Itself

Lesson Sections

  1. Latency vs Throughput Tradeoff (concepts: paLatencyVsThroughput)

    Batch and streaming are usually framed as a single axis, fast versus slow. The framing hides the actual engineering decision, which has two axes. Latency is the time from event arrival to event being processed and visible. Throughput is how many events the pipeline can process per unit of time. The two are not the same and are often in tension: optimizing for one usually costs the other. A pipeline that processes one event in 100 milliseconds has low latency but may have low throughput because t

  2. Micro-Batch: The Middle Ground (concepts: paMicroBatchVsTrue)

    Most production pipelines that look like streaming are not pure streaming. They are micro-batch: very small batches, often every few seconds or every minute, processed by an engine that exposes a streaming API on top. Spark Structured Streaming is the largest example. Flink can run in batch or streaming mode with a tunable trigger interval. The pattern exists because pure streaming is expensive and pure batch cannot meet sub-15-minute freshness. Micro-batch sits in the middle: latency low enough

  3. Why Streaming Costs More (concepts: paStreamingCost)

    Streaming costs more than batch for the same logic on the same data. The factor is rarely 10 percent; it is more often 5x to 50x. The cost difference is real and measurable, and it is the single most important variable in batch-versus-streaming decisions after freshness. Engineers who skip the cost conversation end up with streaming pipelines that consume budget the company does not want to spend, on freshness consumers do not need. The cost story has three components: continuous compute, state

  4. Stateful vs Stateless Transforms (concepts: paStatefulVsStateless, paWatermarks)

    Transforms divide into two categories that matter much more in streaming than in batch. A stateless transform processes one event at a time and produces output that depends only on that event. A stateful transform produces output that depends on more than one event: a count, a sum, a window, a join with another stream. The category changes the cost, the complexity, and the failure-recovery story. In batch, both categories look about the same because the engine has all the data in memory at once.

  5. When Batch Outgrows Itself (concepts: paBatchOutgrowsItself, paIncrementalTransforms)

    The exercise below walks through a real-shaped scenario: a pipeline that started as nightly batch, grew, and stopped meeting its freshness target. The redesign is not a wholesale switch to streaming. The redesign is a careful examination of which dimension is failing and the smallest change that fixes it. Most batch-to-streaming migrations in production look like this exercise, not like a rewrite. The Starting Pipeline An e-commerce company's nightly pipeline reads orders from a Postgres databas