Batch vs Streaming: Beginner

An e-commerce company at Series C scale ran every analytical pipeline once a night at 2am Pacific. The CFO read the daily revenue dashboard at 8am with morning coffee, satisfied. Then a flash sale launched at noon, and the marketing team needed to know within minutes whether discount codes were converting before they bought another $50,000 in ad spend. The 2am pipeline was useless for that question; the freshest number it could provide was eighteen hours stale. The team patched together a Kafka consumer in three hours, paid for it in panic and sleep loss, and shipped a hot fix that drifted from the canonical revenue number for the next six weeks. The mistake was not the pipeline. The mistake was assuming one processing model fit every consumer. This lesson is the picture of the two main ways data moves through a pipeline, when each fits, and why the marketing word real-time almost never means what it sounds like.

Two Ways Data Can Move

Daily Life
Interviews

Recognize the two processing rhythms, batch and streaming, and name what changes between them.

Data moves through a pipeline in one of two basic rhythms. The first rhythm is scheduled. Data piles up for a while, then a job wakes up, processes everything that has accumulated since the last run, and goes back to sleep. The second rhythm is continuous. Each new event flows through the pipeline as it arrives, with no waiting for a scheduled wake-up. Almost every pipeline in production fits into one of these two rhythms, or a hybrid that explicitly mixes them. Naming the rhythm is the first useful skill, because every other architectural choice (compute shape, cost profile, failure handling, freshness expectation) depends on it.

The Two Rhythms

RhythmHow Data MovesWhat Drives the Cadence
BatchData sits in a source until a job runs and processes the accumulated chunkA schedule (every hour, every night) or a manual trigger
StreamingEach event flows through transforms as it arrives, one at a time or in tiny groupsThe arrival of new data; the pipeline is always running
Hybrid (micro-batch)Tiny scheduled batches that feel continuous from the outsideA short interval (every 10 seconds, every minute) the engine sets
Both rhythms produce the same end result if the inputs and the logic are the same. A daily count of signups by country can be computed by summing every signup since midnight at 2am, or by incrementing a counter every time a signup event arrives. The number on the dashboard at 9am is the same. What differs is when the result is available, how much compute it costs, and what the pipeline looks like when something breaks. The choice between batch and streaming is not about correctness; it is about tradeoffs.
The defining differences in plain language:
  • Batch waits and processes a chunk; streaming processes each event as it arrives
  • Batch runs sometimes; streaming runs always
  • Batch fails and restarts the chunk; streaming has to handle failure mid-flight
  • Batch optimizes for throughput; streaming optimizes for latency

An Everyday Analogy

A washing machine is batch. Clothes pile up in a hamper through the week. On Sunday a wash cycle runs, processing all the clothes at once. The machine is idle the rest of the week. Cost per wash is low because the cycle amortizes over a full load. A laundromat dryer with a coin slot, running for one shirt at a time as customers walk in, is closer to streaming. The dryer is on whenever anyone is there. Cost per shirt is much higher because the dryer is also drying air between customers. Neither is wrong; the right answer depends on whether the goal is the cheapest cost per pound or the shortest time from dirty shirt to dry shirt.
Batch Mindset
  • Chunks of data, processed when scheduled
  • Reads everything that has accumulated, then sleeps
  • Compute is cheap because the engine spins up and shuts down
  • Freshness is bounded by the schedule (last hour, last day)
Streaming Mindset
  • Individual events, processed as they arrive
  • Always running; never sleeps
  • Compute is more expensive because nothing shuts down
  • Freshness is bounded by the time to push one event through

The Smallest Possible Comparison

1# Batch: a job that runs once an hour and processes the last hour of orders
2def hourly_batch_job(start_ts, end_ts):
3 rows = read_orders(start_ts, end_ts)
4 totals = aggregate_by_country(rows)
5 write_output(totals, partition=end_ts)
6
7# Streaming: a process that runs forever and updates totals on every event
8def streaming_consumer():
9 for order in consume_kafka_topic('orders'):
10 update_running_total(order.country, order.amount)
11 flush_to_output_if_dirty()
The two snippets do roughly the same job. The batch version has a beginning, a middle, and an end; it is called and it returns. The streaming version is an infinite loop. Every line of operational difference between batch and streaming pipelines flows from this one structural fact. A batch pipeline is a function. A streaming pipeline is a service.

Most companies start with batch and add streaming only when a specific consumer cannot tolerate the wait. Streaming-first architectures are rare and almost always justified by a freshness requirement that batch literally cannot meet.

Batch processes a chunk of accumulated data on a schedule; streaming processes each event as it arrives.
The two rhythms produce the same numbers when given the same inputs and logic; the difference is when, how much, and how it fails.
A batch pipeline is a function with a start and end; a streaming pipeline is a service that never stops running.
TIP
Before reaching for streaming, name the freshness expectation in concrete terms. If the answer is anything coarser than a few minutes, batch is almost always the cheaper and simpler choice.

Batch: Picture, Rhythm, Example

Daily Life
Interviews

Walk through a batch pipeline run end to end and name the points at which compute starts, runs, and shuts down.

Batch processing is the older of the two rhythms and still the dominant pattern in production. Most analytical work in most companies runs as a batch job, often nightly, sometimes hourly. The pattern is so common that the word pipeline used without qualification almost always means a batch pipeline. Knowing the shape of a batch run cold is the foundation for everything else, because streaming is largely defined by what it changes about that shape.

The Shape of a Batch Run

StepWhat HappensTypical Duration
Wake upOrchestrator triggers the job at the scheduled timeSeconds
ReadJob reads the input window (last hour, last day) from the sourceSeconds to minutes depending on volume
TransformJob applies cleaning, joins, aggregations to the chunkMost of the run; minutes to hours at scale
WriteJob writes the output to a partition or tableSeconds to minutes
Shut downCompute resources are released; the job endsSeconds

The Nightly Run

The canonical example is the nightly run. Sometime between 1am and 4am Pacific, when production traffic is at its lowest, an orchestrator wakes up dozens or hundreds of jobs in dependency order. Each job reads its inputs, applies its transforms, and writes its outputs. The work finishes by 6am or 7am, in time for the morning consumers to read fresh data. By 9am the dashboards show last day's numbers. The cycle repeats the next night. Batch is the rhythm of the office workweek: the work happens off-hours so the morning has answers.
1
2INSERT INTO mart.daily_signups(signup_date, country, signup_count)
3SELECT
4 DATE(signup_timestamp) AS signup_date,
5 country_code,
6 COUNT(*) AS signup_count
7FROM raw.signups
8WHERE signup_timestamp >= : run_date :: DATE AND signup_timestamp < : run_date :: DATE + INTERVAL '1 day'
9GROUP BY 1, 2 ;
The query reads exactly one day of data, aggregates it, and writes the result. The next day's run does the same for the next day. The pattern is durable: it has run nearly unchanged in data warehouses since the 1990s, because the basic shape is right. Read a window, transform, write a partition, end.

Why Batch Is Cheap

Batch jobs spin compute up at the start of the run and shut it down at the end. A Snowflake warehouse charges by the second of compute time. A nightly job that runs for forty minutes pays for forty minutes of compute, period. The other twenty-three hours and twenty minutes of the day cost nothing. Cloud compute is elastic specifically so that batch workloads can ramp up to large clusters when work is happening and ramp down to zero between runs. The cost-per-unit-of-data of a well-designed batch pipeline is the lowest of any architecture.

Common Batch Cadences

CadenceWhen It FitsTypical Use
Daily (overnight)Consumer reads the next morning; cost mattersExecutive dashboards, financial reports, ML training data
HourlyConsumer wants same-day freshness without paying for streamingOperational dashboards, marketing reports, fraud retrospective
Every 15 minutesNear-real-time feel without the streaming infrastructure costQuasi-live ad spend dashboards, customer support queues
Weekly or monthlyData changes slowly; reading more often is wasted workCohort analyses, retention reports, long-horizon trends

What Batch Cannot Do

Batch cannot beat its own schedule. A daily pipeline cannot tell anyone what happened in the last hour. An hourly pipeline cannot tell anyone what happened in the last minute. The freshness floor is the schedule itself. For a finance team that reads the dashboard once a morning, that floor is fine. For a fraud team that needs to react to a suspicious transaction within seconds, the floor is unusable. The mismatch between consumer freshness needs and batch cadence is the most common reason streaming gets pulled into a system.
The batch contract in three sentences:
  • The pipeline runs on a schedule and processes the chunk that has accumulated since the last run
  • The result is available some time after the run starts, bounded by how long the run takes
  • Compute is paid for only during the run; idle time is free
Do
  • Use batch as the default unless a specific consumer cannot wait for the next scheduled run
  • Match the cadence to the consumer's freshness need; do not over-run if hourly is enough
  • Partition outputs by the run window so failed runs replay one partition cleanly
Don't
  • Reach for streaming because it sounds more modern; the cost difference is real
  • Run a batch job continuously by scheduling it every minute; that is just expensive streaming
  • Confuse a slow batch with a streaming need; profiling first beats redesigning later

Streaming: Picture, Rhythm, Example

Daily Life
Interviews

Trace a single event through a streaming pipeline from source queue to output sink.

Streaming processing is the second basic rhythm. A streaming pipeline runs continuously. Each new event arrives at the source and flows through the transforms within milliseconds or seconds. There is no concept of a chunk and no concept of a scheduled wake-up. The pipeline is a long-running service, more like a web server than a script. The shape is more recent than batch in mainstream use, dating roughly from the rise of Apache Kafka in the early 2010s and the stream processors that grew up around it: Spark Streaming, Flink, Kafka Streams, Beam.

The Shape of a Streaming Pipeline

ElementWhat It DoesWhat It Looks Like
SourceProduces events continuously into a queue or logKafka topic, Kinesis stream, Pub/Sub topic
Consumer processReads events as they arrive, applies transforms, emits resultsLong-running JVM, Python service, Spark cluster
State storeHolds running totals, windows, joins between eventsRocksDB on local disk, in-memory cache, external KV store
Output sinkWhere transformed events land for downstream consumptionAnother Kafka topic, a database, a feature store

The Live Event Feed

The canonical example is a live event feed. A user clicks a button on a website. The click is recorded as an event, sent to a Kafka topic, and within a few seconds shows up on an internal dashboard, a fraud system, an ad attribution model, and a customer support tool. The same event triggers downstream consequences without any nightly run, any 2am wake-up, or any human waiting until morning. The feeling from the consumer side is that the system is alive: things happen, the dashboard updates.
1# Sketch of a streaming consumer in Kafka Streams style
2from confluent_kafka import Consumer
3
4consumer = Consumer({
5 'bootstrap.servers': 'kafka.internal:9092',
6 'group.id': 'click-aggregator',
7 'enable.auto.commit': False
8})
9consumer.subscribe(['raw.clicks'])
10
11running_totals = {}
12while True:
13 msg = consumer.poll(timeout=1.0)
14 if msg is None:
15 continue
16 event = parse(msg.value())
17 running_totals[event.page] = running_totals.get(event.page, 0) + 1
18 if should_flush(event):
19 write_to_dashboard(running_totals)
20 consumer.commit(msg)
The shape is an infinite loop. A consumer reads the next event, updates state, possibly emits an output, commits the offset, and goes back to read the next event. The loop runs forever. There is no return value. There is no scheduled end. The process is started once and expected to keep running. Stopping it is a deliberate operational action, not the natural conclusion of the work.

Why Streaming Costs More

Streaming pays for compute around the clock. The Kafka consumer process is up at 3am whether or not events are arriving. The Flink cluster is provisioned for peak load even at off-peak hours. State stores like RocksDB hold data in memory or on local disk that costs money to keep around. There is no equivalent of the batch idle period. Cost is roughly proportional to peak event volume times time, instead of total event volume divided by amortized run time. For low-volume topics this can still be cheap, but it never approaches the cost-per-event of a well-amortized batch job.
Batch Cost Profile
  • Compute is on for the run, off the rest of the day
  • Cost scales with the size of each chunk plus overhead per run
  • Idle hours are free; ramp-up amortizes across the chunk
  • A 1 percent traffic dip lowers cost the next night
Streaming Cost Profile
  • Compute is on every minute of every day
  • Cost scales with provisioned capacity, not actual traffic
  • Idle hours still cost the same as peak hours
  • A 1 percent traffic dip lowers nothing; capacity is fixed

Common Streaming Use Cases

Use CaseWhy Streaming FitsTolerable Latency
Fraud detectionDecisions must happen before the transaction settlesSub-second to a few seconds
Live operational dashboardsOperators react to events as they happenSeconds to a minute
Real-time personalizationUser session is short; recommendations must adapt within the sessionA few hundred milliseconds
IoT telemetryVolume is too high to batch economically; alarms are time-criticalSeconds to a minute
Change data capture (CDC)Change data capture (CDC) turns each row write in an operational database into an event stream; downstream replicas must reflect upstream changes within secondsSeconds to a few minutes

What Streaming Is Not

Streaming is not a magic faster batch. The latency win is real but bounded; the streaming pipeline still has to read, transform, and write each event, and physics imposes a floor on how fast that can happen. Network hops, serialization, state lookups, and downstream writes each add milliseconds. A typical end-to-end streaming latency in production is between 100 milliseconds and several seconds. Anything sub-100ms requires careful engineering and dedicated hardware. The marketing word real-time hides this floor; the engineering reality respects it.
The streaming contract in three sentences:
  • Each event flows through the pipeline as it arrives, with no waiting for a scheduled run
  • End-to-end latency is bounded by the time to read, transform, and write one event
  • Compute is paid for around the clock; cost scales with provisioned capacity, not actual volume
TIP
When introducing streaming for the first time, name the latency target in milliseconds before picking the engine. Engines have radically different latency floors, and getting that wrong wastes weeks.

What Real-Time Actually Means

Daily Life
Interviews

Translate a real-time request into a concrete freshness tier and name the architecture each tier requires.

Real-time is the most overloaded phrase in data engineering. A product manager asks for a real-time dashboard and means within an hour. A finance executive asks for real-time revenue and means by the start of the workday. A trading firm asks for real-time and means within five microseconds. The word is so elastic that it carries almost no information. The only useful response to a real-time request is to ask for the actual freshness target in concrete units of time, then translate that target into the simplest pipeline that can meet it.

Five Freshness Tiers

TierFreshness TargetTypical Architecture
Sub-secondUnder 100 milliseconds end to endSpecialized streaming with co-located compute and storage
Near real-timeUnder 15 minutesStreaming or micro-batch (Spark Structured Streaming, Flink)
Same dayUnder 2 hoursHourly batch or micro-batch every 15 minutes
DailyBy the next morningNightly batch, runs at 2am, ready by 7am
Weekly or slowerOn a calendar cadenceWeekly batch, often on a Sunday or Monday morning
Each tier roughly doubles or triples the cost of the tier below it. A daily batch pipeline is cheap. An hourly version of the same pipeline is more expensive because of the per-run overhead repeated 24 times. A streaming version is more expensive again because compute runs continuously. A sub-second streaming pipeline that meets a 50ms target is more expensive again because it requires careful engineering to remove every avoidable millisecond. Picking a tier that the consumer does not actually need is one of the most common forms of overengineering in data work.
1Approximate monthly cost shape for the same logic
2 ON the same data = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Tier 5 : Weekly batch ~ $ 10 / month(4 runs total) Tier 4 : Daily batch ~ $ 60 / month(30 runs) Tier 3 : Hourly batch ~ $ 300 / month(720 runs) Tier 2 : Micro - batch(1 MIN) ~ $ 1, 500 / month(always - ON, modest cluster) Tier 1 : TRUE streaming ~ $ 6, 000 / month(always - ON, larger cluster + state) Each step up IS roughly 2 x to 5 x the cost of the tier below.
Tier 1: Sub-secondTier 2: Near real-timeTier 3: Same dayTier 4: DailyTier 5: Slower
Tier 1: Sub-second
Genuinely real-time
Sub-100ms end-to-end. Trading systems, ad bidding, real-time fraud blocking. Streaming with specialized hardware and code paths.
Tier 2: Near real-time
A human notices
Under 15 minutes. Live operational dashboards, customer support queues, fraud retrospective. Streaming or micro-batch fits.
Tier 3: Same day
Same workday
Under 2 hours. Hourly executive dashboards, ad spend reports, marketing campaigns. Hourly batch is usually enough.
Tier 4: Daily
Tomorrow morning
Next day. Standard executive dashboards, ML training data, finance reports. Nightly batch is the workhorse.
Tier 5: Slower
Weekly or monthly
Calendar-driven. Cohort analyses, board reports, retention curves. Weekly or monthly batch on a slow schedule.

Why Real-Time Usually Means Tier 2 or 3

When non-engineers say real-time, they almost always mean tier 2 or tier 3: within fifteen minutes, or within a couple of hours. The dashboard the marketing team wanted at noon is a tier 2 problem, not a tier 1 problem. The CFO's morning revenue is a tier 4 problem dressed up as a tier 2 ask. Translating the request into the right tier saves an enormous amount of engineering work. A tier 4 ask handled with a tier 1 architecture wastes infrastructure money continuously. A tier 2 ask handled with a tier 4 architecture produces an angry product manager.
Questions that translate real-time into a tier:
  • What decision will be made with this data, and how often does that decision happen?
  • How long can the consumer wait between event and action without harm?
  • What does the consumer do today when they cannot get this data?
  • Is the bound a hard SLA or a fuzzy preference?

The Cost of Misnaming the Tier

Tier Mismatch: Underbuilt
  • Consumer needs sub-15-minute freshness; pipeline runs nightly
  • Symptom: angry consumer, rebuilds shadow pipeline of their own
  • Cost shows up as drift between two pipelines and reconciliation work
  • Fix: upgrade to streaming or micro-batch for that one consumer
Tier Mismatch: Overbuilt
  • Consumer needs daily freshness; pipeline runs streaming
  • Symptom: a Flink cluster that costs $4,000 a month for a daily dashboard
  • Cost shows up as a cloud bill nobody can explain
  • Fix: replace streaming with a nightly batch; same numbers, 1/20th the cost

The Tier Conversation

Senior engineers do not start architecture conversations with tools. They start with the freshness conversation. A consumer says they want a real-time dashboard. The engineer asks what the dashboard is for, who reads it, and what they do with the answer. Those questions reliably surface the actual tier. Once the tier is named, the architecture follows: tier 4 is a nightly job, tier 3 is hourly, tier 2 is streaming or micro-batch, tier 1 is genuine streaming. Any architecture conversation that skips the tier conversation is going to produce the wrong shape.

Almost every real-time request that reaches a data engineer translates to tier 2 (under 15 minutes) or tier 3 (under 2 hours). Genuine tier 1 (sub-second) is rare and usually has a specific dollar value attached to the latency.

TIP
When a consumer says real-time, write the request back as a number with units before agreeing to anything. The conversation that produces that number is more valuable than the architecture that follows it.

Picking Batch or Streaming

Daily Life
Interviews

Pick batch or streaming for a simple use case based on the consumer's freshness tier and cost tolerance.

Vocabulary becomes useful when applied to a specific decision. The exercise below picks between batch and streaming for three small concrete cases. The cases are intentionally simple so the choice is visible. Real production decisions are messier, but the same questions apply: what does the consumer need, when do they need it, and what does each option cost.

Case 1: A Marketing Team's Daily Signup Count

The marketing team wants a chart of new signups by country, by day, for the trailing 30 days. The chart is read once a morning at the marketing standup. The numbers do not change after the day closes. The consumer is patient: yesterday's number is fine, today's morning number is a bonus. The freshness tier is 4 (daily). The right architecture is a nightly batch that aggregates the day's signups and writes a small partition. Streaming would work but cost much more for no extra value the consumer cares about.

Case 2: A Fraud Team's Suspicious Transaction Alert

The fraud team wants to be alerted within seconds when a card has more than five distinct merchants in the last minute. The decision is whether to freeze the card before another transaction lands. A nightly batch produces the answer the next morning, when the card has already been used twenty more times. An hourly batch produces the answer in batches of one hour, far too slow. The freshness tier is 1 or 2. The right architecture is a streaming consumer that maintains a per-card sliding window over recent transactions. The cost is real but the alternative is unacceptable.

Case 3: An Ad Spend Dashboard That Updates Hourly

The growth team wants a dashboard of cost per acquisition by ad campaign, updated hourly during business hours. The team uses the dashboard to pause underperforming campaigns within the same workday. Daily is too slow; sub-second is overkill. The freshness tier is 3 (same day). The right architecture is an hourly batch that runs at the top of every hour, reads the last hour of clicks and conversions, and writes the result. Streaming would work but cost two or three times as much. Daily would lead the team to keep paying for bad campaigns until tomorrow.
CaseTierRight Choice
Daily signups dashboardTier 4 (daily)Nightly batch; cheapest, simplest, fits the consumer
Fraud transaction alertTier 1 or 2 (seconds)Streaming; the latency justifies the cost
Ad spend dashboardTier 3 (same day)Hourly batch; balances freshness with cost

The Three-Question Test

The simple rule for picking batch or streaming:
  • What decision is made with this data, and how often is that decision made?
  • How long can the consumer wait between event and answer without harm?
  • Does the consumer's tolerance match a tier 4 or 5 (batch) or a tier 1 or 2 (streaming)?
If the answers point at tier 4 or 5, batch wins on cost and simplicity. If the answers point at tier 1 or 2, streaming wins because batch literally cannot meet the freshness floor. Tier 3 is the genuinely interesting case: an hourly batch usually beats a streaming pipeline on cost while still meeting the consumer's need, but the right answer depends on the volume and the existing infrastructure. Asking the three questions in order resolves the choice in nearly every case.
Batch Wins When
  • Consumer reads on a schedule (morning standup, end-of-day report)
  • Freshness tolerance is hours or days
  • Cost per dollar of value matters more than latency
  • Failure recovery means rerunning a clean partition
Streaming Wins When
  • Consumer reacts to events as they happen
  • Freshness tolerance is seconds or single-digit minutes
  • A late answer is worse than no answer
  • Volume is high enough that buffering hours of data hurts

What This Means in Practice

Most companies need batch for almost everything and streaming for a few specific consumers. The mistake is treating the two as competing philosophies. They are tools with different cost profiles, suited to different freshness tiers. A mature data platform has both, used where they fit. The conversation about which to use is not a religious debate; it is a freshness conversation followed by a cost conversation. Naming the tier before naming the tool keeps the conversation honest.
Batch fits tier 3 to 5 freshness needs and dominates on cost.
Streaming fits tier 1 to 2 freshness needs and is the only option when batch cannot meet the floor.
The three-question test resolves nearly every batch-vs-streaming decision in practice.
Do
  • Default to batch and graduate to streaming for the specific consumers that need it
  • Translate any real-time request into a numeric freshness tier before picking an architecture
  • Name the cost difference explicitly so consumers can opt in or out of the tier they think they want
Don't
  • Build streaming for everything because it sounds modern; the cost compounds
  • Build batch for tier 1 needs because it is simpler; consumers will work around the pipeline
  • Skip the freshness conversation; tools chosen without it are usually the wrong tools
PUTTING IT ALL TOGETHER

> A media subscription company has three new dashboard requests in the same week. The CFO wants daily revenue at 7am Pacific. The growth team wants signup performance during a flash sale, updated within minutes. The product team wants weekly retention curves on Monday mornings. The data engineer is asked to design all three with a clear story for batch versus streaming.

Each request maps to a freshness tier. Daily revenue is tier 4 (next morning). Flash-sale signups are tier 2 (under 15 minutes). Weekly retention is tier 5 (calendar-driven). The tiers, not the tools, drive the architecture.
Tier 4 and tier 5 use batch. The CFO's revenue dashboard runs nightly; the retention report runs weekly on Sunday. Both fit the four pipeline roles from Lesson 1: source, transform, storage, consumer, with the transform on a scheduled cadence.
Tier 2 uses streaming. The flash-sale dashboard reads from the same Kafka topic of signup events but with a continuous consumer that emits aggregated counts every few seconds. The pipeline still has the four roles; the rhythm is the only thing that changes.
Cost is named explicitly. The streaming pipeline costs roughly five times what an hourly batch would. The growth team is told this and confirms the latency is worth it for the duration of the sale. After the sale, the streaming consumer can be retired or downgraded to hourly.
KEY TAKEAWAYS
Two basic rhythms move data: batch processes a chunk on a schedule; streaming processes each event as it arrives. The same numbers can come out of either.
Batch is a function; streaming is a service: every operational difference (cost, failure handling, freshness floor) flows from this one structural fact.
Real-time means almost nothing without a number: translate the request into one of five freshness tiers (sub-second, under 15 min, under 2 hr, daily, slower) before picking architecture.
Most consumers live at tier 3 or 4: batch is the default and meets nearly every freshness need at the lowest cost. Streaming is reserved for tiers 1 and 2.
The three-question test picks the rhythm: what decision, how often, how long can the consumer wait. The answers point at the tier, and the tier picks the architecture.

Batch vs Streaming: Beginner

Data moves in scheduled chunks or in a continuous flow; the choice changes everything downstream

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Two Ways Data Can Move, Batch: Picture, Rhythm, Example, Streaming: Picture, Rhythm, Example, What Real-Time Actually Means, Picking Batch or Streaming

Lesson Sections

  1. Two Ways Data Can Move (concepts: paBatchVsStreaming)

    Data moves through a pipeline in one of two basic rhythms. The first rhythm is scheduled. Data piles up for a while, then a job wakes up, processes everything that has accumulated since the last run, and goes back to sleep. The second rhythm is continuous. Each new event flows through the pipeline as it arrives, with no waiting for a scheduled wake-up. Almost every pipeline in production fits into one of these two rhythms, or a hybrid that explicitly mixes them. Naming the rhythm is the first us

  2. Batch: Picture, Rhythm, Example (concepts: paBatchProcessing)

    Batch processing is the older of the two rhythms and still the dominant pattern in production. Most analytical work in most companies runs as a batch job, often nightly, sometimes hourly. The pattern is so common that the word pipeline used without qualification almost always means a batch pipeline. Knowing the shape of a batch run cold is the foundation for everything else, because streaming is largely defined by what it changes about that shape. The Shape of a Batch Run The Nightly Run The can

  3. Streaming: Picture, Rhythm, Example (concepts: paStreamProcessing)

    Streaming processing is the second basic rhythm. A streaming pipeline runs continuously. Each new event arrives at the source and flows through the transforms within milliseconds or seconds. There is no concept of a chunk and no concept of a scheduled wake-up. The pipeline is a long-running service, more like a web server than a script. The shape is more recent than batch in mainstream use, dating roughly from the rise of Apache Kafka in the early 2010s and the stream processors that grew up aro

  4. What Real-Time Actually Means (concepts: paFreshnessTiers, paRealTimeMyth)

    Real-time is the most overloaded phrase in data engineering. A product manager asks for a real-time dashboard and means within an hour. A finance executive asks for real-time revenue and means by the start of the workday. A trading firm asks for real-time and means within five microseconds. The word is so elastic that it carries almost no information. The only useful response to a real-time request is to ask for the actual freshness target in concrete units of time, then translate that target in

  5. Picking Batch or Streaming (concepts: paBatchVsStreamingChoice)

    Vocabulary becomes useful when applied to a specific decision. The exercise below picks between batch and streaming for three small concrete cases. The cases are intentionally simple so the choice is visible. Real production decisions are messier, but the same questions apply: what does the consumer need, when do they need it, and what does each option cost. Case 1: A Marketing Team's Daily Signup Count The marketing team wants a chart of new signups by country, by day, for the trailing 30 days.