An e-commerce company at Series C scale ran every analytical pipeline once a night at 2am Pacific. The CFO read the daily revenue dashboard at 8am with morning coffee, satisfied. Then a flash sale launched at noon, and the marketing team needed to know within minutes whether discount codes were converting before they bought another $50,000 in ad spend. The 2am pipeline was useless for that question; the freshest number it could provide was eighteen hours stale. The team patched together a Kafka consumer in three hours, paid for it in panic and sleep loss, and shipped a hot fix that drifted from the canonical revenue number for the next six weeks. The mistake was not the pipeline. The mistake was assuming one processing model fit every consumer. This lesson is the picture of the two main ways data moves through a pipeline, when each fits, and why the marketing word real-time almost never means what it sounds like.
Two Ways Data Can Move
Daily Life
Interviews
Recognize the two processing rhythms, batch and streaming, and name what changes between them.
Data moves through a pipeline in one of two basic rhythms. The first rhythm is scheduled. Data piles up for a while, then a job wakes up, processes everything that has accumulated since the last run, and goes back to sleep. The second rhythm is continuous. Each new event flows through the pipeline as it arrives, with no waiting for a scheduled wake-up. Almost every pipeline in production fits into one of these two rhythms, or a hybrid that explicitly mixes them. Naming the rhythm is the first useful skill, because every other architectural choice (compute shape, cost profile, failure handling, freshness expectation) depends on it.
The Two Rhythms
Rhythm
How Data Moves
What Drives the Cadence
Batch
Data sits in a source until a job runs and processes the accumulated chunk
A schedule (every hour, every night) or a manual trigger
Streaming
Each event flows through transforms as it arrives, one at a time or in tiny groups
The arrival of new data; the pipeline is always running
Hybrid (micro-batch)
Tiny scheduled batches that feel continuous from the outside
A short interval (every 10 seconds, every minute) the engine sets
Both rhythms produce the same end result if the inputs and the logic are the same. A daily count of signups by country can be computed by summing every signup since midnight at 2am, or by incrementing a counter every time a signup event arrives. The number on the dashboard at 9am is the same. What differs is when the result is available, how much compute it costs, and what the pipeline looks like when something breaks. The choice between batch and streaming is not about correctness; it is about tradeoffs.
The defining differences in plain language:
▸Batch waits and processes a chunk; streaming processes each event as it arrives
▸Batch runs sometimes; streaming runs always
▸Batch fails and restarts the chunk; streaming has to handle failure mid-flight
▸Batch optimizes for throughput; streaming optimizes for latency
An Everyday Analogy
A washing machine is batch. Clothes pile up in a hamper through the week. On Sunday a wash cycle runs, processing all the clothes at once. The machine is idle the rest of the week. Cost per wash is low because the cycle amortizes over a full load. A laundromat dryer with a coin slot, running for one shirt at a time as customers walk in, is closer to streaming. The dryer is on whenever anyone is there. Cost per shirt is much higher because the dryer is also drying air between customers. Neither is wrong; the right answer depends on whether the goal is the cheapest cost per pound or the shortest time from dirty shirt to dry shirt.
•Batch Mindset
Chunks of data, processed when scheduled
Reads everything that has accumulated, then sleeps
Compute is cheap because the engine spins up and shuts down
Freshness is bounded by the schedule (last hour, last day)
•Streaming Mindset
Individual events, processed as they arrive
Always running; never sleeps
Compute is more expensive because nothing shuts down
Freshness is bounded by the time to push one event through
The Smallest Possible Comparison
1
# Batch: a job that runs once an hour and processes the last hour of orders
2
defhourly_batch_job(start_ts,end_ts):
3
rows=read_orders(start_ts,end_ts)
4
totals=aggregate_by_country(rows)
5
write_output(totals,partition=end_ts)
6
7
# Streaming: a process that runs forever and updates totals on every event
8
defstreaming_consumer():
9
fororderinconsume_kafka_topic('orders'):
10
update_running_total(order.country,order.amount)
11
flush_to_output_if_dirty()
The two snippets do roughly the same job. The batch version has a beginning, a middle, and an end; it is called and it returns. The streaming version is an infinite loop. Every line of operational difference between batch and streaming pipelines flows from this one structural fact. A batch pipeline is a function. A streaming pipeline is a service.
Most companies start with batch and add streaming only when a specific consumer cannot tolerate the wait. Streaming-first architectures are rare and almost always justified by a freshness requirement that batch literally cannot meet.
Batch processes a chunk of accumulated data on a schedule; streaming processes each event as it arrives.
The two rhythms produce the same numbers when given the same inputs and logic; the difference is when, how much, and how it fails.
A batch pipeline is a function with a start and end; a streaming pipeline is a service that never stops running.
TIP
Before reaching for streaming, name the freshness expectation in concrete terms. If the answer is anything coarser than a few minutes, batch is almost always the cheaper and simpler choice.
Batch: Picture, Rhythm, Example
Daily Life
Interviews
Walk through a batch pipeline run end to end and name the points at which compute starts, runs, and shuts down.
Batch processing is the older of the two rhythms and still the dominant pattern in production. Most analytical work in most companies runs as a batch job, often nightly, sometimes hourly. The pattern is so common that the word pipeline used without qualification almost always means a batch pipeline. Knowing the shape of a batch run cold is the foundation for everything else, because streaming is largely defined by what it changes about that shape.
The Shape of a Batch Run
Step
What Happens
Typical Duration
Wake up
Orchestrator triggers the job at the scheduled time
Seconds
Read
Job reads the input window (last hour, last day) from the source
Seconds to minutes depending on volume
Transform
Job applies cleaning, joins, aggregations to the chunk
Most of the run; minutes to hours at scale
Write
Job writes the output to a partition or table
Seconds to minutes
Shut down
Compute resources are released; the job ends
Seconds
The Nightly Run
The canonical example is the nightly run. Sometime between 1am and 4am Pacific, when production traffic is at its lowest, an orchestrator wakes up dozens or hundreds of jobs in dependency order. Each job reads its inputs, applies its transforms, and writes its outputs. The work finishes by 6am or 7am, in time for the morning consumers to read fresh data. By 9am the dashboards show last day's numbers. The cycle repeats the next night. Batch is the rhythm of the office workweek: the work happens off-hours so the morning has answers.
The query reads exactly one day of data, aggregates it, and writes the result. The next day's run does the same for the next day. The pattern is durable: it has run nearly unchanged in data warehouses since the 1990s, because the basic shape is right. Read a window, transform, write a partition, end.
Why Batch Is Cheap
Batch jobs spin compute up at the start of the run and shut it down at the end. A Snowflake warehouse charges by the second of compute time. A nightly job that runs for forty minutes pays for forty minutes of compute, period. The other twenty-three hours and twenty minutes of the day cost nothing. Cloud compute is elastic specifically so that batch workloads can ramp up to large clusters when work is happening and ramp down to zero between runs. The cost-per-unit-of-data of a well-designed batch pipeline is the lowest of any architecture.
Common Batch Cadences
Cadence
When It Fits
Typical Use
Daily (overnight)
Consumer reads the next morning; cost matters
Executive dashboards, financial reports, ML training data
Hourly
Consumer wants same-day freshness without paying for streaming
Batch cannot beat its own schedule. A daily pipeline cannot tell anyone what happened in the last hour. An hourly pipeline cannot tell anyone what happened in the last minute. The freshness floor is the schedule itself. For a finance team that reads the dashboard once a morning, that floor is fine. For a fraud team that needs to react to a suspicious transaction within seconds, the floor is unusable. The mismatch between consumer freshness needs and batch cadence is the most common reason streaming gets pulled into a system.
The batch contract in three sentences:
▸The pipeline runs on a schedule and processes the chunk that has accumulated since the last run
▸The result is available some time after the run starts, bounded by how long the run takes
▸Compute is paid for only during the run; idle time is free
✓Do
Use batch as the default unless a specific consumer cannot wait for the next scheduled run
Match the cadence to the consumer's freshness need; do not over-run if hourly is enough
Partition outputs by the run window so failed runs replay one partition cleanly
✗Don't
Reach for streaming because it sounds more modern; the cost difference is real
Run a batch job continuously by scheduling it every minute; that is just expensive streaming
Confuse a slow batch with a streaming need; profiling first beats redesigning later
Streaming: Picture, Rhythm, Example
Daily Life
Interviews
Trace a single event through a streaming pipeline from source queue to output sink.
Streaming processing is the second basic rhythm. A streaming pipeline runs continuously. Each new event arrives at the source and flows through the transforms within milliseconds or seconds. There is no concept of a chunk and no concept of a scheduled wake-up. The pipeline is a long-running service, more like a web server than a script. The shape is more recent than batch in mainstream use, dating roughly from the rise of Apache Kafka in the early 2010s and the stream processors that grew up around it: Spark Streaming, Flink, Kafka Streams, Beam.
The Shape of a Streaming Pipeline
Element
What It Does
What It Looks Like
Source
Produces events continuously into a queue or log
Kafka topic, Kinesis stream, Pub/Sub topic
Consumer process
Reads events as they arrive, applies transforms, emits results
Long-running JVM, Python service, Spark cluster
State store
Holds running totals, windows, joins between events
RocksDB on local disk, in-memory cache, external KV store
Output sink
Where transformed events land for downstream consumption
Another Kafka topic, a database, a feature store
The Live Event Feed
The canonical example is a live event feed. A user clicks a button on a website. The click is recorded as an event, sent to a Kafka topic, and within a few seconds shows up on an internal dashboard, a fraud system, an ad attribution model, and a customer support tool. The same event triggers downstream consequences without any nightly run, any 2am wake-up, or any human waiting until morning. The feeling from the consumer side is that the system is alive: things happen, the dashboard updates.
1
# Sketch of a streaming consumer in Kafka Streams style
The shape is an infinite loop. A consumer reads the next event, updates state, possibly emits an output, commits the offset, and goes back to read the next event. The loop runs forever. There is no return value. There is no scheduled end. The process is started once and expected to keep running. Stopping it is a deliberate operational action, not the natural conclusion of the work.
Why Streaming Costs More
Streaming pays for compute around the clock. The Kafka consumer process is up at 3am whether or not events are arriving. The Flink cluster is provisioned for peak load even at off-peak hours. State stores like RocksDB hold data in memory or on local disk that costs money to keep around. There is no equivalent of the batch idle period. Cost is roughly proportional to peak event volume times time, instead of total event volume divided by amortized run time. For low-volume topics this can still be cheap, but it never approaches the cost-per-event of a well-amortized batch job.
•Batch Cost Profile
Compute is on for the run, off the rest of the day
Cost scales with the size of each chunk plus overhead per run
Idle hours are free; ramp-up amortizes across the chunk
A 1 percent traffic dip lowers cost the next night
•Streaming Cost Profile
Compute is on every minute of every day
Cost scales with provisioned capacity, not actual traffic
Idle hours still cost the same as peak hours
A 1 percent traffic dip lowers nothing; capacity is fixed
Common Streaming Use Cases
Use Case
Why Streaming Fits
Tolerable Latency
Fraud detection
Decisions must happen before the transaction settles
Sub-second to a few seconds
Live operational dashboards
Operators react to events as they happen
Seconds to a minute
Real-time personalization
User session is short; recommendations must adapt within the session
A few hundred milliseconds
IoT telemetry
Volume is too high to batch economically; alarms are time-critical
Seconds to a minute
Change data capture (CDC)
Change data capture (CDC) turns each row write in an operational database into an event stream; downstream replicas must reflect upstream changes within seconds
Seconds to a few minutes
What Streaming Is Not
Streaming is not a magic faster batch. The latency win is real but bounded; the streaming pipeline still has to read, transform, and write each event, and physics imposes a floor on how fast that can happen. Network hops, serialization, state lookups, and downstream writes each add milliseconds. A typical end-to-end streaming latency in production is between 100 milliseconds and several seconds. Anything sub-100ms requires careful engineering and dedicated hardware. The marketing word real-time hides this floor; the engineering reality respects it.
The streaming contract in three sentences:
▸Each event flows through the pipeline as it arrives, with no waiting for a scheduled run
▸End-to-end latency is bounded by the time to read, transform, and write one event
▸Compute is paid for around the clock; cost scales with provisioned capacity, not actual volume
TIP
When introducing streaming for the first time, name the latency target in milliseconds before picking the engine. Engines have radically different latency floors, and getting that wrong wastes weeks.
What Real-Time Actually Means
Daily Life
Interviews
Translate a real-time request into a concrete freshness tier and name the architecture each tier requires.
Real-time is the most overloaded phrase in data engineering. A product manager asks for a real-time dashboard and means within an hour. A finance executive asks for real-time revenue and means by the start of the workday. A trading firm asks for real-time and means within five microseconds. The word is so elastic that it carries almost no information. The only useful response to a real-time request is to ask for the actual freshness target in concrete units of time, then translate that target into the simplest pipeline that can meet it.
Five Freshness Tiers
Tier
Freshness Target
Typical Architecture
Sub-second
Under 100 milliseconds end to end
Specialized streaming with co-located compute and storage
Near real-time
Under 15 minutes
Streaming or micro-batch (Spark Structured Streaming, Flink)
Same day
Under 2 hours
Hourly batch or micro-batch every 15 minutes
Daily
By the next morning
Nightly batch, runs at 2am, ready by 7am
Weekly or slower
On a calendar cadence
Weekly batch, often on a Sunday or Monday morning
Each tier roughly doubles or triples the cost of the tier below it. A daily batch pipeline is cheap. An hourly version of the same pipeline is more expensive because of the per-run overhead repeated 24 times. A streaming version is more expensive again because compute runs continuously. A sub-second streaming pipeline that meets a 50ms target is more expensive again because it requires careful engineering to remove every avoidable millisecond. Picking a tier that the consumer does not actually need is one of the most common forms of overengineering in data work.
Tier 1: Sub-secondTier 2: Near real-timeTier 3: Same dayTier 4: DailyTier 5: Slower
Tier 1: Sub-second
Genuinely real-time
Sub-100ms end-to-end. Trading systems, ad bidding, real-time fraud blocking. Streaming with specialized hardware and code paths.
Tier 2: Near real-time
A human notices
Under 15 minutes. Live operational dashboards, customer support queues, fraud retrospective. Streaming or micro-batch fits.
Tier 3: Same day
Same workday
Under 2 hours. Hourly executive dashboards, ad spend reports, marketing campaigns. Hourly batch is usually enough.
Tier 4: Daily
Tomorrow morning
Next day. Standard executive dashboards, ML training data, finance reports. Nightly batch is the workhorse.
Tier 5: Slower
Weekly or monthly
Calendar-driven. Cohort analyses, board reports, retention curves. Weekly or monthly batch on a slow schedule.
Why Real-Time Usually Means Tier 2 or 3
When non-engineers say real-time, they almost always mean tier 2 or tier 3: within fifteen minutes, or within a couple of hours. The dashboard the marketing team wanted at noon is a tier 2 problem, not a tier 1 problem. The CFO's morning revenue is a tier 4 problem dressed up as a tier 2 ask. Translating the request into the right tier saves an enormous amount of engineering work. A tier 4 ask handled with a tier 1 architecture wastes infrastructure money continuously. A tier 2 ask handled with a tier 4 architecture produces an angry product manager.
Questions that translate real-time into a tier:
▸What decision will be made with this data, and how often does that decision happen?
▸How long can the consumer wait between event and action without harm?
▸What does the consumer do today when they cannot get this data?
Symptom: a Flink cluster that costs $4,000 a month for a daily dashboard
Cost shows up as a cloud bill nobody can explain
Fix: replace streaming with a nightly batch; same numbers, 1/20th the cost
The Tier Conversation
Senior engineers do not start architecture conversations with tools. They start with the freshness conversation. A consumer says they want a real-time dashboard. The engineer asks what the dashboard is for, who reads it, and what they do with the answer. Those questions reliably surface the actual tier. Once the tier is named, the architecture follows: tier 4 is a nightly job, tier 3 is hourly, tier 2 is streaming or micro-batch, tier 1 is genuine streaming. Any architecture conversation that skips the tier conversation is going to produce the wrong shape.
Almost every real-time request that reaches a data engineer translates to tier 2 (under 15 minutes) or tier 3 (under 2 hours). Genuine tier 1 (sub-second) is rare and usually has a specific dollar value attached to the latency.
TIP
When a consumer says real-time, write the request back as a number with units before agreeing to anything. The conversation that produces that number is more valuable than the architecture that follows it.
Picking Batch or Streaming
Daily Life
Interviews
Pick batch or streaming for a simple use case based on the consumer's freshness tier and cost tolerance.
Vocabulary becomes useful when applied to a specific decision. The exercise below picks between batch and streaming for three small concrete cases. The cases are intentionally simple so the choice is visible. Real production decisions are messier, but the same questions apply: what does the consumer need, when do they need it, and what does each option cost.
Case 1: A Marketing Team's Daily Signup Count
The marketing team wants a chart of new signups by country, by day, for the trailing 30 days. The chart is read once a morning at the marketing standup. The numbers do not change after the day closes. The consumer is patient: yesterday's number is fine, today's morning number is a bonus. The freshness tier is 4 (daily). The right architecture is a nightly batch that aggregates the day's signups and writes a small partition. Streaming would work but cost much more for no extra value the consumer cares about.
Case 2: A Fraud Team's Suspicious Transaction Alert
The fraud team wants to be alerted within seconds when a card has more than five distinct merchants in the last minute. The decision is whether to freeze the card before another transaction lands. A nightly batch produces the answer the next morning, when the card has already been used twenty more times. An hourly batch produces the answer in batches of one hour, far too slow. The freshness tier is 1 or 2. The right architecture is a streaming consumer that maintains a per-card sliding window over recent transactions. The cost is real but the alternative is unacceptable.
Case 3: An Ad Spend Dashboard That Updates Hourly
The growth team wants a dashboard of cost per acquisition by ad campaign, updated hourly during business hours. The team uses the dashboard to pause underperforming campaigns within the same workday. Daily is too slow; sub-second is overkill. The freshness tier is 3 (same day). The right architecture is an hourly batch that runs at the top of every hour, reads the last hour of clicks and conversions, and writes the result. Streaming would work but cost two or three times as much. Daily would lead the team to keep paying for bad campaigns until tomorrow.
Case
Tier
Right Choice
Daily signups dashboard
Tier 4 (daily)
Nightly batch; cheapest, simplest, fits the consumer
Fraud transaction alert
Tier 1 or 2 (seconds)
Streaming; the latency justifies the cost
Ad spend dashboard
Tier 3 (same day)
Hourly batch; balances freshness with cost
The Three-Question Test
The simple rule for picking batch or streaming:
▸What decision is made with this data, and how often is that decision made?
▸How long can the consumer wait between event and answer without harm?
▸Does the consumer's tolerance match a tier 4 or 5 (batch) or a tier 1 or 2 (streaming)?
If the answers point at tier 4 or 5, batch wins on cost and simplicity. If the answers point at tier 1 or 2, streaming wins because batch literally cannot meet the freshness floor. Tier 3 is the genuinely interesting case: an hourly batch usually beats a streaming pipeline on cost while still meeting the consumer's need, but the right answer depends on the volume and the existing infrastructure. Asking the three questions in order resolves the choice in nearly every case.
✓Batch Wins When
Consumer reads on a schedule (morning standup, end-of-day report)
Freshness tolerance is hours or days
Cost per dollar of value matters more than latency
Failure recovery means rerunning a clean partition
✓Streaming Wins When
Consumer reacts to events as they happen
Freshness tolerance is seconds or single-digit minutes
A late answer is worse than no answer
Volume is high enough that buffering hours of data hurts
What This Means in Practice
Most companies need batch for almost everything and streaming for a few specific consumers. The mistake is treating the two as competing philosophies. They are tools with different cost profiles, suited to different freshness tiers. A mature data platform has both, used where they fit. The conversation about which to use is not a religious debate; it is a freshness conversation followed by a cost conversation. Naming the tier before naming the tool keeps the conversation honest.
Batch fits tier 3 to 5 freshness needs and dominates on cost.
Streaming fits tier 1 to 2 freshness needs and is the only option when batch cannot meet the floor.
The three-question test resolves nearly every batch-vs-streaming decision in practice.
✓Do
Default to batch and graduate to streaming for the specific consumers that need it
Translate any real-time request into a numeric freshness tier before picking an architecture
Name the cost difference explicitly so consumers can opt in or out of the tier they think they want
✗Don't
Build streaming for everything because it sounds modern; the cost compounds
Build batch for tier 1 needs because it is simpler; consumers will work around the pipeline
Skip the freshness conversation; tools chosen without it are usually the wrong tools
❯❯❯PUTTING IT ALL TOGETHER
> A media subscription company has three new dashboard requests in the same week. The CFO wants daily revenue at 7am Pacific. The growth team wants signup performance during a flash sale, updated within minutes. The product team wants weekly retention curves on Monday mornings. The data engineer is asked to design all three with a clear story for batch versus streaming.
Each request maps to a freshness tier. Daily revenue is tier 4 (next morning). Flash-sale signups are tier 2 (under 15 minutes). Weekly retention is tier 5 (calendar-driven). The tiers, not the tools, drive the architecture.
Tier 4 and tier 5 use batch. The CFO's revenue dashboard runs nightly; the retention report runs weekly on Sunday. Both fit the four pipeline roles from Lesson 1: source, transform, storage, consumer, with the transform on a scheduled cadence.
Tier 2 uses streaming. The flash-sale dashboard reads from the same Kafka topic of signup events but with a continuous consumer that emits aggregated counts every few seconds. The pipeline still has the four roles; the rhythm is the only thing that changes.
Cost is named explicitly. The streaming pipeline costs roughly five times what an hourly batch would. The growth team is told this and confirms the latency is worth it for the duration of the sale. After the sale, the streaming consumer can be retired or downgraded to hourly.
KEY TAKEAWAYS
Two basic rhythms move data: batch processes a chunk on a schedule; streaming processes each event as it arrives. The same numbers can come out of either.
Batch is a function; streaming is a service: every operational difference (cost, failure handling, freshness floor) flows from this one structural fact.
Real-time means almost nothing without a number: translate the request into one of five freshness tiers (sub-second, under 15 min, under 2 hr, daily, slower) before picking architecture.
Most consumers live at tier 3 or 4: batch is the default and meets nearly every freshness need at the lowest cost. Streaming is reserved for tiers 1 and 2.
The three-question test picks the rhythm: what decision, how often, how long can the consumer wait. The answers point at the tier, and the tier picks the architecture.
Batch vs Streaming: Beginner
Data moves in scheduled chunks or in a continuous flow; the choice changes everything downstream
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: Two Ways Data Can Move, Batch: Picture, Rhythm, Example, Streaming: Picture, Rhythm, Example, What Real-Time Actually Means, Picking Batch or Streaming
Data moves through a pipeline in one of two basic rhythms. The first rhythm is scheduled. Data piles up for a while, then a job wakes up, processes everything that has accumulated since the last run, and goes back to sleep. The second rhythm is continuous. Each new event flows through the pipeline as it arrives, with no waiting for a scheduled wake-up. Almost every pipeline in production fits into one of these two rhythms, or a hybrid that explicitly mixes them. Naming the rhythm is the first us
Batch processing is the older of the two rhythms and still the dominant pattern in production. Most analytical work in most companies runs as a batch job, often nightly, sometimes hourly. The pattern is so common that the word pipeline used without qualification almost always means a batch pipeline. Knowing the shape of a batch run cold is the foundation for everything else, because streaming is largely defined by what it changes about that shape. The Shape of a Batch Run The Nightly Run The can
Streaming processing is the second basic rhythm. A streaming pipeline runs continuously. Each new event arrives at the source and flows through the transforms within milliseconds or seconds. There is no concept of a chunk and no concept of a scheduled wake-up. The pipeline is a long-running service, more like a web server than a script. The shape is more recent than batch in mainstream use, dating roughly from the rise of Apache Kafka in the early 2010s and the stream processors that grew up aro
Real-time is the most overloaded phrase in data engineering. A product manager asks for a real-time dashboard and means within an hour. A finance executive asks for real-time revenue and means by the start of the workday. A trading firm asks for real-time and means within five microseconds. The word is so elastic that it carries almost no information. The only useful response to a real-time request is to ask for the actual freshness target in concrete units of time, then translate that target in
Vocabulary becomes useful when applied to a specific decision. The exercise below picks between batch and streaming for three small concrete cases. The cases are intentionally simple so the choice is visible. Real production decisions are messier, but the same questions apply: what does the consumer need, when do they need it, and what does each option cost. Case 1: A Marketing Team's Daily Signup Count The marketing team wants a chart of new signups by country, by day, for the trailing 30 days.