Batch vs Streaming Data Pipelines

System design appears in only 2.8% of DE interview rounds. But when it does, batch vs streaming is the defining question. Most pipelines should be batch. Streaming adds real value only when low latency is a hard requirement.

Two Processing Models

The Reliable Workhorse

Batch Processing

Collect data over a time window (hourly, daily, weekly), then process it all at once. The pipeline runs on a schedule, produces output, and stops. Think: "run every night at 2 AM."

The Speed Specialist

Stream Processing

Process each event as it arrives, continuously. The pipeline is always running, consuming from an event stream. Think: "process every click the moment it happens."

Detailed Comparison

Dimension	Batch	Streaming
Latency	Minutes to hours. Data is collected over a window (hourly, daily), then processed all at once. Most analytics workloads are fine with this delay.	Seconds to milliseconds. Events are processed as they arrive. Necessary when stale data has real consequences.
Complexity	Lower. Easier to reason about, test, and debug. A batch job either succeeds or fails. Retries are straightforward: re-run the whole batch.	Higher. You deal with late-arriving events, ordering guarantees, exactly-once semantics, and backpressure. More failure modes, harder to test.
Cost	Generally cheaper. Compute spins up, processes the batch, and shuts down. You pay for what you use. Spot instances work well for batch jobs.	Generally more expensive. Consumers must run continuously, even during low-traffic periods. Infrastructure is always on.
Tooling	Well-established: Airflow, dbt, Spark (batch mode), cron jobs, cloud-native schedulers. Large talent pool. Plenty of documentation.	More specialized: Kafka, Flink, Spark Streaming, Kinesis, Pub/Sub. Smaller talent pool. Operational overhead is higher.
Debugging	Easier. You can inspect the input, re-run the job, and compare outputs. Logs are scoped to a single run. Failed batches are isolated.	Harder. Issues may be intermittent, tied to event ordering, or caused by late data. Reproducing a bug often requires replaying events from a specific offset.
When to Use	Daily/hourly reports, data warehouse refreshes, ML training pipelines, backfills, any workload where hours-old data is acceptable.	Fraud detection, real-time dashboards, alerting, session tracking, any workload where data must be acted on within seconds.

When to Use Each

The right choice most of the time

Batch

Daily data warehouse refresh (most common pipeline type). Weekly or monthly reporting rollups. ML model training on historical data. Backfilling a new table from years of source data. Data quality checks that run after each load. Cost allocation and billing calculations.

When latency is a hard constraint

Streaming

Fraud detection (must block a transaction in milliseconds). Real-time dashboards for operational monitoring. Alerting on anomalies (server errors, traffic spikes). User session tracking for personalization. IoT sensor data where delays mean safety risks. Event-driven microservice communication.

How Interviewers Test Batch vs Streaming Knowledge

System design makes up only 2.8% of DE interview rounds, but it carries outsized weight when it appears. You'll get a scenario and need to choose batch or streaming with clear reasoning. The strongest signal is knowing when streaming is unnecessary.

"Design a pipeline for a daily sales report"

Batch. The report is consumed once per day. Hourly or daily batch processing is simpler, cheaper, and sufficient. Proposing streaming here signals you don't consider cost or complexity.

"Design a system to detect fraudulent credit card transactions"

Streaming. A fraudulent transaction must be flagged before the charge completes. Batch processing with even a 1-hour delay means thousands of fraudulent charges go through. Low latency is a hard requirement.

"Design a pipeline to populate a search index"

Depends on the requirements. If search results can be a few hours stale (product catalog), batch is fine. If users expect to see their own content immediately after posting (social feed), you need streaming or near-real-time micro-batching.

"Design a data pipeline for ML model training"

Batch. Model training runs on historical data. You collect a training dataset, train the model, evaluate it, and deploy. Streaming adds no value here. However, model serving (inference) might need low latency, which is a separate pipeline.

The most common interview mistake

Proposing streaming when batch is sufficient. It signals that you optimize for technical complexity over practical trade-offs. Experienced engineers default to the simplest solution that meets the latency requirement. If the data can be hours old, batch wins on every other dimension: cost, complexity, debuggability, and reliability.

Prepare for the interview

01 / Open invite

02min.

Know batch vs streaming the way the interviewer who asks it knows it.

a batch vs streaming query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1source → bronze → silver → gold

2 ingest : CDC + Kafka

3 transform : dbt + Airflow

4 serve : Snowflake

Execute your solution0.4s avg.

GoogleInterview question

Solve a batch vs streaming problem

Frequently Asked Questions

Should I default to batch or streaming?+

Default to batch. Most data engineering workloads do not require sub-second latency. Batch pipelines are simpler to build, test, debug, and operate. Only add streaming when the business requirement genuinely demands low latency. In interviews, saying "batch is sufficient here" shows maturity.

Can you combine batch and streaming in one system?+

Yes. This is common. For example, a streaming pipeline handles real-time alerts while a batch pipeline loads the same data into a warehouse for historical analysis. Some teams use the Lambda architecture (parallel batch and streaming paths) or the Kappa architecture (streaming only, with replay for batch-like workloads).

What is micro-batching?+

Micro-batching processes data in very small batches (every few seconds or minutes) instead of true event-by-event streaming. Spark Structured Streaming uses this approach. It gives near-real-time latency with batch-like simplicity. It is a good middle ground when sub-second latency is not required but hourly batches are too slow.

Do data engineering interviews always ask about batch vs streaming?+

System design rounds appear in only 2.8% of DE interview rounds, but when they do, batch vs streaming is the most common trade-off question. The interviewer describes a scenario and expects you to choose with clear reasoning. The strongest answers acknowledge that batch is the simpler default and only propose streaming when latency is a genuine constraint.

02 / Why practice

Build Interview-Ready Pipeline Skills

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Strengthen Your Fundamentals

Related Guides

System Design Questions→

How to structure architecture answers that reference batch and streaming

ETL vs ELT→

How transformation timing connects to batch and streaming decisions

DISTINCT and Dedup→

Streaming's at-least-once delivery creates the duplicates you learn to handle here