Batch vs Streaming Data Pipelines
System design appears in only 2.8% of DE interview rounds. But when it does, batch vs streaming is the defining question. Most pipelines should be batch. Streaming adds real value only when low latency is a hard requirement.
Two Processing Models
Batch Processing
Collect data over a time window (hourly, daily, weekly), then process it all at once. The pipeline runs on a schedule, produces output, and stops. Think: "run every night at 2 AM."
Stream Processing
Process each event as it arrives, continuously. The pipeline is always running, consuming from an event stream. Think: "process every click the moment it happens."
Detailed Comparison
| Dimension | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours. Data is collected over a window (hourly, daily), then processed all at once. Most analytics workloads are fine with this delay. | Seconds to milliseconds. Events are processed as they arrive. Necessary when stale data has real consequences. |
| Complexity | Lower. Easier to reason about, test, and debug. A batch job either succeeds or fails. Retries are straightforward: re-run the whole batch. | Higher. You deal with late-arriving events, ordering guarantees, exactly-once semantics, and backpressure. More failure modes, harder to test. |
| Cost | Generally cheaper. Compute spins up, processes the batch, and shuts down. You pay for what you use. Spot instances work well for batch jobs. | Generally more expensive. Consumers must run continuously, even during low-traffic periods. Infrastructure is always on. |
| Tooling | Well-established: Airflow, dbt, Spark (batch mode), cron jobs, cloud-native schedulers. Large talent pool. Plenty of documentation. | More specialized: Kafka, Flink, Spark Streaming, Kinesis, Pub/Sub. Smaller talent pool. Operational overhead is higher. |
| Debugging | Easier. You can inspect the input, re-run the job, and compare outputs. Logs are scoped to a single run. Failed batches are isolated. | Harder. Issues may be intermittent, tied to event ordering, or caused by late data. Reproducing a bug often requires replaying events from a specific offset. |
| When to Use | Daily/hourly reports, data warehouse refreshes, ML training pipelines, backfills, any workload where hours-old data is acceptable. | Fraud detection, real-time dashboards, alerting, session tracking, any workload where data must be acted on within seconds. |
When to Use Each
Batch
Daily data warehouse refresh (most common pipeline type). Weekly or monthly reporting rollups. ML model training on historical data. Backfilling a new table from years of source data. Data quality checks that run after each load. Cost allocation and billing calculations.
Streaming
Fraud detection (must block a transaction in milliseconds). Real-time dashboards for operational monitoring. Alerting on anomalies (server errors, traffic spikes). User session tracking for personalization. IoT sensor data where delays mean safety risks. Event-driven microservice communication.
How Interviewers Test Batch vs Streaming Knowledge
System design makes up only 2.8% of DE interview rounds, but it carries outsized weight when it appears. You'll get a scenario and need to choose batch or streaming with clear reasoning. The strongest signal is knowing when streaming is unnecessary.
"Design a pipeline for a daily sales report"
Batch. The report is consumed once per day. Hourly or daily batch processing is simpler, cheaper, and sufficient. Proposing streaming here signals you don't consider cost or complexity.
"Design a system to detect fraudulent credit card transactions"
Streaming. A fraudulent transaction must be flagged before the charge completes. Batch processing with even a 1-hour delay means thousands of fraudulent charges go through. Low latency is a hard requirement.
"Design a pipeline to populate a search index"
Depends on the requirements. If search results can be a few hours stale (product catalog), batch is fine. If users expect to see their own content immediately after posting (social feed), you need streaming or near-real-time micro-batching.
"Design a data pipeline for ML model training"
Batch. Model training runs on historical data. You collect a training dataset, train the model, evaluate it, and deploy. Streaming adds no value here. However, model serving (inference) might need low latency, which is a separate pipeline.
The most common interview mistake
Know batch vs streaming the way the interviewer who asks it knows it.
Frequently Asked Questions
Should I default to batch or streaming?+
Can you combine batch and streaming in one system?+
What is micro-batching?+
Do data engineering interviews always ask about batch vs streaming?+
Build Interview-Ready Pipeline Skills
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition