Pipeline Architecture

Batch vs Streaming Data Pipelines

System design appears in only 2.8% of DE interview rounds. But when it does, batch vs streaming is the defining question. Most pipelines should be batch. Streaming adds real value only when low latency is a hard requirement.

SQL and Python dominate the vast majority of DE interview loops. System design is rare, but getting it wrong signals you cannot reason about trade-offs.

Batch Processing

The Reliable Workhorse

Collect data over a time window (hourly, daily, weekly), then process it all at once. The pipeline runs on a schedule, produces output, and stops. Think: "run every night at 2 AM."

Stream Processing

The Speed Specialist

Process each event as it arrives, continuously. The pipeline is always running, consuming from an event stream. Think: "process every click the moment it happens."

Detailed Comparison

Latency

Batch

Minutes to hours. Data is collected over a window (hourly, daily), then processed all at once. Most analytics workloads are fine with this delay.

Streaming

Seconds to milliseconds. Events are processed as they arrive. Necessary when stale data has real consequences.

Complexity

Batch

Lower. Easier to reason about, test, and debug. A batch job either succeeds or fails. Retries are straightforward: re-run the whole batch.

Streaming

Higher. You deal with late-arriving events, ordering guarantees, exactly-once semantics, and backpressure. More failure modes, harder to test.

Cost

Batch

Generally cheaper. Compute spins up, processes the batch, and shuts down. You pay for what you use. Spot instances work well for batch jobs.

Streaming

Generally more expensive. Consumers must run continuously, even during low-traffic periods. Infrastructure is always on.

Tooling

Batch

Well-established: Airflow, dbt, Spark (batch mode), cron jobs, cloud-native schedulers. Large talent pool. Plenty of documentation.

Streaming

More specialized: Kafka, Flink, Spark Streaming, Kinesis, Pub/Sub. Smaller talent pool. Operational overhead is higher.

Debugging

Batch

Easier. You can inspect the input, re-run the job, and compare outputs. Logs are scoped to a single run. Failed batches are isolated.

Streaming

Harder. Issues may be intermittent, tied to event ordering, or caused by late data. Reproducing a bug often requires replaying events from a specific offset.

When to Use

Batch

Daily/hourly reports, data warehouse refreshes, ML training pipelines, backfills, any workload where hours-old data is acceptable.

Streaming

Fraud detection, real-time dashboards, alerting, session tracking, any workload where data must be acted on within seconds.

When to Use Each

Batch

The right choice most of the time

  • Daily data warehouse refresh (most common pipeline type)
  • Weekly or monthly reporting rollups
  • ML model training on historical data
  • Backfilling a new table from years of source data
  • Data quality checks that run after each load
  • Cost allocation and billing calculations

Streaming

When latency is a hard constraint

  • Fraud detection (must block a transaction in milliseconds)
  • Real-time dashboards for operational monitoring
  • Alerting on anomalies (server errors, traffic spikes)
  • User session tracking for personalization
  • IoT sensor data where delays mean safety risks
  • Event-driven microservice communication

How Interviewers Test Batch vs Streaming Knowledge

System design makes up only 2.8% of DE interview rounds, but it carries outsized weight when it appears. You'll get a scenario and need to choose batch or streaming with clear reasoning. The strongest signal is knowing when streaming is unnecessary.

"Design a pipeline for a daily sales report"

Batch. The report is consumed once per day. Hourly or daily batch processing is simpler, cheaper, and sufficient. Proposing streaming here signals you don't consider cost or complexity.

"Design a system to detect fraudulent credit card transactions"

Streaming. A fraudulent transaction must be flagged before the charge completes. Batch processing with even a 1-hour delay means thousands of fraudulent charges go through. Low latency is a hard requirement.

"Design a pipeline to populate a search index"

Depends on the requirements. If search results can be a few hours stale (product catalog), batch is fine. If users expect to see their own content immediately after posting (social feed), you need streaming or near-real-time micro-batching.

"Design a data pipeline for ML model training"

Batch. Model training runs on historical data. You collect a training dataset, train the model, evaluate it, and deploy. Streaming adds no value here. However, model serving (inference) might need low latency, which is a separate pipeline.

The most common interview mistake

Proposing streaming when batch is sufficient. It signals that you optimize for technical complexity over practical trade-offs. Experienced engineers default to the simplest solution that meets the latency requirement. If the data can be hours old, batch wins on every other dimension: cost, complexity, debuggability, and reliability.

Frequently Asked Questions

Should I default to batch or streaming?+
Default to batch. Most data engineering workloads do not require sub-second latency. Batch pipelines are simpler to build, test, debug, and operate. Only add streaming when the business requirement genuinely demands low latency. In interviews, saying "batch is sufficient here" shows maturity.
Can you combine batch and streaming in one system?+
Yes. This is common. For example, a streaming pipeline handles real-time alerts while a batch pipeline loads the same data into a warehouse for historical analysis. Some teams use the Lambda architecture (parallel batch and streaming paths) or the Kappa architecture (streaming only, with replay for batch-like workloads).
What is micro-batching?+
Micro-batching processes data in very small batches (every few seconds or minutes) instead of true event-by-event streaming. Spark Structured Streaming uses this approach. It gives near-real-time latency with batch-like simplicity. It is a good middle ground when sub-second latency is not required but hourly batches are too slow.
Do data engineering interviews always ask about batch vs streaming?+
System design rounds appear in only 2.8% of DE interview rounds, but when they do, batch vs streaming is the most common trade-off question. The interviewer describes a scenario and expects you to choose with clear reasoning. The strongest answers acknowledge that batch is the simpler default and only propose streaming when latency is a genuine constraint.

Build Interview-Ready Pipeline Skills

SQL and Python are the two most-tested skills in DE interviews. Every challenge runs real code against real data.