Pipeline Architecture

Stream Processing vs Batch Processing

At the ingestion layer, your choice between batch and streaming becomes a contract with every downstream system. Batch promises eventual correctness and cheaper compute; streaming promises fresh data and event-time reasoning. Real platforms run both, side by side, with a shared storage layer that tolerates reprocessing. This guide maps the four frameworks that occupy the batch-to-streaming spectrum and tells you where each one slots into a production pipeline.

For the conceptual overview of batch vs streaming (when to use each, cost trade-offs, interview scenarios), see Batch vs Streaming Pipelines. This page focuses on specific frameworks and implementation trade-offs.

4

Frameworks compared

120

Pipeline challenges

17%

L6 staff rounds

3%

Pure system-design share

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Framework Profiles

Each framework below occupies a specific point on the batch-streaming continuum defined by three axes: latency budget, state size, and exactly-once guarantees. Pick the one whose defaults match your SLA and your team's operational maturity.

Apache Spark (Batch Mode)

Batch

Spark's batch mode is the workhorse of large-scale data processing. It reads a bounded dataset, processes it across a cluster, and writes the result. Most production data engineering pipelines use Spark batch because it handles terabyte-scale transforms reliably and the ecosystem is mature.

Strengths

  • Extremely well-documented with a large talent pool
  • Handles terabyte-scale batch workloads efficiently
  • Supports SQL (Spark SQL), Python (PySpark), Scala, and Java
  • Catalyst optimizer automatically improves query plans
  • Integrates with every major data warehouse and lake

Weaknesses

  • Not designed for sub-second latency
  • Job startup overhead makes it inefficient for small datasets
  • Shuffle-heavy operations can be expensive and hard to tune

Best for:

Daily/hourly warehouse refreshes, ETL at scale, ML feature engineering, backfills. If your data arrives in bounded chunks and hours of latency is acceptable, Spark batch is the default choice.

Apache Spark Structured Streaming

Streaming

Spark Structured Streaming extends the batch Spark API to handle continuous data. It uses micro-batches by default: it collects events for a short window (100ms to minutes), processes them as a small batch, then repeats. This is not true event-at-a-time processing, but it is close enough for many use cases.

Strengths

  • Same API as Spark batch, so the learning curve is minimal
  • Exactly-once processing with checkpointing
  • Good for teams already using Spark who need streaming capabilities
  • Supports watermarks for late-arriving events
  • Continuous processing mode available for lower latency (experimental)

Weaknesses

  • Micro-batch architecture adds latency (typically 100ms-1s minimum)
  • Higher resource consumption than purpose-built streaming engines
  • Checkpoint management can be complex during schema evolution

Best for:

Teams already using Spark batch who need near-real-time capabilities without adopting an entirely new framework. Good for latencies in the seconds-to-minutes range.

Apache Flink

Streaming

Flink is a true event-at-a-time streaming engine. Unlike Spark Structured Streaming, Flink processes each event as it arrives, without micro-batching. This gives it lower latency and more precise event-time semantics. Flink also handles batch workloads, but its strength is streaming.

Strengths

  • True event-at-a-time processing (millisecond latency)
  • Best-in-class event-time handling with watermarks and late data
  • Exactly-once state consistency with lightweight checkpointing
  • Flink SQL for stream processing without Java/Scala code
  • Scales to millions of events per second

Weaknesses

  • Smaller community and talent pool compared to Spark
  • Steeper learning curve for stateful stream processing
  • Operational overhead: managing checkpoints, savepoints, and state backends
  • Debugging is harder than batch processing

Best for:

Low-latency event processing where sub-second response times matter. Fraud detection, real-time recommendations, anomaly detection, session windowing. If Spark Structured Streaming's micro-batch latency is not fast enough, Flink is the answer.

Kafka Streams

Streaming

Kafka Streams is a client library (not a cluster framework) for processing data stored in Kafka topics. It runs as part of your application, not as a separate cluster. This makes it lightweight and easy to deploy, but it is limited to Kafka-to-Kafka processing.

Strengths

  • No separate cluster to manage (runs inside your application)
  • Exactly-once semantics within the Kafka ecosystem
  • Simple deployment: just another Java/Kotlin application
  • Good for lightweight transformations, enrichments, and aggregations
  • Scales horizontally by adding more application instances

Weaknesses

  • Tied to Kafka: input and output must be Kafka topics
  • Not suitable for complex event processing or heavy computation
  • Limited windowing compared to Flink
  • No built-in support for batch processing

Best for:

Lightweight event transformations between Kafka topics. Enrichment, filtering, simple aggregations, and routing. If your data is already in Kafka and the processing is not computationally heavy, Kafka Streams avoids the overhead of a full streaming cluster.

Side-by-Side Comparison

DimensionSpark BatchSpark StreamingFlinkKafka Streams
Processing ModelBounded datasets processed in fullMicro-batches (100ms-1s intervals)True event-at-a-timeEvent-at-a-time (within Kafka)
LatencyMinutes to hoursSeconds to minutesMilliseconds to secondsMilliseconds to seconds
ThroughputVery high (optimized for bulk)HighHighModerate (single-app model)
Exactly-OnceN/A (idempotent reruns)Yes (with checkpointing)Yes (lightweight checkpoints)Yes (Kafka transactions)
DeploymentCluster (YARN, K8s, EMR)Same Spark clusterDedicated Flink clusterApplication-embedded (no cluster)
Learning CurveModerateLow (if you know Spark)SteepLow-Moderate
EcosystemMassive (PySpark, SQL, MLlib)Same as Spark batchGrowing (SQL, Python coming)Limited to Kafka ecosystem

Which Framework for Which Scenario?

Six real scenarios and the framework that fits each one. In interviews, the reasoning matters more than the answer. Explain why you chose the framework, not just which one.

"Daily warehouse refresh processing 500 GB"

Spark Batch

Bounded dataset, hours of latency acceptable, well-understood workload. Spark batch is the obvious choice. No reason to introduce streaming complexity.

"Near-real-time dashboard updated every 30 seconds"

Spark Structured Streaming

30-second latency is well within micro-batch range. If the team already uses Spark, Structured Streaming avoids adopting a new framework.

"Fraud detection: block transactions in under 100ms"

Flink

Sub-100ms latency requires true event-at-a-time processing. Flink's event-time semantics and exactly-once guarantees make it the right tool.

"Enrich Kafka events with lookup data before routing to topics"

Kafka Streams

Lightweight transformation within the Kafka ecosystem. No need for a full cluster framework. Kafka Streams handles this with minimal operational overhead.

"ML feature pipeline computing 200+ features from click events"

Flink or Spark Streaming

Complex computation on event streams. Flink if latency matters, Spark Streaming if the team already has Spark expertise. Both handle stateful aggregations well.

"Backfill 3 years of historical data into a new schema"

Spark Batch

Backfills are the definition of batch processing. Large bounded dataset, no latency requirement, needs high throughput. Spark batch excels here.

Hybrid Architectures

In practice, most data platforms combine batch and streaming. These three patterns show how.

Lambda Architecture

Runs both a batch layer (complete, accurate, slow) and a speed layer (approximate, fast) in parallel. The batch layer periodically corrects the speed layer. This was the original approach for combining batch and streaming, but it is operationally complex because you maintain two codepaths.

High accuracy and low latency at the cost of maintaining two separate systems with potentially different code.

Kappa Architecture

Uses a single streaming pipeline for everything. Batch workloads are treated as a special case of streaming (replay the full topic from the beginning). Simplifies operations because there is one codebase, but requires a streaming engine capable of handling both real-time and reprocessing workloads.

Simpler operations and one codebase at the cost of higher infrastructure requirements for the streaming layer.

Streaming Ingest + Batch Transform

The most common hybrid in practice. Events stream into a landing zone (S3, GCS) or a message queue (Kafka) in real time. A batch job runs periodically (hourly, daily) to transform and load the data into the warehouse. Combines the reliability of batch with the freshness of streaming ingest.

Practical and reliable. Latency is bounded by the batch interval, but that is acceptable for most analytics workloads.

The interview signal interviewers look for

Interviewers do not want you to name every framework. They want you to match the right tool to the latency requirement and explain the trade-offs. Saying "I would use Spark batch here because the data can be hours old and batch is simpler, cheaper, and easier to debug" is a stronger answer than proposing Flink for every scenario.

Frequently Asked Questions

Should I learn batch or stream processing first?+
Batch first, always. Batch processing is simpler, more common in production, and tested more frequently in interviews. Most data engineering pipelines are batch. Streaming is a specialization you add once batch fundamentals are solid. If you skip batch and go straight to Flink, you will struggle with the foundational concepts that streaming builds on.
Is Spark Structured Streaming real streaming?+
Technically, no. It uses micro-batches by default, processing events in small intervals rather than one at a time. For many use cases, this distinction does not matter. If you need latencies under 1 second, Flink's event-at-a-time model is a better fit. If seconds-level latency is acceptable, Structured Streaming works well and has the advantage of using the same API as Spark batch.
When do interviewers ask about stream processing?+
System design rounds at companies processing large event streams (Uber, Netflix, LinkedIn, Spotify). The interviewer gives you a scenario and you need to decide between batch and streaming, then name specific tools. The most important signal is knowing when NOT to use streaming. Defaulting to batch when latency permits shows engineering maturity.
Can Flink replace Spark entirely?+
Flink can handle both batch and streaming workloads, but in practice, most organizations use Spark for batch and Flink for streaming. Spark's batch ecosystem is more mature, has a larger talent pool, and integrates with more tools. Flink's batch support is good but not yet the industry default. Over time, Flink may close this gap, but in 2026, the typical setup is still Spark for batch and Flink for streaming.

Four Frameworks. One Architecture Diagram.

Know where each one plugs into the pipeline. That's the staff-level answer.