At the ingestion layer, your choice between batch and streaming becomes a contract with every downstream system. Batch promises eventual correctness and cheaper compute; streaming promises fresh data and event-time reasoning. Real platforms run both, side by side, with a shared storage layer that tolerates reprocessing. This guide maps the four frameworks that occupy the batch-to-streaming spectrum and tells you where each one slots into a production pipeline.
For the conceptual overview of batch vs streaming (when to use each, cost trade-offs, interview scenarios), see Batch vs Streaming Pipelines. This page focuses on specific frameworks and implementation trade-offs.
Frameworks compared
Pipeline challenges
L6 staff rounds
Pure system-design share
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Each framework below occupies a specific point on the batch-streaming continuum defined by three axes: latency budget, state size, and exactly-once guarantees. Pick the one whose defaults match your SLA and your team's operational maturity.
Spark's batch mode is the workhorse of large-scale data processing. It reads a bounded dataset, processes it across a cluster, and writes the result. Most production data engineering pipelines use Spark batch because it handles terabyte-scale transforms reliably and the ecosystem is mature.
Strengths
Weaknesses
Best for:
Daily/hourly warehouse refreshes, ETL at scale, ML feature engineering, backfills. If your data arrives in bounded chunks and hours of latency is acceptable, Spark batch is the default choice.
Spark Structured Streaming extends the batch Spark API to handle continuous data. It uses micro-batches by default: it collects events for a short window (100ms to minutes), processes them as a small batch, then repeats. This is not true event-at-a-time processing, but it is close enough for many use cases.
Strengths
Weaknesses
Best for:
Teams already using Spark batch who need near-real-time capabilities without adopting an entirely new framework. Good for latencies in the seconds-to-minutes range.
Flink is a true event-at-a-time streaming engine. Unlike Spark Structured Streaming, Flink processes each event as it arrives, without micro-batching. This gives it lower latency and more precise event-time semantics. Flink also handles batch workloads, but its strength is streaming.
Strengths
Weaknesses
Best for:
Low-latency event processing where sub-second response times matter. Fraud detection, real-time recommendations, anomaly detection, session windowing. If Spark Structured Streaming's micro-batch latency is not fast enough, Flink is the answer.
Kafka Streams is a client library (not a cluster framework) for processing data stored in Kafka topics. It runs as part of your application, not as a separate cluster. This makes it lightweight and easy to deploy, but it is limited to Kafka-to-Kafka processing.
Strengths
Weaknesses
Best for:
Lightweight event transformations between Kafka topics. Enrichment, filtering, simple aggregations, and routing. If your data is already in Kafka and the processing is not computationally heavy, Kafka Streams avoids the overhead of a full streaming cluster.
| Dimension | Spark Batch | Spark Streaming | Flink | Kafka Streams |
|---|---|---|---|---|
| Processing Model | Bounded datasets processed in full | Micro-batches (100ms-1s intervals) | True event-at-a-time | Event-at-a-time (within Kafka) |
| Latency | Minutes to hours | Seconds to minutes | Milliseconds to seconds | Milliseconds to seconds |
| Throughput | Very high (optimized for bulk) | High | High | Moderate (single-app model) |
| Exactly-Once | N/A (idempotent reruns) | Yes (with checkpointing) | Yes (lightweight checkpoints) | Yes (Kafka transactions) |
| Deployment | Cluster (YARN, K8s, EMR) | Same Spark cluster | Dedicated Flink cluster | Application-embedded (no cluster) |
| Learning Curve | Moderate | Low (if you know Spark) | Steep | Low-Moderate |
| Ecosystem | Massive (PySpark, SQL, MLlib) | Same as Spark batch | Growing (SQL, Python coming) | Limited to Kafka ecosystem |
Six real scenarios and the framework that fits each one. In interviews, the reasoning matters more than the answer. Explain why you chose the framework, not just which one.
Spark Batch
Bounded dataset, hours of latency acceptable, well-understood workload. Spark batch is the obvious choice. No reason to introduce streaming complexity.
Spark Structured Streaming
30-second latency is well within micro-batch range. If the team already uses Spark, Structured Streaming avoids adopting a new framework.
Flink
Sub-100ms latency requires true event-at-a-time processing. Flink's event-time semantics and exactly-once guarantees make it the right tool.
Kafka Streams
Lightweight transformation within the Kafka ecosystem. No need for a full cluster framework. Kafka Streams handles this with minimal operational overhead.
Flink or Spark Streaming
Complex computation on event streams. Flink if latency matters, Spark Streaming if the team already has Spark expertise. Both handle stateful aggregations well.
Spark Batch
Backfills are the definition of batch processing. Large bounded dataset, no latency requirement, needs high throughput. Spark batch excels here.
In practice, most data platforms combine batch and streaming. These three patterns show how.
Runs both a batch layer (complete, accurate, slow) and a speed layer (approximate, fast) in parallel. The batch layer periodically corrects the speed layer. This was the original approach for combining batch and streaming, but it is operationally complex because you maintain two codepaths.
High accuracy and low latency at the cost of maintaining two separate systems with potentially different code.
Uses a single streaming pipeline for everything. Batch workloads are treated as a special case of streaming (replay the full topic from the beginning). Simplifies operations because there is one codebase, but requires a streaming engine capable of handling both real-time and reprocessing workloads.
Simpler operations and one codebase at the cost of higher infrastructure requirements for the streaming layer.
The most common hybrid in practice. Events stream into a landing zone (S3, GCS) or a message queue (Kafka) in real time. A batch job runs periodically (hourly, daily) to transform and load the data into the warehouse. Combines the reliability of batch with the freshness of streaming ingest.
Practical and reliable. Latency is bounded by the batch interval, but that is acceptable for most analytics workloads.
Interviewers do not want you to name every framework. They want you to match the right tool to the latency requirement and explain the trade-offs. Saying "I would use Spark batch here because the data can be hours old and batch is simpler, cheaper, and easier to debug" is a stronger answer than proposing Flink for every scenario.
Know where each one plugs into the pipeline. That's the staff-level answer.