Streaming Systems: Beginner
'Tell me about Kafka.' This comes up constantly in DE interviews. The interviewer is not asking you to configure a broker. They want you to explain topics and partitions, what happens when events arrive late, and the difference between micro-batch and true streaming. Here is exactly how to answer.
Event Platforms
Explain Kafka with the right vocabulary
When you hear these in an interview, this is the concept being tested
- ▸"Tell me about Kafka."
- ▸"How does an event streaming platform work?"
- ▸"What is the difference between Kafka and a message queue?"
What They Want to Hear
The Vocabulary to Use
| Term | What It Is | One-Liner for Interviews |
|---|---|---|
| Topic | A named stream of events | Like a table, but append-only |
| Partition | A shard of a topic | Events with the same key go to the same partition |
| Producer | Writes events to a topic | Your application or data source |
| Consumer | Reads events from a topic | Your pipeline or processing job |
| Consumer Group | A set of consumers sharing the work | Each partition is read by exactly one consumer in the group |
| Offset | Position in the partition | Like a bookmark: tracks what you have read |
After your initial answer, expect these probes
- ▸"What is the difference between Kafka and RabbitMQ?" Kafka retains events and supports replay. RabbitMQ deletes events after delivery. Use Kafka when multiple consumers need the same data or when you need to reprocess.
- ▸"How do you choose the number of partitions?" Partitions determine maximum parallelism. If you have 10 consumers, you need at least 10 partitions. But too many partitions increase metadata overhead. Start with the number of consumers and scale up as needed.
- ▸"What happens if a consumer crashes?" The consumer group rebalances. The crashed consumer's partitions are reassigned to other consumers in the group. When the consumer comes back, it resumes from the last committed offset.
Event-Driven Architecture
Explain event-driven architecture and when to use it
When you hear these in an interview, this is the concept being tested
- ▸"What is event-driven architecture?"
- ▸"How is event streaming different from request-response?"
- ▸"When would you choose an event-driven approach?"
What They Want to Hear
- Service A calls Service B directly
- Synchronous: A waits for B's response
- Tight coupling: A breaks if B is down
- Simple for 2-3 services
- Service A publishes an event
- Asynchronous: A does not wait
- Loose coupling: A does not know about B
- Scales to many consumers
After your initial answer, expect these probes
- ▸"What is the downside of event-driven?" Debugging is harder. When something goes wrong, you cannot trace a single request through a call stack. Instead you trace events across multiple systems. Distributed tracing tools (Jaeger, Zipkin) help.
- ▸"What is event sourcing?" Storing every state change as an immutable event instead of overwriting the current state. The current state is derived by replaying all events. Example: a bank account is a sequence of deposits and withdrawals, not a single balance number.
- ▸"When would you NOT use event-driven?" When you need a synchronous response: user login, payment processing, real-time API calls. If the caller needs an answer right now, request-response is the right pattern.
Late-Arriving Data
Explain watermarks and late-data handling
When you hear these in an interview, this is the concept being tested
- ▸"What happens when events arrive late?"
- ▸"How do you handle out-of-order data?"
- ▸"What is a watermark?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"How do you choose the watermark duration?" Based on observed lateness. If 99% of events arrive within 5 minutes, a 10-minute watermark catches nearly everything. Longer watermarks mean more correct results but higher latency and memory usage.
- ▸"What is the difference between event time and processing time?" Event time is when the event actually happened (the click timestamp). Processing time is when your system received it. Always use event time for aggregations, or your hourly counts will be wrong.
- ▸"What if late data is critical and cannot be dropped?" Send it to a dead letter queue and run a separate batch job to backfill the affected windows. This gives you the speed of streaming with the correctness of batch.
Dead Letter Queues
Explain dead letter queues and when to retry vs send to DLQ
When you hear these in an interview, this is the concept being tested
- ▸"What happens to records that fail processing?"
- ▸"How do you handle poison messages?"
- ▸"What is a dead letter queue?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What causes events to land in the DLQ?" Schema mismatch (unexpected field types), malformed JSON, business rule violations (negative amounts), or downstream system unavailability.
- ▸"How many retries before sending to DLQ?" Typically 3 retries with exponential backoff. If the event fails all 3, it goes to DLQ. Transient errors (timeout, rate limit) are worth retrying. Permanent errors (bad schema) should go to DLQ immediately.
- ▸"What if the DLQ itself fills up?" Alert immediately. A growing DLQ means the root cause is not transient. Pause investigation of individual events and focus on the systemic issue first.
Micro-Batch vs True Streaming
Explain micro-batch vs true streaming and pick the right one
When you hear these in an interview, this is the concept being tested
- ▸"What is the difference between Spark Streaming and Flink?"
- ▸"Is Spark Structured Streaming real streaming?"
- ▸"When is micro-batch good enough?"
What They Want to Hear
- Processes in small time windows (1-10 seconds)
- Latency floor: ~100 milliseconds
- Uses the same Spark engine as batch
- Easier to operate if your team knows Spark
- Processes each event individually
- Latency: single-digit milliseconds
- Built-in exactly-once guarantees
- Steeper learning curve, but more powerful for streaming
After your initial answer, expect these probes
- ▸"When does latency actually matter?" Fraud detection (block the transaction before it completes), live bidding (ad auction in 50ms), driver matching (Uber needs real-time location). For dashboards, alerting, and analytics, 100ms micro-batch is fine.
- ▸"Can Spark do true streaming?" No. Spark Structured Streaming is always micro-batch under the hood, even with trigger='continuous' (which is experimental and limited). If you need true per-event streaming, use Flink.
- ▸"Why not always use Flink?" Operational complexity. If your team already runs Spark for batch, adding Flink means learning a new framework, new deployment, new monitoring. Use Spark Streaming for most cases and Flink only when latency requirements demand it.
Answer the Kafka and streaming questions with confidence
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Event Platforms, Event-Driven Architecture, Late-Arriving Data, Dead Letter Queues, Micro-Batch vs True Streaming
Lesson Sections
- Event Platforms (concepts: paEventPlatforms)
What They Want to Hear 'Kafka is a distributed event streaming platform. Producers write events to topics. Each topic is split into partitions for parallel processing. Consumers read from partitions using consumer groups, where each partition is assigned to exactly one consumer in the group. The key difference from a traditional message queue: Kafka retains events after they are read, so multiple consumers can independently replay the same data.' That is the answer. Topics, partitions, consumer
- Event-Driven Architecture (concepts: paEventDriven)
What They Want to Hear 'In event-driven architecture, services communicate by publishing events instead of calling each other directly. When an order is placed, the order service publishes an event. The inventory service, the notification service, and the analytics pipeline each consume that event independently. No service needs to know about the others. This decouples teams and systems.' That is the answer. Publish, not call. Independent consumers. Decoupled teams.
- Late-Arriving Data (concepts: paLateData)
What They Want to Hear 'Late data arrives after the window it belongs to has already been processed. A click that happened at 11:58 PM might arrive at 12:03 AM, after the hourly window closed. I handle this with watermarks: a threshold that says how late I am willing to wait. If my watermark is 10 minutes, I keep the window open for 10 extra minutes to accept late events. Events that arrive after the watermark are either dropped or sent to a dead letter queue for reprocessing.' That is the answe
- Dead Letter Queues (concepts: paDeadLetterQueue)
What They Want to Hear 'A dead letter queue (DLQ) is where events go when they cannot be processed. Instead of crashing the pipeline or blocking the stream, the bad event is moved to a separate topic for investigation. This keeps the main pipeline flowing. I monitor DLQ depth as a health metric: if it grows, something is systematically wrong. I reprocess DLQ events after fixing the root cause.' That is the answer. DLQ = safety valve. Monitor depth. Fix root cause, then replay.
- Micro-Batch vs True Streaming (concepts: paMicroBatchVsTrue)
What They Want to Hear 'Micro-batch processes events in small time windows, typically every few seconds. Spark Structured Streaming uses this model. True streaming processes each event as it arrives with no batching delay. Flink uses this model. The practical difference is latency: micro-batch has a floor around 100 milliseconds. True streaming can process in single-digit milliseconds. For most use cases, micro-batch is good enough and simpler to operate.' That is the answer. Micro-batch = small