Streaming Systems: Beginner

'Tell me about Kafka.' This comes up constantly in DE interviews. The interviewer is not asking you to configure a broker. They want you to explain topics and partitions, what happens when events arrive late, and the difference between micro-batch and true streaming. Here is exactly how to answer.

What you will be able to do

Explain Kafka's role in one sentence and name the key components

Answer 'What happens when events arrive late?' with a concrete strategy

Describe the difference between Spark Streaming and Flink

Event Platforms

Daily Life

Interviews

Explain Kafka with the right vocabulary

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Tell me about Kafka."
▸"How does an event streaming platform work?"
▸"What is the difference between Kafka and a message queue?"

What They Want to Hear

'Kafka is a distributed event streaming platform. Producers write events to topics. Each topic is split into partitions for parallel processing. Consumers read from partitions using consumer groups, where each partition is assigned to exactly one consumer in the group. The key difference from a traditional message queue: Kafka retains events after they are read, so multiple consumers can independently replay the same data.' That is the answer. Topics, partitions, consumer groups, and retention. Four concepts in four sentences.

Source

Producers

Queue

Topic: clickstream

Consumer

Consumer Group 1

Consumer

Consumer Group 2

What to Whiteboard

The Vocabulary to Use

Term	What It Is	One-Liner for Interviews
Topic	A named stream of events	Like a table, but append-only
Partition	A shard of a topic	Events with the same key go to the same partition
Producer	Writes events to a topic	Your application or data source
Consumer	Reads events from a topic	Your pipeline or processing job
Consumer Group	A set of consumers sharing the work	Each partition is read by exactly one consumer in the group
Offset	Position in the partition	Like a bookmark: tracks what you have read

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What is the difference between Kafka and RabbitMQ?" Kafka retains events and supports replay. RabbitMQ deletes events after delivery. Use Kafka when multiple consumers need the same data or when you need to reprocess.
▸"How do you choose the number of partitions?" Partitions determine maximum parallelism. If you have 10 consumers, you need at least 10 partitions. But too many partitions increase metadata overhead. Start with the number of consumers and scale up as needed.
▸"What happens if a consumer crashes?" The consumer group rebalances. The crashed consumer's partitions are reassigned to other consumers in the group. When the consumer comes back, it resumes from the last committed offset.

KEY TAKEAWAYS

Say: 'Kafka is a distributed event platform. Producers write to topics, consumers read from partitions via consumer groups.'

The key difference from message queues: Kafka retains events after consumption. Multiple consumers can replay.

Partitions = parallelism. At least one partition per consumer.

Event-Driven Architecture

Daily Life

Interviews

Explain event-driven architecture and when to use it

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What is event-driven architecture?"
▸"How is event streaming different from request-response?"
▸"When would you choose an event-driven approach?"

What They Want to Hear

'In event-driven architecture, services communicate by publishing events instead of calling each other directly. When an order is placed, the order service publishes an event. The inventory service, the notification service, and the analytics pipeline each consume that event independently. No service needs to know about the others. This decouples teams and systems.' That is the answer. Publish, not call. Independent consumers. Decoupled teams.

•Request-Response

Service A calls Service B directly
Synchronous: A waits for B's response
Tight coupling: A breaks if B is down
Simple for 2-3 services

•Event-Driven

Service A publishes an event
Asynchronous: A does not wait
Loose coupling: A does not know about B
Scales to many consumers

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What is the downside of event-driven?" Debugging is harder. When something goes wrong, you cannot trace a single request through a call stack. Instead you trace events across multiple systems. Distributed tracing tools (Jaeger, Zipkin) help.
▸"What is event sourcing?" Storing every state change as an immutable event instead of overwriting the current state. The current state is derived by replaying all events. Example: a bank account is a sequence of deposits and withdrawals, not a single balance number.
▸"When would you NOT use event-driven?" When you need a synchronous response: user login, payment processing, real-time API calls. If the caller needs an answer right now, request-response is the right pattern.

KEY TAKEAWAYS

Say: 'Event-driven: services publish events instead of calling each other. Consumers are independent and decoupled.'

The tradeoff: loose coupling and scalability vs harder debugging

Event sourcing stores every change. The current state is derived by replay.

Late-Arriving Data

Daily Life

Interviews

Explain watermarks and late-data handling

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What happens when events arrive late?"
▸"How do you handle out-of-order data?"
▸"What is a watermark?"

What They Want to Hear

'Late data arrives after the window it belongs to has already been processed. A click that happened at 11:58 PM might arrive at 12:03 AM, after the hourly window closed. I handle this with watermarks: a threshold that says how late I am willing to wait. If my watermark is 10 minutes, I keep the window open for 10 extra minutes to accept late events. Events that arrive after the watermark are either dropped or sent to a dead letter queue for reprocessing.' That is the answer. Late data is inevitable. Watermarks define how long you wait. After that, dead letter queue.

Source

Event: 11:58 PM click

Quality

Watermark Check

Consumer

Queue

Dead Letter Queue

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"How do you choose the watermark duration?" Based on observed lateness. If 99% of events arrive within 5 minutes, a 10-minute watermark catches nearly everything. Longer watermarks mean more correct results but higher latency and memory usage.
▸"What is the difference between event time and processing time?" Event time is when the event actually happened (the click timestamp). Processing time is when your system received it. Always use event time for aggregations, or your hourly counts will be wrong.
▸"What if late data is critical and cannot be dropped?" Send it to a dead letter queue and run a separate batch job to backfill the affected windows. This gives you the speed of streaming with the correctness of batch.

TIP

The two words that impress interviewers on streaming questions: event time and watermark. Always use event time for aggregations, and always mention your watermark strategy.

KEY TAKEAWAYS

Say: 'Watermarks define how late I wait. Events past the watermark go to a dead letter queue.'

Always aggregate on event time, not processing time

The watermark tradeoff: longer = more correct, but higher latency and memory

Dead Letter Queues

Daily Life

Interviews

Explain dead letter queues and when to retry vs send to DLQ

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What happens to records that fail processing?"
▸"How do you handle poison messages?"
▸"What is a dead letter queue?"

What They Want to Hear

'A dead letter queue (DLQ) is where events go when they cannot be processed. Instead of crashing the pipeline or blocking the stream, the bad event is moved to a separate topic for investigation. This keeps the main pipeline flowing. I monitor DLQ depth as a health metric: if it grows, something is systematically wrong. I reprocess DLQ events after fixing the root cause.' That is the answer. DLQ = safety valve. Monitor depth. Fix root cause, then replay.

Source

Event Stream

Transform

Process Event

Storage

Target Table

Queue

Dead Letter Queue

Quality

Investigate + Replay

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What causes events to land in the DLQ?" Schema mismatch (unexpected field types), malformed JSON, business rule violations (negative amounts), or downstream system unavailability.
▸"How many retries before sending to DLQ?" Typically 3 retries with exponential backoff. If the event fails all 3, it goes to DLQ. Transient errors (timeout, rate limit) are worth retrying. Permanent errors (bad schema) should go to DLQ immediately.
▸"What if the DLQ itself fills up?" Alert immediately. A growing DLQ means the root cause is not transient. Pause investigation of individual events and focus on the systemic issue first.

KEY TAKEAWAYS

Say: 'Dead letter queue: failed events go to a separate topic instead of blocking the pipeline. Monitor depth, fix root cause, replay.'

3 retries with backoff, then DLQ. Permanent errors skip retries.

DLQ depth is a health metric. Growing depth = systemic problem.

Micro-Batch vs True Streaming

Daily Life

Interviews

Explain micro-batch vs true streaming and pick the right one

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What is the difference between Spark Streaming and Flink?"
▸"Is Spark Structured Streaming real streaming?"
▸"When is micro-batch good enough?"

What They Want to Hear

'Micro-batch processes events in small time windows, typically every few seconds. Spark Structured Streaming uses this model. True streaming processes each event as it arrives with no batching delay. Flink uses this model. The practical difference is latency: micro-batch has a floor around 100 milliseconds. True streaming can process in single-digit milliseconds. For most use cases, micro-batch is good enough and simpler to operate.' That is the answer. Micro-batch = small windows, 100ms floor, simpler. True streaming = per-event, sub-10ms, more complex.

•Micro-Batch (Spark)

Processes in small time windows (1-10 seconds)
Latency floor: ~100 milliseconds
Uses the same Spark engine as batch
Easier to operate if your team knows Spark

•True Streaming (Flink)

Processes each event individually
Latency: single-digit milliseconds
Built-in exactly-once guarantees
Steeper learning curve, but more powerful for streaming

The Curveball Follow-ups

After your initial answer, expect these probes

▸"When does latency actually matter?" Fraud detection (block the transaction before it completes), live bidding (ad auction in 50ms), driver matching (Uber needs real-time location). For dashboards, alerting, and analytics, 100ms micro-batch is fine.
▸"Can Spark do true streaming?" No. Spark Structured Streaming is always micro-batch under the hood, even with trigger='continuous' (which is experimental and limited). If you need true per-event streaming, use Flink.
▸"Why not always use Flink?" Operational complexity. If your team already runs Spark for batch, adding Flink means learning a new framework, new deployment, new monitoring. Use Spark Streaming for most cases and Flink only when latency requirements demand it.

KEY TAKEAWAYS

Say: 'Micro-batch (Spark) for most cases: 100ms latency, simpler ops. True streaming (Flink) when sub-10ms matters.'

Spark Streaming is micro-batch under the hood. It is not true per-event streaming.

Choose based on latency needs AND team expertise. Flink is powerful but has a learning curve.

Answer the Kafka and streaming questions with confidence

Category: Pipeline Architecture
Difficulty: beginner
Duration: 20 minutes
Challenges: 0 hands-on challenges

Topics covered: Event Platforms, Event-Driven Architecture, Late-Arriving Data, Dead Letter Queues, Micro-Batch vs True Streaming

Lesson Sections

Event Platforms (concepts: paEventPlatforms)
What They Want to Hear 'Kafka is a distributed event streaming platform. Producers write events to topics. Each topic is split into partitions for parallel processing. Consumers read from partitions using consumer groups, where each partition is assigned to exactly one consumer in the group. The key difference from a traditional message queue: Kafka retains events after they are read, so multiple consumers can independently replay the same data.' That is the answer. Topics, partitions, consumer
Event-Driven Architecture (concepts: paEventDriven)
What They Want to Hear 'In event-driven architecture, services communicate by publishing events instead of calling each other directly. When an order is placed, the order service publishes an event. The inventory service, the notification service, and the analytics pipeline each consume that event independently. No service needs to know about the others. This decouples teams and systems.' That is the answer. Publish, not call. Independent consumers. Decoupled teams.
Late-Arriving Data (concepts: paLateData)
What They Want to Hear 'Late data arrives after the window it belongs to has already been processed. A click that happened at 11:58 PM might arrive at 12:03 AM, after the hourly window closed. I handle this with watermarks: a threshold that says how late I am willing to wait. If my watermark is 10 minutes, I keep the window open for 10 extra minutes to accept late events. Events that arrive after the watermark are either dropped or sent to a dead letter queue for reprocessing.' That is the answe
Dead Letter Queues (concepts: paDeadLetterQueue)
What They Want to Hear 'A dead letter queue (DLQ) is where events go when they cannot be processed. Instead of crashing the pipeline or blocking the stream, the bad event is moved to a separate topic for investigation. This keeps the main pipeline flowing. I monitor DLQ depth as a health metric: if it grows, something is systematically wrong. I reprocess DLQ events after fixing the root cause.' That is the answer. DLQ = safety valve. Monitor depth. Fix root cause, then replay.
Micro-Batch vs True Streaming (concepts: paMicroBatchVsTrue)
What They Want to Hear 'Micro-batch processes events in small time windows, typically every few seconds. Spark Structured Streaming uses this model. True streaming processes each event as it arrives with no batching delay. Flink uses this model. The practical difference is latency: micro-batch has a floor around 100 milliseconds. True streaming can process in single-digit milliseconds. For most use cases, micro-batch is good enough and simpler to operate.' That is the answer. Micro-batch = small