Streaming Systems: Beginner

'Tell me about Kafka.' This comes up constantly in DE interviews. The interviewer is not asking you to configure a broker. They want you to explain topics and partitions, what happens when events arrive late, and the difference between micro-batch and true streaming. Here is exactly how to answer.

Event Platforms

Daily Life
Interviews

Explain Kafka with the right vocabulary

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Tell me about Kafka."
  • "How does an event streaming platform work?"
  • "What is the difference between Kafka and a message queue?"

What They Want to Hear

'Kafka is a distributed event streaming platform. Producers write events to topics. Each topic is split into partitions for parallel processing. Consumers read from partitions using consumer groups, where each partition is assigned to exactly one consumer in the group. The key difference from a traditional message queue: Kafka retains events after they are read, so multiple consumers can independently replay the same data.' That is the answer. Topics, partitions, consumer groups, and retention. Four concepts in four sentences.
What to Whiteboard
write eventsindependent readindependent read
Producers
Web app, mobile, IoT
Topic: clickstream
Partitioned by user_id
Consumer Group 1
Analytics pipeline
Consumer Group 2
Fraud detection

The Vocabulary to Use

TermWhat It IsOne-Liner for Interviews
TopicA named stream of eventsLike a table, but append-only
PartitionA shard of a topicEvents with the same key go to the same partition
ProducerWrites events to a topicYour application or data source
ConsumerReads events from a topicYour pipeline or processing job
Consumer GroupA set of consumers sharing the workEach partition is read by exactly one consumer in the group
OffsetPosition in the partitionLike a bookmark: tracks what you have read
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What is the difference between Kafka and RabbitMQ?" Kafka retains events and supports replay. RabbitMQ deletes events after delivery. Use Kafka when multiple consumers need the same data or when you need to reprocess.
  • "How do you choose the number of partitions?" Partitions determine maximum parallelism. If you have 10 consumers, you need at least 10 partitions. But too many partitions increase metadata overhead. Start with the number of consumers and scale up as needed.
  • "What happens if a consumer crashes?" The consumer group rebalances. The crashed consumer's partitions are reassigned to other consumers in the group. When the consumer comes back, it resumes from the last committed offset.
KEY TAKEAWAYS
Say: 'Kafka is a distributed event platform. Producers write to topics, consumers read from partitions via consumer groups.'
The key difference from message queues: Kafka retains events after consumption. Multiple consumers can replay.
Partitions = parallelism. At least one partition per consumer.

Event-Driven Architecture

Daily Life
Interviews

Explain event-driven architecture and when to use it

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is event-driven architecture?"
  • "How is event streaming different from request-response?"
  • "When would you choose an event-driven approach?"

What They Want to Hear

'In event-driven architecture, services communicate by publishing events instead of calling each other directly. When an order is placed, the order service publishes an event. The inventory service, the notification service, and the analytics pipeline each consume that event independently. No service needs to know about the others. This decouples teams and systems.' That is the answer. Publish, not call. Independent consumers. Decoupled teams.
Request-Response
  • Service A calls Service B directly
  • Synchronous: A waits for B's response
  • Tight coupling: A breaks if B is down
  • Simple for 2-3 services
Event-Driven
  • Service A publishes an event
  • Asynchronous: A does not wait
  • Loose coupling: A does not know about B
  • Scales to many consumers
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What is the downside of event-driven?" Debugging is harder. When something goes wrong, you cannot trace a single request through a call stack. Instead you trace events across multiple systems. Distributed tracing tools (Jaeger, Zipkin) help.
  • "What is event sourcing?" Storing every state change as an immutable event instead of overwriting the current state. The current state is derived by replaying all events. Example: a bank account is a sequence of deposits and withdrawals, not a single balance number.
  • "When would you NOT use event-driven?" When you need a synchronous response: user login, payment processing, real-time API calls. If the caller needs an answer right now, request-response is the right pattern.
KEY TAKEAWAYS
Say: 'Event-driven: services publish events instead of calling each other. Consumers are independent and decoupled.'
The tradeoff: loose coupling and scalability vs harder debugging
Event sourcing stores every change. The current state is derived by replay.

Late-Arriving Data

Daily Life
Interviews

Explain watermarks and late-data handling

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens when events arrive late?"
  • "How do you handle out-of-order data?"
  • "What is a watermark?"

What They Want to Hear

'Late data arrives after the window it belongs to has already been processed. A click that happened at 11:58 PM might arrive at 12:03 AM, after the hourly window closed. I handle this with watermarks: a threshold that says how late I am willing to wait. If my watermark is 10 minutes, I keep the window open for 10 extra minutes to accept late events. Events that arrive after the watermark are either dropped or sent to a dead letter queue for reprocessing.' That is the answer. Late data is inevitable. Watermarks define how long you wait. After that, dead letter queue.
What to Whiteboard
within watermarkpast watermark
Event: 11:58 PM click
Arrives at 12:03 AM
Watermark Check
Is 12:03 within 10 min of window close?
Accept
Include in 11 PM window
Dead Letter Queue
Too late; process separately
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you choose the watermark duration?" Based on observed lateness. If 99% of events arrive within 5 minutes, a 10-minute watermark catches nearly everything. Longer watermarks mean more correct results but higher latency and memory usage.
  • "What is the difference between event time and processing time?" Event time is when the event actually happened (the click timestamp). Processing time is when your system received it. Always use event time for aggregations, or your hourly counts will be wrong.
  • "What if late data is critical and cannot be dropped?" Send it to a dead letter queue and run a separate batch job to backfill the affected windows. This gives you the speed of streaming with the correctness of batch.
TIP
The two words that impress interviewers on streaming questions: event time and watermark. Always use event time for aggregations, and always mention your watermark strategy.
KEY TAKEAWAYS
Say: 'Watermarks define how late I wait. Events past the watermark go to a dead letter queue.'
Always aggregate on event time, not processing time
The watermark tradeoff: longer = more correct, but higher latency and memory

Dead Letter Queues

Daily Life
Interviews

Explain dead letter queues and when to retry vs send to DLQ

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens to records that fail processing?"
  • "How do you handle poison messages?"
  • "What is a dead letter queue?"

What They Want to Hear

'A dead letter queue (DLQ) is where events go when they cannot be processed. Instead of crashing the pipeline or blocking the stream, the bad event is moved to a separate topic for investigation. This keeps the main pipeline flowing. I monitor DLQ depth as a health metric: if it grows, something is systematically wrong. I reprocess DLQ events after fixing the root cause.' That is the answer. DLQ = safety valve. Monitor depth. Fix root cause, then replay.
What to Whiteboard
successfailureafter root cause fix
Event Stream
1000 events/sec
Process Event
Parse, validate, transform
Target Table
998 events succeed
Dead Letter Queue
2 events failed
Investigate + Replay
Fix cause, reprocess
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What causes events to land in the DLQ?" Schema mismatch (unexpected field types), malformed JSON, business rule violations (negative amounts), or downstream system unavailability.
  • "How many retries before sending to DLQ?" Typically 3 retries with exponential backoff. If the event fails all 3, it goes to DLQ. Transient errors (timeout, rate limit) are worth retrying. Permanent errors (bad schema) should go to DLQ immediately.
  • "What if the DLQ itself fills up?" Alert immediately. A growing DLQ means the root cause is not transient. Pause investigation of individual events and focus on the systemic issue first.
KEY TAKEAWAYS
Say: 'Dead letter queue: failed events go to a separate topic instead of blocking the pipeline. Monitor depth, fix root cause, replay.'
3 retries with backoff, then DLQ. Permanent errors skip retries.
DLQ depth is a health metric. Growing depth = systemic problem.

Micro-Batch vs True Streaming

Daily Life
Interviews

Explain micro-batch vs true streaming and pick the right one

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is the difference between Spark Streaming and Flink?"
  • "Is Spark Structured Streaming real streaming?"
  • "When is micro-batch good enough?"

What They Want to Hear

'Micro-batch processes events in small time windows, typically every few seconds. Spark Structured Streaming uses this model. True streaming processes each event as it arrives with no batching delay. Flink uses this model. The practical difference is latency: micro-batch has a floor around 100 milliseconds. True streaming can process in single-digit milliseconds. For most use cases, micro-batch is good enough and simpler to operate.' That is the answer. Micro-batch = small windows, 100ms floor, simpler. True streaming = per-event, sub-10ms, more complex.
Micro-Batch (Spark)
  • Processes in small time windows (1-10 seconds)
  • Latency floor: ~100 milliseconds
  • Uses the same Spark engine as batch
  • Easier to operate if your team knows Spark
True Streaming (Flink)
  • Processes each event individually
  • Latency: single-digit milliseconds
  • Built-in exactly-once guarantees
  • Steeper learning curve, but more powerful for streaming
The Curveball Follow-ups

After your initial answer, expect these probes

  • "When does latency actually matter?" Fraud detection (block the transaction before it completes), live bidding (ad auction in 50ms), driver matching (Uber needs real-time location). For dashboards, alerting, and analytics, 100ms micro-batch is fine.
  • "Can Spark do true streaming?" No. Spark Structured Streaming is always micro-batch under the hood, even with trigger='continuous' (which is experimental and limited). If you need true per-event streaming, use Flink.
  • "Why not always use Flink?" Operational complexity. If your team already runs Spark for batch, adding Flink means learning a new framework, new deployment, new monitoring. Use Spark Streaming for most cases and Flink only when latency requirements demand it.
KEY TAKEAWAYS
Say: 'Micro-batch (Spark) for most cases: 100ms latency, simpler ops. True streaming (Flink) when sub-10ms matters.'
Spark Streaming is micro-batch under the hood. It is not true per-event streaming.
Choose based on latency needs AND team expertise. Flink is powerful but has a learning curve.

Answer the Kafka and streaming questions with confidence

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: Event Platforms, Event-Driven Architecture, Late-Arriving Data, Dead Letter Queues, Micro-Batch vs True Streaming

Lesson Sections

  1. Event Platforms (concepts: paEventPlatforms)

    What They Want to Hear 'Kafka is a distributed event streaming platform. Producers write events to topics. Each topic is split into partitions for parallel processing. Consumers read from partitions using consumer groups, where each partition is assigned to exactly one consumer in the group. The key difference from a traditional message queue: Kafka retains events after they are read, so multiple consumers can independently replay the same data.' That is the answer. Topics, partitions, consumer

  2. Event-Driven Architecture (concepts: paEventDriven)

    What They Want to Hear 'In event-driven architecture, services communicate by publishing events instead of calling each other directly. When an order is placed, the order service publishes an event. The inventory service, the notification service, and the analytics pipeline each consume that event independently. No service needs to know about the others. This decouples teams and systems.' That is the answer. Publish, not call. Independent consumers. Decoupled teams.

  3. Late-Arriving Data (concepts: paLateData)

    What They Want to Hear 'Late data arrives after the window it belongs to has already been processed. A click that happened at 11:58 PM might arrive at 12:03 AM, after the hourly window closed. I handle this with watermarks: a threshold that says how late I am willing to wait. If my watermark is 10 minutes, I keep the window open for 10 extra minutes to accept late events. Events that arrive after the watermark are either dropped or sent to a dead letter queue for reprocessing.' That is the answe

  4. Dead Letter Queues (concepts: paDeadLetterQueue)

    What They Want to Hear 'A dead letter queue (DLQ) is where events go when they cannot be processed. Instead of crashing the pipeline or blocking the stream, the bad event is moved to a separate topic for investigation. This keeps the main pipeline flowing. I monitor DLQ depth as a health metric: if it grows, something is systematically wrong. I reprocess DLQ events after fixing the root cause.' That is the answer. DLQ = safety valve. Monitor depth. Fix root cause, then replay.

  5. Micro-Batch vs True Streaming (concepts: paMicroBatchVsTrue)

    What They Want to Hear 'Micro-batch processes events in small time windows, typically every few seconds. Spark Structured Streaming uses this model. True streaming processes each event as it arrives with no batching delay. Flink uses this model. The practical difference is latency: micro-batch has a floor around 100 milliseconds. True streaming can process in single-digit milliseconds. For most use cases, micro-batch is good enough and simpler to operate.' That is the answer. Micro-batch = small