Streaming Systems: Intermediate

You explained Kafka. Now: 'Your consumer group is rebalancing every 30 seconds. Why?' Or: 'How do you guarantee exactly-once processing?' Or: 'Design a streaming pipeline that handles 100K events/sec with late data.' These questions test whether you have operated streaming systems, not just read about them.

Consumer Groups and Offsets

Daily Life
Interviews

Explain offset management and exactly-once delivery patterns

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you guarantee exactly-once processing?"
  • "Your consumer group rebalances every 30 seconds. Why?"
  • "What happens when you add a consumer to the group?"

What They Want to Hear

'Each consumer in a group tracks its position in the partition using offsets. After processing a batch of events, the consumer commits its offset. If the consumer crashes, it restarts from the last committed offset. This gives at-least-once delivery by default. For exactly-once, I use the transactional pattern: commit the offset and write the output in the same transaction, so either both happen or neither does.' This is the answer that shows you understand the offset commit lifecycle and why exactly-once requires transactional writes.
Offset StrategyHow It WorksGuarantee
Auto-commitOffset committed on timer (every 5s)At-most-once: may lose events on crash
Manual commit after processingCommit after successful processingAt-least-once: may duplicate on crash
Transactional (commit + write atomically)Offset and output in same transactionExactly-once: no loss, no duplicates
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What causes frequent rebalancing?" Three common causes: (1) Consumer processing takes longer than max.poll.interval.ms, so the broker thinks the consumer is dead. Increase the interval or reduce batch size. (2) Unstable consumers joining and leaving. (3) Partition count changed, triggering reassignment.
  • "What if you need to reprocess from the beginning?" Reset the consumer group offset to the earliest position. Kafka retains events for the configured retention period (default 7 days), so you can replay anything within that window.
  • "How do you handle exactly-once with a non-transactional sink?" Idempotent writes. Use a unique event ID as a dedup key. The sink handles duplicates by ignoring events it has already seen.
KEY TAKEAWAYS
Say: 'At-least-once by default. Exactly-once requires transactional offset + write, or idempotent sinks.'
Frequent rebalancing usually means max.poll.interval.ms is too short for the processing time
Reset offsets to replay. Kafka retention defines the replay window.

Event Sourcing Patterns

Daily Life
Interviews

Explain event sourcing, CQRS, and when each applies

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is event sourcing and when do you use it?"
  • "What is CQRS?"
  • "How do you rebuild state from events?"

What They Want to Hear

'Event sourcing stores every state change as an immutable event. The current state is derived by replaying events from the beginning. I pair it with CQRS: Command Query Responsibility Segregation, where writes go to the event log and reads come from a materialized view that is built by processing the event stream. This separates the write model from the read model, allowing each to be optimized independently.' This is the answer that shows you understand event sourcing as an architectural pattern, not just a buzzword.
What to Whiteboard
append eventstream eventsupdate materialized view
Command
Place Order
Event Log
Immutable, append-only
Projection
Materialized view builder
Read Model
Current state for queries
TIP
The Curveball Follow-ups

After your initial answer, expect these probes

  • "When does event sourcing not make sense?" When the event history has no business value. A shopping cart does not need event sourcing: just store the current cart. An audit log for financial transactions does: regulators require the full history.
  • "How do you handle event schema evolution?" Version your events. Event v1 and v2 can coexist in the same log. The projection handles both versions with an upcaster that translates old formats to new.
  • "What about the event log growing forever?" Snapshotting. Periodically write the current state as a snapshot event. On replay, start from the latest snapshot instead of event zero. This bounds replay time.
KEY TAKEAWAYS
Say: 'Event sourcing: immutable event log as the source of truth. CQRS: separate write model from read model.'
Use event sourcing when the history has business value (audit, compliance, analytics)
Snapshots prevent unbounded replay time

Windowing and Watermarks

Daily Life
Interviews

Design windowed aggregations with the right watermark strategy

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Design a streaming aggregation with late data handling."
  • "What are tumbling vs sliding vs session windows?"
  • "How do you set the right watermark duration?"

What They Want to Hear

'I choose the window type based on the use case. Tumbling windows are fixed, non-overlapping intervals: count clicks per hour. Sliding windows overlap: count clicks in the last hour, updated every 5 minutes. Session windows are activity-based: group events that are close together in time, with a gap timeout. I set the watermark based on observed lateness: if 99th percentile lateness is 5 minutes, I set the watermark to 10 minutes to catch stragglers with margin.' This is the answer that shows you can match window type to use case and set watermarks from data.
Window TypeBehaviorUse Case
TumblingFixed size, no overlap (e.g., every hour)Hourly revenue, daily active users
SlidingFixed size, overlaps (e.g., 1 hour window, 5 min slide)Rolling averages, trending topics
SessionDynamic size, grouped by activity gapsUser sessions, conversation threads
1# Spark Structured Streaming: tumbling window with watermark
2from pyspark.sql.functions import window, col
3
4events \
5 .withWatermark("event_time", "10 minutes") \
6 .groupBy(
7 window(col("event_time"), "1 hour"),
8 col("page_id")
9 ) \
10 .count()
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Your watermark is 10 minutes but some events arrive 2 hours late. What do you do?" Accept that those events will not be in the streaming output. Route them to a DLQ and run a separate batch correction job that updates the affected windows daily. Do not extend the watermark to 2 hours: it would hold too much state in memory.
  • "How do session windows handle very long sessions?" Set a session timeout (e.g., 30 minutes of inactivity). If a user is active for 4 hours, it is one long session. Set a maximum session duration to cap memory usage: force-close sessions after 24 hours.
  • "What is the cost of a longer watermark?" More state in memory. Each open window and its partial aggregation must be held in the state store until the watermark passes. A 10-minute watermark holds 10 minutes of windows; a 2-hour watermark holds 2 hours of windows.
KEY TAKEAWAYS
Say: 'Tumbling for fixed intervals, sliding for rolling averages, session for user activity. Watermark from observed p99 lateness.'
Longer watermarks hold more state in memory. Do not over-extend for rare late events.
Extremely late events: DLQ + batch correction. Do not compromise streaming latency.

DLQ Patterns

Daily Life
Interviews

Design DLQ reprocessing with error classification

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Design a DLQ reprocessing workflow."
  • "How do you distinguish transient errors from permanent errors?"
  • "Your DLQ has 500K events. Walk me through remediation."

What They Want to Hear

'I structure DLQ events with metadata: the original event, the error message, the retry count, and the timestamp. For reprocessing, I classify errors first. Transient errors (timeout, rate limit) go back to the main topic after a delay. Permanent errors (schema mismatch, invalid data) require a code fix before replay. I never blindly replay the entire DLQ: that just reproduces the same errors and wastes compute.' This is the answer that shows you have actually dealt with a DLQ in production.
Error TypeExamplesAction
TransientTimeout, rate limit, network blipRetry after backoff, then replay to main topic
Permanent (data)Invalid JSON, missing required fieldFix data manually or with a transform, then replay
Permanent (code)Bug in processing logicFix code, deploy, then replay affected events
Poison pillEvent that crashes the consumerIsolate immediately, investigate separately
TIP
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How do you detect a poison pill event?" A poison pill crashes the consumer, which restarts and crashes again on the same event. Detect it by tracking the event offset across restarts. If the same offset fails 3 times, route it to DLQ and skip. Without this pattern, one bad event blocks the entire partition.
  • "How do you prioritize DLQ remediation?" By business impact. Events affecting revenue (orders, payments) get fixed first. Events affecting analytics (page views, clicks) can wait. Track DLQ events by topic and error type to prioritize.
  • "What is the DLQ schema?" Original event payload, original topic, partition, offset, error message, exception stack trace, retry count, first failure timestamp, last failure timestamp. This gives you everything needed to diagnose and replay.
KEY TAKEAWAYS
Say: 'Classify errors first. Transient: retry. Permanent: fix then replay. Never blindly replay the entire DLQ.'
Poison pill detection: if the same offset fails 3 times, skip to DLQ immediately
DLQ events include full metadata: original event, error, retry count, timestamps

Spark Streaming vs Flink

Daily Life
Interviews

Compare Spark Streaming and Flink with concrete tradeoffs

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Why did you choose Spark Streaming over Flink?"
  • "What is the latency floor for Spark Structured Streaming?"
  • "How does Flink handle state differently than Spark?"

What They Want to Hear

'Spark Structured Streaming is micro-batch: it collects events for a trigger interval (e.g., 10 seconds), processes them as a batch, then starts the next interval. Flink processes each event as it arrives with continuous operator pipelines. The practical differences: Spark has a 100ms latency floor and simpler state management. Flink has sub-10ms latency, more powerful windowing (session windows, event-time processing), and built-in savepoints for state migration. I choose Spark when the team already runs Spark for batch and latency requirements are relaxed. I choose Flink when latency is critical or the streaming logic requires complex stateful processing.' This is the answer that shows you can make a technology choice with concrete tradeoff reasoning.
DimensionSpark Structured StreamingApache Flink
Processing modelMicro-batchTrue per-event streaming
Latency100ms+ (trigger interval)Single-digit milliseconds
State managementState store backed by HDFS/S3RocksDB state backend, local
Exactly-onceCheckpoint + idempotent sinkCheckpoint barriers + 2-phase commit
Session windowsNot natively supportedFirst-class support
Batch + streamingSame API for bothSeparate DataStream and Table APIs
Operational complexityFamiliar if team uses SparkNew deployment, monitoring, tuning
Best Practice
    Avoid
      The Curveball Follow-ups

      After your initial answer, expect these probes

      • "What are Flink savepoints and why do they matter?" Savepoints are user-triggered snapshots of the entire application state. You can stop a Flink job, modify the code, and restart from the savepoint without losing state. Spark does not have an equivalent: restarting from a checkpoint requires the same code.
      • "How does Spark's trigger interval affect throughput?" Shorter intervals (1 second) give lower latency but higher scheduling overhead. Longer intervals (30 seconds) give higher throughput but higher latency. The sweet spot depends on event volume: high-volume streams benefit from longer intervals.
      • "Can you migrate from Spark Streaming to Flink?" Yes, but it requires rewriting the processing logic. The state is not portable between them. Plan a parallel-run period where both systems process the same events and you compare outputs.
      KEY TAKEAWAYS
      Say: 'Spark for micro-batch (100ms+, simpler ops). Flink for true streaming (sub-10ms, complex state).'
      Flink savepoints allow code changes without state loss. Spark checkpoints do not.
      Choose based on latency needs AND team expertise. Flink's power comes with operational complexity.

      Master offset management, consumer groups, and streaming failure modes

      Category
      Pipeline Architecture
      Difficulty
      intermediate
      Duration
      25 minutes
      Challenges
      0 hands-on challenges

      Topics covered: Consumer Groups and Offsets, Event Sourcing Patterns, Windowing and Watermarks, DLQ Patterns, Spark Streaming vs Flink

      Lesson Sections

      1. Consumer Groups and Offsets (concepts: paEventPlatforms)

        What They Want to Hear 'Each consumer in a group tracks its position in the partition using offsets. After processing a batch of events, the consumer commits its offset. If the consumer crashes, it restarts from the last committed offset. This gives at-least-once delivery by default. For exactly-once, I use the transactional pattern: commit the offset and write the output in the same transaction, so either both happen or neither does.' This is the answer that shows you understand the offset comm

      2. Event Sourcing Patterns (concepts: paEventDriven)

        What They Want to Hear 'Event sourcing stores every state change as an immutable event. The current state is derived by replaying events from the beginning. I pair it with CQRS: Command Query Responsibility Segregation, where writes go to the event log and reads come from a materialized view that is built by processing the event stream. This separates the write model from the read model, allowing each to be optimized independently.' This is the answer that shows you understand event sourcing as

      3. Windowing and Watermarks (concepts: paLateData)

        What They Want to Hear 'I choose the window type based on the use case. Tumbling windows are fixed, non-overlapping intervals: count clicks per hour. Sliding windows overlap: count clicks in the last hour, updated every 5 minutes. Session windows are activity-based: group events that are close together in time, with a gap timeout. I set the watermark based on observed lateness: if 99th percentile lateness is 5 minutes, I set the watermark to 10 minutes to catch stragglers with margin.' This is t

      4. DLQ Patterns (concepts: paDeadLetterQueue)

        What They Want to Hear 'I structure DLQ events with metadata: the original event, the error message, the retry count, and the timestamp. For reprocessing, I classify errors first. Transient errors (timeout, rate limit) go back to the main topic after a delay. Permanent errors (schema mismatch, invalid data) require a code fix before replay. I never blindly replay the entire DLQ: that just reproduces the same errors and wastes compute.' This is the answer that shows you have actually dealt with a

      5. Spark Streaming vs Flink (concepts: paMicroBatchVsTrue)

        What They Want to Hear 'Spark Structured Streaming is micro-batch: it collects events for a trigger interval (e.g., 10 seconds), processes them as a batch, then starts the next interval. Flink processes each event as it arrives with continuous operator pipelines. The practical differences: Spark has a 100ms latency floor and simpler state management. Flink has sub-10ms latency, more powerful windowing (session windows, event-time processing), and built-in savepoints for state migration. I choose