Streaming Systems: Intermediate
You explained Kafka. Now: 'Your consumer group is rebalancing every 30 seconds. Why?' Or: 'How do you guarantee exactly-once processing?' Or: 'Design a streaming pipeline that handles 100K events/sec with late data.' These questions test whether you have operated streaming systems, not just read about them.
Consumer Groups and Offsets
Explain offset management and exactly-once delivery patterns
When you hear these in an interview, this is the concept being tested
- ▸"How do you guarantee exactly-once processing?"
- ▸"Your consumer group rebalances every 30 seconds. Why?"
- ▸"What happens when you add a consumer to the group?"
What They Want to Hear
| Offset Strategy | How It Works | Guarantee |
|---|---|---|
| Auto-commit | Offset committed on timer (every 5s) | At-most-once: may lose events on crash |
| Manual commit after processing | Commit after successful processing | At-least-once: may duplicate on crash |
| Transactional (commit + write atomically) | Offset and output in same transaction | Exactly-once: no loss, no duplicates |
After your initial answer, expect these probes
- ▸"What causes frequent rebalancing?" Three common causes: (1) Consumer processing takes longer than max.poll.interval.ms, so the broker thinks the consumer is dead. Increase the interval or reduce batch size. (2) Unstable consumers joining and leaving. (3) Partition count changed, triggering reassignment.
- ▸"What if you need to reprocess from the beginning?" Reset the consumer group offset to the earliest position. Kafka retains events for the configured retention period (default 7 days), so you can replay anything within that window.
- ▸"How do you handle exactly-once with a non-transactional sink?" Idempotent writes. Use a unique event ID as a dedup key. The sink handles duplicates by ignoring events it has already seen.
Event Sourcing Patterns
Explain event sourcing, CQRS, and when each applies
When you hear these in an interview, this is the concept being tested
- ▸"What is event sourcing and when do you use it?"
- ▸"What is CQRS?"
- ▸"How do you rebuild state from events?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"When does event sourcing not make sense?" When the event history has no business value. A shopping cart does not need event sourcing: just store the current cart. An audit log for financial transactions does: regulators require the full history.
- ▸"How do you handle event schema evolution?" Version your events. Event v1 and v2 can coexist in the same log. The projection handles both versions with an upcaster that translates old formats to new.
- ▸"What about the event log growing forever?" Snapshotting. Periodically write the current state as a snapshot event. On replay, start from the latest snapshot instead of event zero. This bounds replay time.
Windowing and Watermarks
Design windowed aggregations with the right watermark strategy
When you hear these in an interview, this is the concept being tested
- ▸"Design a streaming aggregation with late data handling."
- ▸"What are tumbling vs sliding vs session windows?"
- ▸"How do you set the right watermark duration?"
What They Want to Hear
| Window Type | Behavior | Use Case |
|---|---|---|
| Tumbling | Fixed size, no overlap (e.g., every hour) | Hourly revenue, daily active users |
| Sliding | Fixed size, overlaps (e.g., 1 hour window, 5 min slide) | Rolling averages, trending topics |
| Session | Dynamic size, grouped by activity gaps | User sessions, conversation threads |
After your initial answer, expect these probes
- ▸"Your watermark is 10 minutes but some events arrive 2 hours late. What do you do?" Accept that those events will not be in the streaming output. Route them to a DLQ and run a separate batch correction job that updates the affected windows daily. Do not extend the watermark to 2 hours: it would hold too much state in memory.
- ▸"How do session windows handle very long sessions?" Set a session timeout (e.g., 30 minutes of inactivity). If a user is active for 4 hours, it is one long session. Set a maximum session duration to cap memory usage: force-close sessions after 24 hours.
- ▸"What is the cost of a longer watermark?" More state in memory. Each open window and its partial aggregation must be held in the state store until the watermark passes. A 10-minute watermark holds 10 minutes of windows; a 2-hour watermark holds 2 hours of windows.
DLQ Patterns
Design DLQ reprocessing with error classification
When you hear these in an interview, this is the concept being tested
- ▸"Design a DLQ reprocessing workflow."
- ▸"How do you distinguish transient errors from permanent errors?"
- ▸"Your DLQ has 500K events. Walk me through remediation."
What They Want to Hear
| Error Type | Examples | Action |
|---|---|---|
| Transient | Timeout, rate limit, network blip | Retry after backoff, then replay to main topic |
| Permanent (data) | Invalid JSON, missing required field | Fix data manually or with a transform, then replay |
| Permanent (code) | Bug in processing logic | Fix code, deploy, then replay affected events |
| Poison pill | Event that crashes the consumer | Isolate immediately, investigate separately |
After your initial answer, expect these probes
- ▸"How do you detect a poison pill event?" A poison pill crashes the consumer, which restarts and crashes again on the same event. Detect it by tracking the event offset across restarts. If the same offset fails 3 times, route it to DLQ and skip. Without this pattern, one bad event blocks the entire partition.
- ▸"How do you prioritize DLQ remediation?" By business impact. Events affecting revenue (orders, payments) get fixed first. Events affecting analytics (page views, clicks) can wait. Track DLQ events by topic and error type to prioritize.
- ▸"What is the DLQ schema?" Original event payload, original topic, partition, offset, error message, exception stack trace, retry count, first failure timestamp, last failure timestamp. This gives you everything needed to diagnose and replay.
Spark Streaming vs Flink
Compare Spark Streaming and Flink with concrete tradeoffs
When you hear these in an interview, this is the concept being tested
- ▸"Why did you choose Spark Streaming over Flink?"
- ▸"What is the latency floor for Spark Structured Streaming?"
- ▸"How does Flink handle state differently than Spark?"
What They Want to Hear
| Dimension | Spark Structured Streaming | Apache Flink |
|---|---|---|
| Processing model | Micro-batch | True per-event streaming |
| Latency | 100ms+ (trigger interval) | Single-digit milliseconds |
| State management | State store backed by HDFS/S3 | RocksDB state backend, local |
| Exactly-once | Checkpoint + idempotent sink | Checkpoint barriers + 2-phase commit |
| Session windows | Not natively supported | First-class support |
| Batch + streaming | Same API for both | Separate DataStream and Table APIs |
| Operational complexity | Familiar if team uses Spark | New deployment, monitoring, tuning |
After your initial answer, expect these probes
- ▸"What are Flink savepoints and why do they matter?" Savepoints are user-triggered snapshots of the entire application state. You can stop a Flink job, modify the code, and restart from the savepoint without losing state. Spark does not have an equivalent: restarting from a checkpoint requires the same code.
- ▸"How does Spark's trigger interval affect throughput?" Shorter intervals (1 second) give lower latency but higher scheduling overhead. Longer intervals (30 seconds) give higher throughput but higher latency. The sweet spot depends on event volume: high-volume streams benefit from longer intervals.
- ▸"Can you migrate from Spark Streaming to Flink?" Yes, but it requires rewriting the processing logic. The state is not portable between them. Plan a parallel-run period where both systems process the same events and you compare outputs.
Master offset management, consumer groups, and streaming failure modes
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Consumer Groups and Offsets, Event Sourcing Patterns, Windowing and Watermarks, DLQ Patterns, Spark Streaming vs Flink
Lesson Sections
- Consumer Groups and Offsets (concepts: paEventPlatforms)
What They Want to Hear 'Each consumer in a group tracks its position in the partition using offsets. After processing a batch of events, the consumer commits its offset. If the consumer crashes, it restarts from the last committed offset. This gives at-least-once delivery by default. For exactly-once, I use the transactional pattern: commit the offset and write the output in the same transaction, so either both happen or neither does.' This is the answer that shows you understand the offset comm
- Event Sourcing Patterns (concepts: paEventDriven)
What They Want to Hear 'Event sourcing stores every state change as an immutable event. The current state is derived by replaying events from the beginning. I pair it with CQRS: Command Query Responsibility Segregation, where writes go to the event log and reads come from a materialized view that is built by processing the event stream. This separates the write model from the read model, allowing each to be optimized independently.' This is the answer that shows you understand event sourcing as
- Windowing and Watermarks (concepts: paLateData)
What They Want to Hear 'I choose the window type based on the use case. Tumbling windows are fixed, non-overlapping intervals: count clicks per hour. Sliding windows overlap: count clicks in the last hour, updated every 5 minutes. Session windows are activity-based: group events that are close together in time, with a gap timeout. I set the watermark based on observed lateness: if 99th percentile lateness is 5 minutes, I set the watermark to 10 minutes to catch stragglers with margin.' This is t
- DLQ Patterns (concepts: paDeadLetterQueue)
What They Want to Hear 'I structure DLQ events with metadata: the original event, the error message, the retry count, and the timestamp. For reprocessing, I classify errors first. Transient errors (timeout, rate limit) go back to the main topic after a delay. Permanent errors (schema mismatch, invalid data) require a code fix before replay. I never blindly replay the entire DLQ: that just reproduces the same errors and wastes compute.' This is the answer that shows you have actually dealt with a
- Spark Streaming vs Flink (concepts: paMicroBatchVsTrue)
What They Want to Hear 'Spark Structured Streaming is micro-batch: it collects events for a trigger interval (e.g., 10 seconds), processes them as a batch, then starts the next interval. Flink processes each event as it arrives with continuous operator pipelines. The practical differences: Spark has a 100ms latency floor and simpler state management. Flink has sub-10ms latency, more powerful windowing (session windows, event-time processing), and built-in savepoints for state migration. I choose