Loading section...

How Do You Handle Failures?

Transactional Sinks, Barrier Snapshotting, and Failure as Architecture The interviewer is testing a specific mindset: failure handling isn't a recovery strategy, it's an architectural input. You design the system assuming failures happen constantly, and the architecture's job is to make failures invisible to downstream consumers. If you say "we handle failures with try-catch and retries," you've capped your score. The right answer is: "Failures are part of the steady-state design." Barrier snapshotting (Chandy-Lamport algorithm) is how Flink achieves exactly-once state consistency. The JobManager injects checkpoint barriers into the data stream. When an operator receives barriers from all input channels, it snapshots its state. This guarantees that every event is counted exactly once in th