Loading lesson...
Failure Modes and Error Handling: Intermediate
Retries are not enough; failed messages need a home and downstream services need protection
Retries are not enough; failed messages need a home and downstream services need protection
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 32 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Dead Letter Queue Basics, Retry Budgets: Max, Delay, Jitter, Circuit Breakers Stop the Hammer, Partial Failure in a Batch, Pipeline Handling All Three
Lesson Sections
- Dead Letter Queue Basics (concepts: paDeadLetterQueue)
A retry exhausts its budget and the message still has not been processed. The pipeline now faces a choice. It can drop the message, which loses data silently. It can crash and stop processing, which blocks every other message behind it. Or it can move the message somewhere else, somewhere a human can look at it later, while the pipeline continues processing the rest. The third option is the dead letter queue. The dead letter queue is the conventional name for the side channel that holds messages
- Retry Budgets: Max, Delay, Jitter (concepts: paRetryBudget)
A retry budget is the explicit set of constraints that govern how a pipeline retries. The beginner tier defined the three numbers: maximum attempts, wait between attempts, and which errors retry. Production pipelines elaborate on those numbers with two more: a maximum cumulative delay across all attempts, and the jitter strategy used to desynchronize retry waves. A complete budget answers the question 'what is the worst case behavior of this retry policy' before the policy ever runs. Without tha
- Circuit Breakers Stop the Hammer (concepts: paCircuitBreaker)
Retries protect against momentary failures of a single request. A circuit breaker protects against sustained failures of an entire downstream service. The motivating problem is the case where every request is failing. A retry budget keeps issuing requests, each one more painful for the downstream than the last. The downstream has been overloaded for fifteen minutes; sending more requests is not helpful. The circuit breaker pattern, popularized by Michael Nygard's book Release It, says: if the do
- Partial Failure in a Batch (concepts: paPartialFailure, paQuarantine)
A batch job processes ten thousand rows. One row fails. The question is what happens to the other 9,999. The two extreme answers are both common and both wrong. Failing the entire batch loses progress on every good row. Silently dropping the bad row hides a problem that might be a symptom of a larger issue. The right answer is somewhere in the middle, and choosing the right point on the spectrum is one of the most consequential decisions a pipeline designer makes about a given workload. Three St
- Pipeline Handling All Three (concepts: paFailureComposition, paDeadLetterQueue, paCircuitBreaker)
Each pattern in isolation is straightforward. The hard part is composing them into a single pipeline that handles transient errors with backoff, permanent errors with a DLQ, and ambiguous errors with a bounded retry that escalates correctly. The example below is a streaming pipeline that consumes order events from Kafka, calls a downstream tax-calculation API, and writes the enriched events to Snowflake. It handles all three failure categories. Reading through the design end to end shows how the