Loading lesson...

Failure Modes and Error Handling: Beginner

Some failures heal themselves and some never will; the pipeline must tell the difference

Some failures heal themselves and some never will; the pipeline must tell the difference

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Transient vs Permanent Failures, The Retry: Easy to Misuse, Naive Retries and Thundering Herd, Exponential Backoff in One Sentence, When NOT to Retry

Lesson Sections

  1. Transient vs Permanent Failures (concepts: paFailureClassification)

    Every pipeline failure falls into one of two buckets. A transient failure is something that goes wrong because of a temporary condition: a network hiccup, a downstream service rebooting, a momentary rate limit. A permanent failure is something that will never succeed no matter how many times the pipeline tries: a bad credential, a row whose schema does not match, a malformed JSON document. The two buckets demand opposite responses. Treating a transient as permanent gives up too early; treating a

  2. The Retry: Easy to Misuse (concepts: paRetryHandling)

    The retry is the most basic failure handling primitive. The mechanism is two lines of code: catch the exception, run the operation again. That simplicity is what makes the retry both the first reach and the most common source of subtle production bugs. A retry done correctly absorbs nearly all transient failures. A retry done carelessly amplifies an outage, runs forever, or quietly produces duplicate writes. The mechanics that distinguish the two are not complicated; they are unforgiving. A retr

  3. Naive Retries and Thundering Herd (concepts: paThunderingHerd, paJitter)

    The thundering herd is the most cited failure mode in distributed systems and the most overlooked by engineers writing their first retry. The shape is straightforward. A downstream service slows down. Many clients fail at roughly the same moment. Each client retries on the same fixed schedule. The retries arrive at the downstream in a synchronized wave that is larger than the original load that caused the slowdown. The downstream goes from slow to dead. The retries then double in size again. A m

  4. Exponential Backoff in One Sentence (concepts: paExponentialBackoff)

    Exponential backoff is the standard way to choose how long a retry should wait. The rule fits in one sentence: each successive attempt waits roughly twice as long as the previous one, capped at a maximum. The mechanism is everywhere because it solves two problems at once. It gives the downstream more time to recover with each failure. It bounds the total number of retries that can fit in a given time window. The cap prevents a runaway exponential from sleeping for days on the seventh retry. The

  5. When NOT to Retry (concepts: paPermanentFailure, paPoisonPill)

    Retry as a tool is so often correct that engineers begin to apply it reflexively. The reflex causes outages of its own. Some failures will never succeed on a second attempt, and retrying them wastes compute, fills up logs, and hides the underlying problem. Knowing the categories where retrying is wrong is as important as knowing how to retry properly. The pipeline that retries correctly on transient errors and refuses to retry on permanent ones is the pipeline that operates predictably. Three Ca