Failure Modes and Error Handling: Beginner

A logistics startup ran a nightly job that pulled package events from a partner API and loaded them into Snowflake. One Tuesday at 2:14am the partner API returned a single 503 error on one of three thousand pages. The job crashed. The on-call engineer woke up, restarted the job, and went back to sleep. The next night the same page returned a 503 again, and the same engineer woke up again. After the third night the engineer added one line of code that retried the failed request once. The 503s stopped causing pages. They had not stopped happening. They had stopped being the engineer's problem. That one line is the smallest possible step from a brittle pipeline to a resilient one. Failure handling is the discipline that turns a pipeline from a script that works most of the time into a system that works through the times when something does not. This lesson is the picture of the two kinds of failures every pipeline encounters and the smallest correct responses to each.

Transient vs Permanent Failures

Daily Life
Interviews

Distinguish transient, permanent, and ambiguous failures and choose the response category for each.

Every pipeline failure falls into one of two buckets. A transient failure is something that goes wrong because of a temporary condition: a network hiccup, a downstream service rebooting, a momentary rate limit. A permanent failure is something that will never succeed no matter how many times the pipeline tries: a bad credential, a row whose schema does not match, a malformed JSON document. The two buckets demand opposite responses. Treating a transient as permanent gives up too early; treating a permanent as transient burns compute forever. The first move in any failure-handling design is naming which bucket a given error belongs to.

The Two Buckets

Failure TypeWhat Causes ItCorrect Response
TransientTemporary network blip, downstream restart, brief rate limit, transient resource contentionWait, then retry; the next attempt usually succeeds
PermanentWrong credentials, malformed input row, schema mismatch, missing required fieldStop, alert, route the problem to a human or to a quarantine bucket called a dead letter queue (DLQ for short, covered fully in the intermediate tier)
AmbiguousGeneric 500 errors, undocumented HTTP responses, unclear exceptionsRetry a small bounded number of times, then escalate to permanent
The third row is the row that matters most in practice. Ambiguous failures are the common case. A request returned a 500. The engineer does not know whether the downstream is briefly overloaded or permanently broken. The pipeline cannot know either. The pragmatic answer is to treat ambiguous as transient with a small budget: try a handful of times, and if every attempt fails, escalate the error to the permanent bucket. That hedge is the design that lives in production code at every mature data team.

Concrete Examples From Real Pipelines

ErrorBucketReasoning
HTTP 503 Service UnavailableTransientBy specification, 503 means the server is temporarily unable to handle the request
HTTP 401 UnauthorizedPermanentThe credential is wrong; retrying with the same credential will keep failing
HTTP 429 Too Many RequestsTransient (with mandatory backoff)Rate limit; retry after the Retry-After interval, never sooner
JSON parse error on a single rowPermanent for that rowThe row will not parse on the next attempt; route it to a quarantine path
Connection reset by peerTransientThe TCP connection dropped; the next attempt opens a new connection
Foreign key violation on insertPermanentThe referenced row does not exist; retrying does not change that
Why naming the bucket comes first:
  • The retry mechanism is the same for every transient error; only the wait time varies
  • The escape hatch is the same for every permanent error: stop and report
  • Code that mixes the two responses ends up retrying credential errors forever and giving up on network blips after one try

A First Code Sketch

1# Two buckets, two responses, no retry logic yet.
2
3class TransientError(Exception):
4 """The next attempt might work. Retry."""
5
6class PermanentError(Exception):
7 """The next attempt will not work. Stop and report."""
8
9
10def classify(http_status: int) -> Exception:
11 if http_status in (408, 429, 500, 502, 503, 504):
12 return TransientError(f"transient HTTP {http_status}")
13 if http_status in (400, 401, 403, 404, 422):
14 return PermanentError(f"permanent HTTP {http_status}")
15 return TransientError(f"unknown HTTP {http_status}, treating as transient")
The classifier is small but it is the entire foundation of the failure handling that comes later. Every retry policy, every dead letter queue, every alerting rule depends on knowing which bucket a given error is in. A pipeline that does not classify treats every failure the same way, which is the same as not handling failures at all.
1

The 200 case looks wrong because the classifier was given a status code outside of any error category. Real code calls the classifier only after determining a request failed. The example shows the unknown bucket falling back to transient on purpose.

alert
Transient failures want a wait and a retry; permanent failures want a stop and a report.
check
Ambiguous failures get a small bounded retry budget and then escalate.
query
Classification is the foundation; every later failure handling pattern depends on it.
TransientPermanentAmbiguous
Transient
Will heal on retry
Network blips, downstream restarts, brief rate limits, momentary resource contention. The next attempt almost always succeeds.
Permanent
Will never succeed
Bad credentials, malformed payloads, schema mismatches. Retrying changes nothing because nothing changed between attempts.
Ambiguous
Cannot be told apart
Generic 500s, undocumented errors, opaque exceptions. Bounded retry budget, then escalate to permanent.
Permanent failures often come with diagnostic information that is more useful than a transient error's. A 422 with a detailed validation message tells the producer exactly what is wrong. A connection reset tells nobody anything specific. The asymmetry is intentional: services that return permanent errors should be helpful about what was wrong, because the only way to fix the problem is to understand the cause. Pipelines that surface this diagnostic information into alerts shorten the time-to-resolution dramatically. Pipelines that swallow the message and emit only the status code force the on-call to dig into the original payload to figure out what went wrong.
The HTTP status codes in the table above map to a real specification (RFC 9110) that names the semantics of each code. The specification distinguishes 4xx codes (the request was malformed; the client should not retry without changing something) from 5xx codes (the server failed; the client may retry). The convention is honored unevenly across real APIs. Some services return 500 for things that are actually 4xx because the engineers writing the API did not classify carefully. The pipeline cannot rely on the status code alone; some level of judgment about each specific endpoint is unavoidable. The classifier should be reviewed when integrating with a new API and updated as new failure modes are observed.
TIP
Whenever a new error appears in production logs, name its bucket before writing any retry code. A retry policy attached to an unclassified error is a guess wearing a hard hat.
pipeline task
task
transient? retry w/ backoff
retry
warehouse
success
permanent -> DLQ
dlq

Classify the failure first: transient errors (timeout, lock) get retried with exponential backoff; permanent errors (bad schema) go straight to a dead-letter queue. Retrying a permanent error just wastes time.

The Retry: Easy to Misuse

Daily Life
Interviews

Apply a bounded retry that catches specific transient errors, sleeps between attempts, and gives up when the budget is exhausted.

The retry is the most basic failure handling primitive. The mechanism is two lines of code: catch the exception, run the operation again. That simplicity is what makes the retry both the first reach and the most common source of subtle production bugs. A retry done correctly absorbs nearly all transient failures. A retry done carelessly amplifies an outage, runs forever, or quietly produces duplicate writes. The mechanics that distinguish the two are not complicated; they are unforgiving.

A retry only produces the same answer as a single run when the work is idempotent (Lesson 5). Without that property, two attempts double the rows. The retry mechanics described here all assume the underlying write is safe to repeat.

What a Retry Does

A retry calls the same operation a second time after it failed. The expectation is that whatever caused the first failure is no longer present. The downstream service that returned a 503 has finished restarting. The TCP connection that was reset has been replaced by a fresh one. The brief rate limit window has passed. The retry exists because most transient failures are momentary, and waiting a small amount of time before trying again is enough to clear them.
1import time
2
3# The smallest correct retry: bounded attempts, fixed sleep, transient errors only.
4MAX_ATTEMPTS = 3
5WAIT_SECONDS = 1
6
7
8def call_with_retry(operation):
9 for attempt in range(MAX_ATTEMPTS):
10 try:
11 return operation()
12 except TransientError as exc:
13 if attempt + 1 == MAX_ATTEMPTS:
14 raise
15 time.sleep(WAIT_SECONDS)
16 # Permanent errors are not caught and propagate up immediately.
Three attempts, one second between each, and only transient errors are retried. Permanent errors fall through the catch block and propagate. That tiny block handles a surprising fraction of real-world transient failures with no further machinery. The fraction it does not handle is the reason the next sections exist, but every team should be able to recognize the shape above and reach for it as a default.

Three Numbers Every Retry Defines

ParameterWhat It ControlsTypical Value
Maximum attemptsTotal number of tries including the first3 to 5 for most pipeline operations
Wait between attemptsHow long to sleep before retrying1 to 5 seconds for fixed delays; longer for backoff
Which errors retryThe list of exception classes considered transientSpecific exception types, never a bare except

The Retry That Looks Right But Is Not

1# This is broken. It will retry permanent errors forever and never sleep.
2while True:
3 try:
4 do_the_thing()
5 break
6 except Exception:
7 continue
The block above appears in real production code from time to time. It catches every exception, including permanent errors and bugs in the code itself. It has no upper bound on attempts. It does not sleep, so it hammers the downstream as fast as the loop can go. Each of those three properties is wrong in a different way. The first guarantees that a malformed credential keeps retrying with the same wrong credential. The second guarantees that no one notices the failure until the warehouse credit bill arrives. The third turns a momentary downstream slowdown into a self-amplifying outage.
Naive Retry
  • Catches all exceptions, retries permanent errors forever
  • No upper bound on attempts
  • No sleep between attempts; hammers the downstream
  • No logging; failures and recoveries are invisible
Disciplined Retry
  • Catches only specific transient exception classes
  • Bounded attempts; raises after the budget is spent
  • Sleep between attempts grows with each failure
  • Logs every retry with attempt number and elapsed time

Why Specificity Matters

A retry that catches a specific exception class refuses to retry anything else. That refusal is the entire safety guarantee. If the operation throws a permission error, the retry block does not see it, and the error propagates up to the orchestrator where it can be alerted on. If the retry block catches a broad base class, it swallows that permission error and turns it into a slow loop nobody is watching. The specificity is not a stylistic preference; it is the only line that separates a self-healing retry from an infinite-loop bug.
A retry is correct when these four properties hold:
  • It catches a named transient error type, not Exception
  • It has a maximum attempt count and gives up when the budget is spent
  • It sleeps between attempts and the sleep grows over time
  • It logs each attempt so the recovery is visible
The discipline of writing a retry once and reusing it across the codebase is a force multiplier. A single shared retry helper that every operation can adopt prevents the case where one team's retry block has different semantics than another team's. Consistency at this level matters because failure handling is rarely something engineers think hard about while writing new code; the helper that ships with the right defaults is the helper that gets used. The libraries that have shipped this idea successfully (tenacity in Python, retry in Go, polly in C#) all encode the same four properties: specific exceptions, bounded attempts, sleep that grows, and visible logging. The libraries are the cheapest way to get the right semantics without rewriting them every time.
Logging is often skipped when retries are written quickly. The omission is a mistake. A retry that succeeds silently after three attempts looks identical in operations to a retry that succeeded on the first attempt. The recovery is invisible. Visibility matters because the rate of recoveries is itself a signal: a downstream that needs three retries to succeed eighty percent of the time is degraded, even if every operation eventually completes. The on-call rotation that monitors this rate catches degradations before they become outages. The on-call rotation that does not is fighting incidents from the moment of full failure rather than from the moment of first warning.
TIP
Read every existing retry block in the codebase and check the four properties above. The bug in production tomorrow is statistically likely to be a retry that fails one of them.

Naive Retries and Thundering Herd

Daily Life
Interviews

Recognize the thundering herd failure mode and apply backoff plus jitter to prevent retry storms from amplifying an outage.

The thundering herd is the most cited failure mode in distributed systems and the most overlooked by engineers writing their first retry. The shape is straightforward. A downstream service slows down. Many clients fail at roughly the same moment. Each client retries on the same fixed schedule. The retries arrive at the downstream in a synchronized wave that is larger than the original load that caused the slowdown. The downstream goes from slow to dead. The retries then double in size again. A momentary blip becomes an hour-long outage.

The Anatomy of a Thundering Herd

StepWhat HappensEffect on Downstream
1Downstream service hits a brief CPU spike; latency risesSome requests time out; clients see transient errors
2Hundreds of clients see errors at roughly the same secondHundreds of clients enter their retry path simultaneously
3Every client retries after exactly one secondHundreds of new requests arrive at the downstream in the same second
4Downstream is now under double the original loadLatency rises further; more requests time out
5Failed retries trigger their next retry one second laterLoad doubles again; downstream collapses entirely
The pattern is self-amplifying. Each round of retries makes the underlying problem worse. Without intervention, the system reaches a steady state where the retries themselves are the load and the original work never completes. Engineers seeing the dashboards during this state see a downstream that appears to be permanently down even though there is no bug, no deployment, and no obvious trigger. The trigger was a one-second latency spike that never repeated. The retry behavior is what kept the outage alive.

A Real Numbers Example

Suppose 500 worker processes each call a downstream API once per minute. A 200-millisecond latency spike at second 30 causes all 500 to time out simultaneously. Each retries one second later. The downstream now sees 500 retry requests at second 31 in addition to whatever organic traffic was scheduled. Original requests-per-second went from 8 to 508 in a single second.

The Fix Has Two Parts

MechanismWhat It DoesWhy It Helps
BackoffWait longer between each successive retryReduces total request volume during the recovery window
JitterAdd a random offset to each retry's wait timeSpreads the retry wave across time so it does not arrive as a single spike
Backoff alone reduces the size of each retry wave but does not desynchronize the clients. All five hundred clients still retry at the same second, only later. Jitter is what desynchronizes them. With jitter, one client retries at second 31.2, another at second 32.1, another at second 33.7. The downstream sees a stretched-out trickle instead of a synchronized spike. The combination is what makes retries safe at scale.
1import random
2import time
3
4# Backoff plus jitter: the load on the downstream stays bounded
5# even when many clients retry at once.
6
7def sleep_with_jitter(attempt):
8 base = min(60, 2 ** attempt) # capped exponential backoff
9 spread = random.uniform(0, base) # full jitter
10 time.sleep(spread)
Three rules every retry policy obeys at scale:
  • Wait between attempts grows with each failure (exponential is the standard)
  • A random offset is added to the wait so retries do not synchronize
  • There is a hard cap on the wait so a single retry never sleeps for hours
1
The print output makes the difference visible. Without jitter, one hundred clients land in the same second. With jitter, the same hundred clients spread across four seconds. The downstream sees a quarter of the original load at any one moment, which is the difference between a survivable retry storm and a fatal one.
The thundering herd was named after the behavior of cattle in a panic. The metaphor sticks because the dynamics are the same. A single trigger causes a synchronized response from many independent agents. The synchronized response itself becomes the problem. Engineers who have not experienced a thundering herd outage often underestimate how cleanly the metaphor maps to systems. The fix in livestock is to physically prevent the synchronized motion. The fix in software is the same: prevent the synchronized retry.
Operations teams at companies that run global services have written extensively about thundering herd outages from production. AWS publishes blog posts about retry storms in distributed systems. The pattern is consistent enough across industries that the standard mitigation, full jitter, has become a well-known acronym in SRE literature. Reading the AWS Architecture Blog post on exponential backoff and jitter is a reasonable evening's investment for any engineer who writes pipeline code. The post predates 2026 by more than a decade and the conclusions have not changed because the underlying mathematics has not changed.
Synchronization is the property that makes the herd thunder. The same property is what makes Sunday morning grocery store rushes worse than Monday afternoon ones: hundreds of independent decisions to shop happen at the same time because the underlying schedule is shared. Software systems are full of shared schedules: every cron job on the hour, every healthcheck on a fifteen-second interval, every retry after a one-second wait. Engineers building distributed systems learn quickly that 'on the hour' is a stronger correlation than expected, and the same applies to retry waits.
The mathematical model for a thundering herd is the M/M/1 queue with a sudden arrival surge. The math says, in plain words, that any service running near its capacity becomes unstable when arrival rate briefly spikes. Most production services are sized to run somewhere between fifty and seventy percent of capacity to leave headroom for these spikes. Retries that are correlated with each other consume that headroom in seconds. The reason exponential backoff and jitter became universal practice is that they restore the headroom by spreading the surge across enough time that the steady-state capacity can absorb it.
alert
A thundering herd is a self-amplifying outage caused by synchronized retries.
check
Backoff reduces total retry volume; jitter spreads it across time.
query
Both are required at scale; either one alone leaves the failure mode open.

Exponential Backoff in One Sentence

Daily Life
Interviews

Compute exponential backoff wait times by hand and explain why the cap and jitter are required, not optional.

Exponential backoff is the standard way to choose how long a retry should wait. The rule fits in one sentence: each successive attempt waits roughly twice as long as the previous one, capped at a maximum. The mechanism is everywhere because it solves two problems at once. It gives the downstream more time to recover with each failure. It bounds the total number of retries that can fit in a given time window. The cap prevents a runaway exponential from sleeping for days on the seventh retry.

The Formula

1# Exponential backoff with a base of 1 second and a cap of 60 seconds.
2# attempt is zero-indexed: 0 for the first retry, 1 for the second, and so on.
3
4def wait_seconds(attempt, base=1, cap=60):
5 return min(cap, base * (2 ** attempt))
6
7
8for attempt in range(7):
9 print(attempt, wait_seconds(attempt), "seconds")

The Numbers In a Real Example

AttemptComputed WaitCapped Wait
11 second1 second
22 seconds2 seconds
34 seconds4 seconds
48 seconds8 seconds
516 seconds16 seconds
632 seconds32 seconds
764 seconds60 seconds (capped)
8128 seconds60 seconds (capped)
After eight attempts, the total elapsed wait is 1 + 2 + 4 + 8 + 16 + 32 + 60 + 60 = 183 seconds, just over three minutes. That is the budget every retry policy implicitly defines. Three minutes is enough time for most genuine transient failures to clear. If a downstream is still failing after three minutes of escalating waits, the failure is almost certainly not transient and the pipeline should escalate to alerting rather than continuing to retry.

Adding Jitter to the Backoff

1import random
2
3# Full jitter: pick uniformly between 0 and the computed cap.
4# This is the AWS Architecture Blog recommendation and the default in many libraries.
5
6def wait_with_full_jitter(attempt, base=1, cap=60):
7 upper = min(cap, base * (2 ** attempt))
8 return random.uniform(0, upper)
9
10
11# Equal jitter: half the cap, plus a random half.
12# Slightly less variance, slightly higher mean wait.
13
14def wait_with_equal_jitter(attempt, base=1, cap=60):
15 upper = min(cap, base * (2 ** attempt))
16 return upper / 2 + random.uniform(0, upper / 2)
Fixed Delay
  • Same wait between every attempt
  • Total wait grows linearly with attempts
  • Synchronizes retries across many clients
  • Acceptable for very small attempt counts and very small fleets
Exponential Backoff
  • Wait doubles with each successive attempt
  • Total wait grows quickly; budget is naturally bounded
  • Pairs with jitter to spread retries across time
  • The standard for any pipeline that runs with more than a single worker

Why the Cap Matters

Without a cap, the seventh retry waits 64 seconds, the tenth waits 512 seconds, the fifteenth waits more than four hours. The exponential climbs faster than human intuition predicts. A cap of 60 seconds, or 5 minutes, or 30 minutes, depending on the workload, prevents a retry policy from silently turning a stuck job into an hours-long stall. The cap also signals an operational truth: by the time the retry has been waiting that long, the failure is no longer transient and the pipeline should be telling someone.
BaseMultiplierCapMaximum attempts
Base
The starting wait
Usually 1 second. Smaller for low-latency systems, larger for batch pipelines that can absorb a longer first wait.
Multiplier
How fast the wait grows
Almost always 2. Bigger multipliers give up too quickly; smaller ones make the backoff ineffective.
Cap
The hard ceiling
60 seconds for synchronous calls, several minutes for batch retries. Prevents runaway exponential growth.
Maximum attempts
When to stop trying
5 to 8 is typical. Beyond that, the failure is almost certainly not transient and alerting earns its place.
1
The 'roughly twice as long' rule is approximate on purpose. Some implementations use a multiplier of 1.5 or 3 instead of 2; the choice has subtle effects on how aggressive the backoff is. A multiplier of 1.5 takes more attempts to reach the cap, which means more work hits the downstream during the recovery window. A multiplier of 3 gives up too quickly because the budget is exhausted in fewer total attempts. Two is the convention because it sits in the middle of those tradeoffs and because doubling is computationally trivial. Engineers who pick a non-default multiplier should be able to articulate why the default is wrong for the workload at hand; in nearly every case the answer is that the default is fine.
The interaction between exponential backoff and the underlying operation's own latency is worth pausing on. A retry that fires after a one-second wait looks like it took one second of wall time, but the actual call may have taken thirty seconds to time out before the retry fired. The visible behavior is that an operation 'tried three times in three seconds' when in reality the operation tried three times across ninety-three seconds of wall time. The total elapsed clamp from the retry budget catches this case, but only if the budget includes elapsed wall time as a separate dimension from attempt count.
The retry implementation in the standard libraries (tenacity in Python is a good example) compresses many of the lessons in this section into a few lines. Adopting such a library does not absolve a pipeline author of understanding what the library does; the defaults are usually correct, but the explanation in this section is what makes the defaults legible. An engineer who configures tenacity without reading what 'wait_random_exponential' actually does is one bad configuration choice away from a thundering herd.
Do
  • Pick a base of 1 second and a multiplier of 2 unless there is a specific reason to deviate
  • Cap the wait so the seventh retry never sleeps longer than the operational tolerance
  • Pair backoff with jitter so retries from many clients do not synchronize
Don't
  • Use exponential backoff without a cap; the runaway is silent and expensive
  • Use a base shorter than the round-trip time to the downstream; the first retry will hit before the failure has time to clear
  • Treat the maximum-attempts number as an unbounded knob; if the workload needs more than ten attempts, the design needs more than retries
TIP
Compute the worst-case total elapsed time before choosing the cap. If the answer is longer than the on-call tolerance, lower the cap or the attempt count.

When NOT to Retry

Daily Life
Interviews

Identify the failure categories that should not be retried and route them to quarantine, alerting, or credential rotation instead.

Retry as a tool is so often correct that engineers begin to apply it reflexively. The reflex causes outages of its own. Some failures will never succeed on a second attempt, and retrying them wastes compute, fills up logs, and hides the underlying problem. Knowing the categories where retrying is wrong is as important as knowing how to retry properly. The pipeline that retries correctly on transient errors and refuses to retry on permanent ones is the pipeline that operates predictably.

Three Categories That Should Not Be Retried

CategoryExampleWhy Retrying Fails
Validation failuresA required field is missing from an eventThe next attempt sees the same missing field; nothing changes
Authentication failuresA 401 from an API because the token expired or is wrongRetrying with the same credential keeps failing; the credential has to be replaced
Poison pillsA specific row that crashes the parser every time it is processedThe row will keep crashing the parser; retrying loses progress on every other row

Validation Failures

Validation failures occur when input data does not match the expected shape. A field that the contract said was required is missing. A timestamp is in the wrong format. A foreign key points to a row that does not exist. None of these will resolve themselves on a retry. The fix is upstream: either the producer corrects the data, or the pipeline routes the bad record to a quarantine path where a human can inspect it. Retrying a validation failure trades a clear error for a slow grinding one.

Authentication and Authorization Failures

A 401 Unauthorized or a 403 Forbidden response is permanent until the credentials change. Retrying does not change the credentials. Retrying does, in some systems, lock the account out after too many failed attempts, turning a fixable problem into a multi-hour incident. The standard pattern is to fail loudly on the first auth error and let an operator rotate the credential. Some teams pair this with an automatic token refresh on 401, which is a different mechanism than a retry: the refresh changes the inputs before the next attempt.

Poison Pills

A poison pill is a record that causes a worker to fail every time the worker tries to process it. The classic example is a streaming pipeline that consumes from Kafka. The worker pulls a message, the message has a bug-triggering shape, the worker crashes. The orchestrator restarts the worker. The worker pulls the same message, crashes again. The pipeline appears to be processing data but is actually stuck in a loop on the poison message, with every other downstream message piling up behind it. The fix is not retrying the poison record more times. The fix is moving it out of the way so the rest of the queue can flow.
If retrying does not change anything between attempts, retrying is the wrong tool:
  • The data is the same on attempt two as it was on attempt one
  • The credentials are the same on attempt two as on attempt one
  • The schema is the same on attempt two as on attempt one

What To Do Instead

FailureAction Instead of RetryWhere the Action Lives
Validation failure on a single rowQuarantine the row, continue processing the restQuarantine table or dead letter queue
Authentication failureStop the pipeline, page on-call, rotate the credentialAlerting and runbook
Poison pill on a streaming consumerSkip the message after N failed attempts, route it to a DLQConsumer DLQ configuration
Schema mismatch on a downstream insertStop the pipeline, alert the producer teamSchema validation stage and on-call
1# A retry decorator that refuses to retry permanent failures.
2from functools import wraps
3
4
5def retry_transient(max_attempts=3):
6 def decorator(fn):
7 @wraps(fn)
8 def wrapper(*args, **kwargs):
9 for attempt in range(max_attempts):
10 try:
11 return fn(*args, **kwargs)
12 except PermanentError:
13 raise # never retry permanent failures
14 except TransientError:
15 if attempt + 1 == max_attempts:
16 raise
17 return wrapper
18 return decorator
The decorator above is short, but the discipline it encodes is the difference between a self-healing pipeline and a noisy one. PermanentError raises immediately, never sleeping, never wasting compute. TransientError gets a bounded retry. Anything outside those two classes is uncaught and propagates as a real bug, not a quietly retried one.
Retry Always
  • Permanent failures retry until the budget runs out
  • Bad credentials trigger account lockouts after lockout policies kick in
  • Poison pills block streaming pipelines while attempting to be reprocessed
  • On-call gets paged after the budget burns instead of at the first sign of trouble
Retry Selectively
  • Transient failures retry; permanent failures stop immediately
  • Bad credentials surface within seconds and trigger rotation
  • Poison pills move to a DLQ after a small bounded retry budget
  • On-call gets paged on the genuine signal, not the retry aftermath
Authentication failures deserve a closer look because they are the most common permanent error pattern in third-party integrations. A token expires every twelve hours; the pipeline runs every fifteen minutes; the failure window is forty-eight times per day in steady state. Without a token-refresh mechanism in front of the retry path, the pipeline pages on-call every twelve hours forever. The refresh-then-retry pattern is the standard fix: on a 401, refresh the token, then retry the original request once with the new token. A second 401 after a refresh is a real auth failure and should escalate.
The poison-pill case has a subtle variant worth knowing about. Sometimes the offending message is not malformed in any obvious way; it is only poisonous in combination with a particular state inside the worker. A message that triggers a memory leak in a parser library, for example, is poisonous on workers that have already processed many messages and benign on a fresh worker. The diagnosis is harder because restarting the worker appears to fix the problem, until the leak builds up again. The defense against this class of poison pill is the same as for the simpler kind: bounded retries per message, then route to the DLQ, then continue.
TIP
When a new error type appears in production, write down what would change between attempt one and attempt two. If the answer is 'nothing,' the error does not belong in the retry path.
PUTTING IT ALL TOGETHER

> A small data team runs a nightly job that pulls invoice events from a payment processor's API. The job has been failing once or twice a week for months. The on-call engineer manually restarts it each time. A new engineer is asked to make the failures self-healing without making the system worse. The current code retries every exception forever with no sleep.

Step one: classify every error the API returns. HTTP 503s and connection resets are transient; HTTP 401s, 403s, and 422s are permanent. Anything ambiguous gets a small bounded retry budget.
Step two: replace the catch-all retry with a bounded retry that catches only transient exception classes. Five attempts is the upper bound. Permanent errors propagate immediately so on-call sees them at the first occurrence.
Step three: add exponential backoff with a base of 1 second, a multiplier of 2, and a cap of 60 seconds. Add full jitter so multiple workers do not synchronize against the API rate limiter.
Step four: route any single invoice event that fails validation to a quarantine table rather than failing the whole job. Permanent errors at the row level become row-level escalations, not job-level outages, and the pipeline stays the durable conduit Lesson 1 described.
KEY TAKEAWAYS
Failures come in two buckets: transient errors heal on retry; permanent errors will not. Naming the bucket precedes choosing the response.
A retry needs four properties to be safe: specific exception types, bounded attempts, sleep between attempts, and visible logging.
Naive retries amplify outages: synchronized retries across many clients form a thundering herd that turns a momentary blip into a sustained outage.
Exponential backoff with jitter is the standard: wait doubles each attempt, capped at a maximum, with a random offset to desynchronize clients.
Some failures should not be retried at all: validation errors, authentication failures, and poison pills require quarantine, credential rotation, or DLQ routing instead.

Some failures heal themselves and some never will; the pipeline must tell the difference

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Transient vs Permanent Failures, The Retry: Easy to Misuse, Naive Retries and Thundering Herd, Exponential Backoff in One Sentence, When NOT to Retry

Lesson Sections

  1. Transient vs Permanent Failures (concepts: paRetryHandling)

    Every pipeline failure falls into one of two buckets. A transient failure is something that goes wrong because of a temporary condition: a network hiccup, a downstream service rebooting, a momentary rate limit. A permanent failure is something that will never succeed no matter how many times the pipeline tries: a bad credential, a row whose schema does not match, a malformed JSON document. The two buckets demand opposite responses. Treating a transient as permanent gives up too early; treating a

  2. The Retry: Easy to Misuse (concepts: paRetryHandling)

    The retry is the most basic failure handling primitive. The mechanism is two lines of code: catch the exception, run the operation again. That simplicity is what makes the retry both the first reach and the most common source of subtle production bugs. A retry done correctly absorbs nearly all transient failures. A retry done carelessly amplifies an outage, runs forever, or quietly produces duplicate writes. The mechanics that distinguish the two are not complicated; they are unforgiving. A retr

  3. Naive Retries and Thundering Herd (concepts: paRetryHandling)

    The thundering herd is the most cited failure mode in distributed systems and the most overlooked by engineers writing their first retry. The shape is straightforward. A downstream service slows down. Many clients fail at roughly the same moment. Each client retries on the same fixed schedule. The retries arrive at the downstream in a synchronized wave that is larger than the original load that caused the slowdown. The downstream goes from slow to dead. The retries then double in size again. A m

  4. Exponential Backoff in One Sentence (concepts: paRetryHandling)

    Exponential backoff is the standard way to choose how long a retry should wait. The rule fits in one sentence: each successive attempt waits roughly twice as long as the previous one, capped at a maximum. The mechanism is everywhere because it solves two problems at once. It gives the downstream more time to recover with each failure. It bounds the total number of retries that can fit in a given time window. The cap prevents a runaway exponential from sleeping for days on the seventh retry. The

  5. When NOT to Retry (concepts: paRetryHandling)

    Retry as a tool is so often correct that engineers begin to apply it reflexively. The reflex causes outages of its own. Some failures will never succeed on a second attempt, and retrying them wastes compute, fills up logs, and hides the underlying problem. Knowing the categories where retrying is wrong is as important as knowing how to retry properly. The pipeline that retries correctly on transient errors and refuses to retry on permanent ones is the pipeline that operates predictably. Three Ca