A logistics startup ran a nightly job that pulled package events from a partner API and loaded them into Snowflake. One Tuesday at 2:14am the partner API returned a single 503 error on one of three thousand pages. The job crashed. The on-call engineer woke up, restarted the job, and went back to sleep. The next night the same page returned a 503 again, and the same engineer woke up again. After the third night the engineer added one line of code that retried the failed request once. The 503s stopped causing pages. They had not stopped happening. They had stopped being the engineer's problem. That one line is the smallest possible step from a brittle pipeline to a resilient one. Failure handling is the discipline that turns a pipeline from a script that works most of the time into a system that works through the times when something does not. This lesson is the picture of the two kinds of failures every pipeline encounters and the smallest correct responses to each.
Transient vs Permanent Failures
Daily Life
Interviews
Distinguish transient, permanent, and ambiguous failures and choose the response category for each.
Every pipeline failure falls into one of two buckets. A transient failure is something that goes wrong because of a temporary condition: a network hiccup, a downstream service rebooting, a momentary rate limit. A permanent failure is something that will never succeed no matter how many times the pipeline tries: a bad credential, a row whose schema does not match, a malformed JSON document. The two buckets demand opposite responses. Treating a transient as permanent gives up too early; treating a permanent as transient burns compute forever. The first move in any failure-handling design is naming which bucket a given error belongs to.
Retry a small bounded number of times, then escalate to permanent
The third row is the row that matters most in practice. Ambiguous failures are the common case. A request returned a 500. The engineer does not know whether the downstream is briefly overloaded or permanently broken. The pipeline cannot know either. The pragmatic answer is to treat ambiguous as transient with a small budget: try a handful of times, and if every attempt fails, escalate the error to the permanent bucket. That hedge is the design that lives in production code at every mature data team.
Concrete Examples From Real Pipelines
Error
Bucket
Reasoning
HTTP 503 Service Unavailable
Transient
By specification, 503 means the server is temporarily unable to handle the request
HTTP 401 Unauthorized
Permanent
The credential is wrong; retrying with the same credential will keep failing
HTTP 429 Too Many Requests
Transient (with mandatory backoff)
Rate limit; retry after the Retry-After interval, never sooner
JSON parse error on a single row
Permanent for that row
The row will not parse on the next attempt; route it to a quarantine path
Connection reset by peer
Transient
The TCP connection dropped; the next attempt opens a new connection
Foreign key violation on insert
Permanent
The referenced row does not exist; retrying does not change that
Why naming the bucket comes first:
▸The retry mechanism is the same for every transient error; only the wait time varies
▸The escape hatch is the same for every permanent error: stop and report
▸Code that mixes the two responses ends up retrying credential errors forever and giving up on network blips after one try
A First Code Sketch
1
# Two buckets, two responses, no retry logic yet.
2
3
classTransientError(Exception):
4
"""The next attempt might work. Retry."""
5
6
classPermanentError(Exception):
7
"""The next attempt will not work. Stop and report."""
returnTransientError(f"unknown HTTP {http_status}, treating as transient")
The classifier is small but it is the entire foundation of the failure handling that comes later. Every retry policy, every dead letter queue, every alerting rule depends on knowing which bucket a given error is in. A pipeline that does not classify treats every failure the same way, which is the same as not handling failures at all.
1
The 200 case looks wrong because the classifier was given a status code outside of any error category. Real code calls the classifier only after determining a request failed. The example shows the unknown bucket falling back to transient on purpose.
Transient failures want a wait and a retry; permanent failures want a stop and a report.
Ambiguous failures get a small bounded retry budget and then escalate.
Classification is the foundation; every later failure handling pattern depends on it.
TransientPermanentAmbiguous
Transient
Will heal on retry
Network blips, downstream restarts, brief rate limits, momentary resource contention. The next attempt almost always succeeds.
Permanent
Will never succeed
Bad credentials, malformed payloads, schema mismatches. Retrying changes nothing because nothing changed between attempts.
Ambiguous
Cannot be told apart
Generic 500s, undocumented errors, opaque exceptions. Bounded retry budget, then escalate to permanent.
Permanent failures often come with diagnostic information that is more useful than a transient error's. A 422 with a detailed validation message tells the producer exactly what is wrong. A connection reset tells nobody anything specific. The asymmetry is intentional: services that return permanent errors should be helpful about what was wrong, because the only way to fix the problem is to understand the cause. Pipelines that surface this diagnostic information into alerts shorten the time-to-resolution dramatically. Pipelines that swallow the message and emit only the status code force the on-call to dig into the original payload to figure out what went wrong.
The HTTP status codes in the table above map to a real specification (RFC 9110) that names the semantics of each code. The specification distinguishes 4xx codes (the request was malformed; the client should not retry without changing something) from 5xx codes (the server failed; the client may retry). The convention is honored unevenly across real APIs. Some services return 500 for things that are actually 4xx because the engineers writing the API did not classify carefully. The pipeline cannot rely on the status code alone; some level of judgment about each specific endpoint is unavoidable. The classifier should be reviewed when integrating with a new API and updated as new failure modes are observed.
TIP
Whenever a new error appears in production logs, name its bucket before writing any retry code. A retry policy attached to an unclassified error is a guess wearing a hard hat.
pipeline task
task
transient? retry w/ backoff
retry
warehouse
success
permanent -> DLQ
dlq
Classify the failure first: transient errors (timeout, lock) get retried with exponential backoff; permanent errors (bad schema) go straight to a dead-letter queue. Retrying a permanent error just wastes time.
The Retry: Easy to Misuse
Daily Life
Interviews
Apply a bounded retry that catches specific transient errors, sleeps between attempts, and gives up when the budget is exhausted.
The retry is the most basic failure handling primitive. The mechanism is two lines of code: catch the exception, run the operation again. That simplicity is what makes the retry both the first reach and the most common source of subtle production bugs. A retry done correctly absorbs nearly all transient failures. A retry done carelessly amplifies an outage, runs forever, or quietly produces duplicate writes. The mechanics that distinguish the two are not complicated; they are unforgiving.
A retry only produces the same answer as a single run when the work is idempotent (Lesson 5). Without that property, two attempts double the rows. The retry mechanics described here all assume the underlying write is safe to repeat.
What a Retry Does
A retry calls the same operation a second time after it failed. The expectation is that whatever caused the first failure is no longer present. The downstream service that returned a 503 has finished restarting. The TCP connection that was reset has been replaced by a fresh one. The brief rate limit window has passed. The retry exists because most transient failures are momentary, and waiting a small amount of time before trying again is enough to clear them.
# Permanent errors are not caught and propagate up immediately.
Three attempts, one second between each, and only transient errors are retried. Permanent errors fall through the catch block and propagate. That tiny block handles a surprising fraction of real-world transient failures with no further machinery. The fraction it does not handle is the reason the next sections exist, but every team should be able to recognize the shape above and reach for it as a default.
Three Numbers Every Retry Defines
Parameter
What It Controls
Typical Value
Maximum attempts
Total number of tries including the first
3 to 5 for most pipeline operations
Wait between attempts
How long to sleep before retrying
1 to 5 seconds for fixed delays; longer for backoff
Which errors retry
The list of exception classes considered transient
Specific exception types, never a bare except
The Retry That Looks Right But Is Not
1
# This is broken. It will retry permanent errors forever and never sleep.
2
whileTrue:
3
try:
4
do_the_thing()
5
break
6
exceptException:
7
continue
The block above appears in real production code from time to time. It catches every exception, including permanent errors and bugs in the code itself. It has no upper bound on attempts. It does not sleep, so it hammers the downstream as fast as the loop can go. Each of those three properties is wrong in a different way. The first guarantees that a malformed credential keeps retrying with the same wrong credential. The second guarantees that no one notices the failure until the warehouse credit bill arrives. The third turns a momentary downstream slowdown into a self-amplifying outage.
•Naive Retry
Catches all exceptions, retries permanent errors forever
No upper bound on attempts
No sleep between attempts; hammers the downstream
No logging; failures and recoveries are invisible
✓Disciplined Retry
Catches only specific transient exception classes
Bounded attempts; raises after the budget is spent
Sleep between attempts grows with each failure
Logs every retry with attempt number and elapsed time
Why Specificity Matters
A retry that catches a specific exception class refuses to retry anything else. That refusal is the entire safety guarantee. If the operation throws a permission error, the retry block does not see it, and the error propagates up to the orchestrator where it can be alerted on. If the retry block catches a broad base class, it swallows that permission error and turns it into a slow loop nobody is watching. The specificity is not a stylistic preference; it is the only line that separates a self-healing retry from an infinite-loop bug.
A retry is correct when these four properties hold:
▸It catches a named transient error type, not Exception
▸It has a maximum attempt count and gives up when the budget is spent
▸It sleeps between attempts and the sleep grows over time
▸It logs each attempt so the recovery is visible
The discipline of writing a retry once and reusing it across the codebase is a force multiplier. A single shared retry helper that every operation can adopt prevents the case where one team's retry block has different semantics than another team's. Consistency at this level matters because failure handling is rarely something engineers think hard about while writing new code; the helper that ships with the right defaults is the helper that gets used. The libraries that have shipped this idea successfully (tenacity in Python, retry in Go, polly in C#) all encode the same four properties: specific exceptions, bounded attempts, sleep that grows, and visible logging. The libraries are the cheapest way to get the right semantics without rewriting them every time.
Logging is often skipped when retries are written quickly. The omission is a mistake. A retry that succeeds silently after three attempts looks identical in operations to a retry that succeeded on the first attempt. The recovery is invisible. Visibility matters because the rate of recoveries is itself a signal: a downstream that needs three retries to succeed eighty percent of the time is degraded, even if every operation eventually completes. The on-call rotation that monitors this rate catches degradations before they become outages. The on-call rotation that does not is fighting incidents from the moment of full failure rather than from the moment of first warning.
TIP
Read every existing retry block in the codebase and check the four properties above. The bug in production tomorrow is statistically likely to be a retry that fails one of them.
Naive Retries and Thundering Herd
Daily Life
Interviews
Recognize the thundering herd failure mode and apply backoff plus jitter to prevent retry storms from amplifying an outage.
The thundering herd is the most cited failure mode in distributed systems and the most overlooked by engineers writing their first retry. The shape is straightforward. A downstream service slows down. Many clients fail at roughly the same moment. Each client retries on the same fixed schedule. The retries arrive at the downstream in a synchronized wave that is larger than the original load that caused the slowdown. The downstream goes from slow to dead. The retries then double in size again. A momentary blip becomes an hour-long outage.
The Anatomy of a Thundering Herd
Step
What Happens
Effect on Downstream
1
Downstream service hits a brief CPU spike; latency rises
Some requests time out; clients see transient errors
2
Hundreds of clients see errors at roughly the same second
Hundreds of clients enter their retry path simultaneously
3
Every client retries after exactly one second
Hundreds of new requests arrive at the downstream in the same second
4
Downstream is now under double the original load
Latency rises further; more requests time out
5
Failed retries trigger their next retry one second later
Load doubles again; downstream collapses entirely
The pattern is self-amplifying. Each round of retries makes the underlying problem worse. Without intervention, the system reaches a steady state where the retries themselves are the load and the original work never completes. Engineers seeing the dashboards during this state see a downstream that appears to be permanently down even though there is no bug, no deployment, and no obvious trigger. The trigger was a one-second latency spike that never repeated. The retry behavior is what kept the outage alive.
A Real Numbers Example
Suppose 500 worker processes each call a downstream API once per minute. A 200-millisecond latency spike at second 30 causes all 500 to time out simultaneously. Each retries one second later. The downstream now sees 500 retry requests at second 31 in addition to whatever organic traffic was scheduled. Original requests-per-second went from 8 to 508 in a single second.
The Fix Has Two Parts
Mechanism
What It Does
Why It Helps
Backoff
Wait longer between each successive retry
Reduces total request volume during the recovery window
Jitter
Add a random offset to each retry's wait time
Spreads the retry wave across time so it does not arrive as a single spike
Backoff alone reduces the size of each retry wave but does not desynchronize the clients. All five hundred clients still retry at the same second, only later. Jitter is what desynchronizes them. With jitter, one client retries at second 31.2, another at second 32.1, another at second 33.7. The downstream sees a stretched-out trickle instead of a synchronized spike. The combination is what makes retries safe at scale.
1
importrandom
2
importtime
3
4
# Backoff plus jitter: the load on the downstream stays bounded
▸Wait between attempts grows with each failure (exponential is the standard)
▸A random offset is added to the wait so retries do not synchronize
▸There is a hard cap on the wait so a single retry never sleeps for hours
1
The print output makes the difference visible. Without jitter, one hundred clients land in the same second. With jitter, the same hundred clients spread across four seconds. The downstream sees a quarter of the original load at any one moment, which is the difference between a survivable retry storm and a fatal one.
The thundering herd was named after the behavior of cattle in a panic. The metaphor sticks because the dynamics are the same. A single trigger causes a synchronized response from many independent agents. The synchronized response itself becomes the problem. Engineers who have not experienced a thundering herd outage often underestimate how cleanly the metaphor maps to systems. The fix in livestock is to physically prevent the synchronized motion. The fix in software is the same: prevent the synchronized retry.
Operations teams at companies that run global services have written extensively about thundering herd outages from production. AWS publishes blog posts about retry storms in distributed systems. The pattern is consistent enough across industries that the standard mitigation, full jitter, has become a well-known acronym in SRE literature. Reading the AWS Architecture Blog post on exponential backoff and jitter is a reasonable evening's investment for any engineer who writes pipeline code. The post predates 2026 by more than a decade and the conclusions have not changed because the underlying mathematics has not changed.
Synchronization is the property that makes the herd thunder. The same property is what makes Sunday morning grocery store rushes worse than Monday afternoon ones: hundreds of independent decisions to shop happen at the same time because the underlying schedule is shared. Software systems are full of shared schedules: every cron job on the hour, every healthcheck on a fifteen-second interval, every retry after a one-second wait. Engineers building distributed systems learn quickly that 'on the hour' is a stronger correlation than expected, and the same applies to retry waits.
The mathematical model for a thundering herd is the M/M/1 queue with a sudden arrival surge. The math says, in plain words, that any service running near its capacity becomes unstable when arrival rate briefly spikes. Most production services are sized to run somewhere between fifty and seventy percent of capacity to leave headroom for these spikes. Retries that are correlated with each other consume that headroom in seconds. The reason exponential backoff and jitter became universal practice is that they restore the headroom by spreading the surge across enough time that the steady-state capacity can absorb it.
A thundering herd is a self-amplifying outage caused by synchronized retries.
Backoff reduces total retry volume; jitter spreads it across time.
Both are required at scale; either one alone leaves the failure mode open.
Exponential Backoff in One Sentence
Daily Life
Interviews
Compute exponential backoff wait times by hand and explain why the cap and jitter are required, not optional.
Exponential backoff is the standard way to choose how long a retry should wait. The rule fits in one sentence: each successive attempt waits roughly twice as long as the previous one, capped at a maximum. The mechanism is everywhere because it solves two problems at once. It gives the downstream more time to recover with each failure. It bounds the total number of retries that can fit in a given time window. The cap prevents a runaway exponential from sleeping for days on the seventh retry.
The Formula
1
# Exponential backoff with a base of 1 second and a cap of 60 seconds.
2
# attempt is zero-indexed: 0 for the first retry, 1 for the second, and so on.
3
4
defwait_seconds(attempt,base=1,cap=60):
5
returnmin(cap,base*(2**attempt))
6
7
8
forattemptinrange(7):
9
print(attempt,wait_seconds(attempt),"seconds")
The Numbers In a Real Example
Attempt
Computed Wait
Capped Wait
1
1 second
1 second
2
2 seconds
2 seconds
3
4 seconds
4 seconds
4
8 seconds
8 seconds
5
16 seconds
16 seconds
6
32 seconds
32 seconds
7
64 seconds
60 seconds (capped)
8
128 seconds
60 seconds (capped)
After eight attempts, the total elapsed wait is 1 + 2 + 4 + 8 + 16 + 32 + 60 + 60 = 183 seconds, just over three minutes. That is the budget every retry policy implicitly defines. Three minutes is enough time for most genuine transient failures to clear. If a downstream is still failing after three minutes of escalating waits, the failure is almost certainly not transient and the pipeline should escalate to alerting rather than continuing to retry.
Adding Jitter to the Backoff
1
importrandom
2
3
# Full jitter: pick uniformly between 0 and the computed cap.
4
# This is the AWS Architecture Blog recommendation and the default in many libraries.
5
6
defwait_with_full_jitter(attempt,base=1,cap=60):
7
upper=min(cap,base*(2**attempt))
8
returnrandom.uniform(0,upper)
9
10
11
# Equal jitter: half the cap, plus a random half.
12
# Slightly less variance, slightly higher mean wait.
13
14
defwait_with_equal_jitter(attempt,base=1,cap=60):
15
upper=min(cap,base*(2**attempt))
16
returnupper/2+random.uniform(0,upper/2)
•Fixed Delay
Same wait between every attempt
Total wait grows linearly with attempts
Synchronizes retries across many clients
Acceptable for very small attempt counts and very small fleets
✓Exponential Backoff
Wait doubles with each successive attempt
Total wait grows quickly; budget is naturally bounded
Pairs with jitter to spread retries across time
The standard for any pipeline that runs with more than a single worker
Why the Cap Matters
Without a cap, the seventh retry waits 64 seconds, the tenth waits 512 seconds, the fifteenth waits more than four hours. The exponential climbs faster than human intuition predicts. A cap of 60 seconds, or 5 minutes, or 30 minutes, depending on the workload, prevents a retry policy from silently turning a stuck job into an hours-long stall. The cap also signals an operational truth: by the time the retry has been waiting that long, the failure is no longer transient and the pipeline should be telling someone.
BaseMultiplierCapMaximum attempts
Base
The starting wait
Usually 1 second. Smaller for low-latency systems, larger for batch pipelines that can absorb a longer first wait.
Multiplier
How fast the wait grows
Almost always 2. Bigger multipliers give up too quickly; smaller ones make the backoff ineffective.
Cap
The hard ceiling
60 seconds for synchronous calls, several minutes for batch retries. Prevents runaway exponential growth.
Maximum attempts
When to stop trying
5 to 8 is typical. Beyond that, the failure is almost certainly not transient and alerting earns its place.
1
The 'roughly twice as long' rule is approximate on purpose. Some implementations use a multiplier of 1.5 or 3 instead of 2; the choice has subtle effects on how aggressive the backoff is. A multiplier of 1.5 takes more attempts to reach the cap, which means more work hits the downstream during the recovery window. A multiplier of 3 gives up too quickly because the budget is exhausted in fewer total attempts. Two is the convention because it sits in the middle of those tradeoffs and because doubling is computationally trivial. Engineers who pick a non-default multiplier should be able to articulate why the default is wrong for the workload at hand; in nearly every case the answer is that the default is fine.
The interaction between exponential backoff and the underlying operation's own latency is worth pausing on. A retry that fires after a one-second wait looks like it took one second of wall time, but the actual call may have taken thirty seconds to time out before the retry fired. The visible behavior is that an operation 'tried three times in three seconds' when in reality the operation tried three times across ninety-three seconds of wall time. The total elapsed clamp from the retry budget catches this case, but only if the budget includes elapsed wall time as a separate dimension from attempt count.
The retry implementation in the standard libraries (tenacity in Python is a good example) compresses many of the lessons in this section into a few lines. Adopting such a library does not absolve a pipeline author of understanding what the library does; the defaults are usually correct, but the explanation in this section is what makes the defaults legible. An engineer who configures tenacity without reading what 'wait_random_exponential' actually does is one bad configuration choice away from a thundering herd.
✓Do
Pick a base of 1 second and a multiplier of 2 unless there is a specific reason to deviate
Cap the wait so the seventh retry never sleeps longer than the operational tolerance
Pair backoff with jitter so retries from many clients do not synchronize
✗Don't
Use exponential backoff without a cap; the runaway is silent and expensive
Use a base shorter than the round-trip time to the downstream; the first retry will hit before the failure has time to clear
Treat the maximum-attempts number as an unbounded knob; if the workload needs more than ten attempts, the design needs more than retries
TIP
Compute the worst-case total elapsed time before choosing the cap. If the answer is longer than the on-call tolerance, lower the cap or the attempt count.
When NOT to Retry
Daily Life
Interviews
Identify the failure categories that should not be retried and route them to quarantine, alerting, or credential rotation instead.
Retry as a tool is so often correct that engineers begin to apply it reflexively. The reflex causes outages of its own. Some failures will never succeed on a second attempt, and retrying them wastes compute, fills up logs, and hides the underlying problem. Knowing the categories where retrying is wrong is as important as knowing how to retry properly. The pipeline that retries correctly on transient errors and refuses to retry on permanent ones is the pipeline that operates predictably.
Three Categories That Should Not Be Retried
Category
Example
Why Retrying Fails
Validation failures
A required field is missing from an event
The next attempt sees the same missing field; nothing changes
Authentication failures
A 401 from an API because the token expired or is wrong
Retrying with the same credential keeps failing; the credential has to be replaced
Poison pills
A specific row that crashes the parser every time it is processed
The row will keep crashing the parser; retrying loses progress on every other row
Validation Failures
Validation failures occur when input data does not match the expected shape. A field that the contract said was required is missing. A timestamp is in the wrong format. A foreign key points to a row that does not exist. None of these will resolve themselves on a retry. The fix is upstream: either the producer corrects the data, or the pipeline routes the bad record to a quarantine path where a human can inspect it. Retrying a validation failure trades a clear error for a slow grinding one.
Authentication and Authorization Failures
A 401 Unauthorized or a 403 Forbidden response is permanent until the credentials change. Retrying does not change the credentials. Retrying does, in some systems, lock the account out after too many failed attempts, turning a fixable problem into a multi-hour incident. The standard pattern is to fail loudly on the first auth error and let an operator rotate the credential. Some teams pair this with an automatic token refresh on 401, which is a different mechanism than a retry: the refresh changes the inputs before the next attempt.
Poison Pills
A poison pill is a record that causes a worker to fail every time the worker tries to process it. The classic example is a streaming pipeline that consumes from Kafka. The worker pulls a message, the message has a bug-triggering shape, the worker crashes. The orchestrator restarts the worker. The worker pulls the same message, crashes again. The pipeline appears to be processing data but is actually stuck in a loop on the poison message, with every other downstream message piling up behind it. The fix is not retrying the poison record more times. The fix is moving it out of the way so the rest of the queue can flow.
If retrying does not change anything between attempts, retrying is the wrong tool:
▸The data is the same on attempt two as it was on attempt one
▸The credentials are the same on attempt two as on attempt one
▸The schema is the same on attempt two as on attempt one
What To Do Instead
Failure
Action Instead of Retry
Where the Action Lives
Validation failure on a single row
Quarantine the row, continue processing the rest
Quarantine table or dead letter queue
Authentication failure
Stop the pipeline, page on-call, rotate the credential
Alerting and runbook
Poison pill on a streaming consumer
Skip the message after N failed attempts, route it to a DLQ
Consumer DLQ configuration
Schema mismatch on a downstream insert
Stop the pipeline, alert the producer team
Schema validation stage and on-call
1
# A retry decorator that refuses to retry permanent failures.
2
fromfunctoolsimportwraps
3
4
5
defretry_transient(max_attempts=3):
6
defdecorator(fn):
7
@wraps(fn)
8
defwrapper(*args,**kwargs):
9
forattemptinrange(max_attempts):
10
try:
11
returnfn(*args,**kwargs)
12
exceptPermanentError:
13
raise# never retry permanent failures
14
exceptTransientError:
15
ifattempt+1==max_attempts:
16
raise
17
returnwrapper
18
returndecorator
The decorator above is short, but the discipline it encodes is the difference between a self-healing pipeline and a noisy one. PermanentError raises immediately, never sleeping, never wasting compute. TransientError gets a bounded retry. Anything outside those two classes is uncaught and propagates as a real bug, not a quietly retried one.
•Retry Always
Permanent failures retry until the budget runs out
Bad credentials trigger account lockouts after lockout policies kick in
Poison pills block streaming pipelines while attempting to be reprocessed
On-call gets paged after the budget burns instead of at the first sign of trouble
Bad credentials surface within seconds and trigger rotation
Poison pills move to a DLQ after a small bounded retry budget
On-call gets paged on the genuine signal, not the retry aftermath
Authentication failures deserve a closer look because they are the most common permanent error pattern in third-party integrations. A token expires every twelve hours; the pipeline runs every fifteen minutes; the failure window is forty-eight times per day in steady state. Without a token-refresh mechanism in front of the retry path, the pipeline pages on-call every twelve hours forever. The refresh-then-retry pattern is the standard fix: on a 401, refresh the token, then retry the original request once with the new token. A second 401 after a refresh is a real auth failure and should escalate.
The poison-pill case has a subtle variant worth knowing about. Sometimes the offending message is not malformed in any obvious way; it is only poisonous in combination with a particular state inside the worker. A message that triggers a memory leak in a parser library, for example, is poisonous on workers that have already processed many messages and benign on a fresh worker. The diagnosis is harder because restarting the worker appears to fix the problem, until the leak builds up again. The defense against this class of poison pill is the same as for the simpler kind: bounded retries per message, then route to the DLQ, then continue.
TIP
When a new error type appears in production, write down what would change between attempt one and attempt two. If the answer is 'nothing,' the error does not belong in the retry path.
❯❯❯PUTTING IT ALL TOGETHER
> A small data team runs a nightly job that pulls invoice events from a payment processor's API. The job has been failing once or twice a week for months. The on-call engineer manually restarts it each time. A new engineer is asked to make the failures self-healing without making the system worse. The current code retries every exception forever with no sleep.
Step one: classify every error the API returns. HTTP 503s and connection resets are transient; HTTP 401s, 403s, and 422s are permanent. Anything ambiguous gets a small bounded retry budget.
Step two: replace the catch-all retry with a bounded retry that catches only transient exception classes. Five attempts is the upper bound. Permanent errors propagate immediately so on-call sees them at the first occurrence.
Step three: add exponential backoff with a base of 1 second, a multiplier of 2, and a cap of 60 seconds. Add full jitter so multiple workers do not synchronize against the API rate limiter.
Step four: route any single invoice event that fails validation to a quarantine table rather than failing the whole job. Permanent errors at the row level become row-level escalations, not job-level outages, and the pipeline stays the durable conduit Lesson 1 described.
KEY TAKEAWAYS
Failures come in two buckets: transient errors heal on retry; permanent errors will not. Naming the bucket precedes choosing the response.
A retry needs four properties to be safe: specific exception types, bounded attempts, sleep between attempts, and visible logging.
Naive retries amplify outages: synchronized retries across many clients form a thundering herd that turns a momentary blip into a sustained outage.
Exponential backoff with jitter is the standard: wait doubles each attempt, capped at a maximum, with a random offset to desynchronize clients.
Some failures should not be retried at all: validation errors, authentication failures, and poison pills require quarantine, credential rotation, or DLQ routing instead.
Some failures heal themselves and some never will; the pipeline must tell the difference
Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges
Topics covered: Transient vs Permanent Failures, The Retry: Easy to Misuse, Naive Retries and Thundering Herd, Exponential Backoff in One Sentence, When NOT to Retry
Every pipeline failure falls into one of two buckets. A transient failure is something that goes wrong because of a temporary condition: a network hiccup, a downstream service rebooting, a momentary rate limit. A permanent failure is something that will never succeed no matter how many times the pipeline tries: a bad credential, a row whose schema does not match, a malformed JSON document. The two buckets demand opposite responses. Treating a transient as permanent gives up too early; treating a
The retry is the most basic failure handling primitive. The mechanism is two lines of code: catch the exception, run the operation again. That simplicity is what makes the retry both the first reach and the most common source of subtle production bugs. A retry done correctly absorbs nearly all transient failures. A retry done carelessly amplifies an outage, runs forever, or quietly produces duplicate writes. The mechanics that distinguish the two are not complicated; they are unforgiving. A retr
The thundering herd is the most cited failure mode in distributed systems and the most overlooked by engineers writing their first retry. The shape is straightforward. A downstream service slows down. Many clients fail at roughly the same moment. Each client retries on the same fixed schedule. The retries arrive at the downstream in a synchronized wave that is larger than the original load that caused the slowdown. The downstream goes from slow to dead. The retries then double in size again. A m
Exponential backoff is the standard way to choose how long a retry should wait. The rule fits in one sentence: each successive attempt waits roughly twice as long as the previous one, capped at a maximum. The mechanism is everywhere because it solves two problems at once. It gives the downstream more time to recover with each failure. It bounds the total number of retries that can fit in a given time window. The cap prevents a runaway exponential from sleeping for days on the seventh retry. The
Retry as a tool is so often correct that engineers begin to apply it reflexively. The reflex causes outages of its own. Some failures will never succeed on a second attempt, and retrying them wastes compute, fills up logs, and hides the underlying problem. Knowing the categories where retrying is wrong is as important as knowing how to retry properly. The pipeline that retries correctly on transient errors and refuses to retry on permanent ones is the pipeline that operates predictably. Three Ca