Failure Modes and Error Handling: Advanced

A senior data engineer at a public ride-hailing company inherited a pipeline graph with 211 production jobs. Six of them had no retry policy, four had unbounded retries, and none had circuit breakers. The on-call rotation averaged seven pages per week, most of them retries that turned a small failure into a large one. The fix was not to add retries to every job. The fix was to redesign failure handling as a first-class concern of the architecture rather than an afterthought bolted onto each transform. After the redesign, page volume dropped to three per week and the average incident length dropped from forty-five minutes to nine. The work that produced that result is the subject of this lesson. Failure handling at the advanced tier is not a list of patterns. It is a way of seeing the architecture so that failures have planned paths, signals, and recoveries rather than improvised ones.

Failure Classification by Design

Daily Life
Interviews

Treat failure classification as the design constraint that drives every retry, breaker, queue, and runbook decision in the pipeline.

The beginner tier introduced classification as the first move when designing a retry. The advanced tier reframes classification as the central design constraint of the entire pipeline, not only of the retry block. Every node in the architecture has a failure surface, and that surface determines what retries, what queues, what alerts, and what runbooks the node needs. A pipeline that has not classified its failures has not been designed. It has been written.

The Failure Surface of a Node

Every node in a pipeline graph reads from somewhere, processes, and writes somewhere. Each of those three has its own failure modes. The read can fail because the upstream is unavailable, because the credential expired, because the schema changed, or because the data did not arrive in time. The processing can fail because of a bug, a memory exhaustion, a poison pill, or a dependency on a service that is itself failing. The write can fail because the destination is unavailable, the constraint is violated, or the partition is locked. A pipeline designer who lists the failure modes of each node before writing any code arrives at retry policies, DLQ paths, and runbooks that match the actual surface. A designer who writes code first ends up bolting failure handling onto whichever surface broke last week.
Node OperationCommon Failure ModesPattern That Owns It
Read from sourceSource unavailable; auth expired; schema changedRetry budget; alerting on schema drift; circuit breaker on the source
Parse and validateMalformed payload; missing required field; type mismatchPer-row quarantine; threshold-driven batch failure
Enrich via downstreamDownstream slow or down; rate limited; intermittent 5xxRetry with backoff; circuit breaker; DLQ on exhaustion
Write to destinationDestination locked; quota exceeded; constraint violatedIdempotent write; constraint-violation routing; DLQ for hard failures
Coordinate with peer nodePeer late; peer wrote bad data; peer schema driftedSLA-based dependency check; quality gate; halting condition

Classification Drives Architecture

Once each node's failure surface is enumerated, the architecture diagram annotates with the response. A node that reads from a flaky upstream gets a circuit breaker on its read edge. A node that calls a slow downstream gets a retry budget plus a breaker around the call. A node that writes to a destination with constraints gets idempotency, the property covered in Lesson 5, and a quarantine for hard failures. The annotations are not afterthoughts; they appear on the diagram next to the boxes and arrows themselves. The discipline turns the diagram from a sketch into operable documentation.
1# A node descriptor WITH explicit failure handling.Lives next to the node code.node : id : enrich_orders_with_tax reads_from : kafka.order_events writes_to : snowflake.enriched_orders calls : tax_api.v3 failure_surface : read : kafka_unavailable : retry : { attempts : 5, backoff : exponential, max_delay_s : 60 } breaker : { window_s : 30, threshold : 0.5, cooldown_s : 60 } schema_mismatch : action : alert_and_halt process : validation_failure : action : quarantine_row threshold : 0.01 tax_api_5xx : retry : { attempts : 3, backoff : exponential, max_delay_s : 30 } breaker : { window_s : 30, threshold : 0.5, cooldown_s : 120 } tax_api_4xx : action : route_to_dlq write : snowflake_unavailable : retry : { attempts : 5, backoff : exponential, max_delay_s : 60 } constraint_violation : action : route_to_dlq

Why Classification Has To Be Explicit

Implicit Classification
  • Each retry block invents its own failure response
  • New error types become production incidents before they get policies
  • On-call relies on tribal knowledge to interpret error logs
  • Adding a downstream service requires reasoning from scratch
Explicit Classification
  • Failure responses are part of the node's specification
  • New error types map to existing categories with defined responses
  • On-call references the failure surface document, not chat history
  • Adding a downstream service inherits the standard surface template

The Cost of Skipping This Step

Pipelines that skip explicit classification accumulate failure-handling debt. Each new error type that surfaces in production produces an ad-hoc response: a retry tweak in one PR, a try-except block in another, a DLQ added six months later when somebody notices messages disappearing. The accumulated layers contradict each other. One block retries something the next block treats as permanent. The pipeline becomes hard to reason about because the rules of its behavior under failure are scattered across the codebase. Consolidating them into a node descriptor is a one-time investment that pays back every time on-call has to read the system.
Signs that classification is implicit:
  • Retry parameters differ across nodes for no documented reason
  • Some nodes have DLQs, others do not, with no policy distinguishing them
  • Engineers ask 'what does this pipeline do on a 503?' and get different answers
  • New downstream integrations follow whatever pattern the previous integration used, transitively
alert
Every node has a failure surface; enumerating it before writing code prevents bolted-on responses.
check
Failure responses belong in the node specification, not scattered across try-except blocks.
query
Classification is the design constraint that determines retry, breaker, DLQ, and runbook placement.
The vocabulary used across nodes matters more than the specific names chosen. Whether a team uses 'transient' and 'permanent' or 'retryable' and 'fatal' is less important than that every node uses the same words. A pipeline graph in which one node speaks of 'retryable errors' and the next speaks of 'transient failures' creates a translation cost on every code review and every incident investigation. Vocabulary discipline is mundane and unglamorous; it is also one of the highest-return investments a senior engineer can champion. Codifying the vocabulary in a one-page reference and pinning it in the team channel takes about an hour and prevents thousands of hours of confused conversation across the next several years.
Reviewing the failure surface as part of pipeline review brings the same benefits that schema review brings to data modeling: surface incompatibilities early, catch ambiguity before it ships, force a conversation that would otherwise happen during the next outage. The review process should be lightweight; ten minutes per node is enough for an experienced reviewer to spot the gaps. The discipline scales because each new pipeline reuses the standard surface template; only the divergences require fresh thought.
The maturity model from Lesson 1 advanced (script, scheduled job, service, product) maps directly onto failure handling. A pipeline at level 0 has ad-hoc retries scattered across the code. A pipeline at level 1 has consistent retries within a single pipeline but no shared standard. A pipeline at level 2 has a shared retry library across nodes. A pipeline at level 3 has explicit failure surfaces, shared libraries, and DLQs treated as quality signals. The progression is the same one this lesson describes; the levels are the convenient way to talk about where the work is on any given pipeline.
Do
  • Write the failure surface document before the node code; treat it as a reviewable artifact
  • Use the same vocabulary across nodes (transient, permanent, ambiguous, schema_drift, validation)
  • Inherit a standard surface template when adding a new downstream service
Don't
  • Discover failure modes in production; the surface should be enumerated up front
  • Allow each node to invent its own retry shape; consistency is the foundation of operability
  • Treat the classification document as static; update it as new modes appear
pipeline task
task
transient? retry w/ backoff
retry
warehouse
success
permanent -> DLQ
dlq

Classify the failure first: transient errors (timeout, lock) get retried with exponential backoff; permanent errors (bad schema) go straight to a dead-letter queue. Retrying a permanent error just wastes time.

The DLQ as a Quality Signal

Daily Life
Interviews

Read the DLQ as a leading-indicator quality signal and design metrics that surface upstream regressions before consumers notice them.

The intermediate tier introduced the DLQ as durable storage for failed messages. The advanced framing is that the DLQ is also a quality signal. The contents of the DLQ contain information about upstream health, downstream stability, and producer correctness that is not available anywhere else in the system. A growing DLQ is rarely an operational nuisance alone; it is usually a leading indicator of a problem that has not yet manifested in any other dashboard. Reading the DLQ as a signal, not as a queue alone, is the difference between catching an upstream regression on day one and catching it on day fourteen when a consumer notices.

Lesson 7 (data quality) treats DLQ growth rate as the most operationally honest data-quality metric. A growing DLQ is a leading indicator of upstream contract drift, even when no explicit quality gate has fired.

Three Things the DLQ Shape Reveals

DLQ PatternWhat It RevealsAction
Sudden spike in validation failuresAn upstream producer changed schema or started emitting bad dataPage the producer team; check recent deploys upstream
Sudden spike in auth failuresA credential expired or was rotated without coordinationRotate credential; correlate with secret-management deploys
Steady drip of timeout-exhaustion entriesA downstream service is degraded but not entirely downInvestigate downstream latency before it becomes a full outage
Concentration of failures from one partitionA single key is producing poison events; bad data isolated to a tenantQuarantine the partition; coordinate with the affected producer

Metrics That Read the DLQ

DLQ size is a useful metric. DLQ rate-of-change is more useful. DLQ rate-of-change broken down by exception type is the most useful. A pipeline that emits the per-second arrival rate of DLQ entries by exception type produces a dashboard that surfaces upstream regressions within minutes of their first occurrence. The same dashboard exposes the steady drip patterns that precede full outages. Without these metrics, the DLQ is a graveyard checked weekly. With them, it is an early-warning system checked continuously.
1/* DLQ shape over the last 24 hours, broken down by exception type. */
2/* Run on a continuous schedule; alert on rates above the rolling baseline. */
3WITH recent AS (
4 SELECT
5 exception_type,
6 DATE_TRUNC('minute', failed_at) AS minute,
7 COUNT(*) AS arrivals
8 FROM dlq_envelope_log
9 WHERE failed_at >= NOW() - INTERVAL '24 hours'
10 GROUP BY 1, 2
11),
12baseline AS (
13 SELECT
14 exception_type,
15 AVG(arrivals) AS mean_per_minute,
16 STDDEV(arrivals) AS sd_per_minute
17 FROM recent
18 GROUP BY 1
19)
20
21SELECT
22 r.exception_type,
23 r.minute,
24 r.arrivals,
25 b.mean_per_minute,
26 (
27 r.arrivals - b.mean_per_minute
28 ) / NULLIF(b.sd_per_minute, 0) AS z_score
29FROM recent AS r
30INNER JOIN baseline AS b USING (exception_type)
31ORDER BY z_score DESC
1

Tracking DLQ Trends Over Time

A useful dashboard plots DLQ arrival rate over a thirty-day window with a band showing the rolling baseline plus and minus two standard deviations. Spikes above the band trigger an alert. Drips below the band but above the baseline trigger a watch list. The mature data engineering organizations that operate at scale treat this dashboard the same way an SRE team treats a service health dashboard. The DLQ is not a curiosity; it is a continuous quality signal.
DLQ as Graveyard
  • Checked weekly when somebody remembers
  • No metrics emitted; size is the only number anyone has
  • Investigations begin when a consumer complains
  • Replay tooling is invented during incidents
DLQ as Quality Signal
  • Continuous arrival-rate dashboard with type breakdown
  • Z-score alerts catch regressions within minutes
  • Investigations begin when the dashboard moves, before consumers notice
  • Replay tooling is part of the standard pipeline scaffold

The DLQ as a Producer Contract Test

When the DLQ rate of validation failures spikes, the cause is almost always upstream. The producer changed something. The DLQ becomes the test that flags the change. Treating the DLQ this way reframes the relationship between the data engineering team and producer teams. The DLQ entries are evidence; they show up with full envelopes that the producer team can read. Conversations move from 'is something wrong' to 'this specific field stopped being populated last Wednesday at 3:47pm.' The conversation moves faster because the data is concrete.
What a high-functioning DLQ dashboard shows:
  • Arrival rate over time, broken down by exception type
  • Top correlation IDs in the last hour, useful for spotting bad-actor producers
  • Distribution of attempts at the time of routing (was the budget always exhausted, or did some go in immediately as permanent?)
  • Time since the oldest unreplayed entry; freshness of the DLQ contents itself
Arrival rateType breakdownReplay rate
Arrival rate
How fast messages enter the DLQ
The primary signal. Sudden spikes mean upstream changed; steady drips mean downstream is degraded.
Type breakdown
Which classification is dominating
Validation versus auth versus exhaustion tells different stories about what is wrong.
Replay rate
How fast humans drain the DLQ
If arrivals exceed replay, the DLQ is growing unboundedly. The metric forces a real conversation.

A useful operational rule: if the DLQ arrival rate exceeds the replay rate for more than 24 hours, the DLQ is functionally a drop. The math is simple. Whatever is going in faster than humans drain it accumulates, and accumulated entries past a retention window are lost. The metric to watch is not DLQ size; it is the difference between arrival rate and replay rate, integrated over time.

The DLQ as quality signal also reframes how producer teams are evaluated. A producer team whose schema changes generate spikes in the DLQ has a measurable quality problem. The measure is concrete and attributable, which is the precondition for change. Without the metric, the producer team has no feedback loop and quality stays constant. With the metric, the feedback loop closes within minutes of a regression, and producer teams begin treating their consumers as customers rather than as downstream complaints.
The same DLQ-as-signal framing applies at the organizational level. Teams whose pipelines have growing DLQs are teams whose upstream relationships need attention. Aggregating DLQ depth across all pipelines and ranking the producers whose data lands there most often produces a list of teams to coordinate with. The list is concrete; the conversation is concrete. This kind of structured visibility is what turns failure handling from a per-team concern into an organizational discipline. In some organizations this aggregation drives a quarterly health review where the producers and consumers sit at the same table and walk through the data together. The conversations are uncomfortable but productive because the data leaves no room for ambiguity about which side has the most leverage to improve quality.
TIP
Surface DLQ depth and arrival rate to the same dashboard as data-quality metrics. The DLQ is a quality signal; treating it as a separate operational concern hides what it is.

Reprocessing From the DLQ

Daily Life
Interviews

Build replay tooling that turns the DLQ into a recovery surface with inspection, annotation, throttled replay, and idempotent writes.

A DLQ that is hard to drain is functionally a drop with extra storage cost. The advanced framing is that DLQ tooling is a first-class part of the pipeline architecture, not an optional postscript. The tooling has three jobs. It must let a human inspect failed messages without writing custom queries. It must let a human modify or annotate messages before replay. It must let a human replay one message, a hundred messages, or all messages of a particular exception type, with bounded blast radius and observable progress. Without tooling that does these three things, the DLQ becomes a liability rather than an asset.

The Three Capabilities of a Replay Tool

CapabilityWhy It MattersImplementation Note
InspectionAn engineer looks at a failed message in seconds, not in a queryA simple web view that decodes the envelope and shows the original payload, exception, and attempts
AnnotationAn engineer fixes a malformed field before replayAn edit form that updates the payload and writes a new envelope with a 'modified' flag
Bounded replayReplays do not flood the main pipeline; failed replays land back in the DLQReplay batches with throttling, dry-run mode, and re-routing on second failure

Inspection: The Read Path

The simplest inspection tool is a small internal web app that reads from the DLQ and renders each entry as a structured view. The view shows the original payload as JSON, the exception type and message, the attempts, the timestamp, and the correlation ID. The same view supports filtering by exception type, by time range, and by correlation ID. The implementation is rarely more than a few hundred lines of code, but the productivity gain over reading raw envelopes from a Kafka topic is large enough that the investment is the most leveraged work an SRE can do on a pipeline.

Annotation: The Edit Path

Some failures are recoverable only after a human modifies the message. A field was missing; the engineer fills it in. A timestamp was malformed; the engineer corrects the format. The annotation path lets an engineer make this fix in the tool, save the corrected message, and replay it through the pipeline. The corrected message is stored separately from the original and is tagged so the audit trail is preserved. The pipeline that processes the replay sees the modified payload, not the original; the DLQ retains both. The audit trail matters because someone, eventually, will ask what was changed and by whom.
1# A bounded replay function with dry-run support.
2
3def replay(dlq_entries, sink, dry_run=True, batch_size=10):
4 successes = 0
5 failures = 0
6 for batch_start in range(0, len(dlq_entries), batch_size):
7 batch = dlq_entries[batch_start:batch_start + batch_size]
8 for entry in batch:
9 payload = entry.get("modified_payload") or entry["original_payload"]
10 if dry_run:
11 print(f"DRY RUN would replay: {entry['correlation_id']}")
12 continue
13 try:
14 pipeline_handler(payload)
15 successes += 1
16 except Exception as exc:
17 # Failed replays go back to the DLQ with replay_attempt incremented.
18 route_to_dlq(payload, exc, attempt_count=entry["attempt_count"] + 1, worker_id="replayer")
19 failures += 1
20 time.sleep(1) # throttle so the replay does not flood the pipeline
21 return successes, failures

Bounded Replay: The Write Path

A replay tool that floods the pipeline with ten thousand messages in a second is indistinguishable in effect from a thundering herd. The replay path needs the same restraints as the original path: throttling, batching, observable progress, and re-routing on second failure. A failed replay should not loop forever; the second failure routes back to the DLQ with an incremented attempt counter, and after a small number of replay failures the message is marked unrecoverable and a human is paged.
Naive Replay
  • Reads everything in the DLQ at once
  • Runs full speed against the main pipeline
  • No dry-run; first run is the real run
  • Failed replays loop forever or are silently dropped
Bounded Replay
  • Reads in bounded batches with throttling
  • Respects the same rate limits as the main pipeline
  • Dry-run mode previews the work before committing
  • Failed replays return to the DLQ with attempt counter; unrecoverable after N replay failures

Idempotency Is the Prerequisite

Replays only work safely if the pipeline is idempotent. The mechanics of idempotent writes were the subject of Lesson 5: partition overwrite, MERGE on a business key, delete-then-insert in a transaction. Without those mechanics, a replay produces duplicate rows downstream. The replay tool inherits the idempotency property from the pipeline. A replay tool attached to a non-idempotent pipeline is a duplicate-generation tool wearing recovery clothing.
What a complete DLQ tooling stack provides:
  • Inspection without writing queries
  • Annotation with audit trail
  • Bounded, throttled replay with dry-run preview
  • Re-routing on second failure with incremented attempt count
  • Idempotent downstream writes so replays are safe
1
alert
A DLQ without tooling is a graveyard; tooling is what makes it a recovery surface.
check
Replay is bounded, throttled, and observable, the same way the main pipeline is.
query
Idempotency is the prerequisite; without it, replays generate duplicates.
An often-overlooked aspect of replay tooling is the audit trail. Every replay action should produce a record: who replayed which messages at what time, with what modifications, and what the outcome was. The record matters for compliance reasons in regulated industries and for debugging reasons in every industry. A replay that succeeds but produces unexpected downstream effects is hard to debug without an audit trail. The cost of producing one is small; the cost of operating without one shows up only at the worst possible moment. The audit log itself becomes a useful artifact for retrospectives: which failure modes are most often recovered by replay, which require manual upstream fixes, and which are unrecoverable. The patterns inform the next quarter of platform investment.
TIP
Build the replay tool as part of the same project that ships the DLQ. Adding it later is twice the work because by then production has accumulated DLQ entries that the tool was not designed to handle.

Cascading Failures and Backpressure

Daily Life
Interviews

Diagnose cascading failures as queue-overflow problems and apply backpressure or load shedding to bound the pipeline's behavior under downstream slowness.

A cascading failure is the failure mode where one slow component brings down everything upstream of it. The mechanism is a queue that fills faster than it drains. The slow downstream cannot keep up with the producer. The producer keeps producing because nothing tells it to stop. The queue fills. The producer's memory fills. The producer crashes. The producer's upstream begins to fill its own queue, and the failure propagates backward through the graph. The original cause was a slow downstream; the visible symptom is the upstream falling over. Cascading failures are the most expensive class of pipeline outage because the recovery requires restarting many components in the right order.

The Anatomy of a Cascade

StageWhat HappensWhere the Pressure Lands
1Downstream service slows from 100ms to 2 secondsPipeline workers spend more time waiting
2Queue between producer and worker fills upMemory pressure on the queue
3Producer cannot enqueue; producer threads block or crashProducer-side outage
4Upstream of the producer starts queueingCascade propagates one hop further upstream
5Eventually the entire pipeline graph is stalledRecovery requires restarting in topological order

Backpressure: The Standard Mitigation

Backpressure is the mechanism by which a slow component tells its producer to slow down. Backpressure is not a software pattern; it is a design property of every queue between two components. The producer asks the queue to enqueue. If the queue is full, the producer waits. The wait propagates back to the producer's caller, and so on, until the original source slows its rate to match what the slow downstream can absorb. The principle is mechanical: every component runs at the rate of the slowest one downstream of it. The pipeline does not crash; it operates more slowly until the downstream recovers.
1# A bounded queue with backpressure: put() blocks when the queue is full.
2import queue
3import threading
4import time
5
6bounded = queue.Queue(maxsize=100)
7
8
9def producer(items):
10 for item in items:
11 bounded.put(item) # blocks if the queue has 100 unread items
12
13
14def consumer():
15 while True:
16 item = bounded.get()
17 time.sleep(0.05) # slow downstream
18 bounded.task_done()

Why Unbounded Queues Are the Bug

An unbounded queue lets the producer pretend the downstream is fast. The queue accepts every put() instantly because it has no maximum size. Memory grows without limit. The illusion of forward progress hides the fact that no work is actually being completed by the downstream. When the queue eventually exhausts memory, the producer crashes. The producer's queue then accepts no more, and the pressure migrates upstream. The first principle of cascading-failure prevention is that every queue has a maximum size and that the maximum size is enforced by blocking the producer.
Unbounded Queue
  • Producer never blocks; queue grows without limit
  • Memory exhausts; producer crashes
  • Crash propagates upstream as the next queue fills
  • Recovery requires restarting in topological order
Bounded Queue with Backpressure
  • Producer blocks when the queue is full
  • No memory exhaustion; rate matches the downstream
  • Slow downstream causes slow upstream, but not crashes
  • Recovery is automatic when the downstream recovers

Load Shedding: When Slowing Down Is Not Enough

Backpressure works when the workload allows the upstream to slow down. For some workloads (a public API, a real-time event stream from millions of devices), the upstream cannot be slowed without losing data. The choice in those cases is load shedding: dropping or rejecting requests when the queue fills, so the system stays operational at reduced fidelity rather than failing entirely. Load shedding requires explicit decisions about which messages to drop, where to record what was shed, and how to alert when the shedding rate exceeds a threshold. The pattern is unfashionable because it admits to dropping data, but it is more honest than the alternative, which is silently dropping data through queue overflow.
Three responses to a slow downstream:
  • Backpressure: producer slows to match downstream throughput
  • Load shedding: producer drops messages above a threshold; sheds are recorded for replay
  • Buffering with a separate path: queue overflow routes to a holding area for later reprocessing

Backpressure in Streaming Systems

Modern streaming systems implement backpressure as a first-class feature. Kafka consumers control their consumption rate via offset management; a slow consumer falls behind, and the broker holds messages until the retention window expires. Flink propagates backpressure through its operator network: if a sink slows, every upstream operator slows in response. Spark Structured Streaming uses rate limits and trigger-based execution to bound the work per micro-batch. The mechanisms differ; the principle is the same: every link in the chain runs at the rate of the slowest link, and the entire chain has a known maximum throughput at any moment.
1
The output shows the producer blocking on every item after the queue fills up. The producer's effective rate matches the consumer's rate of 10 items per second. No memory grows. No crash happens. The pipeline operates at the speed of its slowest component, which is exactly what a healthy backpressured pipeline does.
BackpressureLoad sheddingCircuit breaker
Backpressure
Producer slows to match downstream
Bounded queues that block on put() when full. The default mitigation for cascading failures.
Load shedding
Producer drops above threshold
When the upstream cannot be slowed (real-time streams, public APIs), drop excess work explicitly.
Circuit breaker
Stop calling the slow downstream
From the intermediate tier; complementary to backpressure when the slowness is sustained.
TIP
Audit every queue in the pipeline for a maxsize. An unbounded queue is the seed of the next cascading failure outage; the audit takes minutes and prevents incidents that take hours.

Postmortem: Six-Hour Outage

Daily Life
Interviews

Read a real outage as a sequence of missing patterns and apply the standard checklist that prevents the next one.

The patterns become real when read against an actual incident. The postmortem below describes a real outage at a mid-size streaming company, with details lightly altered. The pipeline ingested clickstream events into a clickhouse cluster for product analytics. A configuration drift in one pipeline node caused a six-hour outage in which dashboards across the company showed empty graphs. The postmortem is structured the way Google's SRE book recommends: facts, timeline, root cause, contributing factors, action items. Reading the section end to end demonstrates how every pattern from the prior lessons either prevented or could have prevented part of the outage.

The Pipeline Before the Incident

1Web app: clickstream | v Kafka: raw_clicks(retention : 24 hours) | v Worker pool: enrich + write | + | + | +

Timeline

Time (UTC)EventWhat the Team Saw
06:14Geo-IP service deploys a config change that increases p99 latency from 80ms to 4 secondsNo alert; the deploy went unnoticed
06:31Pipeline worker calls to Geo-IP begin timing out at the 30-second client timeoutWorker logs show timeouts; no metric crossed a threshold
06:47First on-call page: 'enriched_clicks staleness > 30 minutes'Engineer paged; investigation begins
07:02Engineer identifies Geo-IP as the slow downstream; restarts the pipeline workersWorkers restart cleanly but immediately begin timing out again
07:30Kafka topic raw_clicks crosses 90% retention; messages begin to expireData loss begins; no alert on retention pressure
09:14Geo-IP team rolls back the config changeGeo-IP latency returns to 80ms
09:18Pipeline workers begin processing again; Kafka backlog is six hours deepCatch-up begins
12:41Kafka backlog cleared; dashboards return to normal freshnessTotal outage duration: 6h 27m

Root Cause

The single root cause of the outage was the Geo-IP config change. Everything downstream of that, including six hours of empty dashboards and three hours of permanent data loss, was a consequence of failure handling that was not in place. The pipeline had no retry policy on the Geo-IP call, so every request waited the full 30-second timeout before failing. The pipeline had no circuit breaker, so the workers continued issuing requests during the entire latency degradation. The pipeline had no DLQ, so messages that could not be enriched piled up in Kafka until retention dropped them. The pipeline had no backpressure on Kafka beyond retention itself, so the producer kept producing into a topic that had no consumers.

Contributing Factors

FactorWhy It Made the Outage WorseWhat Should Have Existed
No retry on Geo-IP callEach request waited 30 seconds before failingRetry budget with 5-second base, max delay 30s, 3 attempts
No circuit breaker on Geo-IPWorkers kept calling the degraded service for hoursBreaker opening after 50% failure rate over 60s window
No DLQMessages that could not be enriched piled up until retention killed themDLQ topic with envelope; replay tool
No alert on Kafka backlog growthRetention pressure was visible only when retention dropped messagesAlert on backlog > 1 hour worth of work
No alert on consumer lagWorkers were timing out; no metric flagged itConsumer lag dashboard; page on lag > 30 min

Why Each Pattern Would Have Helped

Without Patterns (the actual outage)
  • 6h 27m total outage
  • 3h permanent data loss from Kafka retention
  • Manual diagnosis took 15 minutes after first page
  • Investigation reached the wrong root cause first (workers, not Geo-IP)
With Patterns (the avoided outage)
  • Outage limited to ~30 minutes (until breaker opens and humans investigate)
  • No data loss; DLQ holds events until Geo-IP recovers
  • Breaker-open metric points directly at Geo-IP
  • Investigation begins at the correct root cause

The Action Items

1# Action items
2FROM the postmortem, prioritized BY blast - radius reduction.action_items : - id : AI - 1 priority : P0 description : Add retry budget around Geo - IP calls config : { attempts : 3, base_delay_s : 1, max_delay_s : 30, full_jitter : TRUE } owner : data - platform due : 2 days - id : AI - 2 priority : P0 description : Add circuit breaker around Geo - IP calls config : { window_s : 60, threshold : 0.5, cooldown_s : 120 } owner : data - platform due : 3 days - id : AI - 3 priority : P0 description : Add DLQ topic for un - enrichable events ; build replay tool owner : data - platform due : 1 week - id : AI - 4 priority : P1 description : Alert
3
4
5
6
7 ON Kafka consumer lag > 30 minutes owner : observability due : 1 week - id : AI - 5 priority : P1 description : Alert
8
9 ON Kafka topic backlog > 1 hour worth of work owner : observability due : 1 week - id : AI - 6 priority : P2 description : Document the failure surface for every node IN this pipeline owner : data - platform due : 2 weeks
The standing checklist for any new pipeline:
  • Every external call has a retry budget with bounded attempts, capped delay, and jitter
  • Every external call has a circuit breaker with a sliding-window threshold
  • Every queue is bounded; producers experience backpressure or explicit shedding
  • Every classified failure category routes to the DLQ with a full envelope
  • Every DLQ has alerts, a dashboard, and a replay tool
  • Every node has a written failure surface document
alert
A six-hour outage is rarely the fault of one missing pattern; it is usually the fault of every pattern being absent.
check
Postmortems in aggregate teach the same lesson: ship the standard set of patterns up front.
query
The patterns reinforce each other. A pipeline missing one is fragile; a pipeline missing all of them is the next incident.
The postmortem above is a single incident, but the action items it generated apply to most pipelines that have not yet adopted the standard discipline. Reading postmortems across an organization makes the recurring patterns visible and converts the implicit knowledge of senior engineers into explicit standards. The senior data engineer at any company can usually predict the next outage by reading the last six postmortems; the patterns repeat because the missing patterns are usually the same. Codifying the patterns into a checklist, and treating the checklist as a gating artifact during pipeline review, is what stops the cycle.
Do
  • Keep a running tally of postmortem root causes across the team; the recurring root causes become the standard checklist
  • Make the standard checklist part of pipeline review at the time of design
  • Build the alerting before the production traffic; an unobservable pipeline cannot be operated
Don't
  • Treat each outage as a unique event; outages cluster on missing patterns
  • Skip the postmortem when 'it was just a deploy'; the second-order causes are the lessons
  • Add patterns reactively after each outage; that is how the outage above happened in the first place
TIP
The shortest precise statement of failure-handling discipline: every external dependency gets a retry, a breaker, and a DLQ; every queue is bounded; every node has a written failure surface. A pipeline that satisfies that statement absorbs the failures that would otherwise become outages.
PUTTING IT ALL TOGETHER

> A platform data team at a public company runs 180 production pipelines with no shared failure-handling standard. Each team has invented its own retry, queue, and alerting patterns. The on-call rotation averages eight pages per week and fifteen percent of incidents involve cascading failures. Leadership asks for a one-quarter plan to bring the failure-handling discipline to a senior level across all pipelines.

Step one: enumerate the failure surface for the top twenty pipelines by traffic. Use the four-role separation from Lesson 1 to identify each node's read, process, and write surfaces. Document each surface in the YAML descriptor format.
Step two: standardize on a shared retry budget config and circuit breaker library across all nodes. The shared library encodes max attempts, max delay, max total elapsed, full jitter, and breaker thresholds. Every node calls the shared library; no node invents its own retry block. The consistency echoes the shared curated layer from Lesson 1 intermediate, applied to operational behavior instead of data shape.
Step three: ship a DLQ for every pipeline that calls an external service. The DLQ envelope is standardized; the replay tool is the same tool across all pipelines. The DLQ surfaces as a quality signal on the same dashboard as data-quality metrics. Idempotency from Lesson 5 is the prerequisite for safe replay.
Step four: audit every queue for a maxsize. Bound them; instrument backpressure metrics. The pipeline that runs at the speed of its slowest component is the pipeline that does not cascade. This depends on the orchestration scheduling from Lesson 4 understanding which DAGs share resources and how to throttle them.
Step five: build the postmortem aggregation pipeline. Every postmortem feeds a structured database; the recurring root causes become the next quarter's standard checklist additions. The discipline of treating failure handling as an architectural property rather than a per-pipeline afterthought is what prevents the same outages from recurring.
Bridge move: a senior engineer summarizing the discipline says, 'every external call gets a retry, a breaker, and a DLQ; every queue is bounded; every node has a written failure surface.' That sentence, when satisfied, absorbs the failures that would otherwise become outages.
KEY TAKEAWAYS
Failure classification is the design constraint, not an afterthought: every node's failure surface drives retry, breaker, DLQ, and runbook decisions before any code is written.
The DLQ is a quality signal, not a graveyard: arrival rate by exception type catches upstream regressions before consumers notice; the dashboard is as important as the queue.
DLQ tooling is a first-class deliverable: inspection, annotation, and bounded replay turn the DLQ from cost into recovery surface; idempotency is the prerequisite.
Cascading failures are queue-overflow failures: bounded queues with backpressure prevent slow downstreams from crashing upstream producers; load shedding is the alternative when slowing is impossible.
The standing checklist applies to every pipeline: every external call has retry, breaker, and DLQ; every queue is bounded; every node has a written failure surface. Postmortems in aggregate keep returning to these same gaps.

Failure handling is a design property; cascading failures kill systems that bolt it on

Category
Pipeline Architecture
Difficulty
advanced
Duration
38 minutes
Challenges
0 hands-on challenges

Topics covered: Failure Classification by Design, The DLQ as a Quality Signal, Reprocessing From the DLQ, Cascading Failures and Backpressure, Postmortem: Six-Hour Outage

Lesson Sections

  1. Failure Classification by Design (concepts: paRetryHandling)

    The beginner tier introduced classification as the first move when designing a retry. The advanced tier reframes classification as the central design constraint of the entire pipeline, not only of the retry block. Every node in the architecture has a failure surface, and that surface determines what retries, what queues, what alerts, and what runbooks the node needs. A pipeline that has not classified its failures has not been designed. It has been written. The Failure Surface of a Node Every no

  2. The DLQ as a Quality Signal (concepts: paDeadLetterQueue)

    The intermediate tier introduced the DLQ as durable storage for failed messages. The advanced framing is that the DLQ is also a quality signal. The contents of the DLQ contain information about upstream health, downstream stability, and producer correctness that is not available anywhere else in the system. A growing DLQ is rarely an operational nuisance alone; it is usually a leading indicator of a problem that has not yet manifested in any other dashboard. Reading the DLQ as a signal, not as a

  3. Reprocessing From the DLQ (concepts: paDeadLetterQueue)

    A DLQ that is hard to drain is functionally a drop with extra storage cost. The advanced framing is that DLQ tooling is a first-class part of the pipeline architecture, not an optional postscript. The tooling has three jobs. It must let a human inspect failed messages without writing custom queries. It must let a human modify or annotate messages before replay. It must let a human replay one message, a hundred messages, or all messages of a particular exception type, with bounded blast radius an

  4. Cascading Failures and Backpressure (concepts: paStreamProcessing)

    A cascading failure is the failure mode where one slow component brings down everything upstream of it. The mechanism is a queue that fills faster than it drains. The slow downstream cannot keep up with the producer. The producer keeps producing because nothing tells it to stop. The queue fills. The producer's memory fills. The producer crashes. The producer's upstream begins to fill its own queue, and the failure propagates backward through the graph. The original cause was a slow downstream; t

  5. Postmortem: Six-Hour Outage (concepts: paRetryHandling)

    The patterns become real when read against an actual incident. The postmortem below describes a real outage at a mid-size streaming company, with details lightly altered. The pipeline ingested clickstream events into a clickhouse cluster for product analytics. A configuration drift in one pipeline node caused a six-hour outage in which dashboards across the company showed empty graphs. The postmortem is structured the way Google's SRE book recommends: facts, timeline, root cause, contributing fa