An online marketplace had a feature where placing an order updated the orders table in Postgres and published an order_placed event to Kafka in the same handler. The team thought this was the cleanest way to get downstream services notified of new orders. Three percent of the time, in production, the database write succeeded and the Kafka publish failed silently. Reconciliation jobs caught most of the drift overnight, but enough leaked through that the recommendation system showed users orders that did not exist and the warehouse pickers occasionally picked phantom orders. The bug was the architecture, not any individual code path. Writing to two systems atomically across a network is not a thing that exists. The fix was the transactional outbox pattern, and after the migration the failure rate fell to zero. The advanced tier of ingestion is about this kind of structural problem: the failure modes that emerge from how systems are connected, not from any one connector being buggy.
The Dual-Write Problem
Daily Life
Interviews
Recognize the dual-write antipattern, name its four failure modes, and explain why distributed transactions do not solve it.
Dual write is the most common production bug in event-driven architectures. An application writes to its database and to a message broker in the same handler, expecting both writes to succeed or both to fail. There is no transaction that spans the two systems. One will eventually succeed without the other. The bug is not in the code. The bug is in the architecture. The transactional guarantee the application thinks it has does not exist.
What a Dual Write Looks Like
1
# Simulating the dual-write bug. The broker fails 5% of the time.
2
importrandom
3
random.seed(42)
4
5
db_orders=[]
6
broker_msgs=[]
7
8
defdb_insert_order(order_id):
9
db_orders.append(order_id)
10
11
defbroker_publish(order_id):
12
ifrandom.random()<0.05:
13
raiseIOError("broker unavailable")
14
broker_msgs.append(order_id)
15
16
fororder_idinrange(1,101):
17
db_insert_order(order_id)
18
try:
19
broker_publish(order_id)
20
exceptIOError:
21
pass# the application swallows the error or retries futilely
22
23
ghost_orders=set(db_orders)-set(broker_msgs)
24
print(f"Orders in db: {len(db_orders)}")
25
print(f"Events in broker: {len(broker_msgs)}")
26
print(f"Ghost orders (no event): {len(ghost_orders)}")
>>>Output
Orders in db: 100
Events in broker: 95
Ghost orders (no event): 5
The function looks correct on first read. An order is inserted, an event is published, the order id is returned. Production traffic discovers what is missing. If the database insert succeeds and the Kafka publish fails (broker down, network partition, retry exhaustion), the database has an order with no event. Downstream services that subscribe to order_placed will never know the order exists. If the Kafka publish succeeds and the database commit fails (a write conflict, a constraint violation, a rollback), the broker has an event referring to an order id that does not exist.
The Four Failure Modes
Outcome
Database State
Broker State
What Goes Wrong
Both succeed
Order written
Event published
Happy path
DB succeeds, broker fails
Order written
No event
Order exists, downstream never notified, ghost orders
DB fails, broker succeeds
No order
Event published
Phantom event, downstream acts on a nonexistent order
Both fail
No order
No event
Acceptable; the request retries
Two of the four outcomes break invariants. The aggregate failure rate is small but nonzero, and it grows with traffic. At low scale the bug is invisible because the failure rate is low enough that nobody notices. At medium scale the bug shows up as an occasional reconciliation discrepancy and gets papered over with a nightly job. At high scale the bug becomes the dominant cause of customer-facing inconsistency.
Why dual writes keep getting written:
▸The two-write code looks transactional at a glance
▸Local testing rarely triggers the failure (broker is healthy on a developer laptop)
▸Staging environments do not generate enough load to expose the rate
▸The failure mode is silent: no exception, just an event that never publishes
▸The shortcut feels productive: 'one handler, two effects, ship it'
Why a Distributed Transaction Will Not Save This
A natural reaction is to wrap both writes in a distributed transaction (XA, two-phase commit). The reaction is wrong for two reasons. First, neither Postgres nor Kafka supports distributed transactions in any practical sense; XA support exists for some databases but not for the modern broker stack. Second, even when it exists, two-phase commit blocks on coordinator failure and slows every transaction to the speed of the slowest participant. The pattern is theoretically correct and operationally untenable, which is why it is rare in modern systems.
•Distributed Transaction (2PC)
Theoretically gives atomicity across two systems
Requires both systems to support XA semantics
Blocks on coordinator failure
Slows every transaction to the slowest participant
Effectively dead in modern stacks
✓Transactional Outbox
Reduces the problem to a single transactional write
Works on any database that supports transactions and any broker
Does not block on broker failure
Latency cost is one row insert per event
The standard fix in practice
What Actually Works
The fix shows up in section 1 in detail. The shape of the fix is: write to one durable system in one transaction. That system is the database. The database knows how to commit atomically across rows. Write the business row and an outbox row in the same transaction. A separate process reads the outbox and publishes to the broker. Failures localize to the publisher, where they are recoverable, instead of distributing across two systems where they are not.
The dual-write problem is not solved by better error handling, more retries, or careful ordering. It is solved by reducing two writes to one.
Dual write is silent in development and dominant in production at scale.
There is no distributed transaction available across most modern stacks.
The fix is to write to one system in one transaction, not to coordinate two.
TIP
Audit every event-driven handler in a service for the pattern: db.write followed by broker.publish in the same function. Each match is a dual write and a future incident.
The Transactional Outbox Pattern
Daily Life
Interviews
Apply the transactional outbox pattern, design the table, and choose between polling and CDC-driven publishers.
The transactional outbox is the canonical fix for the dual-write problem. The application writes only to the database. Inside the same transaction that writes the business row, it inserts a row into an outbox table that describes the event to be published. A separate publisher process reads from the outbox table and publishes events to the broker. The atomic boundary is the database transaction; the broker is downstream of that boundary. Either both rows commit or neither does, and the publisher's job becomes the strictly easier problem of draining a queue of pending events.
The table holds enough state to publish each event later. The aggregate_type and aggregate_id identify the entity the event concerns. The event_type names the event. The payload carries the body. published_at starts NULL and is updated by the publisher when the broker has confirmed receipt. The partial index makes the publisher's draining query fast: it scans only the unpublished rows.
The Atomic Write
1
# The fix to place_order. One transaction, two row inserts, one durable medium.
The function now performs one logical operation. Either the order and the outbox row commit together, or neither does. There is no intermediate state where the order exists without a pending event. The transaction's atomicity is the entire guarantee, and it costs one extra row insert per event.
The Publisher
The outbox row exists; the broker still needs to receive it. A small publisher process reads unpublished rows, publishes them to the broker, and marks them published. Two implementations dominate. The polling publisher runs a loop that selects unpublished rows in order, publishes them, and updates the row. The CDC-driven publisher uses Debezium to read the database WAL, treating outbox table inserts as a CDC stream, and publishes the resulting events to a Kafka topic. The CDC version is preferred at scale because it inherits Debezium's reliability and ordering.
1
# Simulating the polling outbox publisher draining a small batch.
# Drain unpublished rows in id order; publish-then-mark.
14
forrowin[rforrinoutboxifr["published_at"]isNone]:
15
publish(
16
topic=row["event_type"],
17
value=row["payload"],
18
headers={"outbox_id":str(row["id"])},
19
)
20
row["published_at"]="now"
21
22
print(f"Published {len(broker)} events")
23
print(f"Topics seen: {sorted({m['topic'] for m in broker})}")
24
print(f"All marked: {all(r['published_at'] for r in outbox)}")
>>>Output
Published 4 events
Topics seen: ['order_placed', 'order_shipped']
All marked: True
The publisher is a simple loop. The order of operations is publish-then-mark, not mark-then-publish. A crash between publish and mark causes the next loop iteration to republish the same event. Downstream consumers see at-least-once delivery and deduplicate using the outbox_id (a stable monotonic key). The pattern composes cleanly with the dedupe-on-write approach from the intermediate tier.
Why CDC-Driven Publishers Win at Scale
Aspect
Polling Publisher
Debezium-Driven Publisher
Source-side cost
Index scan every loop iteration
Read WAL once, no extra scan
Latency
Floor at the polling interval
Sub-second tail
Ordering guarantee
Order by outbox.id; close to commit order
Exact commit order from the WAL
Operational components
One small worker per service
Kafka Connect + Debezium + schema registry
Right at scale
Up to medium throughput; hundreds of events/sec
Thousands to millions of events/sec
What the outbox pattern guarantees and what it does not:
▸Guarantees: every committed business write produces at least one published event
▸Guarantees: no published event refers to an uncommitted business state
▸Does not guarantee: exactly-once publishing (the publisher can crash mid-mark and republish)
▸Does not guarantee: ordering across aggregate types (consumers should not assume cross-stream ordering)
▸Does not guarantee: low latency without a CDC-driven publisher
✓Do
Insert the outbox row inside the same transaction as the business write
Index the unpublished slice of the outbox for fast draining
Publish-then-mark, not mark-then-publish
Run the publisher as a separate process, not as a thread inside the application
Use Debezium to drive publishing once volume justifies it
✗Don't
Trust the application to retry the broker write reliably without an outbox
Skip the cleanup job; the outbox grows without bound otherwise
Mix outbox draining with business request handling in the same process
Assume distributed transactions will solve this; they will not
The outbox reduces two writes to one transactional write.
The publisher's job is the easy half: drain a queue of pending events.
Outbox cleanup is operational policy and must be assigned an owner.
TIP
When introducing the outbox to an existing service, instrument both the old dual-write path and the new outbox path side by side for one week. Compare event counts. The drift is the silent loss the dual write was producing all along.
Backpressure at the Ingestion Layer
Daily Life
Interviews
Design backpressure handling for an ingestion layer using buffering, producer signaling, or selective dropping, matched to source loss tolerance.
Backpressure is what happens when a source produces faster than the pipeline consumes. The mismatch is normal in steady state; it is the spike that exposes the design. Without explicit backpressure handling, the slowest component in the chain becomes a buffer, fills, and then either drops events, crashes, or amplifies the spike upstream. Designing for backpressure is the difference between a pipeline that absorbs a 10x burst and a pipeline that becomes the incident.
Where the Mismatch Shows Up
Source Rate
Pipeline Rate
What Happens Without Backpressure
1k events/sec
10k events/sec
Pipeline idles; no problem
5k events/sec
5k events/sec
Steady state; pipeline matched
10k events/sec spike
5k events/sec capacity
Backlog grows; if buffer is bounded, events drop or stall
20k events/sec sustained
5k events/sec capacity
Backlog grows without bound; system fails
Three Strategies for Backpressure
Slow the producerBuffer the spikeDrop selectively
Slow the producer
Rate-limit upstream
Return 429 with Retry-After. Reactive Streams request signals. Forces the producer to absorb the slowdown.
Buffer the spike
Durable queue absorbs the burst
Kafka, Kinesis, SQS. Consumer drains at steady rate; latency grows during spike, recovers after.
Drop selectively
Sample or shed events
Acceptable for telemetry and analytics. Wrong for transactional events. Loss tolerance is the source-class question.
Three strategies cover almost every backpressure case. Slow the producer down (rate-limit at the source, refuse new requests). Buffer the spike (a durable queue absorbs the burst, the consumer drains at its own rate). Drop selectively (sample, age out, or shed lower-priority events). The right answer depends on whether the spike can be tolerated as latency, what the producer's contract allows, and what the data is worth.
✓Buffer the Spike
Durable queue (Kafka, Kinesis, SQS) absorbs the burst
Consumer drains at its own rate; latency increases for the duration
Correct when the data must not be lost
Cost: queue storage, latency tail during bursts
•Drop the Excess
Rate-limit, sample, or age out events
Throughput is preserved; data is partially lost
Correct when the data is per-event optional (analytics, telemetry)
Cost: gaps in the recorded stream
The Queue as Shock Absorber
The most common pattern is the durable queue between source and consumer. Kafka, Kinesis, SQS, and Pub/Sub all serve this role. The producer writes at its rate; the consumer reads at its rate; the queue absorbs the difference. The pattern works because storage at rest is cheaper than compute at peak. A 30-minute Kafka retention buffer can absorb a 10-minute 10x spike with the consumer running at steady-state capacity. The queue's depth is the system's insurance policy.
The producer spikes; the consumer stays pinned at its capacity; the queue depth grows during the spike and drains after. The latency observed by downstream consumers grows during the spike and recovers as the queue drains. The arrangement is correct as long as the queue depth has headroom and the spike has finite duration. When the consumer cannot keep up and the queue cannot grow, the only correct answer is to slow the producer through some signal. HTTP servers return 429 Too Many Requests. Reactive Streams libraries propagate request(n) signals upstream. Kafka producers can be configured with bounded buffer.memory so publishes faster than the broker can accept produce visible backpressure.
Backpressure signal patterns:
▸HTTP 429 Too Many Requests with Retry-After header (synchronous APIs)
▸Reactive Streams request(n) demand signal (in-process pipelines)
▸Custom token bucket between producer and consumer (when neither library helps)
▸Circuit breaker that returns failure when downstream is unhealthy (last resort)
Selective Dropping: When Some Loss Is Correct
Some sources produce events whose individual loss is acceptable. Telemetry from a fleet of mobile devices, click events for casual analytics, page view counts. For these sources, selective dropping is the right answer when the system is overloaded. The tradeoff is explicit: throughput is preserved, latency stays bounded, and a known fraction of events are lost. The pattern is wrong for transactional events (orders, payments, ledger entries) where individual loss creates user-visible incorrectness.
Source Class
Loss Tolerance
Right Backpressure Strategy
Transactional (orders, payments)
Zero
Buffer and slow producer; never drop
Analytical (clickstream, page views)
Sample-rate-aware
Buffer with cap; sample on overflow
Telemetry (device pings, metrics)
High
Drop oldest first; preserve recent
Audit (security events)
Zero, regulatory
Buffer to disk; never drop, never sample
Without backpressure, the slowest component becomes the failure point and amplifies the spike.
A durable queue is the most common shock absorber; consumer rate stays pinned, queue depth flexes.
Loss tolerance of the source class determines whether dropping is acceptable.
TIP
Design the backpressure path before the pipeline is busy. Adding backpressure during an incident is the wrong time to discover that the producer cannot be slowed and the queue is unbounded.
Multi-Tenant Ingestion Fairness
Daily Life
Interviews
Design multi-tenant ingestion with isolation, fair scheduling, per-tenant observability, and structural tenant-id propagation.
A platform that ingests data on behalf of many producers is multi-tenant. The producers might be customer accounts (Segment ingesting events for thousands of companies), internal teams (an internal data platform serving every product team), or external partners (a marketplace ingesting from every seller). The defining property is that one ingestion infrastructure serves N producers, and the producers do not coordinate with each other. Multi-tenant ingestion has problems single-tenant ingestion does not have, and most of them reduce to one word: noisy.
Separate cluster per high-volume or regulated tenant. Most expensive; right answer for whales and compliance.
Three layers of isolation cover most cases. Logical isolation separates tenants by id within shared infrastructure (one Kafka topic, partitioned by tenant id; one warehouse table, filtered by tenant id). Resource isolation gives each tenant or each tenant tier dedicated resources (a worker pool per tier, a queue per tenant). Physical isolation deploys separate infrastructure per tenant (a dedicated Kafka cluster for the largest customers). The right level depends on the SLO offered to each tenant and the cost the platform can absorb.
•Logical Isolation
Tenant id stamped on every record; shared infrastructure
Cheap; one cluster, one set of pipelines
Noisy neighbors still affect each other through shared queue
Fits when the platform offers a single tier and bursts are bounded
•Resource Isolation
Each tenant tier gets a dedicated queue, worker pool, or cluster
More expensive; bounded blast radius
Noisy neighbors only affect their own tier
Fits when the platform offers tiered SLOs (free, pro, enterprise)
Fair Scheduling and Per-Tenant Quotas
When tenants share a queue, fair scheduling is the discipline of preventing one tenant from monopolizing throughput. The simplest implementation is round-robin consumption across tenant-keyed sub-queues. The most sophisticated is weighted fair queuing where each tenant's weight reflects their tier or their committed throughput. Per-tenant quotas (events per minute, payload bytes per minute, downstream queries per minute) are the enforcement mechanism. A tenant exceeding their quota gets shed (429 to the producer), buffered to a separate slow-lane queue, or paused entirely. The choice depends on the contract with the tenant.
1
# Per-tenant token bucket simulation. Tenant A bursts; tenant B stays in budget.
2
classTokenBucket:
3
def__init__(self,capacity):
4
self.capacity=capacity
5
self.tokens=capacity
6
deftake(self,n=1):
7
ifself.tokens>=n:
8
self.tokens-=n
9
returnTrue
10
returnFalse
11
12
buckets={"A":TokenBucket(5),"B":TokenBucket(5)}
13
results={"A":[],"B":[]}
14
15
fortenantin["A"]*8+["B"]*4:
16
code=202ifbuckets[tenant].take()else429
17
results[tenant].append(code)
18
19
print("Tenant A responses:",results["A"])
20
print("Tenant B responses:",results["B"])
21
print("Tenant A 429 count:",results["A"].count(429))
22
print("Tenant B 429 count:",results["B"].count(429))
The token bucket gives each tenant a budget of events per second and a maximum burst size. A tenant within their budget enqueues normally. A tenant exceeding their budget receives 429 and is signaled to slow down. The bucket sits at the ingestion boundary, before any shared resource is touched. This single component prevents most noisy-neighbor incidents that originate from runaway producers.
Per-tenant signals every multi-tenant platform needs:
▸Per-tenant error rate, with tenant id in every log line
▸Per-tenant cost attribution (which tenants drive the warehouse bill)
▸Per-tenant quota utilization (how close each tenant is to their cap)
Without per-tenant signals, every multi-tenant incident requires forensic work to determine which tenant is responsible. With them, the dashboard shows the answer in seconds. The investment is small and the operational return is dramatic. The most severe multi-tenant failure mode is data leakage: tenant A's data appears in tenant B's output. The cause is almost always a missing tenant id in a join key, a transformation that aggregates without partitioning by tenant, or a cache keyed without tenant id. The fix is structural: every record carries tenant id, every join uses tenant id, every aggregation partitions by tenant id. The discipline is non-negotiable because the failure mode is regulatory.
✓Do
Stamp tenant id on every record at the ingestion boundary
Enforce per-tenant quotas before any shared resource is touched
Use weighted fair queuing or per-tier worker pools when SLOs differ
Tag every metric, log line, and trace with tenant id
Audit every transform for tenant-id presence in join and group-by keys
✗Don't
Trust producers to self-limit; assume some will burst
Share connection pools across tenants without per-tenant caps
Build cross-tenant aggregations in the ingestion layer; defer to consumer-aware transforms
Run the largest tenants on the same physical resources as the smallest
When to Move to Physical Isolation
Trigger
Right Move
Cost
One tenant exceeds 30% of platform throughput
Physical isolation for that tenant
Dedicated cluster operations cost
Tier with regulatory data (HIPAA, PCI)
Physical isolation per tier
Compliance overhead doubled per tier
Customer demands a dedicated SLA
Physical isolation as part of contract
Sales-driven; cost embedded in contract
Region locality (data residency laws)
Physical isolation per region
Multi-region deployment overhead
Noisy neighbors are the dominant operational concern in multi-tenant ingestion.
Per-tenant quotas, fair scheduling, and per-tenant observability prevent most incidents.
Tenant id must propagate through every layer; structural absence is a correctness incident.
TIP
When designing a new multi-tenant ingestion platform, write the per-tenant quota and observability story before the first byte flows. Retrofitting both is harder than building either greenfield.
Dual-Write to Outbox + CDC
Daily Life
Interviews
Lead the migration of a brittle dual-write integration to outbox plus CDC end to end, including the operational tradeoffs.
The synthesis exercise is a real-shaped migration. A logistics company runs an order service backed by Postgres. The service publishes order_placed, order_shipped, and order_delivered events to a downstream warehouse, a recommendation system, and a customer notification service. Three years ago the team wired this up with the dual write inside each handler. The reconciliation job catches roughly 0.4% of events that drift, but enough leak through to drive customer complaints. The new principal engineer is asked to replace the dual write end to end without an outage.
The Symptoms in Production
Operational evidence the dual-write architecture is the bug:
▸0.4% of orders never produce a downstream event (silent loss, caught only by nightly recon)
▸0.05% of events refer to orders that do not exist (phantom events from rolled-back transactions)
▸Customer notifications occasionally trigger for orders that never shipped
▸Recommendation system shows ghost orders to users
The order service writes to one durable medium. The outbox table holds a row per event the service intends to publish. Debezium reads the WAL and surfaces inserts on the outbox table as Kafka events, routed by event_type to per-event topics. The three downstream consumers subscribe to the topics they care about. The atomic boundary lives entirely inside the database transaction; the broker is downstream, where retries and at-least-once delivery are tolerable because consumers deduplicate on the outbox id.
The Four-Step Migration
Step 1 is purely additive: create the outbox table, modify the service to insert an outbox row inside the existing transaction, alongside the existing dual-write call to Kafka. The outbox is shadow infrastructure; the broker remains the production path. Outbox row counts and broker published counts should match within reconciliation tolerance. Any drift exposes the silent loss the dual write has been producing.
1
# Migration phase 1: dual path. The outbox is shadow; the broker is still production.
Step 2 stands up a Debezium connector pointed at the outbox table, with the outbox event router extracting event_type and routing each record to the right Kafka topic. The Debezium-driven topic is the new production path; the legacy direct-publish topic continues to flow in parallel. Both should agree event-for-event; residual drift is the silent loss the dual write produced. Step 3 migrates the warehouse, recommender, and notifier consumers one at a time from the legacy topic to the new topic, with shadow-then-cutover per consumer. Migrate the consumer with the lowest customer-facing impact first. Step 4 removes the direct kafka.publish from the order service after all consumers are stable on the new path and drift has been zero for two to four weeks. The dual write is gone.
Customer-facing inconsistency from ghost and phantom orders
On-call burns weekly on recon discrepancies
✓After the Migration
Single transactional write; outbox row drives publication
Zero silent loss; zero phantom events
No reconciliation job needed
Consistency invariant maintained by architecture, not by recon
On-call attention freed for real incidents
What This Migration Borrows From Earlier Lessons
Concept
From Lesson
How It Shows Up Here
Four roles (source, transform, storage, consumer)
Lesson 1
The order service is the source; the outbox is interstitial storage; Debezium is the transform; consumers are downstream
Idempotency at the sink
Lesson 5
Consumers dedupe on outbox_id, surviving Debezium retries
Schema evolution policy
Lesson 8
Outbox event payloads version forward-compatibly; breaking changes go through expand-contract
CDC mechanics (intermediate tier of this lesson)
Lesson 9 intermediate
Debezium reads the WAL; the outbox is the canonical CDC source for events
Backpressure (this lesson, section 2)
Lesson 9 advanced
If consumers fall behind, Kafka retention absorbs; producers do not block
The migration playbook in five moves:
▸1. Add the outbox table; modify the service to dual-write to outbox alongside the existing broker call
▸2. Bring up Debezium against the outbox; the new topic flows in parallel with the legacy one
▸3. Migrate downstream consumers one at a time from legacy topic to new topic, with shadow-then-cutover
▸4. Remove the direct kafka.publish from the order service after consumers are stable
▸5. Remove the reconciliation job; correctness now lives in the architecture, not in batch recovery
✓Do
Migrate additively, never destructively; the legacy path stays until the new one is proven
Run shadow comparison between paths for at least two weeks before cutover
Migrate consumers in order of customer-facing blast radius, lowest first
Treat the outbox as a tier-1 table with monitoring, alerts, and a cleanup owner
Remove the reconciliation job last, as the proof that the architecture is correct
✗Don't
Cut over all consumers in one step; the rollback cost is too high
Skip the schema registry; future breaking changes will require it anyway
Trust the existing recon job to remain as a safety net; the goal is to retire it
Underestimate Kafka and Debezium operational cost on the first deployment
Five additive steps replace dual write with outbox plus CDC, with no outage.
The migration borrows from idempotency, schema evolution, and CDC: every prior lesson shows up.
The new operational cost is real; the math works because the prior silent-loss cost was higher.
TIP
The shorthand for this entire pattern is: write to one system, read from its log, publish to consumers. The migration is not free; Kafka Connect, Debezium, and the schema registry are real operational cost. The math works because the prior cost (silent loss, customer complaints, recon work) was higher than the new cost, and that calculation is the actual decision.
❯❯❯PUTTING IT ALL TOGETHER
> A B2B SaaS company at 1500 engineers operates a multi-tenant ingestion platform that accepts events from 12000 customer accounts. Three of those customers each generate 5 to 8 percent of total platform throughput on their own. The platform also runs an internal order service whose dual-write to Kafka has been the cause of three production incidents this quarter. The new VP of Data is asked to deliver a single architectural plan that addresses the dual write, the noisy neighbors, the absence of a backpressure story, and a coming product launch that will triple ingestion volume.
The dual write goes first. The order service migrates to outbox plus CDC over four weeks using the additive playbook. The migration produces a side benefit: the same Kafka topics the new architecture creates become the canonical event feed for the analytics warehouse, the recommender, and the notifier, removing three independent polling pipelines. (Section 1, builds on Lesson 5 idempotency.)
The three noisy customers move to physical isolation. Each gets a dedicated Kafka cluster sized for their volume plus margin. The remaining 11997 customers stay on shared infrastructure with weighted fair queuing and per-tenant token-bucket quotas. Per-tenant lag and cost dashboards land in week two; without them, every multi-tenant incident becomes forensics. (Section 3.)
The backpressure story is built before the product launch, not after. Producer-side rate limits with 429 plus Retry-After. Kafka retention tuned to absorb a 4-hour 3x spike at consumer steady-state capacity. Selective dropping is allowed for telemetry events but not for transactional events, with the policy documented per source class. (Section 2.)
Idempotent consumption is the foundation everything else stands on. Every event carries a stable id (outbox_id for internal events, vendor event id for SaaS, content hash for file drops). Sinks enforce uniqueness on the id. Connector restarts and retries are absorbed at the sink. (Builds on Lesson 5; intermediate tier section 3 of this lesson.)
Schema evolution policy is published in the same week. Additive changes flow; destructive changes go through expand-contract. The schema registry becomes a tier-1 component with its own SLO. Without the policy in place, the first breaking change after the launch becomes a multi-tenant incident. (Builds on Lesson 8.)
The four roles from Lesson 1 tie everything together. The order service and the customer producers are sources. The outbox tables and Kafka are interstitial storage. The Debezium connectors and stream processors are transforms. The warehouse, recommender, notifier, and per-customer destinations are consumers. The plan is one diagram with the four roles visible at every layer. (Builds on Lesson 1.)
KEY TAKEAWAYS
Dual write is structurally broken: two writes across two systems cannot be atomic. The fix is one transactional write, not better error handling.
The transactional outbox reduces two writes to one: the database transaction is the atomic boundary; a publisher (polling or CDC-driven) drains the outbox to the broker.
Backpressure must be designed before the spike: buffer the burst, signal the producer, or drop selectively, matched to the source's loss tolerance.
Multi-tenant ingestion is dominated by noisy neighbors: per-tenant quotas, fair scheduling, per-tenant observability, and physical isolation for the largest tenants.
Migrations are additive: shadow the new path alongside the legacy path, migrate consumers one at a time, retire the legacy path last.
Topics covered: The Dual-Write Problem, The Transactional Outbox Pattern, Backpressure at the Ingestion Layer, Multi-Tenant Ingestion Fairness, Dual-Write to Outbox + CDC
Dual write is the most common production bug in event-driven architectures. An application writes to its database and to a message broker in the same handler, expecting both writes to succeed or both to fail. There is no transaction that spans the two systems. One will eventually succeed without the other. The bug is not in the code. The bug is in the architecture. The transactional guarantee the application thinks it has does not exist. What a Dual Write Looks Like The function looks correct on
The transactional outbox is the canonical fix for the dual-write problem. The application writes only to the database. Inside the same transaction that writes the business row, it inserts a row into an outbox table that describes the event to be published. A separate publisher process reads from the outbox table and publishes events to the broker. The atomic boundary is the database transaction; the broker is downstream of that boundary. Either both rows commit or neither does, and the publisher
Backpressure is what happens when a source produces faster than the pipeline consumes. The mismatch is normal in steady state; it is the spike that exposes the design. Without explicit backpressure handling, the slowest component in the chain becomes a buffer, fills, and then either drops events, crashes, or amplifies the spike upstream. Designing for backpressure is the difference between a pipeline that absorbs a 10x burst and a pipeline that becomes the incident. Where the Mismatch Shows Up T
A platform that ingests data on behalf of many producers is multi-tenant. The producers might be customer accounts (Segment ingesting events for thousands of companies), internal teams (an internal data platform serving every product team), or external partners (a marketplace ingesting from every seller). The defining property is that one ingestion infrastructure serves N producers, and the producers do not coordinate with each other. Multi-tenant ingestion has problems single-tenant ingestion d
The synthesis exercise is a real-shaped migration. A logistics company runs an order service backed by Postgres. The service publishes order_placed, order_shipped, and order_delivered events to a downstream warehouse, a recommendation system, and a customer notification service. Three years ago the team wired this up with the dual write inside each handler. The reconciliation job catches roughly 0.4% of events that drift, but enough leak through to drive customer complaints. The new principal en