# Four Teams, One Topic, No Agreement

> Everybody is writing to it. Nobody documented it. Now production is fragile.

Canonical URL: <https://datadriven.io/problems/four_teams_one_topic_no_agreement>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

We run a multi-tenant platform where multiple internal teams publish events to shared topics. Different teams have added new fields to the event schema independently over time, and the stream processing jobs that consume these topics are now fragile and break on unexpected schema changes. Design a schema-governed streaming pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Four teams publishing to the same topic without coordination is a guaranteed schema-drift problem. The trap is treating it as a discipline problem (more code reviews, a wiki page) when it's a contract problem. Producers and consumers will deploy independently; the bus has to be where the contract is enforced, not somebody's standup.

The simple answer is one Kafka topic, three consumer jobs, and a 'best practices' wiki page about schema discipline. A producer team adds a required field on Tuesday; one consumer crashes Tuesday night, the on-call engineer rolls back the producer's change and writes a postmortem about it. A different producer team renames a field two weeks later; a different consumer crashes; another postmortem. Each consumer's outage takes the topic with it because the bad event keeps getting redelivered.

> **Trick to Solving**
>
> Schema contract enforced at publish; per-consumer offsets so failures stay local; bad events to a quarantine that doesn't slow anyone down.
> 
> 1. The contract belongs at the bus, not in standups. A producer publishing through the registry can't ship an incompatible change.
> 2. Failure isolation lives in offsets. Each consumer maintains its own progress, so one team's outage doesn't park behind the topic.
> 3. The dashboard's hot path can't be where bad events get retried. Validation routes failures to a quarantine and the good events keep moving.

---

### Walk the requirements

#### Step 1: Schema contract enforced at publish, not discovered downstream

The bus's schema-contract layer holds a contract per topic. Producers publish through the registry; an incompatible change (removed required field, type change, renamed field without an alias) is rejected at publish time. The producing team sees the error immediately and either updates the contract with backward-compatible evolution or reverts. A 'we'll catch breaking changes in code review' approach is the version where four producers ship and three consumers crash at midnight. The contract lives in the registry, not in standups.

#### Step 2: Per-consumer isolation so one failure doesn't block the others

Each consumer is its own consumer group with its own offsets. When a consumer fails on an unexpected event, the bus retains messages within retention; the other consumers don't notice. When the failing consumer is fixed, it replays from where it left off. A shared consumer group across teams is the version where one team's restart blocks every other team. Per-group isolation is the property that contains the failure.

#### Step 3: Quarantine for failed validation; the dashboard keeps its latency

Validation that fails (in the consumer's adapter or in a stream-side check) routes to a quarantine in cold storage, with the rejection reason. The good events keep flowing through the consumer's hot path. The dashboard team's latency budget isn't burned on retrying bad events. A separate triage consumer reads the quarantine on its own schedule, fixes the upstream issue, and replays. Letting the bad event halt the consumer is the version where the dashboard's contract gets violated by a single malformed message.

---

### The shape that fits

> **What this design gives up**
>
> A schema-contract layer adds a publish-time check producers have to integrate with. Per-consumer groups mean per-consumer monitoring and per-consumer alerts. A quarantine adds a triage workflow somebody has to actually run. 'Just put it on Kafka' is the simpler design; in return for the additional pieces, the platform contains schema breakage at the boundary, isolates consumer failures from each other, and protects the dashboard's latency from a malformed message.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An event bus sits between producers and per-consumer paths, with a schema contract enforced before publish.
> - Validation failures route to a quarantine in cold storage so the dashboard's hot path keeps moving.

> **The mistake that ships**
>
> The team's first cut uses one Kafka topic, ad-hoc schema discipline through code review, and a single shared consumer group. A producer team adds a required field; one consumer crashes that night and a postmortem follows. A different producer renames a field; a different consumer crashes. Each crash takes the consumer offline until somebody manually intervenes, and the bad events keep redelivering, which keeps blocking the consumer. The team rebuilds with a registry, per-group consumers, and a quarantine. The team-by-team workarounds outlast the rebuild and have to be unwound one at a time.

---

## Common follow-up questions

- A producer ships a 'compatible' schema change that adds an optional field; one consumer's deserializer breaks anyway. What does the design require from consumers, and where should the fix go? _(Tests whether the candidate sees the contract as two-sided: the registry's compatibility rule defines what 'compatible' means, and consumers' deserializers have to be aligned with that rule (handling unknown fields gracefully). The fix is in the consumer's deserializer if it fails to handle the registry's contract.)_
- The quarantine is filling up because one upstream system has been emitting malformed events for hours. What does the dashboard see, what does the analytics team see, and what does on-call do? _(Tests whether the candidate has thought about quarantine ops: the dashboard's hot path is unaffected (good events flow through), the analytics warehouse is missing the bad events (which appear in the quarantine for triage), and on-call engages the upstream system's owner from the quarantine alert.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/four_teams_one_topic_no_agreement)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.