# Event System for Multiple Consumers

> One event, many hungry consumers.

Canonical URL: <https://datadriven.io/problems/event_system_for_multiple_consumers>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We're building a centralized event platform that multiple engineering teams will consume from. Design a data processing pipeline and ingestion layer for an event system to be shared across multiple applications and servers.

## Worked solution and explanation

### Why this problem exists in real interviews

A multi-tenant event platform succeeds or fails on isolation. Six teams reading the same events at six different speeds, deployed independently, with a producer change that can silently break a consumer; the platform either contains those failure modes or amplifies them. The trap is a single shared topic with one shared consumer group: it looks simple and is the thing that lets one team's outage cascade.

The simple answer is one Kafka topic, six consumer groups, and a 'best practices' wiki page about schema discipline. A team's slow consumer falls behind and lag piles up on the shared topic; another team's deploy publishes a backward-incompatible schema change; the fraud team's consumer dies and nobody's monitoring it. Six teams are now blaming each other in a shared Slack channel and the platform team is on call for everyone's bugs.

> **Trick to Solving**
>
> Per-team consumer isolation, schema contract enforced at publish, deletion propagated and confirmed; the platform is what protects each team from every other team.
> 
> 1. The bus has explicit partitioning and consumer groups per team. Each team's offset is its own; no team's lag affects another team's progress.
> 2. Schema contracts are enforced at publish time by the bus's schema-contract layer: a producer that publishes an incompatible change is rejected, not discovered downstream.
> 3. Two freshness tiers off the same bus: a streaming path for fast consumers (fraud, ops) and a batch path for slow consumers (billing, BI).
> 4. Deletion is an event on the bus that fans out the same way the data did. Each consumer applies the deletion to its store and writes a confirmation; the orchestrator collects them.

---

### Walk the requirements

#### Step 1: One bus, six consumer groups, six freshness budgets

All six teams consume from the same bus, but each team has its own consumer group with its own offsets and its own throughput. Fast consumers run streaming consumers (Flink, Kafka Streams) for sub-minute freshness; slower consumers run batch loaders that catch up on their own cadence into a warehouse. One source, six paths, no team forced onto another team's clock. Without the bus there's no fan-out point; without two freshness tiers either fast teams pay batch latency or slow teams pay streaming compute.

#### Step 2: Per-team isolation so an outage doesn't spread

Each team's consumer group is independent. When a team's consumer is down for hours, its offsets sit where they were; the bus retains messages within the retention window; the other five teams don't notice. When the team comes back, they replay from where they left off. The platform's job is to make 'one team is broken' irrelevant to every other team. A shared consumer group across teams is the version where one team's restart blocks everyone.

#### Step 3: Schema contract enforced at publish, not discovered downstream

The bus's schema-contract layer holds a contract per topic. Producers publish through the registry; an incompatible schema change (removed required field, type change, renamed field without alias) is rejected at publish time. The producing team sees the error immediately and either updates the contract with backward-compatible evolution or reverts. A 'we'll catch breaking changes in code review' approach is the version where a producer ships and three downstream consumers fail at midnight; the registry is what stops that at the boundary.

#### Step 4: Deletion travels the same fan-out as the events

GDPR right-to-be-forgotten reaches every store, not just the bus. The deletion request enters as an event on the same bus the source events ride; each consumer applies the deletion to its store and writes a confirmation; an orchestrator collects all confirmations before closing the request. Without the propagation pattern, deletion becomes a manual hunt across stores nobody is sure exists. Putting deletion on the same fan-out the data took going in is what makes the audit answer a query.

---

### The shape that fits

> **What this design gives up**
>
> A schema-contract layer adds a publish-time check that producers have to integrate with; a deletion fan-out is a control plane that has to track confirmations across consumers. Per-team consumer groups mean per-team monitoring and per-team alerts. 'Just put it on Kafka' is the simpler design; the win is a platform that contains failures within a team, prevents producer breakage from cascading, and survives a privacy audit with proof of completion.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An event bus sits between producers and per-team consumer groups so failures stay local to a team.
> - Two freshness tiers off the same bus serve fast and slow consumers without forcing them onto one cadence.

> **The mistake that ships**
>
> The team's first cut uses one Kafka topic with one shared consumer group, ad-hoc schema discipline, and 'we'll handle deletion when it comes up.' A team's consumer falls behind and lag piles up on the topic for everyone. A producer ships a breaking schema change and three downstream consumers crash at midnight; the producing team finds out from a Slack thread the next morning. A GDPR request lands and the team manually hunts customer data across six stores. The eventual rebuild is per-team groups, a schema-contract layer, and a deletion fan-out. By the time the platform team retrofits the registry and the deletion fan-out, three teams have built local workarounds that have to be torn down.

---

## Common follow-up questions

- A new team wants to consume historical events from a year ago, not just current. What in this design supports that, and what doesn't? _(Tests whether the candidate sees the bus retention as the limit: if the bus only retains a week, the new team can't replay from a year ago. The fix is either longer retention on the bus, or a long-retention archive in cold storage that the new team's batch loader can read from.)_
- A producer rolls out a 'compatible' schema change that adds an optional field; one consumer's deserializer breaks anyway. Where did this design fail, and what changes? _(Tests whether the candidate sees that 'compatible' has to be a property of the registry's compatibility rule, and consumers' deserializers have to be aligned with that rule (e.g. handling unknown fields gracefully). The contract is two-sided: producer publishes compatible, consumer parses tolerantly.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/event_system_for_multiple_consumers)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.