# Everyone Wants the Same Data, Differently

> How you store it decides how fast you can read it.

Canonical URL: <https://datadriven.io/problems/everyone_wants_the_same_data_differently>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Several teams at our company need the same event data but query it very differently. Design a data ingestion pipeline and consider the tradeoffs depending on how the data will be accessed.

## Worked solution and explanation

### Why this problem exists in real interviews

Three teams reading the same source with three different query shapes is the L7 multi-consumer problem. The trap is the warehouse-monolith answer (one big table, three different SQL) that fails on cost, on latency for support, and on GDPR's 'forget me everywhere' requirement. The actual answer is one source feeding three layouts, dedup at the boundary, deletion as a propagated event.

The simple answer is one warehouse table that everyone queries. Analytics is happy because the table is partitioned by date. Support waits while the warehouse scans for one customer's events. ML scans the last week and slows down support. Three years of history sits on warehouse storage prices and the bill keeps climbing. A GDPR delete request lands; the team realizes nobody knows which derived tables contain that customer's data, and the audit response becomes a manual hunt.

> **Trick to Solving**
>
> One source of truth in cheap storage, three layouts tuned to three query shapes, deletion travels the same fan-out the events did.
> 
> 1. Three workloads, three stores: a partitioned lake for analytics scans, a key-value store for support point-lookups, an online lookup tier for ML. All fed from one event bus.
> 2. Most history lives in cheap object storage (lake) with date partitioning; older data stays in the same lake, scannable through a serverless query engine billed by bytes.
> 3. Deletion is an event on the same bus the source events ride. Each consumer path applies the deletion to its store and confirms; the orchestrator collects the confirmations.
> 4. Dedup happens once at ingest on a stable event id, before the fan-out. Each downstream store inherits the dedup; consumers don't have to redo it.

---

### Walk the requirements

#### Step 1: Three stores tuned to three query shapes, one source

Analytics scans by date over weeks; support fetches one customer's history; ML scans the last week of features. That's three layouts. The lake is partitioned by date for analytics. The support store is keyed on customer id with recent events first. The online lookup tier holds aggregated last-week features for ML. All three are fed from the same event bus, so the source of truth is one and the layouts can evolve independently. One shared store means at least two of the three workloads suffer.

#### Step 2: Older data stays in the lake, reachable through the same engine

Three years of history can't sit on hot warehouse storage at the cost the business will accept. The lake is the long retention; older data lives in the same partition layout but rarely scanned. A serverless query engine reads older partitions on demand and is billed by bytes scanned, so the cost difference between recent and old is in compute, not storage class. Audits or back-window reports run against the same lake, slower but reachable in minutes. No 'restore from archive' workflow.

#### Step 3: Deletion travels the same fan-out as the events

GDPR right-to-be-forgotten means every store has to drop the customer's data. The deletion request enters as an event on the same bus. Each consumer path applies the deletion to its store and writes a confirmation back; the orchestrator collects all confirmations and only then closes the request. Without a propagation pattern, deletion becomes a manual hunt across stores nobody is sure exists, which is the audit response that fails. Putting deletion on the same fan-out the data took going in is what makes the audit answer a query.

#### Step 4: Dedup once, at the boundary, before the fan-out

Source retries deliver events more than once. Dedup runs at the stream layer on a stable event id before the events reach any of the three stores. The lake's writer is idempotent (upsert / partition-overwrite) on the same key; the support store and online lookup tier inherit the deduped stream. Consumers don't redo it. Three teams each implementing their own dedup is three subtly different answers to 'how many events did this customer have,' and the warehouse stops being trusted.

---

### The shape that fits

> **What this design gives up**
>
> Three stores fed from one stream is more pieces to operate than one warehouse and three different SQL queries. Deletion fan-out adds a control plane that has to track confirmations across stores. Single-store simplicity and ad-hoc dedup are what get sacrificed; in return, query latency fits each workload, the cost shape scales with access pattern, and deletion survives an audit.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Three downstream stores fed from one event bus, each tuned to its workload.
> - Most history lives in cold storage; older data is reachable through the same query engine.
> - A deletion control plane fans deletes through every store and confirms completion.

> **The mistake that ships**
>
> The build that ships writes everything to one warehouse and lets each team write its own SQL. Storage cost grows month after month and the bill is the first thing leadership notices. Support sessions take long enough that customers hang up. ML retraining contends with analyst queries. A GDPR delete request lands and the team spends a week hand-finding the customer's data across derived tables, dashboards, and exports. The rebuild walks back through three stores, a propagated deletion, and dedup at ingest. Each was a property the conversation could have reached with one more pass past 'one warehouse.'

---

## Common follow-up questions

- An ML engineer asks for years of features instead of the last week. What changes in this design, and what stays the same? _(Tests whether the candidate sees the online lookup tier as approximate / recent and the lake as the long retention. The long-window features query reads from the lake through the serverless engine, builds the long-window features once, and decides whether to materialize them in the online lookup tier or rerun on demand based on how often the query runs.)_
- A deletion request arrives but one of the three stores is down. What happens, and what does the customer see in the audit response? _(Tests whether the candidate sees that the orchestrator holds the request open until all confirmations land, retries the failed store, and the audit answer references the open request rather than claiming completion. Closing the request before all stores confirmed is the move that fails an audit.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/everyone_wants_the_same_data_differently)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.