# 600 Million Events a Day

> 600 million events a day. Two years of retention.

Canonical URL: <https://datadriven.io/problems/600_million_events_a_day>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Our e-commerce platform generates 600 million user interaction events per day through Kafka, and we need to store and serve this data for both real-time personalization and offline analytics teams. Current ad-hoc queries are scanning the entire dataset on every run. Design a storage and query architecture that supports both use cases cost-efficiently with two-year data retention.

## Worked solution and explanation

### Why this problem exists in real interviews

L7 multi-consumer at scale: four freshness budgets, an unsustainable query bill from full scans, GDPR right-to-erasure across multiple stores, and quarterly schema evolution against years of history. The trap is one warehouse for everything; it costs more, scans more, and can't reach into every layer for deletion.

The default reach is one warehouse with two-year retention and a streaming consumer for personalization. The query bill keeps climbing because every date-filtered query scans the full table. Personalization queries the warehouse and the latency stretches past the budget. A right-to-erasure request lands and the team finds the user in five places nobody documented. A new product field gets added; the team rewrites the historical table to add the column and the rewrite takes the warehouse offline for hours.

> **Trick to Solving**
>
> Tier the storage by access pattern, route queries by recency, propagate deletion through every layer, evolve the schema additively.
> 
> 1. Most history lives in cheap object storage with date partitioning; recent data also lives there but is mirrored to faster stores for the consumers that need it.
> 2. Personalization reads from a low-latency online store; fraud and analytics read from a streaming-or-batch warehouse path; training reads from the lake. Each consumer's path matches its budget and bill.
> 3. Deletion is an event on the same fan-out the data took going in; each consumer's store applies the deletion and writes a confirmation; an orchestrator collects them.
> 4. New fields are additive: the schema grows columns over time, defaults to null for historical rows, and old queries don't change behavior.

---

### Walk the requirements

#### Step 1: Four consumers, four paths matched to four budgets

Personalization reads features from an online store within seconds; fraud reads from a streaming consumer that writes a fraud-facing store within minutes; analytics queries the warehouse / lake through a serverless engine on a 15-minute view; training reads partitioned files from the lake. All four are fed from the same source events; the budgets diverge after ingest. One shared store means at least three of the four consumers suffer.

#### Step 2: Lay out the lake so date-filtered queries scan a slice and dashboards read pre-aggregates

Most data lives in cold storage with date partitioning (and clustering by the most common BI filter , user segment, country, product family). A 'last week' query scans a few partitions, not two years. Repeated dashboard queries read pre-aggregated rollups built off the lake on a slower cadence so the same query doesn't scan twice. The bill comes down because the access pattern matches the layout, not because the engine got cheaper.

#### Step 3: Deletion propagates through every store with confirmation

GDPR right-to-erasure has to reach the lake, the online store, the fraud-facing store, and the pre-aggregated views. The deletion request rides the same fan-out the events did; each store applies the deletion and writes a confirmation; the orchestrator collects them and proves the regulatory window. Without the propagation, deletion is a manual hunt and the audit answer is a promise rather than a record.

#### Step 4: Schema evolution is additive; old data isn't rewritten

Product teams add new event fields each quarter. The lake's schema grows columns over time; new fields are nullable with default null for historical rows. Queries that don't reference the new field are unaffected; queries that do filter to the rows that have it. A 'rewrite history when a field is added' approach is the version where every quarter's rollout takes the warehouse offline; additive evolution is what keeps the layer stable as the schema accumulates.

---

### The shape that fits

> **What this design gives up**
>
> Four consumer paths is more operational machinery than one warehouse; layout decisions (partition keys, cluster keys) commit the design to a query pattern; deletion propagation is a control plane that has to track confirmations across stores; additive schema means the lake's column count grows over years and consumers have to handle nulls thoughtfully. Implementation cost is the price; the win is four consumers each reading at their budget, a query bill that scales with access pattern instead of retention, deletion that proves itself, and schema changes that don't take the warehouse offline.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Most history lives in cold storage; consumers read from paths matched to their freshness needs.
> - Date-partitioned layout means common date-filtered queries scan a small slice, and pre-aggregated views absorb repeated dashboard queries.
> - A deletion control plane propagates erasure through every store with completion confirmations.
> - Schema evolution is additive: new fields default null on historical rows; old queries keep working.

> **The mistake that ships**
>
> What gets shipped puts everything in one warehouse with two-year retention. The query bill grows month over month because every date-filtered query scans the full table. Personalization can't hit its sub-second budget reading from the warehouse. A right-to-erasure request lands and the team finds the user across five places nobody documented. A new product field gets added and the historical-table rewrite takes the warehouse offline for hours. The eventual rebuild adds tiered storage, four consumer paths, deletion fan-out, and additive schema , each was reachable in the original conversation if 'one warehouse for two years' had been treated as a cost choice that has to be justified.

---

## Common follow-up questions

- Analytics adds a new dashboard that needs sub-second response. What in this design accommodates it without putting the warehouse on the hot path? _(Tests whether the candidate sees the pre-aggregated views as the surface for repeated low-latency dashboards: a new view materialized off the warehouse on a faster cadence, indexed for the dashboard's query pattern. The warehouse stays the slow path; the view is what the dashboard reads. New consumer needs don't always mean a new path from the source.)_
- A deletion confirmation is overdue from the pre-aggregated view because the view's recompute hasn't run yet. What does the orchestrator surface, and how does the audit response read? _(Tests whether the candidate sees the orchestrator hold the request open with a per-store recompute trigger; the audit response references the per-store confirmation status, with the pre-aggregated view's confirmation arriving after its next recompute. The window is preserved per-store; the user-facing audit answer is honest about what's confirmed and what's pending.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/600_million_events_a_day)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.