# The Event Pile

> 600 million clicks a day. The budget is not infinite.

Canonical URL: <https://datadriven.io/problems/the_event_pile>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Our product generates hundreds of millions of user interaction events every day. We stream them through Kafka but right now they just pile up and we have no good way to query them for analytics. Storage costs are already a concern and the data needs to be queryable for at least two years. Design an architecture to store and query this event data efficiently.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the L7 storage-and-query design problem dressed as a clickstream. Three workloads, four requirements, all of them pulling at the layout from different directions. The trap is forcing all three workloads onto one storage shape because it's simpler. Whichever shape you pick, two of the three workloads suffer enough that someone builds a side-store, then someone else builds another, and the bill goes back up.

Most candidates draw the Kafka topic landing into a single warehouse and tell support, analysts, and dashboards to query it. Support waits while the warehouse scans by date for one user; analysts wait because the table isn't optimized for their cross-month scans; dashboards strain when somebody runs a heavy query. The VP asks why the bill is climbing and discovers that 'one warehouse for everything' is the most expensive option for hundreds of millions of events per day at two-year retention.

> **Trick to Solving**
>
> Three query shapes, three storage shapes, one source of truth in cheap storage; routing is the architecture, not the after-thought.
> 
> 1. The lake (cheap object storage with columnar files, partitioned by date) is the source of truth and the long retention. Most data lives there; queries against it pay per byte scanned.
> 2. Support's session lookups go to a key-value or wide-column store keyed on user id, with recent sessions first, so 'pull this customer's events' is a point lookup.
> 3. Funnel analysts scan months by event type from the lake through a serverless query engine; dashboards run from a streaming-fed aggregate store sized for the last hour.
> 4. All three stores are fed from the same Kafka topic; nothing reads from another store, so the layouts can evolve independently and replays come from one place.

---

### Walk the requirements

#### Step 1: Cold lake as the long-retention source of truth

Hundreds of millions of events per day at two-year retention is a cost problem before it's anything else. The lake is cheap object storage with columnar files (Parquet / ORC) partitioned by date and event type, and it holds everything. Older data stays in the same lake but reachable through the same engine; no offline tape, no rehydration step. The lake is what makes the bill come down because most of the data isn't sitting on warehouse compute. The version that pays warehouse storage prices for two years of events is the bill the VP is reacting to.

#### Step 2: A session store for support, keyed by user, not by date

Support pulls one customer's session while the customer is on the phone. That's a point-lookup workload, not a scan workload. Land events into a key-value or wide-column store (DynamoDB, Cassandra, BigTable) keyed on user id with recent events first, so a query is 'pull events for user X in the last N minutes' and it returns in seconds. Forcing this lookup against the lake or a date-partitioned warehouse scans the wrong axis and is slow enough that customers hang up.

#### Step 3: Three stores, three query shapes, all fed from one source

Three workloads, three stores: lake for analysts (long scans, cheap), session store for support (point lookups, fast), aggregate store for dashboards (last-hour rollups, real-time). Each is fed from the same Kafka topic so the source of truth is one and replays seed any of them. The trap is letting one store read from another (the dashboard reading from the lake on the hot path, or the session store reading from the warehouse), which couples the failure modes and erases the layout choice.

#### Step 4: Older data stays on cheap storage, reachable through the same engine

Older data is rare-but-real. Keep it in the same lake under the same partition layout; the serverless query engine that serves analysts can scan older data the same way, just slower because there's more to read. No 'go retrieve from the archive' workflow, no offline tier; the cost difference between recent and old is in compute (more bytes scanned) not in storage class. Audits or back-window trend reports run against the same lake the analysts use, with longer wait times and higher per-query cost being the trade.

---

### The shape that fits

> **What this design gives up**
>
> Three stores fed from one stream is more pieces to operate than one warehouse, and you maintain three sets of write paths and three failure modes. Routing queries by intent (support vs analyst vs dashboard) means consumers have to know which store to hit. Operational simplicity and a unified query surface go; what arrives is a bill that scales with access pattern, support latency that doesn't make customers hang up, and three workloads that don't slow each other down.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Three downstream stores fed from one event bus, each tuned to a workload (long scans, point lookups, streaming aggregates).
> - The cold lake holds the long retention; older data is reachable through the same query engine.

> **The mistake that ships**
>
> The team's first cut writes everything into a single warehouse and trusts query engines to handle three workloads. Storage cost climbs and the VP starts asking. Support waits for the warehouse to scan by date when they need a single session and a customer hangs up. An analyst scan of last quarter slows down the dashboards. Somebody builds a 'recent sessions' materialized view to fix support; somebody else builds an aggregate view to fix dashboards; the warehouse becomes a graveyard of derived tables and the bill is higher than where it started. The team rebuilds with three stores fed from one stream, which is what should have been on the whiteboard.

---

## Common follow-up questions

- Support starts asking for the customer's session from a year ago, not just the last hour. What changes? _(Tests whether the candidate sees that the session store is sized for recent activity, not for the long tail. Older sessions live in the lake and are retrieved through a different (slower) path. Putting a year of sessions in the session store at hot-store prices is exactly the cost shape the VP is reacting to.)_
- An event arrives twice from a misbehaving client. Which of the three stores cares, and what protects you? _(Tests whether the candidate sees that the lake's idempotent writer handles dedup, the session store cares because support reads exact event lists, and the dashboard store cares because aggregates double-count. The fix is dedup at the stream layer before fan-out, on a stable event id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_event_pile)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.