# The Clicks We Throw Away

> Every tap, swipe, and scroll. At scale.

Canonical URL: <https://datadriven.io/problems/the_clicks_we_throw_away>

Domain: Pipeline Design · Difficulty: hard · Seniority: L4

## Problem

Our platform team captures user interactions across our apps but right now the events go into a log file and get discarded. We want to build a proper clickstream pipeline that captures these events, processes them, and makes them available for product analytics and debugging. Design the clickstream data processing pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Clickstream pipelines fail two ways: they collapse under fan-out (analytics and debugging compete for the same data and break each other), or they're laid out for one consumer and the other is unusable. The trap is treating analytics and debugging as the same query workload. They're not. One needs to scan billions of rows for funnels; the other needs to pull one device's last hour while a bug ticket is still open.

The simple answer is to land events directly into the analytics warehouse and let the on-call engineer query the same warehouse when a customer reports a bug. The warehouse handles the funnel scan fine. When the engineer tries to pull one device's last hour, the query scans a partition meant for analytics and either runs slowly or hits a row limit. The on-call engineer ends up grepping S3 logs again, which is what the team was trying to leave behind.

> **Trick to Solving**
>
> An event bus in the middle, two consumers downstream, each laid out for its own query: analytics scans, debug pulls.
> 
> 1. An event bus / queue in the middle decouples the producer from the consumers, so analytics and debug can fail independently and replay independently.
> 2. Analytics wants partition-by-date for funnel scans. Debug wants partition-by-device-id for fast device lookup. Different layouts, two sinks.
> 3. The clickstream is the source of truth for both; replay from the event bus reseeds either side without touching the apps.

---

### Walk the requirements

#### Step 1: Land events in a queryable analytics store, not a log file

The named problem is that events go to a log file and get thrown away. The fix is to land them in a store that product can query for funnels and adoption: a warehouse table partitioned by date so a 'how many users finished signup this week' query scans only the relevant days. Events flow from the apps through an event bus, then to a writer that lands them in the warehouse. The warehouse is the analytics consumer; product reads from there, not from the log file the events used to die in.

#### Step 2: Lay out a debug store keyed by device, not by date

When a customer files a bug, the engineer wants to pull the last hour of that one device's activity while the bug is fresh. That's a point-lookup, not a scan. Land the same event stream into a key-value or wide-column store keyed on device id, with recent events first. The engineer's query is 'pull events for device X over the last N minutes' and it returns in seconds. Forcing this lookup against the date-partitioned analytics warehouse scans the wrong axis and is slow enough that the engineer gives up and reads logs instead.

---

### The shape that fits

> **What this design gives up**
>
> Two sinks means two storage layouts to operate, two writers to monitor, and roughly twice the storage cost of a single sink. Storage and operational duplication are the cost; the win is two genuinely different query patterns served well, instead of one that serves both poorly.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An event bus sits between the producers and the consumers, fanning out to two stores tuned to different query shapes.
> - Analytics queries scan a date-partitioned warehouse; debug fetches go to a key-store laid out by device.

> **The mistake that ships**
>
> The design the team ships writes events directly to a date-partitioned warehouse and tells the on-call engineer to query the same table when a bug comes in. Funnels work; debug queries scan the wrong axis and time out. The on-call engineer keeps a side script that greps S3 archives, which is functionally what the team was trying to replace. Six months in, somebody adds a 'recent events by device' materialized view, then another, then a third for a different team, all rebuilt every hour off the same warehouse. The eventual fix is a key-value store fed by the same event bus, which is what the design should have started with.

---

## Common follow-up questions

- An app pushes a bad batch of events that all carry the wrong device id. How do you correct the analytics warehouse and the debug store? _(Tests whether the candidate sees the event bus as the source of truth: the bad events sit in the log, the team writes a correction (or the producer republishes), and both consumers replay from the bus. Trying to patch the two sinks directly leaves them out of sync with each other.)_
- Product asks if they can also use the device-keyed store to power a 'recent activity' feature inside the app. What changes about the SLA, and what changes about the operational risk? _(Tests whether the candidate notices that turning a debug store into a customer-facing store changes its SLA from 'best-effort' to 'production': capacity planning, latency targets, and replication strategy all tighten when the consumer becomes an end user.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_clicks_we_throw_away)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.