# Two Years of Clicks, Cheap

> Two years of clicks. Every query has to be affordable.

Canonical URL: <https://datadriven.io/problems/two_years_of_clicks_cheap>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Our platform generates 600 million user interaction events per day through Kafka and we need a cost-efficient architecture to store and query this data for analytics with a two-year retention requirement. Query latency and storage cost are both first-class constraints. Design the end-to-end ingestion, storage, and query architecture.

## Worked solution and explanation

### Why this problem exists in real interviews

The L7 hint is in the requirements: cost is first-class, query latency is first-class, two years has to stay reachable, dashboards have to feel current, and duplicates from client retries can't inflate revenue. The trap is buying a single fast warehouse big enough to hold two years and watching the bill match. Storage and query economics force a tiered shape; deduplication forces a stable key.

The simple answer is to land everything in a single columnar warehouse, set retention to two years, and let analysts query whatever they want. Storage cost climbs every month; the VP asks why the bill keeps growing. Dashboards run against the same warehouse and the latency is fine until a backfill or a heavy analyst query lands and dashboards stutter. A duplicate purchase from a client retry slips through and finance gets a revenue number that doesn't match transactions. Three of the four requirements are misbehaving in different ways.

> **Trick to Solving**
>
> Tier the storage to the access pattern, route the query to the engine that's billed by data scanned, dedup at the boundary where the key is stable.
> 
> 1. Most history goes to cheap object storage in a columnar format (Parquet/ORC) with date partitioning; recent data also lives there but optionally cached or replicated to a hotter store for fast dashboard reads.
> 2. Queries route by recency: dashboards hit the hot store, analyst trend reports hit the lake through a serverless query engine billed per scan, so older queries are slower but cheap and still answerable.
> 3. Streaming path serves dashboards directly; the lake is fed by the same stream but writes in batched files for compactness.
> 4. Dedup runs at ingest with a stable event id (and event_time) so a client retry can't double-write to either tier.

---

### Walk the requirements

#### Step 1: Tier the bill: hot for recent, cold for the long tail

Most of the two years lives in cheap object storage as columnar files with date partitioning, where storage cost is dominated by S3 / GCS / ADLS economics, not warehouse compute. Hotter recent data lives in the same lake (or is mirrored to a faster store) for queries that need it now. Queries against older partitions go through a serverless query engine billed by bytes scanned, so the cheaper-but-slower path is also genuinely cheap. The bill comes down because most data isn't sitting on warehouse compute the way it would in a single-tier design.

#### Step 2: Lay out the data so common queries are fast and old queries still complete

Recent data partitioned and clustered by the dimensions analysts query most (date, product, region) keeps the common case fast. Older partitions retain the same layout so a two-year trend query is slower but possible: it scans more, it pays more, but it returns. The trap is rebuilding old partitions in a way that breaks the schema for trend queries. Schema evolution on the lake stays additive so the same query works across the whole window.

#### Step 3: Streaming path for dashboards, separate from the historical store

Product dashboards have to feel current. The streaming path tails the same Kafka topic the lake is fed from and updates a dashboard-specific store within the dashboard's freshness budget. Dashboard reads don't touch the lake at all on the hot path, so a heavy analyst query can't make a dashboard stutter. One shared path serving both means the analyst's six-month scan slows down the page that shows what users did in the last minute.

#### Step 4: Dedup at ingest, on the event id

Client retries are the named cause of duplicate purchases. Every event carries a stable event id and an event timestamp; ingest dedups on that key before either tier sees the event twice. The lake writes are idempotent (upsert or partition overwrite keyed on event id and date); the dashboard store writes are upserts on the same key. A duplicate purchase that slips through to finance is the failure that makes the warehouse stop being trusted, and the fix is at the boundary where the dedup key is still authoritative.

---

### The shape that fits

> **What this design gives up**
>
> Tiered storage and routed queries are more pieces to operate than a single warehouse. The serverless lake-query path is cheaper but slower; some analyst queries that used to return in seconds will return in minutes. The dashboard-specific store costs separate compute that the lake doesn't need. The one-engine architecture goes; what arrives is a bill that scales with access patterns instead of with raw retention.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Most history sits on cold object storage; a serverless query engine reads it on demand.
> - A streaming path feeds a hot dashboard store; the same events also land deduped in the lake.
> - Dedup runs at ingest on a stable event id so neither tier double-counts.

> **The mistake that ships**
>
> The shape that ships writes everything into a single warehouse with two-year retention. Storage cost grows month over month and the VP starts asking pointed questions. A backfill of last quarter saturates the warehouse and dashboards stutter for an hour during business. A client retry produces a duplicate purchase that lands once in the warehouse and once in finance's reconciliation; the revenue number is wrong by the value of the duplicate, finance flags it, and the team rebuilds with a tiered store and dedup at ingest, which is what they should have started with.

---

## Common follow-up questions

- An analyst's two-year trend query is taking too long. What in this design lets you speed it up without rebuilding everything? _(Tests whether the candidate sees the layout knobs available: better partitioning, additional clustering keys, materialized aggregates over the cold tier, or moving the analyst onto a path that pre-aggregates. The question is which of those costs less than throwing the query into the hot tier.)_
- Finance asks for an exact revenue total for last week, reconciled against the payment provider. Which store do you query, and how do duplicates affect the answer? _(Tests whether the candidate sees that the lake is the source of truth post-dedup, and the dashboard store is approximate (sized for fresh, not for exact). Reconciliation queries the lake; duplicates are already gone because dedup ran at ingest.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/two_years_of_clicks_cheap)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.