# The User Who Asked to Be Forgotten

> Users want their data erased. Completely.

Canonical URL: <https://datadriven.io/problems/the_user_who_asked_to_be_forgotten>

Domain: Pipeline Design · Difficulty: hard · Seniority: L7

## Problem

Our platform has listeners on mobile, web, and smart speakers - all generating user interaction events. We need to aggregate these into dashboards showing hourly and daily engagement metrics. The challenge is that events arrive late, users can listen across multiple devices in one session, and GDPR requires us to completely delete a user's event history within 30 days of a deletion request. Design the end-to-end pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Four properties pulling in opposite directions: GDPR deletion that has to reach every store and prove it, late events from buffered clients, duplicates from retries, and 15x volume spikes during major events. The trap is solving them serially: a deletion control plane built later is a forensic exercise; a streaming aggregator without a buffer falls over on the first spike; arrival-time bucketing inflates the wrong hour every time a client buffers.

The default is one streaming aggregator that windows by arrival time, writes hourly counts to a metrics store, and handles deletion by 'we'll figure it out when it comes up.' During a live event, the spike saturates the aggregator and the dashboard latency stretches past its budget. Mobile buffer drains land late and inflate the hour they arrived in, not the hour they happened in. A duplicate retry pads metrics by a few percent. A deletion request lands and the team realizes the aggregates carry no user id; deletion has to scrub the raw archive and rebuild every aggregate that included the user, which the team didn't plan for.

> **Trick to Solving**
>
> Buffer in front of the aggregator for the spike, event-time windowing for the buffered client, dedup on a client id for the retry, deletion as an event that propagates to every store with confirmation.
> 
> 1. A queue or log buffers events between producers and the aggregator so a 15x spike sits in the buffer rather than crashing the aggregator.
> 2. Aggregation runs on event-time windows with a defined lateness allowance; late buffer drains land in the right time bucket.
> 3. Dedup on a client-generated event id before counting; retried events collapse.
> 4. Deletion enters as an event on the same path as the user data; each store applies the deletion and writes a confirmation; an orchestrator collects confirmations and reports the 30-day window.

---

### Walk the requirements

#### Step 1: Deletion propagates through every store with completion records

GDPR deletion has to reach the raw archive and every aggregation that touched the user. The deletion request is itself an event on the same path as the user data; each consumer applies the deletion (raw archive removes the user's events, aggregations recompute affected partitions) and writes a confirmation back; an orchestrator collects confirmations and proves the 30-day window. Without the propagation pattern, the raw archive is clean and the aggregates still include the user, which is the audit failure GDPR reviewers actually find.

#### Step 2: Event-time windows with a lateness allowance

Mobile clients buffer events for tens of minutes when connectivity is bad. The aggregator runs on event-time windows with a configured lateness allowance; events arriving inside the allowance are slotted into the time bucket they belong to. Events past the allowance route to a late-data path that updates the affected hour through a separate compaction. Bucketing on arrival inflates the hour the buffer drained into and short-changes the actual hour; engagement dashboards that show 'when did people listen' depend on event-time being preserved end-to-end.

#### Step 3: Dedup on the client-generated id, before counting

Each event carries a client-generated id. The aggregator dedups on that id before incrementing any count; a retried event collapses to one. Idempotent state in the aggregator keeps the contract across restarts. Without dedup at this boundary, counts inflate by the retry rate and royalty / engagement numbers don't match what users actually did. Counting and trying to subtract duplicates later is the version where the metrics drift between recalculations.

#### Step 4: Buffer the 15x spike in the queue, not the aggregator

Major live events drive sudden 15x volume spikes. A queue or log between producers and the aggregator absorbs the burst; the aggregator catches up at its own rate, with backpressure visible in the queue depth rather than dropped events. Sizing the aggregator for peak is expensive and still risks overshooting; the queue is what gives the system slack while keeping the latency budget intact for everyone else. Without the buffer, the dashboard freezes during the moments people are watching most.

---

### The shape that fits

> **What this design gives up**
>
> Event-time windows with a lateness allowance mean the aggregator has to hold open windows past wall-clock time and accept late updates; dedup state grows with the unique-id space; a deletion control plane is a system that has to be operated. Implementation cost is the price; the win is metrics that count for the right time, no inflation from retries, dashboards that stay live during a spike, and a GDPR audit that can be answered with confirmations rather than promises.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A queue or log absorbs spikes between producers and the streaming aggregator so a 15x burst doesn't break the latency budget.
> - Aggregation uses event-time windows with a configured lateness allowance, so buffered events count in the time bucket they belong to.
> - Dedup on a client-generated event id collapses retried events before counting.
> - Deletion propagates through the raw archive and every aggregate that included the user, with per-store confirmations within the 30-day window.

> **The mistake that ships**
>
> What gets shipped runs an arrival-time aggregator straight off the producers with no buffer in between. The first major live event saturates the aggregator and the dashboard freezes during the moments people are watching the most. Mobile buffer drains inflate the hour they arrived in; engagement reports for the previous hour drop and the next hour spikes for reasons unrelated to user behavior. A duplicate retry pads metrics by a few percent that royalty owes the difference on. A deletion request lands and the team scrambles to find which aggregates included the user; the 30-day window passes without a confirmation. The rebuild adds the buffer, event-time windowing, dedup on client id, and a deletion control plane in turn.

---

## Common follow-up questions

- An event arrives outside the lateness allowance. What does the design do, and how does the dashboard reflect the correction? _(Tests whether the candidate sees the late-data compactor as the path for events past the allowance: it folds them into the affected hour through a partition-overwrite, the dashboard shows the corrected hour on its next read, and the compaction emits an alert so the team can investigate why the lateness exceeded the allowance.)_
- A deletion is requested but one downstream store has been offline for hours. What does the orchestrator do, and what does the GDPR confirmation look like? _(Tests whether the candidate sees the orchestrator hold the deletion request open until every store confirms, with retries and an alert when a store is past its expected confirmation window. The 30-day window is reported per request and the audit response references the open or closed status of each request, not a global pass/fail.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_user_who_asked_to_be_forgotten)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.