# The Consent Stitcher

> Consent was given. Or was it? Stitch the records together.

Canonical URL: <https://datadriven.io/problems/the_consent_stitcher>

Domain: Pipeline Design · Difficulty: medium · Seniority: L6

## Problem

Our platform gets 100 million visitors a month and we monetize through health advertising and a premium membership. The problem is that most visitors start as anonymous users, then some create accounts during the session. Right now our analytics treats the pre-login and post-login parts of the same visit as two separate users, so we undercount engagement and overcount unique visitors. Design a pipeline that stitches sessions and handles the consent propagation required when users change their privacy settings.

## Worked solution and explanation

### Why this problem exists in real interviews

Identity stitching across pre-login and post-login on a high-traffic site, with HIPAA isolation for health profiles and CCPA opt-out that has to physically forget the user. Plus medical articles get versioned and advertisers buy current-content readers. The trap is treating the pre-login session and post-login session as two users, or treating opt-out as a flag instead of an erasure.

The default reach is to count anonymous and authenticated sessions separately and treat opt-out as a 'do not target' flag on the user record. Analytics double-counts users who logged in mid-session. Health data lands in the same warehouse advertising reads from with column-level filters; the first misconfigured query exposes a profile. Opt-outs persist forever as flags; CCPA review takes a finding. Advertisers buying readers of an arthritis article get readers of the superseded version too.

> **Trick to Solving**
>
> Stitch anonymous and authenticated to one user on login, isolate health data behind warehouse policies advertising can't reach, physical erasure on opt-out, version the article so 'readers of X' means current.
> 
> 1. On login, a stream stitches the anonymous session id to the authenticated user id; analytics and ad consumers read one stitched user.
> 2. Health data lives in a separately governed warehouse area advertising roles can't read; column-level policies enforce the boundary at the engine.
> 3. CCPA opt-out triggers physical deletion through every advertising path; the user's events go away rather than gaining a flag.
> 4. Articles version on update; reader features tag the article version so 'readers of arthritis article v3' means v3, not the earlier v1 readers.

---

### Walk the requirements

#### Step 1: Stitch the pre-login and post-login user into one

When a visitor authenticates mid-session, a streaming consumer links the anonymous session id to the authenticated user id; downstream analytics and advertising read one stitched user. Counting the same physical visitor as two users is the named problem; the stitch is what collapses them into one. Without a streaming tier the stitch lags by the offline cadence and analytics keeps double-counting.

#### Step 2: Health data isolated from advertising at the warehouse

Health profiles, symptoms, and medication tracker data live in a separately governed warehouse area. Access policies enforce that advertising roles can't read those tables regardless of how the query is written. A 'mask the columns in BI' design is the version where a direct query exposes them; warehouse-level policies hold against the query path. Without a governed warehouse the boundary lives in BI, which the prompt says has been bypassed.

#### Step 3: CCPA opt-out physically removes from advertising paths

An opt-out request triggers a deletion that propagates through every advertising-side store: the stitched user is removed, the per-user features are deleted, the audience segments recompute. The user's events go away rather than gaining a 'do not target' flag. A flag-based opt-out is what fails the CCPA review because the data still exists; physical deletion is what makes the audit answerable.

#### Step 4: Articles version; readers of a version are versioned readers

Medical articles update; an advertiser buying readers of the current arthritis article should not get readers of an earlier superseded version. Each article carries a version id; reader features tag the version a reader saw. Audience targeting joins on the article version, so 'readers of arthritis v3' is exactly that. Without versioning, current-content audiences inherit superseded-content readers and the advertiser pays for the wrong audience.

---

### The shape that fits

> **What this design gives up**
>
> Session stitching adds streaming state per active visitor; warehouse-level isolation requires explicit role grants and view rebuilds when teams change; physical CCPA deletion adds a control plane that has to track confirmations across stores; article versioning grows the article dimension over time. Implementation cost is the price; the win is one count per visitor, HIPAA isolation that holds, CCPA opt-out that's physical, and audience targeting that refers to the article the advertiser meant.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A streaming session-stitcher links pre-login and post-login activity to one user.
> - Health data lives in a governed area with warehouse-level access policies advertising can't reach.
> - CCPA opt-out triggers physical deletion of the user from advertising paths.
> - Article versions tag reader features so audience targeting refers to the current version.

> **The mistake that ships**
>
> What gets shipped counts anonymous and authenticated as two users, treats opt-out as a flag, lets advertising and health share a warehouse with column filters, and ignores article versions. Analytics double-counts; a misconfigured query exposes health data; CCPA review finds opt-outs still present and advertisers complain that audiences include readers of superseded articles. The eventual rebuild adds session stitching, warehouse isolation, physical deletion, and article versioning.

---

## Common follow-up questions

- A user opts out and then signs up again later. What does this design do, and what does advertising see? _(Tests whether the candidate sees the prior opt-out's deletion as final; a new signup writes a new user with no historical events tied to the prior identity. Advertising treats the new user as new because the old one is physically gone.)_
- An article is renamed but not updated. What in this design lets advertising audiences continue to refer to the same readers? _(Tests whether the candidate sees that the article version doesn't change just because the title did; the version id stays the same and the audience targeting against that version is unaffected. The renaming is a metadata change, not a content version.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_consent_stitcher)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.