# Who Saw the Ad Twice

> TV and digital. Same viewer, two measurement worlds.

Canonical URL: <https://datadriven.io/problems/who_saw_the_ad_twice>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We measure advertising effectiveness across linear TV and digital platforms, but right now our TV and digital measurement pipelines are completely separate. Advertisers want to know the true unduplicated reach of a campaign that ran on both broadcast TV and connected apps. This requires joining second-level TV ad exposure logs from set-top boxes with digital impression logs, all without exposing raw PII. Design a pipeline that produces these cross-platform audience metrics.

## Worked solution and explanation

### Why this problem exists in real interviews

Cross-platform reach is a join across two datasets that legally cannot share raw identifiers. The trap is treating it as a normal warehouse join. The trap also hides a second pipeline (frequency capping) that has to feel instant on every ad request, which kills any 'just batch this' answer. One pipeline serves the morning report; another serves the ad server. Both depend on the same identity work.

The whiteboard answer is to copy the TV exposure logs into the digital warehouse, join on email or device id, count distinct households, and ship. Three things go wrong. Raw PII just crossed a boundary that CCPA cares about. The dedup is wrong because the same household has different identifiers on TV and digital and equality on email won't match them. And the ad server's frequency cap is hitting the warehouse at impression time, which is hundreds of milliseconds when it needs to be tens.

> **Trick to Solving**
>
> Hash identities before they cross, dedup on a privacy-preserving identity graph, serve the morning report from a clean room and the ad server from a fast cap store.
> 
> 1. Each platform hashes its identifiers in place before anything crosses; the join is between hashed identities, not raw ones.
> 2. An identity graph maps multiple hashed identifiers (TV household, digital device) to a stable household key, so dedup is on the household, not on the strongest match.
> 3. The morning report is published into a warehouse-backed clean room: the advertiser queries the audience without ever holding raw identifiers.
> 4. Frequency capping is a separate path, fed by the same impression stream, with state in a low-latency store the ad server reads on every impression.

---

### Walk the requirements

#### Step 1: One unduplicated reach number from a household-keyed join

TV exposures and digital impressions both carry identifiers, but they're different identifiers. An identity graph maps hashed TV household ids and hashed digital device ids to a stable household key. The cross-platform reach metric is a count distinct on the household key after the join, not a sum of the two platform counts. Summing TV reach and digital reach double-counts households that saw the campaign on both, which is exactly the number the advertiser is paying not to see.

#### Step 2: Hash identifiers in source, keep raw PII in place

CCPA cares where raw PII goes. Each platform hashes its identifiers (with a shared salt, so the hashes match for the join) before any data crosses pipeline boundaries. The TV pipeline keeps its raw identifiers inside the TV dataset; the digital pipeline keeps its raw identifiers inside the digital dataset; only hashed identifiers, the identity graph mapping, and aggregated metrics ever leave. A 'we'll just dedup in a single big warehouse' design centralises raw PII in exactly the way the audit will fail.

#### Step 3: Serve frequency caps from a fast store, separate from the batch path

The ad server checks the cap on every impression; if the lookup isn't fast, the cap doesn't apply and the campaign over-delivers. A streaming path tails the impression feed, updates per-household cap counters in a low-latency key-value store, and the ad server reads from that store on every request. The batch reach path runs on its own schedule for the morning report. Trying to serve cap lookups from the same warehouse the morning report is built in is the version that misses the cap window.

#### Step 4: Sequence identity, measurement, and delivery so the report ships by morning

An orchestrator owns the nightly DAG: identity graph refresh runs first (because measurement depends on it), then the cross-platform measurement job, then the clean-room publish. Each stage has its own SLA and its own alert; if identity is late, the alert fires before measurement is queued, and on-call sees the issue with hours to fix it, not the morning of the contract breach. The clean-room publish is the last step; if it's at risk, the alert fires before the contract breach, not after.

---

### The shape that fits

> **What this design gives up**
>
> Hashing in place plus an identity graph is more pieces than one centralised join, and the identity graph is itself a system that has to be operated. A separate frequency-cap path duplicates ingestion of the same stream. Clean-room delivery costs more than emailing a CSV. The simpler pipeline goes; what arrives is a privacy boundary that survives a CCPA review, frequency capping that actually applies, and reach numbers advertisers won't reject.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Identifiers are hashed at the source; raw PII never crosses pipeline boundaries.
> - A streaming serving path delivers frequency-cap state to the ad server in real time, separate from the morning measurement batch.
> - An orchestrator sequences identity refresh, measurement, and clean-room delivery with alerting before the contractual deadline.

> **The mistake that ships**
>
> What goes out the door first copies TV logs into the digital warehouse, joins on email, sums platform counts, and lets the ad server query the warehouse for cap lookups. The reach number is wrong, which the advertiser notices within a week. The CCPA assessor finds raw email and device-id pairs in a single warehouse environment. The ad server's frequency-cap latency causes campaigns to over-deliver, and finance has to refund. The team rebuilds with hashing-in-place, an identity graph, a separate cap-serving store, and a clean-room delivery. The fix takes a quarter to land and the trust takes longer.

---

## Common follow-up questions

- An advertiser asks for unduplicated reach across TV, digital, and a third partner that joins next quarter. What in this design extends, and what doesn't? _(Tests whether the candidate sees the identity graph as the extension point: a new partner means a new hashed-identifier source feeding the same graph, with measurement automatically inheriting the new partner. The privacy boundary stays intact because the new partner hashes in source the same way.)_
- The ad server's cap lookup is fast but starts disagreeing with the morning report on cap counts. Where does the gap come from, and how do you fix it? _(Tests whether the candidate sees that the cap store is approximate (sized for fast, eventually consistent), and the morning report is exact (full reconciliation against the impression log). The fix is in how the discrepancy is reconciled at the day boundary, not in trying to make the cap store perfectly consistent.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/who_saw_the_ad_twice)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.