# The Notification That Paid Off

> The message went out to thousands. A smaller number actually bit.

Canonical URL: <https://datadriven.io/problems/campaign_conversion_count>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

We ran a push notification campaign starting one day after each user's first transaction. How many users went on to buy new products they hadn't purchased on their first day? Users who only re-bought the same items or never made additional purchases don't count. Return a single count.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a **per-user set difference** dressed up as marketing attribution. The skill being probed: can you express `post_campaign_products(u) \ first_day_products(u)` in SQL without conflating users? Anyone can write the CTEs. The trick is the correlated filter that scopes the exclusion to the same `user_id`. Get that wrong and you delete every product anyone bought on day one.

---

### Break down the requirements

#### Step 1: Anchor each user to their first day

`MIN(transaction_date) GROUP BY user_id` in a `first_txn` CTE. This is the only reference point, every other CTE joins back to it.

#### Step 2: Capture the day-one basket per user

`first_day_products`: distinct `(user_id, product_id)` where `transaction_date = first_date`. This is the exclusion set, scoped per user.

#### Step 3: Capture purchases strictly after the campaign starts

`post_campaign`: distinct `(user_id, product_id)` where `julianday(transaction_date) > julianday(first_date) + 1`. The `+ 1` reflects 'one day after first transaction'.

#### Step 4: Subtract day-one products per user, then count distinct users

`SELECT COUNT(DISTINCT pc.user_id) FROM post_campaign pc WHERE pc.product_id NOT IN (SELECT product_id FROM first_day_products WHERE user_id = pc.user_id)`. The correlation is load-bearing.

---

### The solution

**PER-USER SET DIFFERENCE OVER A TIME WINDOW**

```sql
WITH first_txn AS (
  SELECT user_id, MIN(transaction_date) AS first_date
  FROM transactions
  GROUP BY user_id
),
first_day_products AS (
  SELECT DISTINCT t.user_id, t.product_id
  FROM transactions t
  INNER JOIN first_txn ft
    ON t.user_id = ft.user_id
   AND t.transaction_date = ft.first_date
),
post_campaign AS (
  SELECT DISTINCT t.user_id, t.product_id
  FROM transactions t
  INNER JOIN first_txn ft
    ON t.user_id = ft.user_id
  WHERE julianday(t.transaction_date) > julianday(ft.first_date) + 1
)
SELECT COUNT(DISTINCT pc.user_id) AS user_count
FROM post_campaign pc
WHERE pc.product_id NOT IN (
  SELECT product_id
  FROM first_day_products
  WHERE user_id = pc.user_id
);
```

> **Cost Analysis**
>
> 200M rows partitioned by `transaction_date`. The `MIN(transaction_date) GROUP BY user_id` pass scans every partition (no pruning, you don't know the dates yet) and is the dominant cost. Both downstream CTEs hash-join back on `user_id`; the second one also needs a date predicate that the planner can't push down (`julianday(...) > julianday(...) + 1` references a joined column). On a real warehouse, materialize `first_txn` once and broadcast-join it. The correlated `NOT IN` becomes a hash anti-join with `(user_id, product_id)` as the key, fine if `first_day_products` fits in memory per partition.

> **Interviewers Watch For**
>
> Before writing SQL: ask what 'one day after first transaction' means. Is it 24 hours (`julianday + 1`) or calendar-day next-day (`DATE(t) > DATE(ft.first_date)`)? Those return different users. The senior signal is naming the ambiguity. Second: confirm whether 'new product' is per-user or globally new. The prompt says 'they hadn't purchased on their first day', so per-user. Different read, different query.

> **Common Pitfall**
>
> Writing `WHERE pc.product_id NOT IN (SELECT product_id FROM first_day_products)` without the `WHERE user_id = pc.user_id` correlation. That excludes any product anyone bought on day one, so a user who bought a coffee mug on day three gets dropped because some other user bought a mug on their day one. Wildly under-counts. The correlated subquery is what makes the set difference per-user.

**NOT IN (silent NULL bug)**

`WHERE pc.product_id NOT IN (SELECT product_id FROM first_day_products WHERE user_id = pc.user_id)`. If any returned `product_id` is NULL, SQL three-valued logic makes every comparison evaluate to UNKNOWN, the row is dropped, and `user_count` collapses to zero. No error, no warning.

**NOT EXISTS (NULL-safe)**

`WHERE NOT EXISTS (SELECT 1 FROM first_day_products fdp WHERE fdp.user_id = pc.user_id AND fdp.product_id = pc.product_id)`. NULLs in `product_id` don't satisfy the equality, the row stays. Same shape, NULL-safe semantics. In production, default to this.

> **One-liner**
>
> Three CTEs, one correlated anti-join. The whole problem is just `post_campaign(u) MINUS first_day(u)`, scoped per user.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Rewrite this using `EXCEPT` or `LEFT JOIN ... IS NULL` instead of the correlated `NOT IN`. Which does the planner prefer at 200M rows? _(Tests fluency with the three idioms for set difference and the cost trade-offs each one carries on a partitioned fact table.)_
- How would the answer change if 'new product' meant globally new (never sold by anyone before this user's first day) instead of per-user new? _(Probes whether the writer understood the correlation as a deliberate semantic choice, not boilerplate. Globally-new requires a different anchor entirely.)_
- The campaign actually runs for 30 days starting one day after first transaction. Add that window. _(Forces a second `julianday` boundary and exposes whether the writer composes date predicates cleanly or hardcodes a magic number.)_
- Why might `COUNT(DISTINCT user_id)` here over-count compared to a campaign-attribution model in production? _(Opens the door to discussing causality (did the push cause the purchase?) versus the pure set-difference this query computes. Senior signal.)_
- Suppose `transactions` has late-arriving rows that backfill into older partitions. Does this query stay correct? _(Tests awareness that `MIN(transaction_date)` is not stable under backfill, and that `first_txn` should probably be materialized with a cutoff.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/campaign_conversion_count)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.