# User Campaign Overlap Percentage

> How much ad overlap between users?

Canonical URL: <https://datadriven.io/problems/user_campaign_overlap_percentage>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

For every pair of users, calculate the overlap ratio of shared ad campaigns relative to the smaller user's campaign set. Include only pairs that share at least one campaign. Show the two user IDs, the shared campaign count, and the overlap ratio.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests self-join on aggregated data with set intersection logic. Interviewers check whether you can compute pairwise similarity metrics (Jaccard-style) efficiently in SQL.

> **Trick to Solving**
>
> The overlap ratio uses the smaller set as the denominator, which is a form of containment similarity. The steps are:
> 
> 1. Get distinct campaigns per user
> 2. Self-join to find shared campaigns between pairs
> 3. Count shared campaigns per pair
> 4. Divide by MIN of the two users' campaign counts

---

### Break down the requirements

#### Step 1: Get distinct campaigns per user

A CTE with `SELECT DISTINCT user_id, ad_campaign FROM ad_impressions` deduplicates the 400M rows to unique (user, campaign) pairs.

#### Step 2: Self-join to find shared campaigns

Join the CTE to itself on `ad_campaign` where `a.user_id < b.user_id` to avoid duplicate pairs and self-pairs.

#### Step 3: Count shared and compute overlap ratio

GROUP BY the user pair, count shared campaigns, and divide by the MIN of each user's total campaign count.

---

### The solution

**Self-join for pairwise campaign overlap**

```sql
WITH user_campaigns AS (
    SELECT DISTINCT user_id, ad_campaign
    FROM ad_impressions
),
user_counts AS (
    SELECT user_id, COUNT(*) AS campaign_count
    FROM user_campaigns
    GROUP BY user_id
)
SELECT
    a.user_id AS user_id_1,
    b.user_id AS user_id_2,
    COUNT(*) AS shared_campaigns,
    COUNT(*) * 1.0 / MIN(uc1.campaign_count, uc2.campaign_count) AS overlap_ratio
FROM user_campaigns a
JOIN user_campaigns b ON a.ad_campaign = b.ad_campaign AND a.user_id < b.user_id
JOIN user_counts uc1 ON a.user_id = uc1.user_id
JOIN user_counts uc2 ON b.user_id = uc2.user_id
GROUP BY a.user_id, b.user_id, uc1.campaign_count, uc2.campaign_count
HAVING COUNT(*) >= 1
```

> **Cost Analysis**
>
> The CTE deduplicates 400M rows to ~3.75M (15M users x 250 campaigns, with overlap). The self-join on `ad_campaign` can be expensive: each campaign could have thousands of users, creating O(n^2) pairs per campaign. In production, limit to active users or sample.

> **Interviewers Watch For**
>
> Using `a.user_id < b.user_id` to avoid counting each pair twice and excluding self-pairs. This is a standard optimization for pairwise comparisons.

> **Common Pitfall**
>
> Dividing by the larger set instead of the smaller set. The prompt says "relative to the smaller user's campaign set," so `MIN(count_a, count_b)` is the correct denominator.

---

## Common follow-up questions

- How would you compute Jaccard similarity instead of containment? _(Denominator becomes count_a + count_b - shared_count (union size).)_
- What if you only needed pairs with overlap above 50%? _(Add HAVING to filter on the ratio after grouping.)_
- How would this scale with 15 million users? _(The self-join is O(n^2) in the worst case; discuss sampling, MinHash, or approximate methods.)_
- What if some users have impressions but no clicks? _(The query uses all impressions regardless of click status; filtering to clicked-only would change the semantics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/user_campaign_overlap_percentage)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.