# Shared Category Purchasers

> They bought different things from the same aisle.

Canonical URL: <https://datadriven.io/problems/shared_category_purchasers>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The recommendations team is building a collaborative filtering feature and needs to identify users with overlapping purchase interests. For each product category, find pairs of users who both purchased in that category, showing the category, both user IDs, and how many products they share, with the highest overlap first.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is self-join, applied to the `transactions` and `products` tables in a purchase behavior context. Getting columns like `user_id`, `product_id`, and `category` right is where most candidates slip.

> **Trick to Solving**
>
> When the prompt asks for pairs or combinations within the same table, a self-join is the pattern.
> 
> 1. Alias the table twice (e.g., `a` and `b`)
> 2. Join on the shared attribute (region, user, etc.)
> 3. Add `a.id < b.id` to avoid duplicate and self-pairs

---

### Break down the requirements

#### Step 1: Self-join the table to pair rows

Join `transactions` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 2: Select the target columns

The SELECT clause picks exactly the columns the prompt asks for. Returning extra columns or missing a required alias would fail the grading check.

---

### The solution

**Self-join the table to pair rows to find shared category purchasers**

```sql
SELECT p.category, t1.user_id AS user1, t2.user_id AS user2, COUNT(*) AS shared_categories
FROM transactions t1
INNER JOIN transactions t2 ON t1.user_id < t2.user_id
INNER JOIN products p ON t1.product_id = p.product_id
INNER JOIN products p2 ON t2.product_id = p2.product_id AND p.category = p2.category
GROUP BY p.category, t1.user_id, t2.user_id
ORDER BY shared_categories DESC
```

> **Cost Analysis**
>
> With ~100M rows, the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you prevent duplicate pairs and ensure the join condition is correct.

> **Common Pitfall**
>
> A self-join without an inequality condition (`a.id < b.id`) produces duplicate mirrored pairs and a self-pair row for each entry.

---

## Common follow-up questions

- What result would you get if every value in `products.rating` were NULL? Would your query return an empty set or something unexpected? _(Tests extreme NULL scenarios and whether the candidate guards against edge cases in `rating`.)_
- `transactions.transaction_id` has roughly 100,000,000 distinct values. What index strategy would you use to avoid a full scan on `transactions`? _(Tests indexing knowledge specific to the high-cardinality `transaction_id` column in `transactions`.)_
- `transactions.quantity` only has 18 distinct values. If a new category were added, would your query automatically include it? _(Tests whether the query hard-codes values or dynamically adapts to `quantity` changes.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/shared_category_purchasers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.