# Diverse Shoppers

> They shop the whole catalog.

Canonical URL: <https://datadriven.io/problems/diverse_shoppers>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The retention team is building a loyalty tier for customers who show broad purchasing behavior rather than repeat-buying a single item. For each user, show their username, how many different products they have purchased, and what they have spent in total across all transactions. Only include users who have bought from at least two different products. Present from the highest total spend to the lowest.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests JOIN with GROUP BY, COUNT(DISTINCT), HAVING, and ORDER BY. Combining multiple aggregates with a post-aggregation filter probes standard analytics query construction.

---

### Break down the requirements

#### Step 1: Join users to transactions

Join on `user_id` to get `username` alongside transaction data.

#### Step 2: Aggregate per user

`COUNT(DISTINCT product_id)` for product diversity, `SUM(total_amount)` for total spend.

#### Step 3: Filter and sort

`HAVING COUNT(DISTINCT product_id) >= 2` keeps diverse shoppers. `ORDER BY total_spend DESC`.

---

### The solution

**Multi-metric aggregation with diversity filter**

```sql
SELECT
    u.username,
    COUNT(DISTINCT t.product_id) AS distinct_products,
    SUM(t.total_amount) AS total_spend
FROM users u
JOIN transactions t ON u.user_id = t.user_id
GROUP BY u.user_id, u.username
HAVING COUNT(DISTINCT t.product_id) >= 2
ORDER BY total_spend DESC
```

> **Cost Analysis**
>
> Hash join of 5M users to 50M transactions. COUNT(DISTINCT product_id) per user requires hash sets per group. With 5M users, memory can be significant.

> **Interviewers Watch For**
>
> Whether the candidate uses COUNT(DISTINCT product_id) (correct) vs COUNT(product_id) (incorrect, counts total purchases not unique products).

> **Common Pitfall**
>
> Grouping by only username without user_id could merge distinct users who share a name. Always include the primary key in GROUP BY when selecting non-key columns.

---

## Common follow-up questions

- How would you also show the most frequently purchased product per user? _(Tests mode-per-group pattern with RANK.)_
- What if you needed the average spend per unique product? _(Divide total_spend by distinct_products.)_
- How would you identify users who are diverse AND high-spending? _(Add a HAVING condition on SUM(total_amount) as well.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/diverse_shoppers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.