# User Spend Segmentation by Category

> Users segmented by spending behavior.

Canonical URL: <https://datadriven.io/problems/user_spend_segmentation_by_category>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Compute each user's average basket size (total spend / transaction count) per category from transactions combined with products. Label 'High' if over $500, 'Medium' between $200 and $500, 'Low' under $200. Then produce a per-category, per-segment summary: unique users, total transactions, total sales, and average basket size. Return the category, segment, unique users, total transactions, total sales, and average basket size.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests multi-level aggregation with conditional labeling. Interviewers probe whether you can compute a per-user, per-category metric, classify it into buckets, and then re-aggregate at the segment level.

> **Trick to Solving**
>
> This is a two-stage aggregation problem. First aggregate to user-category grain, then classify and re-aggregate to category-segment grain.
> 
> 1. Join transactions to products for category
> 2. Compute basket size per user per category (SUM / COUNT)
> 3. Label each user-category pair as High/Medium/Low
> 4. Aggregate per category per segment

---

### Break down the requirements

#### Step 1: Compute basket size per user per category

Join `transactions` to `products`, GROUP BY `user_id, category`, compute `SUM(total_amount) / COUNT(*)` as basket_size.

#### Step 2: Label segments

Use CASE: `> 500` is 'High', between 200 and 500 is 'Medium', `< 200` is 'Low'.

#### Step 3: Re-aggregate per category per segment

GROUP BY `category, segment` with COUNT(DISTINCT user_id), COUNT(*), SUM(total_amount), AVG(basket_size).

---

### The solution

**Two-stage aggregation with segment labeling**

```sql
WITH user_baskets AS (
    SELECT
        t.user_id,
        p.category,
        SUM(t.total_amount) AS total_sales,
        COUNT(*) AS txn_count,
        SUM(t.total_amount) * 1.0 / COUNT(*) AS basket_size,
        CASE
            WHEN SUM(t.total_amount) * 1.0 / COUNT(*) > 500 THEN 'High'
            WHEN SUM(t.total_amount) * 1.0 / COUNT(*) >= 200 THEN 'Medium'
            ELSE 'Low'
        END AS segment
    FROM transactions t
    JOIN products p ON t.product_id = p.product_id
    GROUP BY t.user_id, p.category
)
SELECT
    category,
    segment,
    COUNT(*) AS unique_users,
    SUM(txn_count) AS total_transactions,
    SUM(total_sales) AS total_sales,
    AVG(basket_size) AS avg_basket_size
FROM user_baskets
GROUP BY category, segment
```

> **Cost Analysis**
>
> The initial join and GROUP BY processes 120M transactions against 30K products, producing ~132M user-category combinations (6M users x 22 categories, with sparse coverage). The second aggregation reduces to ~66 rows (22 categories x 3 segments).

> **Interviewers Watch For**
>
> Correct boundary handling in the CASE expression. The prompt says `> 500` for High and "between $200 and $500" for Medium, so Medium includes 200 and 500.

> **Common Pitfall**
>
> Integer division when computing basket size. `SUM(total_amount) / COUNT(*)` truncates if both are integers. Multiply by `1.0` to force decimal division.

---

## Common follow-up questions

- What if a user has transactions in multiple categories? _(They appear in multiple user_baskets rows, one per category. This is correct per the requirements.)_
- How would you handle users with exactly $200 or $500 basket size? _(Clarify boundary conditions: the CASE uses `>= 200` for Medium and `> 500` for High.)_
- What if you needed to add a time dimension (monthly segmentation)? _(Add date truncation to the first GROUP BY, turning it into a three-level aggregation.)_
- How would you optimize this for a 10x larger dataset? _(Pre-compute the user_baskets CTE as a materialized summary table updated incrementally.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/user_spend_segmentation_by_category)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.