# Average Spending by Account Status

> Average per-user lifetime spending segmented by account status

Canonical URL: <https://datadriven.io/problems/average_spending_by_account_status>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The finance team is building a lifetime value model and wants to know whether premium account tiers actually correlate with higher spend. For each account status, show the average total spend among users who have made at least one purchase.

## Worked solution and explanation

### Why this problem exists in real interviews

LTV questions like this probe one habit specifically: do you name the **unit of observation** before writing SQL? Finance wants average spend per user, not per transaction. A whale with 1000 small purchases otherwise dominates a casual buyer with one big one. The interviewer is watching to see if you collapse to per-user totals in a subquery before averaging.

---

### Break down the requirements

#### Step 1: Collapse transactions to one row per user

Subquery on `transactions` with `SUM(total_amount) AS total_spent GROUP BY user_id`. This is the unit of observation the outer average should run over. Pre-aggregating here is what makes the final `AVG` user-weighted instead of transaction-weighted.

#### Step 2: Join to users for the segment column

`INNER JOIN users u ON u.user_id = ut.user_id`. Inner join is intentional: the prompt says **users who have made at least one purchase**, so zero-spenders are out of the denominator. Pull `u.account_status` only; you do not need any other user column.

#### Step 3: Aggregate up to account_status

Outer `SELECT u.account_status, AVG(ut.total_spent) GROUP BY u.account_status`. Each user contributes exactly one number to the bucket regardless of whether they ran 1 transaction or 1,000. That is the user-weighted lifetime spend finance asked for.

---

### The solution

**USER-WEIGHTED AVG VIA PRE-AGGREGATION**

```sql
SELECT
  u.account_status,
  AVG(user_totals.total_spent) AS avg_total_spending
FROM users u
INNER JOIN (
  SELECT user_id, SUM(total_amount) AS total_spent
  FROM transactions
  GROUP BY user_id
) user_totals
  ON u.user_id = user_totals.user_id
GROUP BY u.account_status;
```

**Wrong, transaction-weighted**

`SELECT u.account_status, AVG(t.total_amount) FROM users u JOIN transactions t ...`

One power user with 1,000 × $5 purchases and one casual user with 1 × $5,000 purchase. Direct `AVG(total_amount)` = ($5,000 + 1,000×$5) / 1,001 = **$9.99**. The power user's behavior dominates the bucket.

**Right, user-weighted**

Subquery collapses to two rows: $5,000 and $5,000. Outer `AVG` = **$5,000**. Both users count once. That is what the finance team means by average spend per user in an LTV model.

> **Cost Analysis**
>
> 150M `transactions` hash-aggregated by `user_id` is the dominant cost: high-cardinality grouping over the full fact table, no partition prune available unless `transaction_date` is filtered. Expect a multi-GB hash table or a spill-to-disk sort-aggregate. The 8M-row `users` join is cheap by comparison, a hash join on the already-grouped 8M-row build side keyed on a unique `user_id`. An index on `transactions(user_id)` does not help a full aggregate; hash-partitioning `transactions` by `user_id` would let workers aggregate slices in parallel.

> **Interviewers Watch For**
>
> Whether you ask **average of what?** before writing SQL. Per-user vs per-transaction is a business question, not a SQL question. Strong candidates name the trap out loud, propose the subquery shape, and confirm with the interviewer that zero-spenders are excluded. Weak candidates write `AVG(t.total_amount)` and move on, producing a number that is mathematically fine and semantically wrong.

> **Common Pitfall**
>
> Writing `AVG(t.total_amount)` directly off the `users JOIN transactions` shape. It reads correctly but answers a different question: it weights by transaction count, so a tier with one whale running thousands of small charges looks identical to a tier of consistent big spenders. Producing the number without flagging the choice is the bug.

> **Zero-Spender Decision**
>
> Swap to `LEFT JOIN` (users on the left, the subquery on the right) and wrap with `COALESCE(user_totals.total_spent, 0)` if finance wants zero-spenders included as $0. That shifts every bucket's average down, often dramatically for `free` or `trial` tiers where most accounts never purchase. The prompt says **at least one purchase**, so `INNER JOIN` is right here, but flag the alternative aloud, it is the first follow-up they will ask.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would the answer change if finance wanted to include users with zero purchases as $0 spenders? _(Tests whether you can articulate `LEFT JOIN` + `COALESCE` and reason about the denominator shifting per account_status tier.)_
- Compute median spend per user by account_status instead of the mean. How? _(Probes window/aggregate functions: `PERCENTILE_CONT(0.5) WITHIN GROUP` over the pre-aggregated per-user totals, or `NTILE(2)` plus `MAX` on the lower half.)_
- What if `total_amount` can be negative for refunds, and finance wants net spend? _(Tests whether you defend `SUM` over filtering, and whether you would clip negatives, expose refund volume as its own column, or report both gross and net.)_
- How would you run this incrementally as transactions stream in, rather than scanning 150M rows nightly? _(Probes materialized per-user rollup tables, CDC into a `user_lifetime_spend` mart, and avoiding full re-aggregation on every refresh.)_
- How does the plan change if `transactions` is partitioned by `transaction_date` and finance only wants the trailing 90 days? _(Tests partition pruning awareness and whether the per-user `SUM` should sit inside a CTE with the date filter pushed in.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_spending_by_account_status)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.