# The Weight of Everything Before

> Every purchase carries the ones that came before it. Trace the climb.

Canonical URL: <https://datadriven.io/problems/the-weight-of-everything-before>

Domain: SQL · Difficulty: medium · Seniority: mid

## Problem

The lifecycle analytics team is studying how each customer's spend accumulates over their lifetime on the platform, because lifetime-value models depend on seeing the full trajectory of a buyer rather than a single snapshot. For every purchase a customer has ever made, they want to see that buyer's cumulative spend as it stood at the moment of that purchase, with each customer's history walked forward from their earliest transaction to their most recent. Produce one row per purchase showing the customer, the date of that purchase, and the total amount the customer had spent up to and including that point, laid out customer by customer and earliest to latest within each.

## Worked solution and explanation

### Why this problem exists in real interviews

Running totals are the canonical test of whether a candidate reaches for a window function instead of a self-join. The interviewer wants to see that you can accumulate total_amount within each user_id while preserving every individual transaction row, rather than collapsing them. The decision point is recognizing that the cumulative spend at each purchase is an ordered, partitioned aggregate: partition by the customer, order by transaction_date, and sum forward. Candidates who try a correlated subquery or a self-join produce the right numbers but signal they have not internalized SQL's analytic functions, which is exactly what a Databricks data engineer works with daily over large tables.

---

### Break down the requirements

#### Step 1: Keep every purchase row

The output has one row per transaction, not one row per customer. That rules out a plain GROUP BY collapse; you need an analytic function that annotates each existing row with a computed value while leaving the row count intact.

#### Step 2: Accumulate within each customer

Partition the calculation by user_id so one customer's spend never bleeds into another's. SUM(total_amount) OVER (PARTITION BY user_id ...) restarts the accumulation at each new customer.

#### Step 3: Walk forward in time

Add ORDER BY transaction_date inside the window so the sum includes only purchases up to and including the current one. The default RANGE frame means same-date purchases share one cumulative value, which is the deterministic behavior you want. A final ORDER BY user_id, transaction_date lays the result out for reading.

---

### The solution

**Per-customer running total**

```sql
SELECT user_id, transaction_date, SUM(total_amount) OVER (PARTITION BY user_id ORDER BY transaction_date) AS running_total FROM transactions ORDER BY user_id, transaction_date;
```

> **Cost Analysis**
>
> On a transactions table of 80M rows spanning roughly 25 GB, the window function forces a sort by (user_id, transaction_date), which is the dominant cost. If the table is partitioned or clustered on user_id and pre-sorted by transaction_date, the engine can stream each partition without a global sort, keeping memory bounded to a single customer's window. The query is single-pass after the sort, with no join blowup, so it scales linearly and stays cheap relative to any self-join alternative that would be quadratic per customer.

> **Interviewers Watch For**
>
> A strong candidate names the window function immediately, explains that PARTITION BY resets the accumulation per customer, and proactively raises how ties on transaction_date are handled, noting the default RANGE frame versus ROWS. Mentioning that a self-join solution is O(n^2) per customer and the window is one sorted pass earns senior credit.

> **Common Pitfall**
>
> Forgetting the ORDER BY inside the OVER clause. SUM(total_amount) OVER (PARTITION BY user_id) with no ordering returns the customer's grand total on every row instead of a running total. The ordering is what turns the partition aggregate into a cumulative one.

---

## Common follow-up questions

- How would you change the query to show the running total of only the last three purchases for each customer, a moving window rather than a cumulative one? _(Tests whether the candidate can switch the default RANGE frame to an explicit ROWS BETWEEN 2 PRECEDING AND CURRENT ROW frame.)_
- If two purchases share the exact same transaction_date for a customer, what running_total does each row show, and how would you make the accumulation strictly row-by-row instead? _(Tests understanding of RANGE versus ROWS framing and tie-breaking determinism.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-weight-of-everything-before)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.