# The Address That Changed

> Addresses change. History must not be erased.

Canonical URL: <https://datadriven.io/problems/slowly_changing_dimension_type_2>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

We maintain a slowly changing dimension (Type 2) for customer addresses. Each customer_id can appear multiple times, once per address change. The current record is the one where last_name is NULL. Find each customer's current country and how many times they have moved (total records minus 1). Show customer_id, first_name, current country, and move count.

## Worked solution and explanation

### Why this problem exists in real interviews

This analytics problem uses the `customers` table to evaluate self-join. Watch how the `customer_id` column interact in the grouping and filtering logic.

---

### Break down the requirements

#### Step 1: Apply the range filter

The WHERE clause restricts rows to the target range. Applying this filter early reduces the volume flowing into downstream operations.

#### Step 2: Aggregate by `customer_id`

`GROUP BY customer_id` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

#### Step 4: Use a subquery to find the reference value

The scalar subquery computes a single value (like the maximum) that the outer query filters against. This avoids a self-join.

---

### The solution

**Apply the range filter to find slowly changing dimension type 2**

```sql
SELECT c.customer_id, c.first_name, c.country AS current_city, versions.move_count
FROM customers c
JOIN (SELECT customer_id, COUNT(*) - 1 AS move_count FROM customers GROUP BY customer_id) versions ON c.customer_id = versions.customer_id
WHERE versions.move_count > 0
ORDER BY c.customer_id
```

> **Cost Analysis**
>
> With ~5M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you handle NULL values and whether you account for them in filters and aggregations; whether you use a subquery or self-join, and can explain the tradeoffs.

> **Common Pitfall**
>
> Forgetting to filter NULLs creates phantom groups or inflated counts. Always check `null_fraction` in the schema before assuming columns are clean.

---

## Common follow-up questions

- The `country` column in `customers` has a 1% null rate. How does your query handle rows where `country` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `customers.country` and understands how aggregates skip NULL values.)_
- `customers.customer_id` has roughly 3,000,000 distinct values. What index strategy would you use to avoid a full scan on `customers`? _(Tests indexing knowledge specific to the high-cardinality `customer_id` column in `customers`.)_
- How would you modify this query if the business logic required grouping by both `customer_id` and `first_name` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/slowly_changing_dimension_type_2)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.