Change Tracking
The History Problem
Understand why overwriting loses context
Why Overwriting Destroys Information
| Quarter | City (Overwritten) | Revenue | Accurate? |
|---|---|---|---|
| Q1 2024 | Seattle (was Portland) | $50,000 | No: customer was in Portland during Q1 |
| Q2 2024 | Seattle | $60,000 | Yes: customer is in Seattle now |
| Q3 2024 | Seattle | $45,000 | Yes |
Type 1: Overwrite Old Value
Update dimensions when history is irrelevant
When History Does Not Matter
| Operation | customer_id | city |
|---|---|---|
| Before | C001 | Portland |
| After UPDATE | C001 | Seattle (Portland gone) |
Type 1 is a destructive operation. There is no undo. If you later discover you needed the historical value, it is too late. When in doubt, use Type 2.
Type 2: Preserve Full History
Keep every version of a changing record
Keeping Every Version of a Dimension Row
| customer_sk | customer_id | city | effective_from | effective_to | is_current |
|---|---|---|---|---|---|
| 1001 | C001 | Portland | 2023-01-01 | 2024-06-14 | false |
| 1002 | C001 | Seattle | 2024-06-15 | 9999-12-31 | true |
Key Components
- Unique per version. The fact table joins on this. Each version of the customer gets its own SK.
- The business identifier. Multiple rows share the same customer_id. GROUP BY customer_id to see all versions.
- The validity window. effective_to = '9999-12-31' means 'still active.'
- Convenience flag. Redundant with effective_to = '9999-12-31' but makes WHERE is_current = TRUE queries simpler.
Point-in-Time Joins
- Simple equality join: fact.customer_sk = dim.customer_sk
- Fast: standard hash join
- Requires SK to be assigned at fact load time
- Most common approach
- Range join: fact.order_date BETWEEN effective_from AND effective_to
- Slower: range predicate, no hash join
- Works when fact table stores natural key
- Useful for late-arriving facts
The Storage Tradeoff
Type 3: Keep Previous Value
Store just the last two states of a record
Limited History: Just the Last Two States
| customer_id | city | previous_city |
|---|---|---|
| C001 | Seattle | Portland |
- Only one prior value preserved
- No effective dates
- No row growth
- Cannot answer 'city in Q1 2023?'
- All historical values preserved
- Effective date range per version
- One row per change
- Can answer point-in-time queries for any date
Choosing Your Strategy
Match SCD type to your reporting needs
Per-Attribute Strategy
| Attribute | SCD Type | Why |
|---|---|---|
| Customer name | Type 1 | Name corrections have no analytical impact |
| Customer region | Type 2 | Revenue must be attributed to region at time of purchase |
| Customer email | Type 1 | Nobody reports by historical email |
| Product category | Type 2 | Category-level metrics need point-in-time accuracy |
| Employee title | Type 2 | Headcount reports need to reflect title at each point in time |
| Customer segment (migration) | Type 3 | Need before/after comparison for a one-time migration analysis |
- Apply Type 2 to every attribute
- Table grows 40x because phone numbers change
- Queries slow because dimension is bloated
- Choose type per attribute
- Type 1 for corrections and contact info
- Type 2 only for attributes that affect reporting
> You are designing dim_employee for an HR analytics warehouse. Employees change departments, titles, and phone numbers over time.
When yesterday's truth still matters
- Category
- Data Modeling
- Duration
- 17 minutes
- Challenges
- 12 hands-on challenges
Topics covered: The History Problem, Type 1: Overwrite Old Value, Type 2: Preserve Full History, Type 3: Keep Previous Value, Choosing Your Strategy
Lesson Sections
- The History Problem
Why Overwriting Destroys Information When a dimension attribute changes, the simplest approach is to overwrite the old value with the new one. Customer moved to Seattle? UPDATE customers SET city = 'Seattle'. Done. Simple. But now every fact row that referenced this customer while they lived in Portland shows 'Seattle.' Last quarter's revenue by region report just retroactively changed. Portland lost revenue it actually had. Seattle gained revenue it did not earn. This is the slowly changing dim
- Type 1: Overwrite Old Value (concepts: dmScdType1)
When History Does Not Matter Type 1 simply overwrites the old value with the new one. The customer moved to Seattle? UPDATE the city column. The old value (Portland) is gone forever. This is the right choice for attributes where nobody ever needs the historical value: typo corrections, phone number updates, name changes due to marriage. Type 1 is not lazy modeling. It is the correct choice when historical values have no analytical meaning. Applying Type 2 to a phone number creates dimension rows
- Type 2: Preserve Full History (concepts: dmScdType2)
Keeping Every Version of a Dimension Row Type 2 creates a new row for every change. The old row is marked as expired. The new row is marked as current. Both rows have different surrogate keys. Fact rows from before the change point to the old SK. Fact rows after the change point to the new SK. Revenue is attributed to the correct city at the time of each transaction. Two rows, same customer_id, different surrogate keys. Fact rows from January 2024 join to SK 1001 (Portland). Fact rows from July
- Type 3: Keep Previous Value (concepts: dmScdType3)
Limited History: Just the Last Two States Type 3 adds a 'previous' column alongside the current column. customer has city and previous_city. When the customer moves, the current city becomes the previous city, and the new city becomes current. Only one prior value is preserved. History beyond the immediately-previous state is lost. Type 3 is rarely used in practice. It is useful for exactly one scenario: before-and-after comparisons. 'Show me customers who changed region this quarter.' Beyond th
- Choosing Your Strategy (concepts: dmScdStrategy)
Per-Attribute Strategy The most important concept: SCD type is chosen per attribute, not per table. A single dim_customer can have Type 1 on name (typo corrections do not need history), Type 2 on region (revenue attribution needs point-in-time accuracy), and Type 3 on segment (before/after comparison for a migration). The decision for each attribute: does anyone need the historical value? If the answer is 'yes, for reporting or audit,' use Type 2. If the answer is 'no, and keeping it creates use