Change Tracking

A customer moves from Portland to Seattle. Do you overwrite Portland with Seattle? Or do you keep both, so that last quarter's revenue report still shows Portland? This is the slowly changing dimension (SCD) problem, and the answer depends on whether anyone needs the historical value. This lesson teaches you the three main SCD strategies, when each one is the right choice, and how to implement Type 2 (the most common and most complex) correctly.

The History Problem

Daily Life
Interviews

Understand why overwriting loses context

Why Overwriting Destroys Information

When a dimension attribute changes, the simplest approach is to overwrite the old value with the new one. Customer moved to Seattle? UPDATE customers SET city = 'Seattle'. Done. Simple. But now every fact row that referenced this customer while they lived in Portland shows 'Seattle.' Last quarter's revenue by region report just retroactively changed. Portland lost revenue it actually had. Seattle gained revenue it did not earn.
This is the slowly changing dimension problem. Dimension data changes over time (customer addresses, product prices, employee titles), and how you handle those changes determines whether your historical reports stay accurate.
QuarterCity (Overwritten)RevenueAccurate?
Q1 2024Seattle (was Portland)$50,000No: customer was in Portland during Q1
Q2 2024Seattle$60,000Yes: customer is in Seattle now
Q3 2024Seattle$45,000Yes
The Q1 revenue was earned while the customer lived in Portland. After the overwrite, it appears under Seattle. If your company reports revenue by geography to investors, this is a material misstatement.
schedule
SCD Type 1: Overwrite. History is lost. Simple but destroys point-in-time accuracy.
immutable
SCD Type 2: Version. Keep all historical values. Complex but preserves full history.
copy
SCD Type 3: Previous column. Keep only the immediately prior value. Limited history, simple implementation.
TIP
Not every attribute needs history. A customer's phone number correction does not affect revenue reports. The question is always: does anyone need the historical value? If yes, Type 2. If no, Type 1. Choose per attribute, not per table.

Type 1: Overwrite Old Value

Daily Life
Interviews

Update dimensions when history is irrelevant

When History Does Not Matter

Type 1 simply overwrites the old value with the new one. The customer moved to Seattle? UPDATE the city column. The old value (Portland) is gone forever. This is the right choice for attributes where nobody ever needs the historical value: typo corrections, phone number updates, name changes due to marriage.
Type 1 is not lazy modeling. It is the correct choice when historical values have no analytical meaning. Applying Type 2 to a phone number creates dimension rows that nobody will ever query, bloating the table for no benefit.
dim_customer (Type 1)PKcustomer_idnamecityphone
Operationcustomer_idcity
BeforeC001Portland
After UPDATEC001Seattle (Portland gone)

Type 1 is a destructive operation. There is no undo. If you later discover you needed the historical value, it is too late. When in doubt, use Type 2.

Type 2: Preserve Full History

Daily Life
Interviews

Keep every version of a changing record

Keeping Every Version of a Dimension Row

Type 2 creates a new row for every change. The old row is marked as expired. The new row is marked as current. Both rows have different surrogate keys. Fact rows from before the change point to the old SK. Fact rows after the change point to the new SK. Revenue is attributed to the correct city at the time of each transaction.
dim_customer (Type 2)PKcustomer_skcustomer_idnamecityeffective_fromeffective_tois_current
customer_skcustomer_idcityeffective_fromeffective_tois_current
1001C001Portland2023-01-012024-06-14false
1002C001Seattle2024-06-159999-12-31true
Two rows, same customer_id, different surrogate keys. Fact rows from January 2024 join to SK 1001 (Portland). Fact rows from July 2024 join to SK 1002 (Seattle). Revenue attribution is correct at every point in time.

Key Components

  • Unique per version. The fact table joins on this. Each version of the customer gets its own SK.
  • The business identifier. Multiple rows share the same customer_id. GROUP BY customer_id to see all versions.
  • The validity window. effective_to = '9999-12-31' means 'still active.'
  • Convenience flag. Redundant with effective_to = '9999-12-31' but makes WHERE is_current = TRUE queries simpler.

Point-in-Time Joins

The power of Type 2: any fact row can be joined to the correct version of the dimension. If the fact stores the surrogate key (assigned at load time), the join is a simple equality. If the fact stores the natural key, the join uses a date range: WHERE fact.order_date BETWEEN dim.effective_from AND dim.effective_to.
Join by Surrogate Key
  • Simple equality join: fact.customer_sk = dim.customer_sk
  • Fast: standard hash join
  • Requires SK to be assigned at fact load time
  • Most common approach
Join by Natural Key + Date Range
  • Range join: fact.order_date BETWEEN effective_from AND effective_to
  • Slower: range predicate, no hash join
  • Works when fact table stores natural key
  • Useful for late-arriving facts

The Storage Tradeoff

Type 2 grows the dimension table. A customer with 10 changes has 10 rows. If 1 million customers average 5 changes each, the dimension has 5 million rows. For most dimensions, this is fine. For dimensions with high-frequency changes (daily behavioral scores, real-time status updates), the table can explode. Consider mini-dimensions for high-churn attributes.
TIP
Type 2 is the standard for any attribute that affects reporting. Customer region, product category, employee department, subscription tier. When in doubt, use Type 2. The storage cost of extra rows is negligible compared to the cost of incorrect historical reports.

Type 3: Keep Previous Value

Daily Life
Interviews

Store just the last two states of a record

Limited History: Just the Last Two States

Type 3 adds a 'previous' column alongside the current column. customer has city and previous_city. When the customer moves, the current city becomes the previous city, and the new city becomes current. Only one prior value is preserved. History beyond the immediately-previous state is lost.
customer_idcityprevious_city
C001SeattlePortland
Type 3 is rarely used in practice. It is useful for exactly one scenario: before-and-after comparisons. 'Show me customers who changed region this quarter.' Beyond that, it is too limited. If the customer moves a third time, Portland is lost. For full history, use Type 2.
Type 3 (Limited)
  • Only one prior value preserved
  • No effective dates
  • No row growth
  • Cannot answer 'city in Q1 2023?'
Type 2 (Full History)
  • All historical values preserved
  • Effective date range per version
  • One row per change
  • Can answer point-in-time queries for any date

Choosing Your Strategy

Daily Life
Interviews

Match SCD type to your reporting needs

Per-Attribute Strategy

The most important concept: SCD type is chosen per attribute, not per table. A single dim_customer can have Type 1 on name (typo corrections do not need history), Type 2 on region (revenue attribution needs point-in-time accuracy), and Type 3 on segment (before/after comparison for a migration).
The decision for each attribute: does anyone need the historical value? If the answer is 'yes, for reporting or audit,' use Type 2. If the answer is 'no, and keeping it creates useless rows,' use Type 1. If the answer is 'only the immediately previous value, for a specific comparison,' Type 3.
Type 1Type 2Type 3
Type 1
Corrections and Contact Info
Fix a typo in a name. Update a phone number. Change an email. Nobody reports by historical phone numbers.
Type 2
Anything That Affects Reporting
Customer region, product category, subscription tier, employee department. Revenue and metrics must be attributed to the value at the time of the event.
Type 3
Before/After Comparisons
Customer migrated from segment A to segment B. Need to compare behavior in both segments. Rare use case.
AttributeSCD TypeWhy
Customer nameType 1Name corrections have no analytical impact
Customer regionType 2Revenue must be attributed to region at time of purchase
Customer emailType 1Nobody reports by historical email
Product categoryType 2Category-level metrics need point-in-time accuracy
Employee titleType 2Headcount reports need to reflect title at each point in time
Customer segment (migration)Type 3Need before/after comparison for a one-time migration analysis
Common Mistake
  • Apply Type 2 to every attribute
  • Table grows 40x because phone numbers change
  • Queries slow because dimension is bloated
Best Practice
  • Choose type per attribute
  • Type 1 for corrections and contact info
  • Type 2 only for attributes that affect reporting
TIP
Start every dimension design by listing the attributes and classifying each one: Type 1 or Type 2? This takes 5 minutes and prevents months of either lost history (everything Type 1) or table bloat (everything Type 2).
PUTTING IT ALL TOGETHER

> You are designing dim_employee for an HR analytics warehouse. Employees change departments, titles, and phone numbers over time.

Phone number: Type 1. Nobody reports by historical phone number. Overwrite is correct.
Department and title: Type 2. Headcount reports need to show the employee's department at each point in time. A surrogate key per version, with effective_from/to dates.
You validate: a dbt test checks that every (employee_id, effective_from) pair is unique. Another test checks that is_current = TRUE rows have effective_to = '9999-12-31'.
KEY TAKEAWAYS
Overwriting destroys history: Type 1 is simple but retroactively changes historical reports
Type 2 preserves everything: new row per change, surrogate key per version, point-in-time joins
Type 3 is limited: only the previous value, only for before/after comparisons
Choose per attribute: Type 1 for corrections, Type 2 for reporting attributes, Type 3 for migrations
9999-12-31 sentinel: universal convention for 'this row is still current'

When yesterday's truth still matters

Category
Data Modeling
Duration
17 minutes
Challenges
12 hands-on challenges

Topics covered: The History Problem, Type 1: Overwrite Old Value, Type 2: Preserve Full History, Type 3: Keep Previous Value, Choosing Your Strategy

Lesson Sections

  1. The History Problem

    Why Overwriting Destroys Information When a dimension attribute changes, the simplest approach is to overwrite the old value with the new one. Customer moved to Seattle? UPDATE customers SET city = 'Seattle'. Done. Simple. But now every fact row that referenced this customer while they lived in Portland shows 'Seattle.' Last quarter's revenue by region report just retroactively changed. Portland lost revenue it actually had. Seattle gained revenue it did not earn. This is the slowly changing dim

  2. Type 1: Overwrite Old Value (concepts: dmScdType1)

    When History Does Not Matter Type 1 simply overwrites the old value with the new one. The customer moved to Seattle? UPDATE the city column. The old value (Portland) is gone forever. This is the right choice for attributes where nobody ever needs the historical value: typo corrections, phone number updates, name changes due to marriage. Type 1 is not lazy modeling. It is the correct choice when historical values have no analytical meaning. Applying Type 2 to a phone number creates dimension rows

  3. Type 2: Preserve Full History (concepts: dmScdType2)

    Keeping Every Version of a Dimension Row Type 2 creates a new row for every change. The old row is marked as expired. The new row is marked as current. Both rows have different surrogate keys. Fact rows from before the change point to the old SK. Fact rows after the change point to the new SK. Revenue is attributed to the correct city at the time of each transaction. Two rows, same customer_id, different surrogate keys. Fact rows from January 2024 join to SK 1001 (Portland). Fact rows from July

  4. Type 3: Keep Previous Value (concepts: dmScdType3)

    Limited History: Just the Last Two States Type 3 adds a 'previous' column alongside the current column. customer has city and previous_city. When the customer moves, the current city becomes the previous city, and the new city becomes current. Only one prior value is preserved. History beyond the immediately-previous state is lost. Type 3 is rarely used in practice. It is useful for exactly one scenario: before-and-after comparisons. 'Show me customers who changed region this quarter.' Beyond th

  5. Choosing Your Strategy (concepts: dmScdStrategy)

    Per-Attribute Strategy The most important concept: SCD type is chosen per attribute, not per table. A single dim_customer can have Type 1 on name (typo corrections do not need history), Type 2 on region (revenue attribution needs point-in-time accuracy), and Type 3 on segment (before/after comparison for a migration). The decision for each attribute: does anyone need the historical value? If the answer is 'yes, for reporting or audit,' use Type 2. If the answer is 'no, and keeping it creates use