Data Modeling

Slowly Changing Dimensions: SCD Type 1, 2, 3 Explained

Slowly changing dimensions test whether you understand how data warehouses track historical changes. Interviewers want to see trade-off reasoning: when do you preserve history, when do you overwrite, and why does it matter for downstream analytics?

SCD questions appear in roughly a third of data modeling interview segments. Type 2 is the most frequently tested, but knowing when Type 1 or Type 3 is the better choice separates strong candidates from those who memorized definitions.

What Are Slowly Changing Dimensions?

A slowly changing dimension is a dimension table in a data warehouse where attribute values change over time, but not with every transaction. Customer addresses change when people move. Employee job titles change with promotions. Product categories change when the catalog is reorganized. These changes are "slow" compared to fact data, which changes with every order, click, or event.

The challenge is deciding how to handle these changes in your warehouse. Do you overwrite the old value and lose history? Do you keep every version and accept the storage and query complexity? Do you track just the previous value? Each approach has a name (Type 1 through Type 6), specific implementation mechanics, and distinct trade-offs that interviewers expect you to articulate.

The concept comes from Ralph Kimball's dimensional modeling methodology and remains one of the most practical and frequently tested topics in data engineering interviews. Even teams using modern lakehouse patterns face the same fundamental question: when a dimension attribute changes, what do you do with the old value?

SCD Types Explained

Each SCD type makes a different trade-off between simplicity, storage cost, query complexity, and historical completeness. Interviewers expect you to explain the trade-off, not just describe the mechanism.

Type 0: Retain Original

Never changes

The attribute value is set at insert time and never updated, regardless of what happens in the source system. This is appropriate for attributes that represent a historical fact at a point in time, where the original value has permanent business meaning.

Example

A customer's original_signup_date or original_acquisition_channel. Even if the customer later changes their behavior, the original value represents how they first arrived and should never be overwritten.

Interview note

Rarely the focus of a question, but mentioning Type 0 when comparing SCD types signals that you understand the full spectrum. It shows you have thought about which attributes should be immutable.

Type 1: Overwrite

No history

The old value is replaced with the new value in place. No record of the previous value is kept. The dimension table always reflects the current state of the entity. This is the simplest approach and the cheapest in storage, but it rewrites history: any fact records joined to this dimension will now reflect the new attribute value, even for events that occurred before the change.

Example

A customer corrects a typo in their name. There is no analytical reason to preserve the misspelled version. Type 1 overwrites it, and all historical transactions now show the correct name. Another example: standardizing a phone number format from '(555) 123-4567' to '+15551234567'. The old format was never analytically meaningful.

Interview note

Interviewers test whether you understand the trade-off, not just the mechanism. The right answer depends on whether the attribute has analytical significance in its historical state. If it does, Type 1 is wrong. If it does not, Type 2 is over-engineering.

Type 2: Add New Row

Full history

When an attribute changes, the current row is closed (effective_to set to yesterday, is_current set to false) and a new row is inserted with the updated value (effective_from set to today, is_current set to true). Each version of the entity gets its own surrogate key. The natural key remains the same across all versions, linking them together. This is the most common SCD type in production data warehouses.

Example

A customer moves from New York to Chicago on March 15. The existing row gets effective_to = '2026-03-14' and is_current = false. A new row is inserted with city = 'Chicago', effective_from = '2026-03-15', effective_to = '9999-12-31', is_current = true. A new surrogate key is assigned. Fact records from before the move still join to the New York version via the old surrogate key.

Interview note

This is the SCD type interviewers ask about most. Be ready to explain: (1) why surrogate keys are necessary (natural keys are no longer unique), (2) how fact tables reference the correct version, (3) the effective_to = '9999-12-31' convention for the current row, and (4) how this affects query patterns (always filter on is_current for current-state queries).

Type 3: Add New Column

Limited history

Instead of adding a new row, you add a new column to store the previous value. The table maintains exactly one level of history: current and previous. If the attribute changes again, the oldest value is lost. The table grows wider (more columns) but not taller (more rows).

Example

A customer changes sales regions from 'West' to 'Central'. The table has columns current_region and previous_region. If they later move to 'East', current_region becomes 'East', previous_region becomes 'Central', and 'West' is lost. This works when the business only needs before/after comparison, not full history.

Interview note

Less common in practice than Types 1 and 2. But knowing when Type 3 makes sense (simple before/after analysis, limited storage budget, stable attributes that change at most once) demonstrates that you select the right tool rather than always defaulting to Type 2.

Type 4: Mini-Dimension (History Table)

Separate history

The main dimension table always contains only the current state. A separate history table stores all previous versions with timestamps. This keeps the primary dimension table compact and fast for current-state queries while preserving full history for audit or analysis. Queries that need historical context join to the history table explicitly.

Example

The customer table has one row per customer with current attributes. A customer_history table has one row per change, with change_timestamp, changed_attribute, old_value, and new_value. Current-state dashboards query only the customer table. Compliance audits query the history table.

Interview note

Type 4 is the most practical approach in many production systems, especially when the dimension is large and only a small percentage of queries need historical data. Mentioning it distinguishes you from candidates who only know Types 1, 2, and 3.

Type 6: Hybrid (1 + 2 + 3)

Combined approach

Combines Types 1, 2, and 3 in a single table. A new row is added for each change (Type 2). The new row includes a previous_value column (Type 3). A current_value column on ALL rows for that entity is overwritten to the latest value (Type 1). This gives consumers three options: query the current value from any row (Type 1 column), compare current to previous (Type 3 columns), or traverse the full history (Type 2 row versions).

Example

A customer moves from New York to Chicago. A new row is inserted (Type 2) with city = 'Chicago' and previous_city = 'New York' (Type 3). The current_city column on ALL rows for this customer, including the old New York row, is updated to 'Chicago' (Type 1). Any row for this customer now tells you both the historical value for that version and the current value.

Interview note

Advanced topic. Most interviewers will be impressed if you can explain the rationale: Type 6 lets different queries use the same table without trade-offs. Do not lead with it. Bring it up if the interviewer asks about hybrid approaches or if you are explaining trade-offs between types.

How Interviewers Test SCDs

Interviewers do not want you to recite definitions. They want to see that you can select the right SCD type for a given scenario and defend your choice. The strongest signal is trade-off reasoning: "I would use Type 1 here because the business does not need historical phone number data, and Type 2 would add row proliferation with no analytical value."

They also test implementation depth. Can you write the MERGE statement? Do you understand surrogate keys vs natural keys? Can you handle edge cases like late-arriving dimension changes or the initial load? Can you diagnose performance problems in a large Type 2 table?

The common pitfall: candidates who default to Type 2 for every attribute. In a real warehouse, a customer dimension might have 30 attributes. Tracking full history for all 30 creates enormous tables with minimal analytical value. Experienced engineers apply Type 2 selectively to the 3 or 4 attributes that matter for historical analysis and use Type 1 for the rest. This pragmatism is exactly what interviewers want to hear.

Interview Questions with Guidance

These are the SCD questions that come up most frequently. For each one, the guidance shows how a strong candidate structures the answer.

Implement SCD Type 2 for a customer dimension table.

Define the table with: customer_sk (surrogate key, auto-increment), customer_id (natural key), name, city, state, effective_from (date), effective_to (date, default '9999-12-31'), is_current (boolean, default true). The load process: (1) Compare incoming data against current dimension rows (WHERE is_current = true) on the natural key. (2) For changed rows, UPDATE the existing current row to set effective_to = yesterday and is_current = false. (3) INSERT a new row with the updated attributes, effective_from = today, effective_to = '9999-12-31', is_current = true. (4) For new customers, INSERT directly with is_current = true.

When would you choose Type 1 over Type 2?

When the attribute has no analytical significance in its historical state. Three categories: (1) Data corrections, like fixing a misspelled name or standardizing a phone format. The old value was wrong, so preserving it adds no value. (2) Attributes that are operationally useful but not analytically tracked, like a customer's login email. (3) Attributes where the business explicitly does not want historical reporting, like GDPR-driven updates where the old value must be removed. The key signal to the interviewer: you make this decision based on business requirements, not engineering convenience.

How do you handle late-arriving dimension changes?

A late-arriving dimension change means you learn about an attribute change after fact records for that period have already been loaded. For Type 2: you need to insert a backdated version of the dimension row with the correct effective dates, then potentially re-point fact records that were loaded during the affected period. For Type 1: the late arrival is handled by the normal overwrite. The complexity is in Type 2, because you must split an existing version into two periods and update fact table foreign keys. This is operationally expensive, which is why some teams choose Type 1 for attributes that frequently arrive late.

Your Type 2 dimension has 500M rows and queries are slow. What do you do?

Five approaches, in order of impact: (1) Partition or cluster by is_current so current-state queries skip historical rows. This alone often solves the problem. (2) Create a view or materialized view filtered to is_current = true for downstream consumers. (3) Move to Type 4: separate current and historical data into two physical tables. (4) Archive very old versions beyond a retention window if the business does not need them. (5) Index the surrogate key and natural key combination for join performance. The interviewer wants to see that you diagnose the root cause (most queries only need current data) before jumping to solutions.

Design the ETL process that maintains a Type 2 dimension.

Step by step: (1) Stage incoming data in a temporary table. (2) LEFT JOIN staged data to the current dimension (WHERE is_current = true) on natural key. (3) Identify three categories: new rows (no match in dimension), unchanged rows (match with identical attributes), and changed rows (match with different attributes). (4) For changed rows, UPDATE the dimension to close the current version. (5) INSERT new versions for changed rows and entirely new rows. (6) Log metrics: rows inserted, rows updated, rows unchanged. The MERGE statement combines steps 3 through 5 into a single atomic operation in most warehouses.

What is the difference between a surrogate key and a natural key in the context of SCDs?

A natural key identifies the real-world entity (customer_id = 12345). A surrogate key identifies a specific version of that entity in the dimension table (customer_sk = 98765). In Type 2, one natural key maps to many surrogate keys (one per version). Fact tables must reference the surrogate key to join to the correct historical version. If fact tables reference the natural key, every fact row joins to every version of the dimension, producing incorrect results. This distinction is the most common source of bugs in Type 2 implementations.

How do SCDs interact with dbt snapshots?

dbt snapshots implement SCD Type 2 declaratively. You configure the source table, unique key, and change detection strategy (timestamp-based or check-based). dbt automatically manages surrogate keys, effective dates (dbt_valid_from, dbt_valid_to), and current flags. Mentioning dbt snapshots shows practical experience. But also note the limitations: dbt snapshots run at scheduled intervals, so they capture changes at snapshot time, not at the exact moment the source changed. For real-time SCD tracking, you need CDC-based approaches outside of dbt.

SQL: SCD Type 2 MERGE Pattern

This is the standard MERGE pattern for maintaining an SCD Type 2 dimension. Most warehouses (Snowflake, BigQuery, Databricks) support MERGE syntax. The pattern requires two passes: one to close existing versions and insert new customers, and a second to insert new versions for changed rows.

-- SCD Type 2 MERGE pattern (Snowflake / BigQuery syntax)
MERGE INTO dim_customer AS target
USING staging_customer AS source
ON target.customer_id = source.customer_id
   AND target.is_current = TRUE

-- Close the existing version when attributes have changed
WHEN MATCHED
  AND (target.city <> source.city OR target.state <> source.state)
THEN UPDATE SET
  target.effective_to = CURRENT_DATE - 1,
  target.is_current = FALSE

-- Insert new customers that do not exist yet
WHEN NOT MATCHED
THEN INSERT (customer_id, name, city, state,
             effective_from, effective_to, is_current)
VALUES (source.customer_id, source.name, source.city,
        source.state, CURRENT_DATE, '9999-12-31', TRUE);

-- Second pass: insert new versions for changed rows
INSERT INTO dim_customer
  (customer_id, name, city, state,
   effective_from, effective_to, is_current)
SELECT s.customer_id, s.name, s.city, s.state,
       CURRENT_DATE, '9999-12-31', TRUE
FROM staging_customer s
JOIN dim_customer d
  ON s.customer_id = d.customer_id
  AND d.effective_to = CURRENT_DATE - 1
  AND d.is_current = FALSE;

Note: Some warehouses support multi-action MERGE (matching the same row for both UPDATE and INSERT). In those cases, you can combine this into a single statement. The two-pass approach shown here works everywhere.

Common Mistakes Interviewers Catch

Joining fact tables to Type 2 dimensions on the natural key

If your fact table joins to dim_customer on customer_id, every fact row matches every version of that customer. Revenue gets multiplied by the number of address changes. You must join on the surrogate key (customer_sk) to get the correct version. This is the single most common SCD implementation bug.

Forgetting to set effective dates on the initial load

The first time you populate a Type 2 dimension, every row needs effective_from set to the earliest relevant date (often the warehouse start date), effective_to set to '9999-12-31', and is_current set to true. If you skip this, you have NULL date ranges that break every downstream query using BETWEEN logic on effective dates.

Applying Type 2 to every attribute indiscriminately

Not every attribute change deserves a new version. If a customer updates their phone number, do you need a new row? Probably not. Track only the attributes that have analytical significance for historical reporting. Use Type 1 for the rest. Otherwise, your dimension table grows explosively and most of the versions add no analytical value.

Not indexing is_current or effective date columns

Without an index or partition on is_current, every current-state query scans the full table, including millions of historical rows. In a table with 500M total rows and 10M current rows, this turns a sub-second query into a 30-second scan. Partition by is_current or create a filtered index.

Frequently Asked Questions

What is a slowly changing dimension?+
A slowly changing dimension (SCD) is a dimension table in a data warehouse where attribute values change over time, but infrequently. The classic example is a customer's address: it might change once every few years, compared to transactional data that changes with every order. The 'slowly' part distinguishes these from rapidly changing attributes (like stock prices or sensor readings) that require different handling patterns such as mini-dimensions or real-time streaming.
Which SCD type is most common in production?+
SCD Type 2 is the most widely implemented in production data warehouses. It preserves full history by adding new rows with effective date ranges. Many production systems use a hybrid approach: Type 2 for analytically significant attributes (address, job title, account tier) and Type 1 for attributes where history is irrelevant (phone format, typo corrections). This selective application of SCD types is what experienced engineers do in practice.
Do data engineering interviews always ask about SCDs?+
SCDs appear in roughly a third of data modeling interview segments. They are a staple at companies that use data warehouses, especially those with Kimball-style dimensional models. If the job description mentions data warehousing, dimensional modeling, or analytics engineering, expect SCD questions. Even at companies using modern lakehouse patterns, the concept of tracking historical changes in dimension tables remains relevant.
How do SCDs work with dbt?+
dbt has a built-in snapshot feature that implements SCD Type 2 automatically. You define the source table, the unique key, and the change detection strategy (timestamp or check). dbt manages the effective date columns (dbt_valid_from, dbt_valid_to) and tracks whether each row is current. Mentioning dbt snapshots in an interview shows practical experience. But also understand the limitation: dbt snapshots capture state at scheduled run times, not at the exact moment of change.
What is the '9999-12-31' convention in SCD Type 2?+
The far-future date '9999-12-31' is used as the effective_to value for the current version of a dimension row. This convention makes date range queries simpler: you can write WHERE event_date BETWEEN effective_from AND effective_to without special-casing NULL values for the current row. Some teams use NULL instead, but the far-future date is more common because it avoids the need for COALESCE logic in every join.

Practice SCD Questions with Real SQL

DataDriven covers data modeling, SQL, and Python with hands-on challenges at interview difficulty. Build fluency in the concepts interviewers actually test.

Start Practicing