Data Modeling · Interview concept

SCD Type 2: Implementation and Point-in-Time Queries

SCD Type 2 keeps a new row for every version of a dimension attribute, enabling point-in-time accuracy. It is the default senior-DE expectation and the most-asked data modeling interview topic.

Try itDesign the Type 2 dimension: drag columns, draw the FK
drag • connect • edit

fact_order should reference dim_customer on customer_key (surrogate), not customer_id (natural). Draw the FK connection on the correct column.

Loading canvas...

Overview

What SCD Type 2 is and why it matters

Type 2 inserts a new row every time a tracked attribute changes. The old row is closed by setting valid_to to the change timestamp and is_current to 0. The new row carries a fresh surrogate key, the updated values, valid_from set to the change timestamp, valid_to NULL, and is_current 1.

The payoff: every historical state of every dimension member is queryable. Facts join to the version that was current when the fact occurred, not the version current today.

Syntax

SCD Type 2 reference

TokenMeaning
customer_keySurrogate key. Warehouse-generated, unique per version. Fact tables reference this.
customer_idNatural key. Stable across versions. Used for lookups and upstream identification.
valid_from / valid_toHalf-open interval [from, to). valid_to NULL means currently valid.
is_currentConvenience flag. 1 on the current row, 0 on historical rows. Redundant with valid_to IS NULL but speeds up current-state queries.

Patterns

How SCD Type 2 shows up in practice

Close-and-insert pattern

The two-statement Type 2 update. Run in a transaction so the dimension never has two current rows for the same natural key.

1BEGIN ; UPDATE dim_customer SET valid_to = '2025-06-01', is_current = 0
2
3
4WHERE customer_id = 101 AND is_current = 1 ; INSERT INTO dim_customer(customer_key, customer_id, name, plan_tier, country, valid_from, valid_to, is_current) VALUES(2, 101, 'Alice', 'pro', 'US', '2025-06-01', NULL, 1) ; COMMIT ;

Interview noteInterviewers watch for the transaction. Without it, a crash between statements leaves the dimension in an inconsistent state with zero current rows.

MERGE-based Type 2 (Snowflake/BigQuery/Databricks)

Production Type 2 is usually implemented via MERGE. Snowflake supports a single-statement form; other engines need a close + insert split.

1MERGE INTO dim_customer tgt
2 USING stg_customer src
3 ON tgt.customer_id = src.customer_id AND tgt.is_current = 1 WHEN MATCHED AND(tgt.name <> src.name OR tgt.plan_tier <> src.plan_tier OR tgt.country <> src.country) THEN UPDATE SET valid_to = CURRENT_TIMESTAMP, is_current = 0 ; INSERT INTO dim_customer(...)
4
5SELECT
6 ...
7FROM stg_customer s
8LEFT JOIN dim_customer d
9 ON d.customer_id = s.customer_id AND d.is_current = 1
10WHERE d.customer_id IS NULL OR d.name <> s.name OR d.plan_tier <> s.plan_tier OR d.country <> s.country ;

Point-in-time fact-to-dim join

The reason Type 2 exists. Join each fact to the dimension version that was valid when the fact happened.

1SELECT
2 d.plan_tier,
3 SUM(f.amount) AS revenue
4FROM fact_order AS f
5INNER JOIN dim_customer AS d
6 ON f.customer_id = d.customer_id
7 AND f.order_ts >= d.valid_from
8 AND (
9 f.order_ts < d.valid_to
10 OR d.valid_to IS NULL
11)
12GROUP BY d.plan_tier

Interview noteBest practice: capture the surrogate key at fact-load time so the fact carries customer_key directly. This avoids the range predicate and is far faster at scale.

Current-state query (the common case)

Most analytics queries just want the current state. The is_current flag makes this a simple equality predicate.

1SELECT
2 customer_id,
3 name,
4 plan_tier,
5 country
6FROM dim_customer
7WHERE is_current = 1

dbt snapshot implementation

dbt's snapshot feature is the standard Type 2 implementation in modern warehouses. dbt manages the close-and-insert logic automatically.

1-- snapshots/dim_customer.sql
2{% snapshot dim_customer %}
3{{
4 config(
5 target_schema='snapshots',
6 unique_key='customer_id',
7 strategy='check',
8 check_cols=['name', 'plan_tier', 'country']
9 )
10}}
11SELECT * FROM {{ source('crm', 'customers') }}
12{% endsnapshot %}

Interview notedbt uses dbt_valid_from / dbt_valid_to columns. Know the naming convention, interviewers will test whether you actually shipped dbt or just read about it.

Interview questions

SCD Type 2 interview questions

Q1.Walk me through how you would implement SCD Type 2 for a customer dimension in SQL.

What they testLive implementation. Watch for the transaction, the close-old-row step, and surrogate key generation.

ApproachStart with the schema: customer_key (surrogate, PK), customer_id (natural), business attributes, valid_from, valid_to, is_current. On a change: BEGIN transaction, UPDATE current row to set valid_to = now and is_current = 0, INSERT new row with next surrogate key and valid_to NULL. COMMIT.

Q2.Given a fact table and a Type 2 dimension, write the join so each fact picks up the dimension version valid when the fact happened.

What they testThe point-in-time join is the single most common SCD question. Candidates who stumble here have not shipped a warehouse.

ApproachJOIN ON natural_key AND fact.ts >= dim.valid_from AND (fact.ts < dim.valid_to OR dim.valid_to IS NULL). Alternatively, if the fact-load process captured the surrogate key at write time, JOIN ON customer_key directly, faster and cleaner.

Q3.What's the trade-off between tracking an attribute as Type 2 vs Type 1?

What they testDesign judgment. Type 2 is not free, it multiplies row count. Know when history matters and when it doesn't.

ApproachType 2 preserves history at the cost of row count and query complexity. Use it for business-meaningful attributes (plan tier, pricing, territory). Type 1 is fine for data-quality fixes (typo in a name, corrected address). Some dimensions use Type 6. Type 2 rows plus a Type 1 current-value column.

Q4.Your Type 2 dimension keeps growing to 500M rows for 10M customers. What went wrong?

What they testProduction debugging. Usually the cause is a high-churn attribute tracked as Type 2 when it should be on the fact.

ApproachAudit which attributes are changing most often. If last_login_at is tracked as Type 2, every login creates a new dim row. Move it to fact_session or drop it from the dimension entirely. Reserve Type 2 for attributes that change rarely and where history has business value.

Q5.How do you detect a data bug where a Type 2 dimension has two current rows for the same natural key?

What they testData quality thinking. The invariant is exactly one is_current=1 per natural key, always.

ApproachQuery: SELECT natural_key, COUNT(*) FROM dim WHERE is_current = 1 GROUP BY natural_key HAVING COUNT(*) > 1. Enforce it with a unique constraint on (natural_key) WHERE is_current = 1 (partial index in Postgres). Run as a dbt test on every dbt run.

Q6.Why does the fact table reference customer_key (surrogate) instead of customer_id (natural)?

What they testThe single most important Type 2 invariant. Fact-to-surrogate references lock facts to specific dimension versions.

ApproachA fact created on 2025-03-01 should reference the dimension version valid on 2025-03-01 forever, even after the dimension updates. Storing customer_key captures that binding at write time. Storing customer_id would cause the fact's joined attributes to silently change whenever the dimension updates.

Common mistakes

What breaks in practice

Two current rows for the same natural key

Non-transactional updates or a bug in the close-old-row step can leave two is_current=1 rows. Every downstream current-state query returns duplicates.

Wrong

1INSERT INTO dim_customer(...) VALUES(..., valid_from = now, valid_to = NULL, is_current = 1) ;

Right

1BEGIN ; UPDATE dim_customer SET valid_to = now, is_current = 0
2WHERE customer_id = ? AND is_current = 1 ; INSERT INTO dim_customer(...) VALUES(...) ; COMMIT ;

Joining facts to the dimension on natural key without a time predicate

Without the valid_from/valid_to predicate, each fact joins to every historical version of the dimension, row explosion.

Wrong

1SELECT
2 *
3FROM fact_order AS f
4INNER JOIN dim_customer AS d
5 ON f.customer_id = d.customer_id

Right

1SELECT
2 *
3FROM fact_order AS f
4INNER JOIN dim_customer AS d
5 ON f.customer_id = d.customer_id
6 AND f.order_ts >= d.valid_from
7 AND (
8 f.order_ts < d.valid_to
9 OR d.valid_to IS NULL
10)

Tracking high-churn attributes as Type 2

Attributes like last_login_at, session_count, or updated_at change on every event. Tracking them as Type 2 multiplies dimension row count into the billions.

Wrong

1'CREATE TABLE dim_customer ( . . ., last_login_at TIMESTAMP, . . . )'

Right

1'CREATE TABLE fact_session ( customer_key INT, session_start TIMESTAMP, . . . )'

FAQ

Common questions

What is SCD Type 2?
SCD Type 2 is a dimensional modeling strategy that inserts a new row every time a tracked attribute changes, preserving full history. Each row carries a unique surrogate key plus valid_from, valid_to, and is_current columns.
How do you query a Type 2 dimension as of a specific date?
Use the temporal range predicate: WHERE <date> >= valid_from AND (<date> < valid_to OR valid_to IS NULL). This returns the version that was current on the given date.
Should fact tables reference the natural key or surrogate key?
The surrogate key, captured at fact-load time. This locks the fact to a specific dimension version and makes joins a simple equality rather than a range predicate.
How does dbt implement SCD Type 2?
dbt snapshots. Define a snapshot with unique_key and a strategy (timestamp or check). dbt runs the snapshot periodically and maintains dbt_valid_from and dbt_valid_to automatically.
When should I use Type 2 vs Type 1?
Type 2 when business cares about history, plan tier changes, pricing updates, territory reassignments. Type 1 when the change is a data-quality fix and history is noise. Default to Type 2 for business attributes.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.