Data Modeling · Interview concept

SCD Type 2: Implementation and Point-in-Time Queries

SCD Type 2 keeps a new row for every version of a dimension attribute, enabling point-in-time accuracy. It is the default senior-DE expectation and the most-asked data modeling interview topic.

Try itDesign the Type 2 dimension: drag columns, draw the FK
drag • connect • edit

fact_order should reference dim_customer on customer_key (surrogate), not customer_id (natural). Draw the FK connection on the correct column.

Loading canvas...

Overview

What SCD Type 2 is and why it matters

Type 2 inserts a new row every time a tracked attribute changes. The old row is closed by setting valid_to to the change timestamp and is_current to 0. The new row carries a fresh surrogate key, the updated values, valid_from set to the change timestamp, valid_to NULL, and is_current 1.

The payoff: every historical state of every dimension member is queryable. Facts join to the version that was current when the fact occurred, not the version current today.

Syntax

SCD Type 2 reference

TokenMeaning
customer_keySurrogate key. Warehouse-generated, unique per version. Fact tables reference this.
customer_idNatural key. Stable across versions. Used for lookups and upstream identification.
valid_from / valid_toHalf-open interval [from, to). valid_to NULL means currently valid.
is_currentConvenience flag. 1 on the current row, 0 on historical rows. Redundant with valid_to IS NULL but speeds up current-state queries.
Prepare for the interview
01 / Open invite
02min.

Know SCD Type 2 the way the interviewer who asks it knows it.

a SCD Type 2 query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1fact_orders
2 order_id bigint PK
3 customer_sk bigint FK
4 order_date date SCD2
5
Execute your solution0.4s avg.
DatabricksInterview question
Solve a SCD Type 2 problem

Patterns

How SCD Type 2 shows up in practice

Close-and-insert pattern

The two-statement Type 2 update. Run in a transaction so the dimension never has two current rows for the same natural key.

1BEGIN ; UPDATE dim_customer SET valid_to = '2025-06-01', is_current = 0
2
3
4WHERE customer_id = 101 AND is_current = 1 ; INSERT INTO dim_customer(customer_key, customer_id, name, plan_tier, country, valid_from, valid_to, is_current) VALUES(2, 101, 'Alice', 'pro', 'US', '2025-06-01', NULL, 1) ; COMMIT ;

Interview noteInterviewers watch for the transaction. Without it, a crash between statements leaves the dimension in an inconsistent state with zero current rows.

MERGE-based Type 2 (Snowflake/BigQuery/Databricks)

Production Type 2 is usually implemented via MERGE. Snowflake supports a single-statement form; other engines need a close + insert split.

1MERGE INTO dim_customer tgt
2 USING stg_customer src
3 ON tgt.customer_id = src.customer_id AND tgt.is_current = 1 WHEN MATCHED AND(tgt.name <> src.name OR tgt.plan_tier <> src.plan_tier OR tgt.country <> src.country) THEN UPDATE SET valid_to = CURRENT_TIMESTAMP, is_current = 0 ; INSERT INTO dim_customer(...)
4
5SELECT
6 ...
7FROM stg_customer s
8LEFT JOIN dim_customer d
9 ON d.customer_id = s.customer_id AND d.is_current = 1
10WHERE d.customer_id IS NULL OR d.name <> s.name OR d.plan_tier <> s.plan_tier OR d.country <> s.country ;

Point-in-time fact-to-dim join

The reason Type 2 exists. Join each fact to the dimension version that was valid when the fact happened.

1/* fact_order modeled by transactions (the fact), dim_customer by a */
2/* Type 2 view over users (account_status as the tracked attribute). */
3WITH dim_customer AS (
4 SELECT
5 user_id AS customer_id,
6 account_status AS plan_tier,
7 signup_date AS valid_from,
8 NULL AS valid_to
9 FROM users
10)
11
12SELECT
13 d.plan_tier,
14 SUM(f.total_amount) AS revenue
15FROM transactions AS f
16INNER JOIN dim_customer AS d
17 ON f.user_id = d.customer_id
18 AND f.transaction_date >= d.valid_from
19 AND (
20 f.transaction_date < d.valid_to
21 OR d.valid_to IS NULL
22)
23GROUP BY d.plan_tier

Interview noteBest practice: capture the surrogate key at fact-load time so the fact carries customer_key directly. This avoids the range predicate and is far faster at scale.

Current-state query (the common case)

Most analytics queries just want the current state. The is_current flag makes this a simple equality predicate.

1/* dim_customer modeled as a Type 2 view over users; the current */
2/* version is the row whose valid_to is still open (is_current = 1). */
3WITH dim_customer AS (
4 SELECT
5 user_id AS customer_id,
6 username AS name,
7 account_status AS plan_tier,
8 age_bucket AS country,
9 1 AS is_current
10 FROM users
11)
12
13SELECT
14 customer_id,
15 name,
16 plan_tier,
17 country
18FROM dim_customer
19WHERE is_current = 1

dbt snapshot implementation

dbt's snapshot feature is the standard Type 2 implementation in modern warehouses. dbt manages the close-and-insert logic automatically.

1-- snapshots/dim_customer.sql
2{% snapshot dim_customer %}
3{{
4 config(
5 target_schema='snapshots',
6 unique_key='customer_id',
7 strategy='check',
8 check_cols=['name', 'plan_tier', 'country']
9 )
10}}
11SELECT * FROM {{ source('crm', 'customers') }}
12{% endsnapshot %}

Interview notedbt uses dbt_valid_from / dbt_valid_to columns. Know the naming convention; interviewers will test whether you shipped dbt in production or only read about it.

Interview questions

SCD Type 2 interview questions

Q1.Walk me through how you would implement SCD Type 2 for a customer dimension in SQL.

What they testLive implementation. Watch for the transaction, the close-old-row step, and surrogate key generation.

ApproachStart with the schema: customer_key (surrogate, PK), customer_id (natural), business attributes, valid_from, valid_to, is_current. On a change: BEGIN transaction, UPDATE current row to set valid_to = now and is_current = 0, INSERT new row with next surrogate key and valid_to NULL. COMMIT.

Q2.Given a fact table and a Type 2 dimension, write the join so each fact picks up the dimension version valid when the fact happened.

What they testThe point-in-time join is the single most common SCD question. Candidates who stumble here have not shipped a warehouse.

ApproachJOIN ON natural_key AND fact.ts >= dim.valid_from AND (fact.ts < dim.valid_to OR dim.valid_to IS NULL). Alternatively, if the fact-load process captured the surrogate key at write time, JOIN ON customer_key directly, faster and cleaner.

Q3.What's the trade-off between tracking an attribute as Type 2 vs Type 1?

What they testDesign judgment. Type 2 is not free, it multiplies row count. Know when history matters and when it doesn't.

ApproachType 2 preserves history at the cost of row count and query complexity. Use it for business-meaningful attributes (plan tier, pricing, territory). Type 1 is fine for data-quality fixes (typo in a name, corrected address). Some dimensions use Type 6. Type 2 rows plus a Type 1 current-value column.

Q4.Your Type 2 dimension keeps growing to 500M rows for 10M customers. What went wrong?

What they testProduction debugging. Usually the cause is a high-churn attribute tracked as Type 2 when it should be on the fact.

ApproachAudit which attributes are changing most often. If last_login_at is tracked as Type 2, every login creates a new dim row. Move it to fact_session or drop it from the dimension entirely. Reserve Type 2 for attributes that change rarely and where history has business value.

Q5.How do you detect a data bug where a Type 2 dimension has two current rows for the same natural key?

What they testData quality thinking. The invariant is exactly one is_current=1 per natural key, always.

ApproachQuery: SELECT natural_key, COUNT(*) FROM dim WHERE is_current = 1 GROUP BY natural_key HAVING COUNT(*) > 1. Enforce it with a unique constraint on (natural_key) WHERE is_current = 1 (partial index in Postgres). Run as a dbt test on every dbt run.

Q6.Why does the fact table reference customer_key (surrogate) instead of customer_id (natural)?

What they testThe single most important Type 2 invariant. Fact-to-surrogate references lock facts to specific dimension versions.

ApproachA fact created on 2025-03-01 should reference the dimension version valid on 2025-03-01 forever, even after the dimension updates. Storing customer_key captures that binding at write time. Storing customer_id would cause the fact's joined attributes to silently change whenever the dimension updates.

Common mistakes

What breaks in practice

Two current rows for the same natural key

Non-transactional updates or a bug in the close-old-row step can leave two is_current=1 rows. Every downstream current-state query returns duplicates.

Wrong

1INSERT INTO dim_customer(...) VALUES(..., valid_from = now, valid_to = NULL, is_current = 1) ;

Right

1BEGIN ; UPDATE dim_customer SET valid_to = now, is_current = 0
2WHERE customer_id = ? AND is_current = 1 ; INSERT INTO dim_customer(...) VALUES(...) ; COMMIT ;

Joining facts to the dimension on natural key without a time predicate

Without the valid_from/valid_to predicate, each fact joins to every historical version of the dimension, row explosion.

Wrong

1/* Fan-out: 1 fact row × 3 dim versions = 3 rows */
2SELECT
3 *
4FROM fact_order AS f
5INNER JOIN dim_customer AS d
6 ON f.customer_id = d.customer_id

Right

1SELECT
2 *
3FROM fact_order AS f
4INNER JOIN dim_customer AS d
5 ON f.customer_id = d.customer_id
6 AND f.order_ts >= d.valid_from
7 AND (
8 f.order_ts < d.valid_to
9 OR d.valid_to IS NULL
10)

Tracking high-churn attributes as Type 2

Attributes like last_login_at, session_count, or updated_at change on every event. Tracking them as Type 2 multiplies dimension row count into the billions.

Wrong

1'CREATE TABLE dim_customer ( . . ., last_login_at TIMESTAMP, . . . )'
2/* Every login creates a new dim row */
3/* tracked as Type 2 */

Right

1'CREATE TABLE fact_session ( customer_key INT, session_start TIMESTAMP, . . . )'
2/* Move to a fact table */
3/* Dimension stays lean */
Prepare for the interview
03 / From the bank03 of many
03hand-picked.

The Plan That Changed Twice This Month

Medium30 min

Subscribers come, go, downgrade, and share. The schema has to keep up.

Pulled from debriefs where modeling rounds went sideways.

FAQ

Common questions

What is SCD Type 2?
SCD Type 2 is a dimensional modeling strategy that inserts a new row every time a tracked attribute changes, preserving full history. Each row carries a unique surrogate key plus valid_from, valid_to, and is_current columns.
How do you query a Type 2 dimension as of a specific date?
Use the temporal range predicate: WHERE <date> >= valid_from AND (<date> < valid_to OR valid_to IS NULL). This returns the version that was current on the given date.
Should fact tables reference the natural key or surrogate key?
The surrogate key, captured at fact-load time. This locks the fact to a specific dimension version and makes joins a simple equality rather than a range predicate.
How does dbt implement SCD Type 2?
dbt snapshots. Define a snapshot with unique_key and a strategy (timestamp or check). dbt runs the snapshot periodically and maintains dbt_valid_from and dbt_valid_to automatically.
When should I use Type 2 vs Type 1?
Type 2 when business cares about history, plan tier changes, pricing updates, territory reassignments. Type 1 when the change is a data-quality fix and history is noise. Default to Type 2 for business attributes.