SCD interview questions isolated from the data modeling catalog. Type 1 overwrite for typos, Type 2 row-per-version for history retention, Type 3 current-and-previous columns for two-state attributes. Merge logic in SQL and pandas, half-open joins to prevent boundary doubling, late-arriving dimensions with placeholder rows.
Slowly changing dimensions (SCD) handle the problem that dimension attributes change over time. A customer's address changes when they move. A merchant's payout schedule changes when they upgrade plans. A product's category changes when the catalog is reorganized. SCD types specify how the dimension table stores those changes, which determines what kinds of analytical queries downstream facts can answer.
Type 1 overwrite: the dim row is updated in place; old value is lost. Useful for typo corrections and for attributes whose history is irrelevant (a customer's preferred language, a product's UPC code). UPDATE dim_customer SET preferred_language = 'fr' WHERE customer_id = 42. Simple. Cheap. History gone.
Type 2 row-per-version: a new dim row is inserted for each change; old row is expired but retained. Columns: surrogate_key (unique per version), natural_key (customer_id, stable across versions), effective_from, effective_to, is_current. The merge pattern is: identify changed rows by comparing staging to current dim; expire the matched current rows (set effective_to = now, is_current = false); insert the new rows (effective_from = now, effective_to = null, is_current = true). Fact tables FK to the surrogate_key, which means the fact links to the dim row that was current at the time of the fact event. Point-in-time queries (what address was customer 42's at the time of the order on 2026-05-27) join fact to dim_scd2 on the half-open interval: ON dim.natural_key = fact.entity_id AND dim.effective_from <= fact.event_time AND (dim.effective_to IS NULL OR fact.event_time < dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the changeover microsecond and doubling the fact.
Type 3 current-and-previous columns: the dim row has two columns for the attribute (e.g., current_address and previous_address). Only the most recent change is preserved. Useful for two-state transitions where downstream queries always compare current to previous (a sales territory reorganization where you need to see "this region used to be in territory X, now in territory Y"). Rarely the right answer in 2026 data engineer interviews; mention Type 3 for completeness, defend Type 2 in most cases.
The Type 2 merge in SQL: MERGE INTO dim_customer d USING staging s ON d.customer_id = s.customer_id AND d.is_current = true WHEN MATCHED AND d.address != s.address THEN UPDATE SET effective_to = current_timestamp, is_current = false. Then INSERT INTO dim_customer (customer_id, address, effective_from, effective_to, is_current) SELECT s.customer_id, s.address, current_timestamp, null, true FROM staging s LEFT JOIN dim_customer d ON s.customer_id = d.customer_id AND d.is_current = true WHERE d.address IS NULL OR d.address != s.address. The Type 2 merge in pandas: identify changed rows with a merge on the natural key plus an attribute compare; expire matched current rows by setting effective_to; concat new rows with effective_from = now. Common bugs: forgetting the natural_key plus is_current condition in the MATCHED clause (causes infinite re-expiration); using = instead of != on the attribute compare (no rows change, no expiration happens).
Late-arriving dimensions are the senior data engineer SCD question. A fact event arrives before its dimension row exists. Insert a placeholder dim row with surrogate_key, natural_key = the fact's entity_id, all attributes NULL or "unknown", is_late = true. When the real dim row arrives, either replace the placeholder (if is_late tracking is sufficient) or insert a new SCD Type 2 row with the real attributes. Alternative: hold the fact in a quarantine until the dim arrives, then backfill. The placeholder approach is more common at scale because it does not block the fact pipeline.
Slowly Changing Dimension Interview Questions
SCD merge logic and join patterns for data engineer interview prep.
57 practice problems matching this filter. Difficulty: medium (32), easy (8), hard (17).
Data Modeling (57)
- A/B Experiment Assignment Schema - medium - One user, one experiment, one variant. No exceptions.
- Where They Used to Live - medium - They moved. The data stayed behind.
- Airline Flight Operations Schema - medium - Flights, passengers, and routes. Before you draw a single table, tell me the grain.
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
- Clickstream and Session Schema - medium - Millions of clicks, mostly anonymous.
- Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
- Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
- Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
- Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
Common questions
- When does a data engineer use Type 1 versus Type 2 SCD?
- Type 1 (overwrite) when history does not matter and you only need the current value: typo corrections, immutable identifiers, preferred language. Type 2 (row-per-version) when downstream queries need point-in-time correctness: customer address (ship-to history matters), employee department (compensation history), product category (revenue-by-category over time). Default to Type 2 for analytical dimensions; the storage cost is usually worth the history.
- What columns does an SCD Type 2 dimension table have?
- Surrogate key (unique per version, typically an int), natural key (customer_id, stable across versions), the descriptive attributes themselves (address, name, etc.), effective_from (timestamp the version became current), effective_to (timestamp the version was superseded; NULL for current), is_current (boolean for the current row, true on exactly one row per natural_key).
- What is the SCD Type 2 merge pattern?
- Three steps. Identify changed rows by comparing staging to current dim (anti-join on natural key plus an attribute compare). Expire the matched current rows: UPDATE dim SET effective_to = now, is_current = false WHERE natural_key IN (changed) AND is_current = true. Insert the new rows: INSERT (natural_key, attributes, effective_from = now, effective_to = NULL, is_current = true) for each changed row. SQL MERGE INTO combines the matched-update and the insert in one statement.
- What is the SCD2 half-open join and why does it matter?
- Joining a fact at event_time to a SCD Type 2 dimension uses ON dim.natural_key = fact.entity_id AND dim.effective_from <= fact.event_time AND (dim.effective_to IS NULL OR fact.event_time < dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the exact changeover microsecond. Closed-interval mistake doubles facts. Open-interval mistake drops facts at the boundary.
- How does a data engineer handle late-arriving dimensions?
- Insert a placeholder dim row with surrogate key, natural_key = the fact's entity_id, attributes NULL, is_late = true. When the real dim row arrives, replace the placeholder or insert a new SCD Type 2 row. Alternative: quarantine the fact until the dim arrives, then backfill. The placeholder approach is more common at scale because it does not block the fact pipeline.
- What is SCD Type 3 and when is it useful?
- Type 3 stores the current value and the previous value in two columns on the same dim row (current_address, previous_address). Only the most recent change is preserved. Useful for two-state transitions where downstream queries always compare current to previous: a sales territory reorganization where you need 'this region used to be in territory X, now in territory Y'. Rarely the right answer in 2026 data engineer interviews; mention for completeness, defend Type 2 in most cases.
- What is the most common SCD merge bug?
- Forgetting the is_current = true condition in the MATCHED clause. Without it, the merge expires all historical rows (every row with that natural_key), not just the current one. Result is that every version's effective_to gets reset to now, breaking point-in-time queries. Always include AND d.is_current = true in the MATCHED predicate.
- How does Type 2 SCD interact with the fact table surrogate key?
- Facts FK to the surrogate key of the dim version that was current at the time of the fact event. When customer 42 has surrogate_key 1001 from 2025-01-01 to 2025-06-30, and surrogate_key 1002 from 2025-06-30 onward, an order placed on 2025-05-15 links to surrogate_key 1001 (not 1002, even though 1002 is the current row when the analyst runs the query). This is what enables point-in-time correctness.