Q: How does a data engineer handle late-arriving dimensions?

Insert a placeholder dim row with surrogate key, natural_key = the fact's entity_id, attributes NULL, is_late = true. When the real dim row arrives, replace the placeholder or insert a new SCD Type 2 row. Alternative: quarantine the fact until the dim arrives, then backfill. The placeholder approach is more common at scale because it does not block the fact pipeline.

Q: What is SCD Type 3 and when is it useful?

Type 3 stores the current value and the previous value in two columns on the same dim row (current_address, previous_address). Only the most recent change is preserved. Useful for two-state transitions where downstream queries always compare current to previous: a sales territory reorganization where you need 'this region used to be in territory X, now in territory Y'. Rarely the right answer in 2026 data engineer interviews; mention for completeness, defend Type 2 in most cases.

Q: What is the most common SCD merge bug?

Forgetting the is_current = true condition in the MATCHED clause. Without it, the merge expires all historical rows (every row with that natural_key), not just the current one. Result is that every version's effective_to gets reset to now, breaking point-in-time queries. Always include AND d.is_current = true in the MATCHED predicate.

Q: How does Type 2 SCD interact with the fact table surrogate key?

Facts FK to the surrogate key of the dim version that was current at the time of the fact event. When customer 42 has surrogate_key 1001 from 2025-01-01 to 2025-06-30, and surrogate_key 1002 from 2025-06-30 onward, an order placed on 2025-05-15 links to surrogate_key 1001 (not 1002, even though 1002 is the current row when the analyst runs the query). This is what enables point-in-time correctness.

Question 1

When does a data engineer use Type 1 versus Type 2 SCD?

Accepted Answer

Type 1 (overwrite) when history does not matter and you only need the current value: typo corrections, immutable identifiers, preferred language. Type 2 (row-per-version) when downstream queries need point-in-time correctness: customer address (ship-to history matters), employee department (compensation history), product category (revenue-by-category over time). Default to Type 2 for analytical dimensions; the storage cost is usually worth the history.

Question 2

What columns does an SCD Type 2 dimension table have?

Accepted Answer

Surrogate key (unique per version, typically an int), natural key (customer_id, stable across versions), the descriptive attributes themselves (address, name, etc.), effective_from (timestamp the version became current), effective_to (timestamp the version was superseded; NULL for current), is_current (boolean for the current row, true on exactly one row per natural_key).

Question 3

What is the SCD Type 2 merge pattern?

Accepted Answer

Three steps. Identify changed rows by comparing staging to current dim (anti-join on natural key plus an attribute compare). Expire the matched current rows: UPDATE dim SET effective_to = now, is_current = false WHERE natural_key IN (changed) AND is_current = true. Insert the new rows: INSERT (natural_key, attributes, effective_from = now, effective_to = NULL, is_current = true) for each changed row. SQL MERGE INTO combines the matched-update and the insert in one statement.

Question 4

What is the SCD2 half-open join and why does it matter?

Accepted Answer

Joining a fact at event_time to a SCD Type 2 dimension uses ON dim.natural_key = fact.entity_id AND dim.effective_from <= fact.event_time AND (dim.effective_to IS NULL OR fact.event_time < dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) prevents two dim rows from matching at the exact changeover microsecond. Closed-interval mistake doubles facts. Open-interval mistake drops facts at the boundary.

Question 5

How does a data engineer handle late-arriving dimensions?

Accepted Answer

Insert a placeholder dim row with surrogate key, natural_key = the fact's entity_id, attributes NULL, is_late = true. When the real dim row arrives, replace the placeholder or insert a new SCD Type 2 row. Alternative: quarantine the fact until the dim arrives, then backfill. The placeholder approach is more common at scale because it does not block the fact pipeline.

Question 6

What is SCD Type 3 and when is it useful?

Accepted Answer

Type 3 stores the current value and the previous value in two columns on the same dim row (current_address, previous_address). Only the most recent change is preserved. Useful for two-state transitions where downstream queries always compare current to previous: a sales territory reorganization where you need 'this region used to be in territory X, now in territory Y'. Rarely the right answer in 2026 data engineer interviews; mention for completeness, defend Type 2 in most cases.

Question 7

What is the most common SCD merge bug?

Accepted Answer

Forgetting the is_current = true condition in the MATCHED clause. Without it, the merge expires all historical rows (every row with that natural_key), not just the current one. Result is that every version's effective_to gets reset to now, breaking point-in-time queries. Always include AND d.is_current = true in the MATCHED predicate.

Question 8

How does Type 2 SCD interact with the fact table surrogate key?

Accepted Answer

Facts FK to the surrogate key of the dim version that was current at the time of the fact event. When customer 42 has surrogate_key 1001 from 2025-01-01 to 2025-06-30, and surrogate_key 1002 from 2025-06-30 onward, an order placed on 2025-05-15 links to surrogate_key 1001 (not 1002, even though 1002 is the current row when the analyst runs the query). This is what enables point-in-time correctness.

Slowly Changing Dimension (SCD) Interview Questions

Slowly Changing Dimension Interview Questions

Data Modeling (63)