Data Modeling Round Prep

Data Modeling Interview Questions (2026)

Roughly a third of data engineering interviews include a dedicated data modeling round. Below: questions by topic, plus three fully worked examples that cover the topics interviewers reach for most often: star schema design, normalization, and SCD Type 2.
Updated April 2026·By The DataDriven Team
What this guide actually says
  1. 01The data modeling round is the only round that cannot be passed by memorizing patterns. The interviewer is grading reasoning, not recall.
  2. 02Interviewers grade you on grain, not on Kimball trivia. If your one-sentence grain statement is sloppy, the rest of the schema does not matter.
  3. 03If you cannot defend why a column lives on the fact table and not the dimension, you fail. Pricing and quantities are the most common offenders.
  4. 04SCD Type 2 is the median question. The way you handle late-arriving facts is the differentiator that separates mid-level from senior.
  5. 05Star schema is the default in 2026, OBT is increasingly defensible on columnar warehouses, snowflake is rare. Have a one-line reason for each.
  6. 06Bridge tables, closure tables, and many-to-many time-bounded relationships are where senior candidates separate from the pack.

TL;DR: Data Modeling Interview Questions

The six topics interviewers reach for most often, and what you need to actually do under time pressure.

Six areas

What gets tested

Data modeling interview questions for data engineers in 2026 focus on six areas: normalization (1NF-3NF) and denormalization, dimensional modeling (star vs snowflake), fact and dimension table design, SCD types (1, 2, 3, 6), keys (primary, foreign, surrogate, natural), and ER diagrams and constraints.
Scope

What you need to do

For most roles you need to design a star schema for a given business case, choose the right SCD strategy, justify primary key choices (surrogate vs natural), and identify normalization violations in a denormalized table. Senior roles add modeling for slowly-changing dimensions in cloud warehouses, hybrid normalization (Data Vault, OBT), and migration patterns.

What grain actually means and why interviewers care

Most candidates flunk this round by being imprecise about grain. Grain is the one sentence that says what one row in the fact table represents. If you cannot say it cleanly, every downstream answer is shaky.

The interview-ready grain statement

Three required pieces: subject, qualifier, temporal scope.

  • Subject: the noun being measured (order line item, sensor ping, payment, login). Concrete, not abstract.
  • Qualifier: the modifier that disambiguates. "Per fulfillment event," "per attempted vs successful," "per warehouse location."
  • Temporal scope: the time grain. Per day. Per event timestamp. Per hour bucket. Without this, summing measures will silently double-count or under-count.

Five grain statements ranked from bad to senior

The same scenario, articulated five ways. The bottom three are interview-passing. Only the last one is what senior candidates say.

Bad

"One row per customer."

Useless. Customers do things over time. This grain answers no useful question. Most candidates start here, get pushed, and never recover.
Vague

"One row per order."

Better, but still incomplete. Is a partial shipment one row or two? Is a returned line item subtracted from this row or written as a separate row? The interviewer will ask.
Acceptable

"One row per order line item."

Now the table answers product-level questions. This is the median answer that gets you through most interviews if you can defend it.
Strong

"One row per order line item per fulfillment event."

Now you have an event grain. Splits, returns, partial cancellations all become first-class rows. Measures stay additive across the lifecycle. This is the answer that signals you have shipped this before.
Senior

"One row per order line item per fulfillment event, with the dim version effective at the event timestamp captured as a foreign key."

You have grain plus you have explicitly tied the row to the right SCD Type 2 dim version, so late-arriving facts and dimension drift do not corrupt historical reporting. This is the senior answer.
Interview tip
Before you draw a single table, say the grain out loud. Watch the interviewer's face. If they nod, build on it. If they squint, the grain is wrong and the next 25 minutes will be defensive instead of constructive. State, do not draw, until grain is agreed.
Data ModelingPractice grain on this problem
Customer Address History

People move. Sometimes twice in a month. How do you remember where everyone was, and when?

The interviewer doesn't want to see your favorite schema. They want to see whether you can defend its weakest point under pressure.
What every senior data engineer figures out by their second offer cycle

SCD Types: when each one is wrong

Most candidates know 'Type 2 = history.' Push past that. The question is when each type is the wrong choice and what you reach for instead.

Type 1

Overwrite. When it is right.

Use Type 1 when downstream queries genuinely do not care about history. Typo corrections. Email fixes after a bounce. ML feature stores where the feature is computed at request time and you want the latest value. Type 1 is wrong when any downstream user might say 'show me what this customer looked like at the time of the order.'
Type 2

Default for analytics.

Add a new row, close the old one with eff_to. Surrogate key on the dim, fact tables join on the surrogate. This is the median right answer in 80% of interviews. The differentiator is whether you can talk about late-arriving facts, change-data-capture sources, and how you avoid duplicate effective windows.
Type 3

Rarely the right call. When it is NOT.

Type 3 stores only the previous value (previous_city, previous_tier). It is rarely the right answer because most analytics needs more than one version back. Pick Type 3 only when the business explicitly says 'I only ever compare current vs prior' (price tier before/after a promotion). Default to Type 2; if the interviewer asks why not Type 3, the answer is 'because the business will eventually want three versions back, not one.'
Type 4

Dim plus history table.

Keep the current state in a narrow, fast dim and push history to a separate table. Useful when the dim is wide (50+ attributes) but only 2 or 3 of them change. Saves storage vs Type 2 because you do not duplicate the unchanging columns. Cost: queries that need history have to know about the second table.
Type 6

1 + 2 + 3, the auditor's pick.

Combines overwriting current attributes (Type 1), historical rows with effective dating (Type 2), and a current_value column on every historical row (Type 3-style). When auditors or compliance need 'what was the value when this happened, AND what is the current value of the same attribute, in one row.' Heavy, but unambiguous.
Bridge

Many-to-many over time.

When a customer can be in two regions at once, or held by two relationship managers simultaneously, no single SCD type fits. A bridge table with eff_from / eff_to per assignment, optionally with a weighting factor, expresses the relationship cleanly. Often missed by candidates who default to 'just use Type 2' when the relationship is not 1:1.

Late-arriving facts and how candidates lose the offer

The single biggest differentiator between mid-level and senior in the data modeling round. Walk through the scenario, the naive fix, and the right fix.

Scenario

The setup

Your fact table loads sensor pings every 15 minutes. A backlog clears, and 50,000 events from 11 days ago arrive at once. Your dim_sensor has had two SCD Type 2 changes in those 11 days (recalibration, gantry reassignment). The candidate's job: load the late events without corrupting historical reporting.
Naive

What most candidates say

'Just look up the current sensor surrogate key.' This is wrong. The car passed under the old calibration. Joining to the current sk attributes the event to a sensor configuration that did not exist when the event happened. Every report aggregated by calibration_date breaks.
Right answer (eager)

If your dim materializes effective ranges

Use an as_of_timestamp join: fact.event_ts BETWEEN dim.eff_from AND dim.eff_to. The query naturally pulls the version that was current at the event timestamp. This is cheap on a columnar warehouse with clustering on (sensor_id, eff_from).
Right answer (lazy)

If your dim is a snapshot table

If you only persist the current version, you reconstruct historical state from a change log (CDC). Walk back from the current row using changes ordered by event_ts to get the as-of value. More expensive per query but no storage cost in the dim.
The trade-off

What you give up

Eager materialization (Type 2 with explicit ranges) costs storage and write complexity. Lazy reconstruction (CDC log replay) costs query time and infra complexity. The senior signal is naming both options, picking one, and stating which downstream use-case made you pick it.
What breaks

When candidates lose the offer

If the candidate cannot articulate why joining on the natural key is wrong, they fail. If they propose 'just reload everything every night,' they fail (does not scale, breaks late-arriving fact contracts upstream). If they wave their hands and say 'we use Snowflake's time travel,' they fail (time travel is bounded retention, not a modeling pattern).
Scenario: a sensor event arrives 11 days late.

Event timestamp: 2025-06-15  (the day the car passed sensor 17)
Arrived at warehouse: 2025-06-26
Current dim_sensor row for sensor 17:
  +--------+-----------+---------+------------+------------+------------+
  | sk     | sensor_id | gantry  | calibrated | eff_from   | eff_to     |
  +--------+-----------+---------+------------+------------+------------+
  | 4471   | S-17      | G-W04   | 2024-09-01 | 2024-09-01 | 2025-06-21 |
  | 4719   | S-17      | G-W04   | 2025-06-22 | 2025-06-22 | 9999-12-31 |
  +--------+-----------+---------+------------+------------+------------+

The naive answer: join on the latest sensor_sk for S-17 (4719).
Wrong. The car passed under the OLD calibration (sk 4471).

The right answer: as_of_timestamp join.
  fact.event_timestamp BETWEEN dim.eff_from AND dim.eff_to
This pulls sk 4471 for the 2025-06-15 event and sk 4719 for events after 2025-06-22.
What strong candidates say
"The naive answer is to join on the latest sensor surrogate key, but that attributes the event to a calibration that did not exist when the car passed under the gantry. The right answer is an as-of-timestamp join: fact.event_ts BETWEEN dim.eff_from AND dim.eff_to. The trade-off is between materializing effective ranges eagerly (storage cost, simple queries) and reconstructing them lazily from a CDC log (storage savings, query cost). For high-volume reporting I pick eager; for an operational system with mostly point lookups I pick lazy."
Data ModelingLate-arriving facts: practice
Toll Road Sensor Analytics

Cars enter, cars exit. Except when they don't.

Star vs Snowflake vs One Big Table

Opinionated take. In 2026, with cheap storage and cheap columnar compute, OBT is more defensible than it was when Kimball wrote the books. Star is still the right default. Snowflake is rare.

PropertyStar SchemaSnowflake SchemaOne Big Table (OBT)
Query simplicityHighMediumHigh (single table)
Query perf on columnarStrongWeaker (deeper joins)Strong (no joins)
ETL complexityModerateHighestHighest (one wide table to maintain)
SCD handlingClean (Type 2 on dim)Same as star, more places to coordinateAwkward (rebuilds expensive)
Storage costModerateLowestHighest (dim attrs duplicated)
Self-service analyticsBestWorse (more tables to model)Variable (works for one cut, fails for many)
When it goes wrongCross-fact joins, conformed-dim driftAnywhere the join chain exceeds 3 levelsSchema drift, multi-audience filtering
The 2026 take, comparison-by-comparison
  • Query simplicity: OBT wins for a single dominant audience cut. Star wins when multiple audiences need different filter sets. Snowflake loses on simplicity for everyone.
  • Query performance on a columnar store: OBT is competitive because columnar pruning ignores unused columns. Star pays one or two cheap dimension joins. Snowflake pays many deeper joins, often with worse selectivity.
  • ETL complexity: OBT is the most complex to build (many sources land in one wide table; one schema change touches everything). Star is moderate. Snowflake is the most complex to keep consistent across many normalized dim tables.
  • Slowly-changing handling: Star handles SCD Type 2 cleanly via dim surrogate keys. OBT handles SCD by writing the dim attributes onto each fact row (eager); rebuilds become expensive. Snowflake inherits star's behavior, more places to coordinate.
  • When OBT goes wrong: when multiple downstream users need to filter by dimension attributes that were not duplicated onto the OBT, when SCD changes require expensive backfills, and when self-service tools need a clean dim to power their dropdowns.
Interview-ready stance
"I default to star for BI workloads on a columnar warehouse. OBT is increasingly defensible when there is one dominant audience and one dominant query pattern, especially for ML features. Snowflake I reach for only when a dimension hierarchy has attributes that change at very different rates and the storage savings actually matter at the scale we operate at."

Bridge tables and many-to-many

Skipped in too many study guides. Bridge tables are how you model relationships that a single SCD type cannot express. If the relationship is many-to-many, especially over time, you need a bridge.

Customer-Account

Customer-to-Account in a household model

One customer can hold many accounts. One account can be held by multiple customers (joint, household). bridge_account_customer with role + weighting_factor cleanly expresses the relationship and lets you allocate balances when reporting needs per-customer measures.
Order-Promo

Order-to-Promotion stacking

One order can carry multiple promotions (loyalty + first-time + free-shipping). bridge_order_promo with allocated_discount per promo lets reporting attribute the discount cost to the right promo without double-counting.
Patient-Clinician

Patient-to-Clinician care teams

Modern care models assign multiple clinicians per patient (PCP, specialist, nurse practitioner). bridge_patient_clinician with role + eff_from / eff_to over time lets you answer 'who was on this patient's care team during this admission' even when the team rotated mid-stay.
Customer-to-Account (household model)

  dim_customer            bridge_account_customer        dim_account
  +-------------+         +------------------------+      +-------------+
  | customer_sk |--+      | account_sk         FK  |  +---| account_sk  |
  | name        |  +-----<| customer_sk        FK  |>-+   | account_no  |
  | email       |         | role                   |      | open_date   |
  +-------------+         | weighting_factor       |      | tier        |
                          +------------------------+      +-------------+

Why the bridge:
  - One account can be held by multiple customers (joint, household).
  - One customer can hold multiple accounts (checking + savings).
  - role is on the bridge: "primary", "joint", "authorized_user".
  - weighting_factor allocates account-level measures (balance, fees) when
    the business needs to attribute them to individual customers.
    Three joint holders with weighting 0.33 each prevents triple-counting.

Without the bridge:
  - You either pick a single primary customer (lose other holders),
    cartesian-join (triple-count balances), or denormalize into the fact
    (still cannot represent the many-to-many cleanly).
Why candidates miss this
Most candidates default to "just put it on the dim" or "just put it on the fact." Both fail when the relationship is genuinely many-to-many. The senior signal is to recognize the m:n shape early, name the bridge, and explain how the weighting_factor prevents double-counting downstream.
Data ModelingBridge a real m:n relationship
B2B Invoicing Data Model

Invoices go out, partial payments trickle in, and some customers are three months overdue.

Star vs Snowflake Schema: Decision Matrix

The most common dimensional modeling question. Memorize this table and you can answer 80% of star/snowflake comparisons.

PropertyStar SchemaSnowflake Schema
DimensionsDenormalized (single table per dim)Normalized (broken into multiple tables)
Query joinsFew (1 fact + N dims)Many (deeper join chains)
StorageHigher (redundant attributes)Lower
BI tool friendlinessExcellentWorse (more tables to model)
Update consistencyLower (denormalized data drifts)Higher (single source of truth)
Best forAnalytics, BI dashboards, modern warehousesStorage-constrained OLAP, hierarchical attributes

Primary, Foreign, Surrogate Keys: When to Use Each

Key choice questions appear in nearly every data modeling interview. Know the tradeoffs.

Key typeWhat it isUse when
Natural keyA real-world identifier (email, SSN, ISBN)Source-of-truth tables where the identifier won't change
Surrogate keySystem-generated meaningless integer (auto-increment, UUID, hash)Dimensional warehouses, especially SCD type 2 where natural key alone is not unique
Primary keyConstraint that uniquely identifies a row (NOT NULL + UNIQUE)Every persisted table; choose the surrogate or natural based on stability
Foreign keyConstraint that enforces a row references a primary key in another tableOLTP systems where referential integrity matters; often omitted in cloud warehouses for performance

What interviewers actually grade on

The six questions you will hear, with answer outlines that separate strong from senior. The questions are not new. The answers that earn offers are not the obvious ones.

Q01

Design a schema for an Uber-like ride-sharing platform.

Strong answers ask clarifying questions first: which audience (operations, finance, ML)? What grain (one ride, one ride-leg, one driver-shift)? Then they sketch fact_ride at one-row-per-ride grain plus dim_rider, dim_driver, dim_vehicle, dim_geography (pickup AND dropoff as separate FKs), dim_date, dim_promo. Senior answers explicitly call out SCD Type 2 on dim_driver (rating, vehicle, status all change) and a separate fact_ride_lifecycle event-grain table for ETA, accept, cancel events.
Q02

How would you handle a customer with three addresses, two emails, and a merge into another customer?

Stop and clarify what 'three addresses' means. Three current addresses (billing, shipping, residential)? That is bridge_customer_address with role + eff_from/eff_to. Three historical addresses? That is SCD Type 2 on dim_customer. Then the merge: a customer-survivor table mapping merged_customer_id -> surviving_customer_sk so historical facts still resolve. Most candidates miss the merge survivor table and lose history.
Q03

What's the grain of your fact table for [scenario]? Defend it.

Subject + qualifier + temporal scope. 'One row per shipment line per warehouse-event per ship-date.' Defend by enumerating the questions the grain answers and the questions a coarser grain would not. If you cannot list two questions a coarser grain would fail, your grain is over-granular.
Q04

Walk through a Type 2 dimension with effective dating. What goes wrong with late-arriving facts?

Walk through the as_of_timestamp join (fact.event_ts BETWEEN dim.eff_from AND dim.eff_to). Then walk through the failure mode: a fact arrives with an event_ts older than every active eff_to row in the dim because the dim's history was not backfilled deeply enough. Fix: extend the dim's effective windows backwards (UPDATE the earliest row's eff_from to the global minimum), or insert a synthetic 'unknown' dim row for events before recorded history.
Q05

Star or snowflake or one big table? When would you pick each?

Star is the default for BI dashboards on a columnar store. Snowflake when dimension hierarchies have rich attributes that change at different rates (geography: country attrs change every decade, city attrs change yearly, neighborhood attrs change monthly). OBT when one dominant audience uses one denormalized surface and storage is cheap. State the tie-breaker: 'I default to star unless someone tells me a specific reason to snowflake.'
Q06

How do you model a hierarchy where the depth is unknown?

Two options. Adjacency list (parent_id self-reference) is simple but recursive queries are expensive. Closure table (one row per ancestor-descendant pair with depth) makes hierarchical queries O(1) at the cost of storage and write complexity. Bridge tables can express hierarchies that are not strict trees (employees with two managers). State which option you would pick for the asked use case and why.

Why Data Modeling Is the Hardest Interview Round

Data modeling appears in roughly a third of all data engineering interviews, yet it is the round candidates prepare for the least. Unlike SQL or Python, there is no single correct answer. Every schema design is a series of decisions with consequences, and the interviewer's job is to probe whether you understand those consequences.

No single answer

There is no single right schema

Two candidates can produce different schemas for the same problem, and both can pass. What matters is whether you can explain why you chose one approach over another. If the interviewer asks "why not snowflake?" and you cannot articulate the trade-off, that is a fail.
Moving target

Requirements change mid-interview

Interviewers deliberately add new requirements after you finish your initial design. "Now the product team wants real-time updates." "Now we need to track 3 years of history." Your schema needs to bend without breaking.
Conversation

You must think out loud

Unlike SQL where you can silently write and submit, data modeling is a conversation. Interviewers are scoring your reasoning process. Verbalize your assumptions, state your trade-offs, and ask clarifying questions before you draw a single table.

Worked Example: E-Commerce Star Schema

"Design a star schema for an e-commerce platform. Define the grain of your fact table and list your dimension tables." This question appears in nearly every data modeling interview.

                    dim_customer                 dim_product
                +-----------------+         +------------------+
                | customer_id PK  |         | product_id  PK   |
                | name            |         | name             |
                | email           |         | category         |
                | segment         |         | subcategory      |
                | region          |         | brand            |
                | signup_date     |         | unit_cost        |
                +-----------------+         +------------------+
                         |                           |
                         |    fact_order_items        |
                         |  +--------------------+   |
                         +--| customer_id   FK   |---+
                            | product_id    FK   |
                  +---------| date_id       FK   |---------+
                  |         | order_id      DD   |         |
                  |         | order_item_id PK   |         |
                  |         |--------------------|         |
                  |         | quantity           |         |
                  |         | unit_price         |         |
                  |         | discount_amount    |         |
                  |         | total_amount       |         |
                  |         +--------------------+         |
                  |                                        |
          +--------------+                     +-----------------+
          | date_id  PK  |                     | promo_id   PK   |
          | full_date    |                     | promo_name      |
          | day_of_week  |                     | discount_pct    |
          | month        |                     | start_date      |
          | quarter      |                     | end_date        |
          | year         |                     | promo_type      |
          | is_holiday   |                     +-----------------+
          +--------------+                        dim_promotion
             dim_date
Grain

Grain: One row per order line item

An order with 3 products creates 3 fact rows. This is the lowest useful grain: it supports product-level analysis (revenue per SKU, discount effectiveness per category) without requiring re-aggregation. If you chose order-level grain, you would need a separate query to break down product mix.
Degenerate dim

Why order_id is a degenerate dimension

order_id has no attributes of its own (no order_name, order_type). It exists only as a grouping key to reassemble complete orders from line items. Putting it in a separate dimension table would add a JOIN with no analytical value.
Pricing

Why unit_price is on the fact table, not dim_product

Product prices change over time. The fact table captures the price at the moment of sale. If you stored price only on dim_product, historical revenue calculations would use today's price, not the price when the order was placed. This is one of the most common modeling mistakes.
Trade-offs to state in the interview
  • Star vs snowflake: dim_product.category could be normalized into a separate dim_category table (snowflake). This saves storage if categories have many attributes, but adds a JOIN to every product query. At this scale, star is simpler and query performance is better.
  • dim_customer SCD strategy: Currently Type 1 (overwrite). If the business needs to analyze orders by the customer's region at the time of purchase, upgrade to SCD Type 2 with effective_from and effective_to columns. Cost: table grows 3-5x over 3 years.
  • Aggregation tables: If the dashboard queries daily revenue by category, a pre-aggregated table (daily_category_revenue) avoids scanning 500M fact rows. Cost: refresh latency and storage. Benefit: sub-second dashboard queries.

Worked Example: Normalization from 1NF to 3NF

"This table has anomalies. Normalize it." The interviewer hands you a denormalized table and expects you to identify the violations at each normal form and fix them step by step.

Original Table (Unnormalized)

orders_raw
+----------+------------+---------+--------+-----------+----------+
| order_id | order_date | cust_id | cust_name | products           |
+----------+------------+---------+-----------+--------------------+
| 1001     | 2025-03-15 | C1      | Alice     | Widget, Gadget     |
| 1002     | 2025-03-16 | C2      | Bob       | Widget             |
| 1003     | 2025-03-16 | C1      | Alice     | Gadget, Thingamajig|
+----------+------------+---------+-----------+--------------------+

Problem: "products" column contains comma-separated values.
This violates 1NF (each cell must hold a single atomic value).

Step 1: First Normal Form (1NF)

Eliminate repeating groups. Each row stores one product per order.

order_items_1nf
+----------+------------+---------+-----------+-------------+
| order_id | order_date | cust_id | cust_name | product     |
+----------+------------+---------+-----------+-------------+
| 1001     | 2025-03-15 | C1      | Alice     | Widget      |
| 1001     | 2025-03-15 | C1      | Alice     | Gadget      |
| 1002     | 2025-03-16 | C2      | Bob       | Widget      |
| 1003     | 2025-03-16 | C1      | Alice     | Gadget      |
| 1003     | 2025-03-16 | C1      | Alice     | Thingamajig |
+----------+------------+---------+-----------+-------------+

PK: (order_id, product)  -- composite key
Now 1NF: every cell is atomic, every row is unique.

Step 2: Second Normal Form (2NF)

Eliminate partial dependencies. order_date and cust_name depend only on order_id, not on the full composite key (order_id, product).

orders_2nf                          order_items_2nf
+----------+------------+---------+-----------+   +----------+-------------+
| order_id | order_date | cust_id | cust_name |   | order_id | product     |
+----------+------------+---------+-----------+   +----------+-------------+
| 1001     | 2025-03-15 | C1      | Alice     |   | 1001     | Widget      |
| 1002     | 2025-03-16 | C2      | Bob       |   | 1001     | Gadget      |
| 1003     | 2025-03-16 | C1      | Alice     |   | 1002     | Widget      |
+----------+------------+---------+-----------+   | 1003     | Gadget      |
  PK: order_id                                     | 1003     | Thingamajig |
                                                   +----------+-------------+
Now 2NF: no non-key column depends on part of the PK.

Step 3: Third Normal Form (3NF)

Eliminate transitive dependencies. cust_name depends on cust_id, not directly on order_id. Extract it into a separate table.

customers_3nf         orders_3nf                   order_items_3nf
+---------+-----------+  +----------+------------+---------+  +----------+-------------+
| cust_id | cust_name |  | order_id | order_date | cust_id |  | order_id | product     |
+---------+-----------+  +----------+------------+---------+  +----------+-------------+
| C1      | Alice     |  | 1001     | 2025-03-15 | C1      |  | 1001     | Widget      |
| C2      | Bob       |  | 1002     | 2025-03-16 | C2      |  | 1001     | Gadget      |
+---------+-----------+  | 1003     | 2025-03-16 | C1      |  | 1002     | Widget      |
  PK: cust_id            +----------+------------+---------+  | 1003     | Gadget      |
                           PK: order_id  FK: cust_id          | 1003     | Thingamajig |
                                                              +----------+-------------+

Now 3NF: every non-key column depends on the key, the whole key,
and nothing but the key.
When to denormalize back
If your warehouse dashboard queries join orders_3nf to customers_3nf on every single query, and the customer table is small (under 1M rows), denormalizing cust_name back onto the orders table saves a JOIN and speeds up every dashboard query. The trade-off: if Alice changes her name, you need to update every order row. For analytical workloads where name changes are rare and query speed matters, this is a reasonable trade-off. State it explicitly.

Worked Example: SCD Type 2 Implementation

"A customer changes their address. Show how you track the history." SCD Type 2 is the most commonly tested slowly changing dimension pattern.

Three SCD types compared

TypeHow it worksKeeps history?Table growthBest for
Type 1Overwrite the old valueNoNoneTypo corrections, non-analytical fields
Type 2Add new row with date rangeFull history3-5x over 3 yearsAddress, segment, pricing tier
Type 3Add column for previous valueOne version backMinimalBefore/after comparisons only

SCD Type 2: Before and After a Customer Moves

dim_customer (SCD Type 2)
+----------------+-------------+---------+----------+------------+------------+------------+
| customer_sk PK | customer_id | name    | city     | state      | eff_from   | eff_to     | is_current |
+----------------+-------------+---------+----------+------------+------------+------------+
| 1001           | C-42        | Alice   | Chicago  | IL         | 2023-01-15 | 2025-06-30 | false      |
| 1002           | C-42        | Alice   | Austin   | TX         | 2025-07-01 | 9999-12-31 | true       |
| 1003           | C-88        | Bob     | Seattle  | WA         | 2024-03-01 | 9999-12-31 | true       |
+----------------+-------------+---------+----------+------------+------------+------------+

Key design choices:
  1. customer_sk (surrogate key) is the PK. customer_id is the natural/business key.
  2. eff_to = 9999-12-31 marks the current row. is_current flag is redundant but speeds up queries.
  3. When Alice moves, the old row gets eff_to = 2025-06-30 and is_current = false.
     A new row is inserted with eff_from = 2025-07-01.
  4. Fact tables FK to customer_sk, NOT customer_id. This preserves point-in-time accuracy:
     an order placed when Alice lived in Chicago joins to the Chicago row.

Querying with SCD Type 2

-- Current state of all customers
SELECT * FROM dim_customer WHERE is_current = true;

-- Point-in-time: customer state on a specific date
SELECT * FROM dim_customer
WHERE '2024-06-15' BETWEEN eff_from AND eff_to;

-- Join fact to dimension preserving historical accuracy
SELECT f.order_date, f.total_amount, d.city, d.state
FROM fact_orders f
JOIN dim_customer d ON f.customer_sk = d.customer_sk;

The fact table joins on the surrogate key, not the business key. This means each fact row is permanently linked to the correct version of the customer dimension. If you join on customer_id instead, you get the current address for all orders, which destroys historical accuracy.

Common interview follow-ups
  • "The table is growing too fast." Partition by is_current, archive old rows to cold storage, or increase the granularity of change tracking (only track meaningful changes, not every field update).
  • "How do you handle a late-arriving dimension update?" Backdate eff_from and adjust the overlapping row's eff_to. This requires an UPDATE + INSERT, not just an INSERT.
  • "When would you use Type 3 instead?" When you only need before/after comparison (e.g., pricing tier before and after a promotion) and do not need full history. Type 3 adds a previous_city column instead of new rows.

Myth vs Reality

Five things candidates have been told that are wrong, or wrong in important ways. Each one has cost someone an offer.

The Myth
3NF is the right answer for a warehouse.
The Reality
3NF is a transactional concept. Warehouses denormalize on purpose so analytics queries do not pay for joins on every dashboard load. Knowing 3NF rules is interview table-stakes; applying them blindly to a warehouse design is a fail.
The Myth
Surrogate keys are always better than natural keys.
The Reality
Surrogate is the right default for dim tables (Type 2 history requires it). But natural keys can be the right call for cross-system reconciliation, idempotent loads, and source-of-truth tables where the natural key never changes (ISBN, transaction_id from upstream). The best answer states the rule plus the carve-outs.
The Myth
SCD Type 2 = always the right answer for changing dimensions.
The Reality
Only when downstream queries actually need history. For ML feature stores serving the freshest value at request time, Type 1 is better. For audit-heavy domains, Type 6. For attributes that are many-to-many over time (customer-region), no single SCD type fits and you need a bridge.
The Myth
Star schemas are obsolete.
The Reality
Still the default for BI dashboards on columnar stores. The cheap-storage / cheap-compute argument for OBT does not eliminate the need for clean, denormalized dimensions when self-service users need to filter and group. Star plus aggregation tables remains the boring, reliable answer.
The Myth
Just denormalize everything into one table.
The Reality
Works until it does not. Joins are cheap on columnar stores; rebuilds when the schema drifts are not. Every new column you want to add to a dim attribute touches every historical fact row in the OBT, which is a multi-hour backfill at scale.

Decision matrix: which model for which situation

Eight situations, the schema pattern that fits, and a one-line reason. Internalize this and you have an answer for the most common 'so which would you pick' question.

If your situation is
Pick
Why
BI dashboards, < 10 facts, slowly changing dims
Star + SCD Type 2
Cleanest grammar for self-service analytics; dim history preserved.
ML feature store, point-in-time correctness needed
OBT or feature-table per entity
Features computed and frozen at training-time row; no SCD complexity.
Operational reporting (real-time), low-latency dashboards
Materialized view + denormalized cache
Pre-aggregated by the dimensions you filter on; refresh on a streaming cadence.
Compliance audit trail required
SCD Type 6 or full event log
Type 6 keeps current + history + previous in one row; event log is unambiguous and append-only.
Self-service analytics, lots of ad-hoc joins
Star schema, defensive
Predictable join paths; analysts know which dim to filter without consulting the model.
Wide dim that changes in 2 of 50 columns
SCD Type 4
Avoid duplicating 48 unchanging columns on every history row.
Customer can be in two regions at once
Bridge table with eff dates + weighting
Many-to-many relationships over time cannot be modeled by a single SCD type.
Hierarchy depth is unknown and queries traverse depth
Closure table
O(1) ancestor-descendant lookups beat recursive CTEs at scale; pay storage for query speed.

All Data Modeling Topics by Frequency

Six topic clusters, ordered by how often interviewers reach for them. Each cluster lists the questions you should be ready to answer cold.

Topic 1

Normalization (1NF through 3NF)

Difficulty: Medium

Frequency: Very High

Normalization is the foundation of relational design. Interviewers test whether you can identify violations, explain why they matter, and know when denormalization is the right call.
Q1
Given a table with repeating groups in a single column (comma-separated tags), normalize it to 1NF. Explain what problem the original design creates for queries.
Q2
Identify the partial dependency in a table with a composite primary key and decompose it into 2NF.
Q3
A table stores employee_name, department_name, and department_location. Identify the transitive dependency and normalize to 3NF.
Q4
When would you intentionally denormalize a 3NF schema? Give a specific scenario with the trade-offs.
Topic 2

Star vs Snowflake Schemas

Difficulty: Medium

Frequency: Star schema: 4.7% of modeling questions

Star schema appears in 4.7% of data modeling questions, with fact tables (4.7%) and dimension tables (4.2%) closely behind. Interviewers want you to design fact and dimension tables, choose the right grain, and justify star vs snowflake trade-offs.
Q1
Design a star schema for an e-commerce platform. Define the fact table grain and list 4 dimension tables with their key attributes.
Q2
Convert a star schema into a snowflake schema by normalizing the product dimension. What queries become harder? What storage do you save?
Q3
Your fact table has 500M rows. A dimension table has 50M rows. Should you use a star or snowflake schema? Explain your reasoning based on query patterns.
Q4
What is a degenerate dimension? Give an example and explain why it lives in the fact table instead of a separate dimension.
Topic 3

Slowly Changing Dimensions (SCD Types 1-3)

Difficulty: Medium-Hard

Frequency: High

Tracking historical changes in dimension data is a classic interview topic. Interviewers test whether you understand the trade-offs between SCD types and can implement them correctly.
Q1
A customer changes their address. Show how SCD Type 1, Type 2, and Type 3 each handle this change. Which columns does each approach add or modify?
Q2
Design an SCD Type 2 implementation for a product dimension where price changes frequently. Include the surrogate key, natural key, effective dates, and current flag.
Q3
Your team uses SCD Type 2 for a dimension with 10M rows. After 3 years, the table has grown to 80M rows. What strategies can reduce the size without losing history?
Q4
When would you choose SCD Type 3 over Type 2? Describe a real scenario where storing only the previous value is sufficient.
Topic 4

Entity-Relationship Modeling

Difficulty: Medium

Frequency: Entity ID: 6.6% of modeling questions

Entity identification is the most common data modeling topic at 6.6%. ER diagrams are the starting point for most schema design interviews. You draw entities, relationships, and cardinalities. Then the interviewer asks follow-up questions about edge cases.
Q1
Draw an ER model for a university system: students, courses, professors, and enrollments. Define cardinalities for each relationship.
Q2
Model a social network where users can follow other users, create posts, and like posts. How do you handle the self-referencing follow relationship?
Q3
Design an ER model for a hotel booking system. Handle the case where a single booking can span multiple rooms and multiple nights.
Q4
Your ER model has a many-to-many relationship between orders and products. The intersection table needs to store quantity and unit_price. Why is unit_price stored here instead of on the product table?
Topic 5

Keys and Constraints

Difficulty: Easy-Medium

Frequency: Primary keys 5.9%, Foreign keys 4.7%

Primary keys (5.9%) and foreign keys (4.7%) are core modeling topics. Interviewers probe whether you understand why each constraint exists and what breaks when you skip one.
Q1
When should you use a surrogate key (auto-increment ID) vs a natural key (email, SSN)? Give one scenario for each.
Q2
A table has a composite primary key of (user_id, event_date). Can you add a foreign key from another table to this composite key? What are the implications?
Q3
Your schema has no foreign key constraints for performance reasons. How do you ensure referential integrity? What is the cost of skipping enforcement?
Topic 6

Trade-off Reasoning

Difficulty: Hard

Frequency: Very High

The hardest part of a data modeling interview is not the design itself. It is defending your choices when the interviewer pushes back. Every design has trade-offs, and you need to articulate them clearly.
Q1
You denormalized a dimension to speed up a critical dashboard query. The interviewer asks: what happens when the source data changes? How do you keep the denormalized copy in sync?
Q2
Your schema uses a single wide fact table with 200 columns. The interviewer suggests splitting it into multiple fact tables. Argue for and against each approach.
Q3
A stakeholder wants real-time data in the warehouse. Your current design uses daily batch loads. What schema changes would you make to support streaming ingestion without breaking existing queries?
Q4
You chose a star schema, but a new requirement needs a query that joins 5 dimension tables. The query is slow. What are your options? Which one do you recommend and why?

How to Defend Your Schema Design Decisions

When the interviewer pushes back, your job is to explain why you chose what you chose. Four moves that turn a defensive answer into a confident one.

Step 1

Name the trade-off explicitly

"I chose to denormalize the product category onto the fact table. This costs storage and creates update complexity, but it eliminates a JOIN on every dashboard query, which matters at our query volume."
Step 2

Reference the query patterns

Every design decision should tie back to how the data will be queried. "Analysts will filter by date range and group by region, so I put both on the fact table as foreign keys to keep those queries single-scan."
Step 3

Acknowledge what you are giving up

Interviewers respect honesty. "This design does not handle the case where a product belongs to multiple categories. If that requirement appears later, I would add a bridge table. But for now, the simpler design serves the stated requirements."
Step 4

Have a fallback ready

When the interviewer pushes back, do not defend your original design to the death. Say: "If the requirement changes to X, I would restructure this table by Y. Here is what that migration looks like."

Practice the data modeling patterns interviewers actually test

Six real practice problems on DataDriven, free. Each one is a scenario interviewers reach for: history tracking, SCD changes, m:n invoicing, balance accuracy, network ripple, late-arriving sensor events.

Data Modeling Interview FAQ

Why is data modeling the hardest interview round?+
Because there is no single correct answer. SQL and Python have objectively right or wrong output. Data modeling has trade-offs. The interviewer is evaluating your reasoning process, not just the final schema. You need to articulate why you made each choice and what you would change if requirements shifted.
How do I practice data modeling without a study partner?+
Design a schema for a business you know well: your favorite app, your company, an e-commerce site. Write down 3 trade-offs for each design decision. Then change a requirement (add real-time, add a new entity, increase scale 100x) and see how your schema holds up. DataDriven offers interactive schema building where you create tables, define relationships, and validate your design.
Do I need to memorize all SCD types for interviews?+
Know Types 1, 2, and 3 cold. Type 1 overwrites, Type 2 adds rows with date ranges, Type 3 adds a column for the previous value. Types 4-7 exist but are rarely asked. If you can explain the trade-offs between Types 1-3 with a concrete example, you are well prepared.
Star schema or snowflake schema: which should I default to in interviews?+
Start with star schema unless you have a specific reason to snowflake. Star schemas are simpler, faster for queries, and easier to explain. If the interviewer asks about storage efficiency or dimension table normalization, that is your cue to discuss snowflake. Always frame it as a trade-off, not a rule.
How long should a data modeling interview answer take?+
Spend 5 minutes asking clarifying questions about the business requirements. Spend 15 minutes drawing the schema. Spend the remaining 20-25 minutes defending your design and discussing alternatives. Interviewers value the discussion more than the diagram.
What is the difference between a fact table and a dimension table?+
A fact table stores measurable events (orders, clicks, transactions) at a specific grain. Each row is one event. Columns are numeric measures (quantity, amount, duration) and foreign keys to dimension tables. A dimension table stores descriptive attributes (customer name, product category, date components). Dimensions change slowly; facts accumulate quickly.
What is a surrogate key and when should I use one?+
A surrogate key is a system-generated, meaningless integer (or UUID/hash) used as the primary key of a table, independent of business values. Use surrogate keys in dimensional warehouses (SCD type 2 requires them because the natural key is no longer unique once you add date ranges), when the natural key is composite or string-heavy, or when the natural key may change over time.
What is the difference between a primary key and a foreign key?+
A primary key uniquely identifies each row in a table (NOT NULL + UNIQUE constraint). A foreign key is a column that references the primary key of another table to enforce relationships. In OLTP systems both are heavily used; in modern cloud warehouses, primary keys are still defined for documentation and modeling tools, but foreign key constraints are often disabled or skipped for write performance.
What is a data model and what types are tested in interviews?+
A data model is a structured definition of how data is organized, related, and constrained. Three layers are commonly tested: conceptual (entities and relationships, no implementation detail), logical (tables, columns, keys, normalization), and physical (engine-specific: indexes, partitioning, distribution keys). Most interviews focus on logical modeling with occasional physical-modeling questions for senior roles.
What are the most common data modeling tools used in industry?+
ER/Studio, ERwin, and Lucidchart are common for enterprise and on-premise. dbdiagram.io and DrawSQL are popular for cloud and modern stack work. Within data warehouses, Looker (LookML), dbt models, and Cube.dev are de-facto modeling layers. For interviews, you mostly draw on a whiteboard or in a shared doc; tool fluency rarely matters as much as model reasoning.
What is an entity-relationship diagram (ERD) and why does it matter?+
An ERD visualizes entities (tables) and the relationships between them (one-to-many, many-to-many, etc.) along with key constraints. Most data modeling interviews require you to sketch one. Mastering the notation (crow's foot, Chen) and being able to translate business requirements into entities and relationships in real time is the central skill being tested.
How do you handle late-arriving facts in an SCD Type 2 model?+
Look up the dimension version that was current at the event timestamp, not the row that is current today. If the dim is materialized eagerly with effective_from and effective_to, an as_of_timestamp join (BETWEEN eff_from AND eff_to) gets you the right surrogate key. If the change came in late and overlaps a currently-active row, you split the existing row by closing it at the new event date and inserting a new version starting from that date. The naive approach (re-applying the latest dim row to all backfilled facts) is wrong and is the classic interview-failing answer.
What is the grain of a fact table and why does it matter so much?+
The grain is one sentence that tells you what one row in the fact represents. 'One order line per product per customer per day.' If you cannot say that sentence, you do not have a model. Grain determines what every measure means, what dimensions can sit on the table, and which questions the table can answer. Mid-level and senior interviews almost always probe grain before they probe anything else.
When is one-big-table (OBT) the right answer over a star schema?+
OBT works when downstream users mostly query a single denormalized surface (ML feature stores, ad-hoc Looker explores on a columnar warehouse, BI tools that flatten joins anyway), and when the storage cost of duplicating dimension attributes is cheap compared to the join cost. It breaks when the same fact rows must support many different audience cuts, when SCD history must be enforced, or when downstream tools need a clean dimension to filter by. The 2026 take: in a columnar warehouse with cheap storage and pre-computed micro-partitions, OBT is more defensible than it was in 2010, but it is still the wrong default for self-service analytics.
What is a bridge table and when do you reach for one?+
A bridge table sits between a fact and a dimension, or between two dimensions, when the relationship is many-to-many. Customer-to-account in a household model. Patient-to-clinician in healthcare. Order-to-promotion when one order can have multiple promos. The bridge stores the pair plus an optional weighting factor when measures need to be allocated. Without the bridge you either lose rows (inner join collapses), double-count (cartesian product), or invent fake parent-child hierarchies that mislead the business.

Practice Data Modeling Interactively

Build schemas, define relationships, and validate your designs. The only platform with hands-on data modeling practice.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats