Data Modeling Interview Questions (2026)
- 01The data modeling round is the only round that cannot be passed by memorizing patterns. The interviewer is grading reasoning, not recall.
- 02Interviewers grade you on grain, not on Kimball trivia. If your one-sentence grain statement is sloppy, the rest of the schema does not matter.
- 03If you cannot defend why a column lives on the fact table and not the dimension, you fail. Pricing and quantities are the most common offenders.
- 04SCD Type 2 is the median question. The way you handle late-arriving facts is the differentiator that separates mid-level from senior.
- 05Star schema is the default in 2026, OBT is increasingly defensible on columnar warehouses, snowflake is rare. Have a one-line reason for each.
- 06Bridge tables, closure tables, and many-to-many time-bounded relationships are where senior candidates separate from the pack.
TL;DR: Data Modeling Interview Questions
The six topics interviewers reach for most often, and what you need to actually do under time pressure.
What gets tested
What you need to do
What grain actually means and why interviewers care
Most candidates flunk this round by being imprecise about grain. Grain is the one sentence that says what one row in the fact table represents. If you cannot say it cleanly, every downstream answer is shaky.
Three required pieces: subject, qualifier, temporal scope.
- Subject: the noun being measured (order line item, sensor ping, payment, login). Concrete, not abstract.
- Qualifier: the modifier that disambiguates. "Per fulfillment event," "per attempted vs successful," "per warehouse location."
- Temporal scope: the time grain. Per day. Per event timestamp. Per hour bucket. Without this, summing measures will silently double-count or under-count.
Five grain statements ranked from bad to senior
The same scenario, articulated five ways. The bottom three are interview-passing. Only the last one is what senior candidates say.
"One row per customer."
"One row per order."
"One row per order line item."
"One row per order line item per fulfillment event."
"One row per order line item per fulfillment event, with the dim version effective at the event timestamp captured as a foreign key."
People move. Sometimes twice in a month. How do you remember where everyone was, and when?
“The interviewer doesn't want to see your favorite schema. They want to see whether you can defend its weakest point under pressure.”
SCD Types: when each one is wrong
Most candidates know 'Type 2 = history.' Push past that. The question is when each type is the wrong choice and what you reach for instead.
Overwrite. When it is right.
Default for analytics.
Rarely the right call. When it is NOT.
Dim plus history table.
1 + 2 + 3, the auditor's pick.
Many-to-many over time.
Late-arriving facts and how candidates lose the offer
The single biggest differentiator between mid-level and senior in the data modeling round. Walk through the scenario, the naive fix, and the right fix.
The setup
What most candidates say
If your dim materializes effective ranges
If your dim is a snapshot table
What you give up
When candidates lose the offer
Scenario: a sensor event arrives 11 days late. Event timestamp: 2025-06-15 (the day the car passed sensor 17) Arrived at warehouse: 2025-06-26 Current dim_sensor row for sensor 17: +--------+-----------+---------+------------+------------+------------+ | sk | sensor_id | gantry | calibrated | eff_from | eff_to | +--------+-----------+---------+------------+------------+------------+ | 4471 | S-17 | G-W04 | 2024-09-01 | 2024-09-01 | 2025-06-21 | | 4719 | S-17 | G-W04 | 2025-06-22 | 2025-06-22 | 9999-12-31 | +--------+-----------+---------+------------+------------+------------+ The naive answer: join on the latest sensor_sk for S-17 (4719). Wrong. The car passed under the OLD calibration (sk 4471). The right answer: as_of_timestamp join. fact.event_timestamp BETWEEN dim.eff_from AND dim.eff_to This pulls sk 4471 for the 2025-06-15 event and sk 4719 for events after 2025-06-22.
Cars enter, cars exit. Except when they don't.
Star vs Snowflake vs One Big Table
Opinionated take. In 2026, with cheap storage and cheap columnar compute, OBT is more defensible than it was when Kimball wrote the books. Star is still the right default. Snowflake is rare.
| Property | Star Schema | Snowflake Schema | One Big Table (OBT) |
|---|---|---|---|
| Query simplicity | High | Medium | High (single table) |
| Query perf on columnar | Strong | Weaker (deeper joins) | Strong (no joins) |
| ETL complexity | Moderate | Highest | Highest (one wide table to maintain) |
| SCD handling | Clean (Type 2 on dim) | Same as star, more places to coordinate | Awkward (rebuilds expensive) |
| Storage cost | Moderate | Lowest | Highest (dim attrs duplicated) |
| Self-service analytics | Best | Worse (more tables to model) | Variable (works for one cut, fails for many) |
| When it goes wrong | Cross-fact joins, conformed-dim drift | Anywhere the join chain exceeds 3 levels | Schema drift, multi-audience filtering |
- Query simplicity: OBT wins for a single dominant audience cut. Star wins when multiple audiences need different filter sets. Snowflake loses on simplicity for everyone.
- Query performance on a columnar store: OBT is competitive because columnar pruning ignores unused columns. Star pays one or two cheap dimension joins. Snowflake pays many deeper joins, often with worse selectivity.
- ETL complexity: OBT is the most complex to build (many sources land in one wide table; one schema change touches everything). Star is moderate. Snowflake is the most complex to keep consistent across many normalized dim tables.
- Slowly-changing handling: Star handles SCD Type 2 cleanly via dim surrogate keys. OBT handles SCD by writing the dim attributes onto each fact row (eager); rebuilds become expensive. Snowflake inherits star's behavior, more places to coordinate.
- When OBT goes wrong: when multiple downstream users need to filter by dimension attributes that were not duplicated onto the OBT, when SCD changes require expensive backfills, and when self-service tools need a clean dim to power their dropdowns.
Bridge tables and many-to-many
Skipped in too many study guides. Bridge tables are how you model relationships that a single SCD type cannot express. If the relationship is many-to-many, especially over time, you need a bridge.
Customer-to-Account in a household model
Order-to-Promotion stacking
Patient-to-Clinician care teams
Customer-to-Account (household model)
dim_customer bridge_account_customer dim_account
+-------------+ +------------------------+ +-------------+
| customer_sk |--+ | account_sk FK | +---| account_sk |
| name | +-----<| customer_sk FK |>-+ | account_no |
| email | | role | | open_date |
+-------------+ | weighting_factor | | tier |
+------------------------+ +-------------+
Why the bridge:
- One account can be held by multiple customers (joint, household).
- One customer can hold multiple accounts (checking + savings).
- role is on the bridge: "primary", "joint", "authorized_user".
- weighting_factor allocates account-level measures (balance, fees) when
the business needs to attribute them to individual customers.
Three joint holders with weighting 0.33 each prevents triple-counting.
Without the bridge:
- You either pick a single primary customer (lose other holders),
cartesian-join (triple-count balances), or denormalize into the fact
(still cannot represent the many-to-many cleanly).Invoices go out, partial payments trickle in, and some customers are three months overdue.
Star vs Snowflake Schema: Decision Matrix
The most common dimensional modeling question. Memorize this table and you can answer 80% of star/snowflake comparisons.
| Property | Star Schema | Snowflake Schema |
|---|---|---|
| Dimensions | Denormalized (single table per dim) | Normalized (broken into multiple tables) |
| Query joins | Few (1 fact + N dims) | Many (deeper join chains) |
| Storage | Higher (redundant attributes) | Lower |
| BI tool friendliness | Excellent | Worse (more tables to model) |
| Update consistency | Lower (denormalized data drifts) | Higher (single source of truth) |
| Best for | Analytics, BI dashboards, modern warehouses | Storage-constrained OLAP, hierarchical attributes |
Primary, Foreign, Surrogate Keys: When to Use Each
Key choice questions appear in nearly every data modeling interview. Know the tradeoffs.
| Key type | What it is | Use when |
|---|---|---|
| Natural key | A real-world identifier (email, SSN, ISBN) | Source-of-truth tables where the identifier won't change |
| Surrogate key | System-generated meaningless integer (auto-increment, UUID, hash) | Dimensional warehouses, especially SCD type 2 where natural key alone is not unique |
| Primary key | Constraint that uniquely identifies a row (NOT NULL + UNIQUE) | Every persisted table; choose the surrogate or natural based on stability |
| Foreign key | Constraint that enforces a row references a primary key in another table | OLTP systems where referential integrity matters; often omitted in cloud warehouses for performance |
What interviewers actually grade on
The six questions you will hear, with answer outlines that separate strong from senior. The questions are not new. The answers that earn offers are not the obvious ones.
Design a schema for an Uber-like ride-sharing platform.
How would you handle a customer with three addresses, two emails, and a merge into another customer?
What's the grain of your fact table for [scenario]? Defend it.
Walk through a Type 2 dimension with effective dating. What goes wrong with late-arriving facts?
Star or snowflake or one big table? When would you pick each?
How do you model a hierarchy where the depth is unknown?
Why Data Modeling Is the Hardest Interview Round
Data modeling appears in roughly a third of all data engineering interviews, yet it is the round candidates prepare for the least. Unlike SQL or Python, there is no single correct answer. Every schema design is a series of decisions with consequences, and the interviewer's job is to probe whether you understand those consequences.
There is no single right schema
Requirements change mid-interview
You must think out loud
Worked Example: E-Commerce Star Schema
"Design a star schema for an e-commerce platform. Define the grain of your fact table and list your dimension tables." This question appears in nearly every data modeling interview.
dim_customer dim_product
+-----------------+ +------------------+
| customer_id PK | | product_id PK |
| name | | name |
| email | | category |
| segment | | subcategory |
| region | | brand |
| signup_date | | unit_cost |
+-----------------+ +------------------+
| |
| fact_order_items |
| +--------------------+ |
+--| customer_id FK |---+
| product_id FK |
+---------| date_id FK |---------+
| | order_id DD | |
| | order_item_id PK | |
| |--------------------| |
| | quantity | |
| | unit_price | |
| | discount_amount | |
| | total_amount | |
| +--------------------+ |
| |
+--------------+ +-----------------+
| date_id PK | | promo_id PK |
| full_date | | promo_name |
| day_of_week | | discount_pct |
| month | | start_date |
| quarter | | end_date |
| year | | promo_type |
| is_holiday | +-----------------+
+--------------+ dim_promotion
dim_dateGrain: One row per order line item
Why order_id is a degenerate dimension
Why unit_price is on the fact table, not dim_product
- Star vs snowflake: dim_product.category could be normalized into a separate dim_category table (snowflake). This saves storage if categories have many attributes, but adds a JOIN to every product query. At this scale, star is simpler and query performance is better.
- dim_customer SCD strategy: Currently Type 1 (overwrite). If the business needs to analyze orders by the customer's region at the time of purchase, upgrade to SCD Type 2 with effective_from and effective_to columns. Cost: table grows 3-5x over 3 years.
- Aggregation tables: If the dashboard queries daily revenue by category, a pre-aggregated table (daily_category_revenue) avoids scanning 500M fact rows. Cost: refresh latency and storage. Benefit: sub-second dashboard queries.
Worked Example: Normalization from 1NF to 3NF
"This table has anomalies. Normalize it." The interviewer hands you a denormalized table and expects you to identify the violations at each normal form and fix them step by step.
Original Table (Unnormalized)
orders_raw +----------+------------+---------+--------+-----------+----------+ | order_id | order_date | cust_id | cust_name | products | +----------+------------+---------+-----------+--------------------+ | 1001 | 2025-03-15 | C1 | Alice | Widget, Gadget | | 1002 | 2025-03-16 | C2 | Bob | Widget | | 1003 | 2025-03-16 | C1 | Alice | Gadget, Thingamajig| +----------+------------+---------+-----------+--------------------+ Problem: "products" column contains comma-separated values. This violates 1NF (each cell must hold a single atomic value).
Step 1: First Normal Form (1NF)
Eliminate repeating groups. Each row stores one product per order.
order_items_1nf +----------+------------+---------+-----------+-------------+ | order_id | order_date | cust_id | cust_name | product | +----------+------------+---------+-----------+-------------+ | 1001 | 2025-03-15 | C1 | Alice | Widget | | 1001 | 2025-03-15 | C1 | Alice | Gadget | | 1002 | 2025-03-16 | C2 | Bob | Widget | | 1003 | 2025-03-16 | C1 | Alice | Gadget | | 1003 | 2025-03-16 | C1 | Alice | Thingamajig | +----------+------------+---------+-----------+-------------+ PK: (order_id, product) -- composite key Now 1NF: every cell is atomic, every row is unique.
Step 2: Second Normal Form (2NF)
Eliminate partial dependencies. order_date and cust_name depend only on order_id, not on the full composite key (order_id, product).
orders_2nf order_items_2nf
+----------+------------+---------+-----------+ +----------+-------------+
| order_id | order_date | cust_id | cust_name | | order_id | product |
+----------+------------+---------+-----------+ +----------+-------------+
| 1001 | 2025-03-15 | C1 | Alice | | 1001 | Widget |
| 1002 | 2025-03-16 | C2 | Bob | | 1001 | Gadget |
| 1003 | 2025-03-16 | C1 | Alice | | 1002 | Widget |
+----------+------------+---------+-----------+ | 1003 | Gadget |
PK: order_id | 1003 | Thingamajig |
+----------+-------------+
Now 2NF: no non-key column depends on part of the PK.Step 3: Third Normal Form (3NF)
Eliminate transitive dependencies. cust_name depends on cust_id, not directly on order_id. Extract it into a separate table.
customers_3nf orders_3nf order_items_3nf
+---------+-----------+ +----------+------------+---------+ +----------+-------------+
| cust_id | cust_name | | order_id | order_date | cust_id | | order_id | product |
+---------+-----------+ +----------+------------+---------+ +----------+-------------+
| C1 | Alice | | 1001 | 2025-03-15 | C1 | | 1001 | Widget |
| C2 | Bob | | 1002 | 2025-03-16 | C2 | | 1001 | Gadget |
+---------+-----------+ | 1003 | 2025-03-16 | C1 | | 1002 | Widget |
PK: cust_id +----------+------------+---------+ | 1003 | Gadget |
PK: order_id FK: cust_id | 1003 | Thingamajig |
+----------+-------------+
Now 3NF: every non-key column depends on the key, the whole key,
and nothing but the key.Worked Example: SCD Type 2 Implementation
"A customer changes their address. Show how you track the history." SCD Type 2 is the most commonly tested slowly changing dimension pattern.
Three SCD types compared
| Type | How it works | Keeps history? | Table growth | Best for |
|---|---|---|---|---|
| Type 1 | Overwrite the old value | No | None | Typo corrections, non-analytical fields |
| Type 2 | Add new row with date range | Full history | 3-5x over 3 years | Address, segment, pricing tier |
| Type 3 | Add column for previous value | One version back | Minimal | Before/after comparisons only |
SCD Type 2: Before and After a Customer Moves
dim_customer (SCD Type 2)
+----------------+-------------+---------+----------+------------+------------+------------+
| customer_sk PK | customer_id | name | city | state | eff_from | eff_to | is_current |
+----------------+-------------+---------+----------+------------+------------+------------+
| 1001 | C-42 | Alice | Chicago | IL | 2023-01-15 | 2025-06-30 | false |
| 1002 | C-42 | Alice | Austin | TX | 2025-07-01 | 9999-12-31 | true |
| 1003 | C-88 | Bob | Seattle | WA | 2024-03-01 | 9999-12-31 | true |
+----------------+-------------+---------+----------+------------+------------+------------+
Key design choices:
1. customer_sk (surrogate key) is the PK. customer_id is the natural/business key.
2. eff_to = 9999-12-31 marks the current row. is_current flag is redundant but speeds up queries.
3. When Alice moves, the old row gets eff_to = 2025-06-30 and is_current = false.
A new row is inserted with eff_from = 2025-07-01.
4. Fact tables FK to customer_sk, NOT customer_id. This preserves point-in-time accuracy:
an order placed when Alice lived in Chicago joins to the Chicago row.Querying with SCD Type 2
-- Current state of all customers SELECT * FROM dim_customer WHERE is_current = true; -- Point-in-time: customer state on a specific date SELECT * FROM dim_customer WHERE '2024-06-15' BETWEEN eff_from AND eff_to; -- Join fact to dimension preserving historical accuracy SELECT f.order_date, f.total_amount, d.city, d.state FROM fact_orders f JOIN dim_customer d ON f.customer_sk = d.customer_sk;
The fact table joins on the surrogate key, not the business key. This means each fact row is permanently linked to the correct version of the customer dimension. If you join on customer_id instead, you get the current address for all orders, which destroys historical accuracy.
- "The table is growing too fast." Partition by is_current, archive old rows to cold storage, or increase the granularity of change tracking (only track meaningful changes, not every field update).
- "How do you handle a late-arriving dimension update?" Backdate eff_from and adjust the overlapping row's eff_to. This requires an UPDATE + INSERT, not just an INSERT.
- "When would you use Type 3 instead?" When you only need before/after comparison (e.g., pricing tier before and after a promotion) and do not need full history. Type 3 adds a previous_city column instead of new rows.
Myth vs Reality
Five things candidates have been told that are wrong, or wrong in important ways. Each one has cost someone an offer.
Decision matrix: which model for which situation
Eight situations, the schema pattern that fits, and a one-line reason. Internalize this and you have an answer for the most common 'so which would you pick' question.
All Data Modeling Topics by Frequency
Six topic clusters, ordered by how often interviewers reach for them. Each cluster lists the questions you should be ready to answer cold.
Normalization (1NF through 3NF)
Difficulty: Medium
Frequency: Very High
Practice Normalization (1NF through 3NF) questions
Star vs Snowflake Schemas
Difficulty: Medium
Frequency: Star schema: 4.7% of modeling questions
Practice Star vs Snowflake Schemas questions
Slowly Changing Dimensions (SCD Types 1-3)
Difficulty: Medium-Hard
Frequency: High
Practice Slowly Changing Dimensions (SCD Types 1-3) questions
Entity-Relationship Modeling
Difficulty: Medium
Frequency: Entity ID: 6.6% of modeling questions
Practice Entity-Relationship Modeling questions
Keys and Constraints
Difficulty: Easy-Medium
Frequency: Primary keys 5.9%, Foreign keys 4.7%
Practice Keys and Constraints questions
Trade-off Reasoning
Difficulty: Hard
Frequency: Very High
Practice Trade-off Reasoning questions
How to Defend Your Schema Design Decisions
When the interviewer pushes back, your job is to explain why you chose what you chose. Four moves that turn a defensive answer into a confident one.
Name the trade-off explicitly
Reference the query patterns
Acknowledge what you are giving up
Have a fallback ready
Practice the data modeling patterns interviewers actually test
Six real practice problems on DataDriven, free. Each one is a scenario interviewers reach for: history tracking, SCD changes, m:n invoicing, balance accuracy, network ripple, late-arriving sensor events.
People move. Sometimes twice in a month. How do you remember where everyone was, and when?
She moved. She upgraded. She became someone new. The record has to keep up.
Invoices go out, partial payments trickle in, and some customers are three months overdue.
Money out, payments back. The balance has to be exact.
One device goes down. The ripple keeps going.
Cars enter, cars exit. Except when they don't.
Data Modeling Interview FAQ
Why is data modeling the hardest interview round?+
How do I practice data modeling without a study partner?+
Do I need to memorize all SCD types for interviews?+
Star schema or snowflake schema: which should I default to in interviews?+
How long should a data modeling interview answer take?+
What is the difference between a fact table and a dimension table?+
What is a surrogate key and when should I use one?+
What is the difference between a primary key and a foreign key?+
What is a data model and what types are tested in interviews?+
What are the most common data modeling tools used in industry?+
What is an entity-relationship diagram (ERD) and why does it matter?+
How do you handle late-arriving facts in an SCD Type 2 model?+
What is the grain of a fact table and why does it matter so much?+
When is one-big-table (OBT) the right answer over a star schema?+
What is a bridge table and when do you reach for one?+
Practice Data Modeling Interactively
Build schemas, define relationships, and validate your designs. The only platform with hands-on data modeling practice.