Data Modeling Interview Questions for Data Engineers
About a third of DE loops include a dedicated modeling round, and at senior levels almost every loop does. The round is graded on reasoning, not recall. Below: the six topic clusters interviewers actually pull from, plus three worked walk-throughs (star, normalization, SCD Type 2).
What this guide actually says
The modeling round is the only round you can't pass by memorizing patterns, because interviewers are assessing reasoning, not recall. They score you on grain, not Kimball trivia: if your one-sentence grain statement is sloppy, the rest of the schema doesn't matter. If you can't defend why a column lives on the fact and not the dimension, you fail. SCD Type 2 is the median question; how you handle late-arriving facts separates mid-level from senior. Star schema is the 2026 default; OBT is increasingly defensible on columnar warehouses; snowflake is rare.
What grain actually means, and why interviewers care
Most candidates flunk this round by being imprecise about grain. Grain is the one sentence that says what one row in the fact table represents.
"One row per customer."
Useless. Customers do things over time. This grain answers no useful question. Most candidates start here, get pushed, and never recover.
"One row per order."
Better, but incomplete. Is a partial shipment one row or two? Is a returned line item subtracted from this row or written as a separate row? The interviewer will ask.
"One row per order line item."
Now the table answers product-level questions. Median answer that gets you through most interviews if you can defend it.
"One row per order line item per fulfillment event."
Event grain. Splits, returns, partial cancellations all become first-class rows. Measures stay additive across the lifecycle. The answer that signals you've shipped this before.
"One row per order line item per fulfillment event, with the dim version effective at the event timestamp captured as a foreign key."
Grain plus explicit tie to the right SCD Type 2 dim version, so late-arriving facts and dimension drift don't corrupt historical reporting.
SCD types: when each one is wrong
Most candidates know 'Type 2 = history.' Push past that. The question is when each type is the wrong choice and what you reach for instead.
Overwrite. When it's right.
Use when downstream queries genuinely don't care about history. Typo corrections. Email fixes after a bounce. ML feature stores where the feature is computed at request time and you want the latest value. Wrong when any downstream user might say 'show me what this customer looked like at the time of the order.'
Default for analytics.
Add a new row, close the old one with eff_to. Surrogate key on the dim, fact tables join on the surrogate. Median right answer in 80% of interviews. The differentiator: late-arriving facts, CDC sources, and how you avoid duplicate effective windows.
Rarely the right call.
Stores only the previous value (previous_city, previous_tier). Rarely right because most analytics needs more than one version back. Pick only when the business explicitly says 'I only ever compare current vs prior' (price tier before/after a promotion). Default to Type 2.
Dim plus history table.
Current state in a narrow, fast dim; history in a separate table. Useful when the dim is wide (50+ attributes) but only 2-3 change. Saves storage vs Type 2. Cost: queries needing history must know about the second table.
1 + 2 + 3, the auditor's pick.
Overwriting current attributes (Type 1), historical rows with effective dating (Type 2), and a current_value column on every historical row (Type 3-style). When auditors need 'what was the value when this happened, AND what is the current value of the same attribute, in one row.'
Many-to-many over time.
When a customer can be in two regions at once or held by two relationship managers simultaneously, no single SCD type fits. A bridge table with eff_from/eff_to per assignment, optionally with a weighting factor, expresses the relationship cleanly. Often missed by candidates who default to 'just use Type 2.'
Late-arriving facts: where candidates lose the offer
The single biggest differentiator between mid-level and senior. Walk through the scenario, the naive fix, and the right fix.
The setup
Your fact table loads sensor pings every 15 minutes. A backlog clears, and 50,000 events from 11 days ago arrive at once. Your dim_sensor has had two SCD Type 2 changes in those 11 days (recalibration, gantry reassignment). Load the late events without corrupting historical reporting.
What most candidates say
'Just look up the current sensor surrogate key.' Wrong. The car passed under the old calibration. Joining to the current sk attributes the event to a sensor configuration that didn't exist when the event happened. Every report aggregated by calibration_date breaks.
If your dim materializes effective ranges
Use an as_of_timestamp join: fact.event_ts BETWEEN dim.eff_from AND dim.eff_to. The query naturally pulls the version current at the event timestamp. Cheap on a columnar warehouse with clustering on (sensor_id, eff_from).
If your dim is a snapshot table
If you only persist the current version, reconstruct historical state from a change log (CDC). Walk back from the current row using changes ordered by event_ts to get the as-of value. More expensive per query but no storage cost in the dim.
What you give up
Eager (Type 2 with explicit ranges) costs storage and write complexity. Lazy (CDC log replay) costs query time and infra complexity. The senior signal is naming both options, picking one, and stating which downstream use-case made you pick it.
When candidates lose the offer
Can't articulate why joining on the natural key is wrong: fail. Proposes 'just reload everything every night': fail (doesn't scale, breaks late-arriving fact contracts upstream). Waves hands and says 'we use Snowflake's time travel': fail (time travel is bounded retention, not a modeling pattern).
Star vs snowflake vs OBT
On a 2026 columnar warehouse with cheap storage and compute, OBT is more defensible than it was when Kimball wrote the books. Star is still the right default. Snowflake is rare.
| Property | Star Schema | Snowflake Schema | One Big Table |
|---|---|---|---|
| Query simplicity | High | Medium | High (single table) |
| Query perf on columnar | Strong | Weaker (deeper joins) | Strong (no joins) |
| ETL complexity | Moderate | Highest | Highest (one wide table to maintain) |
| SCD handling | Clean (Type 2 on dim) | Same as star, more places to coordinate | Awkward (rebuilds expensive) |
| Storage cost | Moderate | Lowest | Highest (dim attrs duplicated) |
| Self-service analytics | Best | Worse (more tables to model) | Variable (works for one cut, fails for many) |
| When it goes wrong | Cross-fact joins, conformed-dim drift | Anywhere join chain exceeds 3 levels | Schema drift, multi-audience filtering |
Primary, foreign, surrogate, natural keys
When to use each. Key choice questions appear in nearly every modeling interview.
| Key type | What it is | Use when |
|---|---|---|
| Natural key | A real-world identifier (email, SSN, ISBN) | Source-of-truth tables where the identifier won't change |
| Surrogate key | System-generated meaningless integer (auto-increment, UUID, hash) | Dimensional warehouses, especially SCD Type 2 where natural key alone is not unique |
| Primary key | Constraint that uniquely identifies a row (NOT NULL + UNIQUE) | Every persisted table; choose surrogate or natural based on stability |
| Foreign key | Constraint enforcing a row references a primary key in another table | OLTP systems where referential integrity matters; often omitted in cloud warehouses for performance |
What interviewers actually score on
Six questions you'll hear, with answer outlines that separate strong from senior.
Design a schema for an Uber-like ride-sharing platform.
Strong answers ask clarifying questions first: which audience (operations, finance, ML)? What grain (one ride, one ride-leg, one driver-shift)? Then sketch fact_ride at one-row-per-ride grain plus dim_rider, dim_driver, dim_vehicle, dim_geography (pickup AND dropoff as separate FKs), dim_date, dim_promo. Senior answers explicitly call out SCD Type 2 on dim_driver (rating, vehicle, status all change) and a separate fact_ride_lifecycle event-grain table for ETA, accept, cancel events.
Customer with three addresses, two emails, merged into another customer.
Clarify 'three addresses.' Three current (billing, shipping, residential)? bridge_customer_address with role + eff_from/eff_to. Three historical? SCD Type 2 on dim_customer. The merge: a customer-survivor table mapping merged_customer_id → surviving_customer_sk so historical facts still resolve. Most candidates miss the merge survivor table and lose history.
What's the grain of your fact table? Defend it.
Subject + qualifier + temporal scope. 'One row per shipment line per warehouse-event per ship-date.' Defend by enumerating questions the grain answers and questions a coarser grain would not. If you can't list two questions a coarser grain would fail, your grain is over-granular.
Walk through a Type 2 dimension. What goes wrong with late-arriving facts?
Walk through as_of_timestamp join (fact.event_ts BETWEEN dim.eff_from AND dim.eff_to). Then the failure mode: a fact arrives with event_ts older than every active eff_to in the dim because the dim's history wasn't backfilled deeply enough. Fix: extend the dim's effective windows backwards (UPDATE the earliest row's eff_from to the global minimum), or insert a synthetic 'unknown' dim row for events before recorded history.
Star or snowflake or one big table? When each?
Star is the default for BI dashboards on a columnar store. Snowflake when dim hierarchies have rich attributes that change at different rates (geography: country attrs change every decade, city yearly, neighborhood monthly). OBT when one dominant audience uses one denormalized surface and storage is cheap. Tiebreaker: 'I default to star unless someone tells me a specific reason to snowflake.'
How do you model a hierarchy where the depth is unknown?
Two options. Adjacency list (parent_id self-reference) is simple but recursive queries are expensive. Closure table (one row per ancestor-descendant pair with depth) makes hierarchical queries O(1) at the cost of storage and write complexity. Bridge tables can express hierarchies that aren't strict trees (employees with two managers). State which option you'd pick for the asked use case and why.
Worked example: e-commerce star schema
dim_customer dim_product
+-----------------+ +------------------+
| customer_id PK | | product_id PK |
| name | | name |
| email | | category |
| segment | | subcategory |
| region | | brand |
| signup_date | | unit_cost |
+-----------------+ +------------------+
| |
| fact_order_items |
| +--------------------+ |
+--| customer_id FK |---+
| product_id FK |
+---------| date_id FK |---------+
| | order_id DD | |
| | order_item_id PK | |
| |--------------------| |
| | quantity | |
| | unit_price | |
| | discount_amount | |
| | total_amount | |
| +--------------------+ |
| |
+--------------+ +-----------------+
| date_id PK | | promo_id PK |
| full_date | | promo_name |
| day_of_week | | discount_pct |
| month | | start_date |
| quarter | | end_date |
| year | | promo_type |
| is_holiday | +-----------------+
+--------------+ dim_promotion
dim_dateGrain: one row per order line item. Trade-offs to state in the interview: SCD strategy on dim_customer (Type 1 vs Type 2), aggregation tables for dashboard queries, and why unit_price lives on the fact (prices change over time; historical revenue must use the price at sale).
Worked example: normalization from 1NF to 3NF
Original (unnormalized) - comma-separated products violates 1NF:
orders_raw
+----------+------------+---------+-----------+--------------------+
| order_id | order_date | cust_id | cust_name | products |
+----------+------------+---------+-----------+--------------------+
| 1001 | 2025-03-15 | C1 | Alice | Widget, Gadget |
| 1003 | 2025-03-16 | C1 | Alice | Gadget, Thingamajig|
+----------+------------+---------+-----------+--------------------+
Step 1 - 1NF: split repeating groups so every cell is atomic.
order_items_1nf PK: (order_id, product)
+----------+------------+---------+-----------+-------------+
| order_id | order_date | cust_id | cust_name | product |
+----------+------------+---------+-----------+-------------+
| 1001 | 2025-03-15 | C1 | Alice | Widget |
| 1001 | 2025-03-15 | C1 | Alice | Gadget |
+----------+------------+---------+-----------+-------------+
Step 2 - 2NF: split off attrs that depend on only part of the composite PK.
orders_2nf (PK: order_id) order_items_2nf
+----------+------------+---------+ +----------+-------------+
| order_id | order_date | cust_id | | order_id | product |
| 1001 | 2025-03-15 | C1 | | 1001 | Widget |
+----------+------------+---------+ +----------+-------------+
Step 3 - 3NF: split off transitive deps (cust_name depends on cust_id, not order_id).
customers_3nf orders_3nf order_items_3nf
+---------+--------+ +----------+---------+ +----------+----------+
| cust_id | name | | order_id | cust_id | | order_id | product |
| C1 | Alice | | 1001 | C1 | | 1001 | Widget |
+---------+--------+ +----------+---------+ +----------+----------+The interviewer hands you a denormalized table and expects you to identify violations at each normal form and fix them step by step.
Worked example: SCD Type 2 dimension
dim_customer (SCD Type 2)
+----------------+-------------+---------+----------+-------+------------+------------+------------+
| customer_sk PK | customer_id | name | city | state | eff_from | eff_to | is_current |
+----------------+-------------+---------+----------+-------+------------+------------+------------+
| 1001 | C-42 | Alice | Chicago | IL | 2023-01-15 | 2025-06-30 | false |
| 1002 | C-42 | Alice | Austin | TX | 2025-07-01 | 9999-12-31 | true |
| 1003 | C-88 | Bob | Seattle | WA | 2024-03-01 | 9999-12-31 | true |
+----------------+-------------+---------+----------+-------+------------+------------+------------+
Key design choices:
1. customer_sk (surrogate) is PK. customer_id is the natural/business key.
2. eff_to = 9999-12-31 marks the current row. is_current is redundant but speeds up queries.
3. When Alice moves: old row gets eff_to = 2025-06-30 and is_current = false; new row inserted.
4. Fact tables FK to customer_sk, NOT customer_id, preserving point-in-time accuracy.When Alice moves: close the old row with eff_to and is_current=false; insert a new row. Fact tables join on customer_sk, not customer_id, which preserves point-in-time accuracy. Joining on customer_id instead destroys historical accuracy, the classic interview-failing mistake.
Querying an SCD Type 2 dimension
-- Current state of all customers
SELECT * FROM dim_customer WHERE is_current = true;
-- Point-in-time: customer state on a specific date
SELECT * FROM dim_customer
WHERE '2024-06-15' BETWEEN eff_from AND eff_to;
-- Join fact to dim preserving historical accuracy
SELECT f.order_date, f.total_amount, d.city, d.state
FROM fact_orders f
JOIN dim_customer d ON f.customer_sk = d.customer_sk;Myth vs reality
Myth: 3NF is the right answer for a warehouse
Reality: 3NF is a transactional concept. Warehouses denormalize on purpose so analytics queries don't pay for joins on every dashboard load. Knowing 3NF rules is table-stakes; applying them blindly to warehouse design is a fail.
Myth: Surrogate keys are always better than natural keys
Reality: surrogate is the right default for dim tables (Type 2 history requires it). Natural keys can be right for cross-system reconciliation, idempotent loads, and source-of-truth tables where the natural key never changes (ISBN, upstream transaction_id). The best answer states the rule plus the carve-outs.
Myth: SCD Type 2 is always right for changing dimensions
Reality: only when downstream queries actually need history. For ML feature stores serving the freshest value at request time, Type 1. For audit-heavy domains, Type 6. For attributes many-to-many over time (customer-region), no single SCD type fits, and a bridge table is the right answer.
Myth: Star schemas are obsolete
Reality: still the default for BI dashboards on columnar stores. Cheap storage and compute don't eliminate the need for clean denormalized dimensions when self-service users filter and group. Star plus aggregation tables remains the boring, reliable answer.
Myth: Just denormalize everything into one table
Reality: works until it doesn't. Joins are cheap on columnar stores; rebuilds when the schema drifts are not. Every new column on a dim attribute touches every historical fact row in the OBT, which means a multi-hour backfill at scale.
Decision matrix
Eight situations, the schema pattern that fits, and a one-line reason.
| Situation | Pick | Reason |
|---|---|---|
| BI dashboards, < 10 facts, slowly changing dims | Star + SCD Type 2 | Cleanest grammar for self-service; dim history preserved. |
| ML feature store, point-in-time correctness needed | OBT or feature-table per entity | Features computed and frozen at training-time row; no SCD complexity. |
| Operational reporting, low-latency dashboards | Materialized view + denormalized cache | Pre-aggregated by the dimensions you filter on; refresh on a streaming cadence. |
| Compliance audit trail required | SCD Type 6 or full event log | Type 6 keeps current + history + previous in one row; event log is unambiguous and append-only. |
| Self-service analytics, lots of ad-hoc joins | Star schema, defensive | Predictable join paths; analysts know which dim to filter without consulting the model. |
| Wide dim that changes in 2 of 50 columns | SCD Type 4 | Avoid duplicating 48 unchanging columns on every history row. |
| Customer can be in two regions at once | Bridge table with eff dates + weighting | Many-to-many over time can't be modeled by a single SCD type. |
| Hierarchy depth unknown, queries traverse depth | Closure table | O(1) ancestor-descendant lookups beat recursive CTEs at scale; pay storage for query speed. |
All modeling topics by frequency
Six topic clusters, ordered by how often interviewers reach for them. Each cluster lists the questions you should answer cold.
Normalization (1NF through 3NF)
Frequency: Almost every junior or mid loop The textbook layer. Spot a 1NF violation in a denormalized table, walk it to 3NF, then defend why a warehouse usually denormalizes back. Knowing the rules is the floor; knowing when to break them is the answer. Questions: 1. Given a table with repeating groups in a single column (comma-separated tags), normalize to 1NF. Explain what problem the original design creates for queries. 2. Identify the partial dependency in a table with a composite primary key and decompose it into 2NF. 3. A table stores employee_name, department_name, and department_location. Identify the transitive dependency and normalize to 3NF. 4. When would you intentionally denormalize a 3NF schema? Give a specific scenario with the trade-offs.
Star vs Snowflake Schemas
Frequency: Star schema explicit in ~5% of questions; fact and dim talk in nearly all Almost every modeling round assumes the star vocabulary: pick a grain, sketch a fact, hang dimensions off it. The part interviewers score is whether you can defend why a column lives on the fact and not the dim, and when snowflake or OBT becomes the right call. Questions: 1. Design a star schema for an e-commerce platform. Define the fact table grain and list 4 dimension tables with their key attributes. 2. Convert a star schema into a snowflake schema by normalizing the product dimension. What queries become harder? What storage do you save? 3. Your fact table has 500M rows. A dimension table has 50M rows. Star or snowflake? Explain your reasoning based on query patterns. 4. What is a degenerate dimension? Give an example and explain why it lives in the fact table.
Slowly Changing Dimensions (SCD Types 1-3)
Frequency: The single most common modeling question The address-change-on-a-customer question, in some form, in almost every interview. Type 2 with effective dating is the median right answer. Late-arriving facts, dim growth, and when Type 4 or Type 6 fits better are where mid and senior part ways. Questions: 1. A customer changes their address. Show how SCD Type 1, Type 2, and Type 3 each handle this change. 2. Design an SCD Type 2 implementation for a product dimension where price changes frequently. 3. Your SCD Type 2 dimension grew from 10M to 80M rows over 3 years. What strategies reduce size without losing history? 4. When would you choose SCD Type 3 over Type 2? Describe a real scenario where storing only the previous value is sufficient.
Entity-Relationship Modeling
Frequency: About 7% of modeling questions Draw boxes and lines. Questions look easy until the interviewer adds a self-referencing relationship or a many-to-many with an attribute on the join. Candidates who pass slow down and name the cardinality before they draw. Questions: 1. Draw an ER model for a university system: students, courses, professors, enrollments. Define cardinalities for each relationship. 2. Model a social network where users can follow other users, create posts, and like posts. How do you handle the self-referencing follow relationship? 3. Design an ER model for a hotel booking system. Handle the case where one booking spans multiple rooms and multiple nights. 4. Your many-to-many between orders and products has an intersection table with quantity and unit_price. Why is unit_price stored here instead of on the product table?
Keys and Constraints
Frequency: PKs in ~6%, FKs in ~5% of modeling questions Surrogate vs natural is the most-asked key question. Honest answer: most warehouses surrogate everything by default. The carve-outs are where it gets interesting: source-of-truth tables, idempotent loads, cross-system reconciliation. Questions: 1. When should you use a surrogate key vs a natural key? Give one scenario for each. 2. A composite primary key of (user_id, event_date). Can you add a foreign key from another table to this composite? Implications? 3. Your schema has no foreign key constraints for performance reasons. How do you ensure referential integrity? Cost of skipping enforcement?
Trade-off Reasoning
Frequency: The back half of every senior loop Drawing the schema is the easy part. The interviewer rarely cares about your initial answer. They care what happens when they add a requirement that breaks it. Questions: 1. You denormalized a dimension to speed up a critical dashboard. What happens when source data changes? How do you keep the denormalized copy in sync? 2. Your schema uses a single wide fact table with 200 columns. The interviewer suggests splitting it. Argue for and against each approach. 3. A stakeholder wants real-time data in the warehouse. Your current design uses daily batch. What schema changes support streaming without breaking existing queries? 4. You chose a star schema, but a new requirement needs a query joining 5 dimensions. The query is slow. Options? Recommendation?
Data modeling interview FAQ
Why is the modeling round so much harder than SQL or Python?+
Star or snowflake by default?+
Do I need to memorize all the SCD types?+
What's the grain of a fact table and why does it matter so much?+
How do you handle late-arriving facts in an SCD Type 2 model?+
What is a bridge table and when do you reach for one?+
When is one-big-table (OBT) the right answer over a star schema?+
Primary key versus foreign key in the warehouse era?+
Practice a few of these on the canvas
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition