Data Modeling Interview Questions for Data Engineers

About a third of DE loops include a dedicated modeling round, and at senior levels almost every loop does. The round is graded on reasoning, not recall. Below: the six topic clusters interviewers actually pull from, plus three worked walk-throughs (star, normalization, SCD Type 2).

What this guide actually says

The modeling round is the only round you can't pass by memorizing patterns, because interviewers are assessing reasoning, not recall. They score you on grain, not Kimball trivia: if your one-sentence grain statement is sloppy, the rest of the schema doesn't matter. If you can't defend why a column lives on the fact and not the dimension, you fail. SCD Type 2 is the median question; how you handle late-arriving facts separates mid-level from senior. Star schema is the 2026 default; OBT is increasingly defensible on columnar warehouses; snowflake is rare.

What grain actually means, and why interviewers care

Most candidates flunk this round by being imprecise about grain. Grain is the one sentence that says what one row in the fact table represents.

Bad

"One row per customer."

Useless. Customers do things over time. This grain answers no useful question. Most candidates start here, get pushed, and never recover.

Vague

"One row per order."

Better, but incomplete. Is a partial shipment one row or two? Is a returned line item subtracted from this row or written as a separate row? The interviewer will ask.

Acceptable

"One row per order line item."

Now the table answers product-level questions. Median answer that gets you through most interviews if you can defend it.

Strong

"One row per order line item per fulfillment event."

Event grain. Splits, returns, partial cancellations all become first-class rows. Measures stay additive across the lifecycle. The answer that signals you've shipped this before.

Senior

"One row per order line item per fulfillment event, with the dim version effective at the event timestamp captured as a foreign key."

Grain plus explicit tie to the right SCD Type 2 dim version, so late-arriving facts and dimension drift don't corrupt historical reporting.

SCD types: when each one is wrong

Most candidates know 'Type 2 = history.' Push past that. The question is when each type is the wrong choice and what you reach for instead.

Type 1

Overwrite. When it's right.

Use when downstream queries genuinely don't care about history. Typo corrections. Email fixes after a bounce. ML feature stores where the feature is computed at request time and you want the latest value. Wrong when any downstream user might say 'show me what this customer looked like at the time of the order.'

Type 2

Default for analytics.

Add a new row, close the old one with eff_to. Surrogate key on the dim, fact tables join on the surrogate. Median right answer in 80% of interviews. The differentiator: late-arriving facts, CDC sources, and how you avoid duplicate effective windows.

Type 3

Rarely the right call.

Stores only the previous value (previous_city, previous_tier). Rarely right because most analytics needs more than one version back. Pick only when the business explicitly says 'I only ever compare current vs prior' (price tier before/after a promotion). Default to Type 2.

Type 4

Dim plus history table.

Current state in a narrow, fast dim; history in a separate table. Useful when the dim is wide (50+ attributes) but only 2-3 change. Saves storage vs Type 2. Cost: queries needing history must know about the second table.

Type 6

1 + 2 + 3, the auditor's pick.

Overwriting current attributes (Type 1), historical rows with effective dating (Type 2), and a current_value column on every historical row (Type 3-style). When auditors need 'what was the value when this happened, AND what is the current value of the same attribute, in one row.'

Bridge

Many-to-many over time.

When a customer can be in two regions at once or held by two relationship managers simultaneously, no single SCD type fits. A bridge table with eff_from/eff_to per assignment, optionally with a weighting factor, expresses the relationship cleanly. Often missed by candidates who default to 'just use Type 2.'

Late-arriving facts: where candidates lose the offer

The single biggest differentiator between mid-level and senior. Walk through the scenario, the naive fix, and the right fix.

Scenario

The setup

Your fact table loads sensor pings every 15 minutes. A backlog clears, and 50,000 events from 11 days ago arrive at once. Your dim_sensor has had two SCD Type 2 changes in those 11 days (recalibration, gantry reassignment). Load the late events without corrupting historical reporting.

Naive

What most candidates say

'Just look up the current sensor surrogate key.' Wrong. The car passed under the old calibration. Joining to the current sk attributes the event to a sensor configuration that didn't exist when the event happened. Every report aggregated by calibration_date breaks.

Right (eager)

If your dim materializes effective ranges

Use an as_of_timestamp join: fact.event_ts BETWEEN dim.eff_from AND dim.eff_to. The query naturally pulls the version current at the event timestamp. Cheap on a columnar warehouse with clustering on (sensor_id, eff_from).

Right (lazy)

If your dim is a snapshot table

If you only persist the current version, reconstruct historical state from a change log (CDC). Walk back from the current row using changes ordered by event_ts to get the as-of value. More expensive per query but no storage cost in the dim.

Trade-off

What you give up

Eager (Type 2 with explicit ranges) costs storage and write complexity. Lazy (CDC log replay) costs query time and infra complexity. The senior signal is naming both options, picking one, and stating which downstream use-case made you pick it.

What breaks

When candidates lose the offer

Can't articulate why joining on the natural key is wrong: fail. Proposes 'just reload everything every night': fail (doesn't scale, breaks late-arriving fact contracts upstream). Waves hands and says 'we use Snowflake's time travel': fail (time travel is bounded retention, not a modeling pattern).

Star vs snowflake vs OBT

On a 2026 columnar warehouse with cheap storage and compute, OBT is more defensible than it was when Kimball wrote the books. Star is still the right default. Snowflake is rare.

PropertyStar SchemaSnowflake SchemaOne Big Table
Query simplicityHighMediumHigh (single table)
Query perf on columnarStrongWeaker (deeper joins)Strong (no joins)
ETL complexityModerateHighestHighest (one wide table to maintain)
SCD handlingClean (Type 2 on dim)Same as star, more places to coordinateAwkward (rebuilds expensive)
Storage costModerateLowestHighest (dim attrs duplicated)
Self-service analyticsBestWorse (more tables to model)Variable (works for one cut, fails for many)
When it goes wrongCross-fact joins, conformed-dim driftAnywhere join chain exceeds 3 levelsSchema drift, multi-audience filtering

Primary, foreign, surrogate, natural keys

When to use each. Key choice questions appear in nearly every modeling interview.

Key typeWhat it isUse when
Natural keyA real-world identifier (email, SSN, ISBN)Source-of-truth tables where the identifier won't change
Surrogate keySystem-generated meaningless integer (auto-increment, UUID, hash)Dimensional warehouses, especially SCD Type 2 where natural key alone is not unique
Primary keyConstraint that uniquely identifies a row (NOT NULL + UNIQUE)Every persisted table; choose surrogate or natural based on stability
Foreign keyConstraint enforcing a row references a primary key in another tableOLTP systems where referential integrity matters; often omitted in cloud warehouses for performance

What interviewers actually score on

Six questions you'll hear, with answer outlines that separate strong from senior.

Q01

Design a schema for an Uber-like ride-sharing platform.

Strong answers ask clarifying questions first: which audience (operations, finance, ML)? What grain (one ride, one ride-leg, one driver-shift)? Then sketch fact_ride at one-row-per-ride grain plus dim_rider, dim_driver, dim_vehicle, dim_geography (pickup AND dropoff as separate FKs), dim_date, dim_promo. Senior answers explicitly call out SCD Type 2 on dim_driver (rating, vehicle, status all change) and a separate fact_ride_lifecycle event-grain table for ETA, accept, cancel events.

Q02

Customer with three addresses, two emails, merged into another customer.

Clarify 'three addresses.' Three current (billing, shipping, residential)? bridge_customer_address with role + eff_from/eff_to. Three historical? SCD Type 2 on dim_customer. The merge: a customer-survivor table mapping merged_customer_id → surviving_customer_sk so historical facts still resolve. Most candidates miss the merge survivor table and lose history.

Q03

What's the grain of your fact table? Defend it.

Subject + qualifier + temporal scope. 'One row per shipment line per warehouse-event per ship-date.' Defend by enumerating questions the grain answers and questions a coarser grain would not. If you can't list two questions a coarser grain would fail, your grain is over-granular.

Q04

Walk through a Type 2 dimension. What goes wrong with late-arriving facts?

Walk through as_of_timestamp join (fact.event_ts BETWEEN dim.eff_from AND dim.eff_to). Then the failure mode: a fact arrives with event_ts older than every active eff_to in the dim because the dim's history wasn't backfilled deeply enough. Fix: extend the dim's effective windows backwards (UPDATE the earliest row's eff_from to the global minimum), or insert a synthetic 'unknown' dim row for events before recorded history.

Q05

Star or snowflake or one big table? When each?

Star is the default for BI dashboards on a columnar store. Snowflake when dim hierarchies have rich attributes that change at different rates (geography: country attrs change every decade, city yearly, neighborhood monthly). OBT when one dominant audience uses one denormalized surface and storage is cheap. Tiebreaker: 'I default to star unless someone tells me a specific reason to snowflake.'

Q06

How do you model a hierarchy where the depth is unknown?

Two options. Adjacency list (parent_id self-reference) is simple but recursive queries are expensive. Closure table (one row per ancestor-descendant pair with depth) makes hierarchical queries O(1) at the cost of storage and write complexity. Bridge tables can express hierarchies that aren't strict trees (employees with two managers). State which option you'd pick for the asked use case and why.

Worked example: e-commerce star schema

                    dim_customer                 dim_product
                +-----------------+         +------------------+
                | customer_id PK  |         | product_id  PK   |
                | name            |         | name             |
                | email           |         | category         |
                | segment         |         | subcategory      |
                | region          |         | brand            |
                | signup_date     |         | unit_cost        |
                +-----------------+         +------------------+
                         |                           |
                         |    fact_order_items        |
                         |  +--------------------+   |
                         +--| customer_id   FK   |---+
                            | product_id    FK   |
                  +---------| date_id       FK   |---------+
                  |         | order_id      DD   |         |
                  |         | order_item_id PK   |         |
                  |         |--------------------|         |
                  |         | quantity           |         |
                  |         | unit_price         |         |
                  |         | discount_amount    |         |
                  |         | total_amount       |         |
                  |         +--------------------+         |
                  |                                        |
          +--------------+                     +-----------------+
          | date_id  PK  |                     | promo_id   PK   |
          | full_date    |                     | promo_name      |
          | day_of_week  |                     | discount_pct    |
          | month        |                     | start_date      |
          | quarter      |                     | end_date        |
          | year         |                     | promo_type      |
          | is_holiday   |                     +-----------------+
          +--------------+                        dim_promotion
             dim_date

Grain: one row per order line item. Trade-offs to state in the interview: SCD strategy on dim_customer (Type 1 vs Type 2), aggregation tables for dashboard queries, and why unit_price lives on the fact (prices change over time; historical revenue must use the price at sale).

Worked example: normalization from 1NF to 3NF

Original (unnormalized) - comma-separated products violates 1NF:

orders_raw
+----------+------------+---------+-----------+--------------------+
| order_id | order_date | cust_id | cust_name | products           |
+----------+------------+---------+-----------+--------------------+
| 1001     | 2025-03-15 | C1      | Alice     | Widget, Gadget     |
| 1003     | 2025-03-16 | C1      | Alice     | Gadget, Thingamajig|
+----------+------------+---------+-----------+--------------------+


Step 1 - 1NF: split repeating groups so every cell is atomic.

order_items_1nf  PK: (order_id, product)
+----------+------------+---------+-----------+-------------+
| order_id | order_date | cust_id | cust_name | product     |
+----------+------------+---------+-----------+-------------+
| 1001     | 2025-03-15 | C1      | Alice     | Widget      |
| 1001     | 2025-03-15 | C1      | Alice     | Gadget      |
+----------+------------+---------+-----------+-------------+


Step 2 - 2NF: split off attrs that depend on only part of the composite PK.

orders_2nf  (PK: order_id)        order_items_2nf
+----------+------------+---------+ +----------+-------------+
| order_id | order_date | cust_id | | order_id | product     |
| 1001     | 2025-03-15 | C1      | | 1001     | Widget      |
+----------+------------+---------+ +----------+-------------+


Step 3 - 3NF: split off transitive deps (cust_name depends on cust_id, not order_id).

customers_3nf       orders_3nf            order_items_3nf
+---------+--------+ +----------+---------+ +----------+----------+
| cust_id | name   | | order_id | cust_id | | order_id | product  |
| C1      | Alice  | | 1001     | C1      | | 1001     | Widget   |
+---------+--------+ +----------+---------+ +----------+----------+

The interviewer hands you a denormalized table and expects you to identify violations at each normal form and fix them step by step.

Worked example: SCD Type 2 dimension

dim_customer (SCD Type 2)
+----------------+-------------+---------+----------+-------+------------+------------+------------+
| customer_sk PK | customer_id | name    | city     | state | eff_from   | eff_to     | is_current |
+----------------+-------------+---------+----------+-------+------------+------------+------------+
| 1001           | C-42        | Alice   | Chicago  | IL    | 2023-01-15 | 2025-06-30 | false      |
| 1002           | C-42        | Alice   | Austin   | TX    | 2025-07-01 | 9999-12-31 | true       |
| 1003           | C-88        | Bob     | Seattle  | WA    | 2024-03-01 | 9999-12-31 | true       |
+----------------+-------------+---------+----------+-------+------------+------------+------------+

Key design choices:
  1. customer_sk (surrogate) is PK. customer_id is the natural/business key.
  2. eff_to = 9999-12-31 marks the current row. is_current is redundant but speeds up queries.
  3. When Alice moves: old row gets eff_to = 2025-06-30 and is_current = false; new row inserted.
  4. Fact tables FK to customer_sk, NOT customer_id, preserving point-in-time accuracy.

When Alice moves: close the old row with eff_to and is_current=false; insert a new row. Fact tables join on customer_sk, not customer_id, which preserves point-in-time accuracy. Joining on customer_id instead destroys historical accuracy, the classic interview-failing mistake.

Querying an SCD Type 2 dimension

-- Current state of all customers
SELECT * FROM dim_customer WHERE is_current = true;

-- Point-in-time: customer state on a specific date
SELECT * FROM dim_customer
WHERE '2024-06-15' BETWEEN eff_from AND eff_to;

-- Join fact to dim preserving historical accuracy
SELECT f.order_date, f.total_amount, d.city, d.state
FROM fact_orders f
JOIN dim_customer d ON f.customer_sk = d.customer_sk;

Myth vs reality

Myth: 3NF is the right answer for a warehouse

Reality: 3NF is a transactional concept. Warehouses denormalize on purpose so analytics queries don't pay for joins on every dashboard load. Knowing 3NF rules is table-stakes; applying them blindly to warehouse design is a fail.

Myth: Surrogate keys are always better than natural keys

Reality: surrogate is the right default for dim tables (Type 2 history requires it). Natural keys can be right for cross-system reconciliation, idempotent loads, and source-of-truth tables where the natural key never changes (ISBN, upstream transaction_id). The best answer states the rule plus the carve-outs.

Myth: SCD Type 2 is always right for changing dimensions

Reality: only when downstream queries actually need history. For ML feature stores serving the freshest value at request time, Type 1. For audit-heavy domains, Type 6. For attributes many-to-many over time (customer-region), no single SCD type fits, and a bridge table is the right answer.

Myth: Star schemas are obsolete

Reality: still the default for BI dashboards on columnar stores. Cheap storage and compute don't eliminate the need for clean denormalized dimensions when self-service users filter and group. Star plus aggregation tables remains the boring, reliable answer.

Myth: Just denormalize everything into one table

Reality: works until it doesn't. Joins are cheap on columnar stores; rebuilds when the schema drifts are not. Every new column on a dim attribute touches every historical fact row in the OBT, which means a multi-hour backfill at scale.

Decision matrix

Eight situations, the schema pattern that fits, and a one-line reason.

SituationPickReason
BI dashboards, < 10 facts, slowly changing dimsStar + SCD Type 2Cleanest grammar for self-service; dim history preserved.
ML feature store, point-in-time correctness neededOBT or feature-table per entityFeatures computed and frozen at training-time row; no SCD complexity.
Operational reporting, low-latency dashboardsMaterialized view + denormalized cachePre-aggregated by the dimensions you filter on; refresh on a streaming cadence.
Compliance audit trail requiredSCD Type 6 or full event logType 6 keeps current + history + previous in one row; event log is unambiguous and append-only.
Self-service analytics, lots of ad-hoc joinsStar schema, defensivePredictable join paths; analysts know which dim to filter without consulting the model.
Wide dim that changes in 2 of 50 columnsSCD Type 4Avoid duplicating 48 unchanging columns on every history row.
Customer can be in two regions at onceBridge table with eff dates + weightingMany-to-many over time can't be modeled by a single SCD type.
Hierarchy depth unknown, queries traverse depthClosure tableO(1) ancestor-descendant lookups beat recursive CTEs at scale; pay storage for query speed.

All modeling topics by frequency

Six topic clusters, ordered by how often interviewers reach for them. Each cluster lists the questions you should answer cold.

Medium

Normalization (1NF through 3NF)

Frequency: Almost every junior or mid loop The textbook layer. Spot a 1NF violation in a denormalized table, walk it to 3NF, then defend why a warehouse usually denormalizes back. Knowing the rules is the floor; knowing when to break them is the answer. Questions: 1. Given a table with repeating groups in a single column (comma-separated tags), normalize to 1NF. Explain what problem the original design creates for queries. 2. Identify the partial dependency in a table with a composite primary key and decompose it into 2NF. 3. A table stores employee_name, department_name, and department_location. Identify the transitive dependency and normalize to 3NF. 4. When would you intentionally denormalize a 3NF schema? Give a specific scenario with the trade-offs.

Medium

Star vs Snowflake Schemas

Frequency: Star schema explicit in ~5% of questions; fact and dim talk in nearly all Almost every modeling round assumes the star vocabulary: pick a grain, sketch a fact, hang dimensions off it. The part interviewers score is whether you can defend why a column lives on the fact and not the dim, and when snowflake or OBT becomes the right call. Questions: 1. Design a star schema for an e-commerce platform. Define the fact table grain and list 4 dimension tables with their key attributes. 2. Convert a star schema into a snowflake schema by normalizing the product dimension. What queries become harder? What storage do you save? 3. Your fact table has 500M rows. A dimension table has 50M rows. Star or snowflake? Explain your reasoning based on query patterns. 4. What is a degenerate dimension? Give an example and explain why it lives in the fact table.

Medium-Hard

Slowly Changing Dimensions (SCD Types 1-3)

Frequency: The single most common modeling question The address-change-on-a-customer question, in some form, in almost every interview. Type 2 with effective dating is the median right answer. Late-arriving facts, dim growth, and when Type 4 or Type 6 fits better are where mid and senior part ways. Questions: 1. A customer changes their address. Show how SCD Type 1, Type 2, and Type 3 each handle this change. 2. Design an SCD Type 2 implementation for a product dimension where price changes frequently. 3. Your SCD Type 2 dimension grew from 10M to 80M rows over 3 years. What strategies reduce size without losing history? 4. When would you choose SCD Type 3 over Type 2? Describe a real scenario where storing only the previous value is sufficient.

Medium

Entity-Relationship Modeling

Frequency: About 7% of modeling questions Draw boxes and lines. Questions look easy until the interviewer adds a self-referencing relationship or a many-to-many with an attribute on the join. Candidates who pass slow down and name the cardinality before they draw. Questions: 1. Draw an ER model for a university system: students, courses, professors, enrollments. Define cardinalities for each relationship. 2. Model a social network where users can follow other users, create posts, and like posts. How do you handle the self-referencing follow relationship? 3. Design an ER model for a hotel booking system. Handle the case where one booking spans multiple rooms and multiple nights. 4. Your many-to-many between orders and products has an intersection table with quantity and unit_price. Why is unit_price stored here instead of on the product table?

Easy-Medium

Keys and Constraints

Frequency: PKs in ~6%, FKs in ~5% of modeling questions Surrogate vs natural is the most-asked key question. Honest answer: most warehouses surrogate everything by default. The carve-outs are where it gets interesting: source-of-truth tables, idempotent loads, cross-system reconciliation. Questions: 1. When should you use a surrogate key vs a natural key? Give one scenario for each. 2. A composite primary key of (user_id, event_date). Can you add a foreign key from another table to this composite? Implications? 3. Your schema has no foreign key constraints for performance reasons. How do you ensure referential integrity? Cost of skipping enforcement?

Hard

Trade-off Reasoning

Frequency: The back half of every senior loop Drawing the schema is the easy part. The interviewer rarely cares about your initial answer. They care what happens when they add a requirement that breaks it. Questions: 1. You denormalized a dimension to speed up a critical dashboard. What happens when source data changes? How do you keep the denormalized copy in sync? 2. Your schema uses a single wide fact table with 200 columns. The interviewer suggests splitting it. Argue for and against each approach. 3. A stakeholder wants real-time data in the warehouse. Your current design uses daily batch. What schema changes support streaming without breaking existing queries? 4. You chose a star schema, but a new requirement needs a query joining 5 dimensions. The query is slow. Options? Recommendation?

Data modeling interview FAQ

Why is the modeling round so much harder than SQL or Python?+
No automated scorer. SQL is right or wrong; Python passes or fails the tests. Modeling is a conversation where the interviewer rolls a fresh requirement onto the table every five minutes and watches what you do. The assessment is on whether your reasoning survives the new constraint, not on what you put on the whiteboard the first time.
Star or snowflake by default?+
Star. Always. Snowflake is a specific answer to a specific problem (a dimension hierarchy whose attributes change at very different rates, with storage actually expensive at the relevant scale). On columnar warehouses, storage savings are tiny and the additional joins hurt query simplicity for self-service users.
Do I need to memorize all the SCD types?+
Types 1, 2, and 3 cold. Draw the table state before and after a change, name the columns each type adds, and have a one-sentence reason for picking one. Type 4 and Type 6 come up rarely and only at senior. Types 5 and 7 are trivia.
What's the grain of a fact table and why does it matter so much?+
One sentence that tells you what one row in the fact represents. 'One order line per product per customer per day.' If you cannot say that sentence, you do not have a model. Grain determines what every measure means, what dimensions can sit on the table, and which questions the table can answer. Mid-level and senior interviews probe grain before anything else.
How do you handle late-arriving facts in an SCD Type 2 model?+
Look up the dim version current at the event timestamp, not the row current today. Eager: as_of_timestamp join (fact.event_ts BETWEEN dim.eff_from AND dim.eff_to) on a materialized Type 2 dim. Lazy: reconstruct from a CDC log walking back from current. The naive answer (re-apply the latest dim row to backfilled facts) is wrong and is the classic interview-failing answer.
What is a bridge table and when do you reach for one?+
Between a fact and a dim (or two dims) when the relationship is many-to-many. Customer-to-account in households. Patient-to-clinician in care teams. Order-to-promotion when one order has multiple promos. The bridge stores the pair plus an optional weighting factor when measures must be allocated. Without it: lost rows (inner join collapses), double-counting (cartesian), or fake hierarchies that mislead the business.
When is one-big-table (OBT) the right answer over a star schema?+
OBT works when downstream users mostly query a single denormalized surface (ML feature stores, ad-hoc Looker explores, BI tools that flatten joins anyway), and when storage of duplicating dim attributes is cheap vs the join cost. It breaks when the same fact rows must support many audience cuts, when SCD history must be enforced, or when downstream tools need a clean dim to filter by.
Primary key versus foreign key in the warehouse era?+
PK says a row is unique. FK says a column references a PK elsewhere. In OLTP both enforced. In Snowflake or BigQuery, PKs are usually declared for documentation but not enforced; FKs often skipped entirely for write performance. The trade-off interviewers want named: integrity guaranteed at write time vs throughput.
02 / Why practice

Practice a few of these on the canvas

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Data modeling deep dives