Data Modeling Interview Practice

25 data modeling interview questions with worked answers, ranked by frequency in real data engineering loops and by seniority level. Star, snowflake, SCD types, conformed dimensions, late-arriving data, medallion, data vault. Each answer is the version that earns the senior signal.

25 data modeling interview questions with worked answers, ranked roughly by frequency in real data engineering loops and by seniority level. Star schema, snowflake, SCD types, conformed dimensions, late-arriving data, medallion, data vault. Each answer is the version that earns the senior signal, not the textbook recap.

The shape interviewers expect for a star schema

The e-commerce version is worked end to end in question 3. This is what your whiteboard should look like before you start defending choices. Grain stated first, fact at the center, dimensions hanging off the foreign keys, SCD type called out on each one.

dim_productPKproduct_skproduct_idcategorybranddim_customer (T1)PKcustomer_skcustomer_idemailnamefact_order_itemsPKorder_item_skorder_idFKproduct_skFKcustomer_skFKdate_skquantityunit_pricetotal_amountdim_customer_history (T2)PKcustomer_history_skFKcustomer_skregionsegmentvalid_fromvalid_tois_currentdim_datePKdate_skdatefiscal_periodis_holiday
GrainOne row per order line item. Lets you analyze by product without re-aggregating; switching to order-level loses that.
SCD splitCustomer splits into Type 1 (name, email) and Type 2 (region, segment). Email changes aren't analytically interesting; region changes are.
Conformeddim_date is conformed across every fact in the warehouse. That's what lets cross-fact reporting produce consistent calendars.

How the 25 questions map to seniority

A modeling round at your target level pulls from the band that matches, plus one warm-up below and one stretch above.

  1. L3 (Junior)
    3 questionsQ1 to Q3

    Fact vs dimension, grain, basic star schema.

  2. L4 (Mid)
    7 questionsQ4 to Q10

    Star vs snowflake, SCDs, surrogate keys, conformed dimensions, OBT, ride-share and payments modeling.

  3. L5 (Senior)
    11 questionsQ11 to Q21

    SCD2 merge, late-arriving data, data vault, medallion, slowly-changing facts, bridge tables, partitioning, normalization, schema evolution, time dimension.

  4. L6 (Staff)
    4 questionsQ22 to Q25

    Multi-region, graph databases, billion-row scale, real production tradeoffs.

25 data modeling interview questions

Each links to its full worked answer. The level tag is the seniority at which the question typically lands.

  1. Q01L3 · Fundamentals

    What is the difference between a fact table and a dimension table?

    Fact tables hold events (orders, page views, payments) and the numeric measures associated with them. Dimension tables hold descriptive context (customer, product, date, location). Facts are tall and skinny: billions of rows, few columns. Dimensions are short and wide: thousands to millions of rows, many descriptive columns. Facts grow forever; dimensions grow with the business.

  2. Q02L3 · Grain

    What is the grain of a fact table and why does it matter?

    Grain is the level of detail one row represents, stated as 'one row per X'. State it before drawing tables. 'One row per order line item' is different from 'one row per order' and changes every column that follows. The interviewer almost always asks this first; not having an answer is the most common reason modeling rounds wobble. Pick the finest grain you can without making the table prohibitively large.

  3. Q03L3 · Star schema

    Design a star schema for an e-commerce analytics warehouse.

    Grain: one row per order line item. Fact columns: order_item_id, order_id, customer_sk, product_sk, date_sk, quantity, unit_price, discount, total_amount. Dimensions: dim_customer (Type 1 plus Type 2 split), dim_product, dim_date, dim_store. The star over snowflake call: analytical queries on a columnar warehouse win from fewer joins and dimension tables small enough to broadcast.

  4. Q04L4 · Star vs snowflake

    When would you choose a snowflake schema over a star?

    Almost never in 2026. Cloud warehouse compute is cheap and storage is cheap, so the cases that historically justified snowflake (normalize a huge dimension to save space) mostly don't apply. The honest case: when a dimension hierarchy genuinely changes at different cadences and you want to version each level separately. The interview test is whether you default to star and reserve snowflake for the carve-out. Picking snowflake first reads as junior.

  5. Q05L4 · SCDs

    Walk through SCD Type 1 vs Type 2 vs Type 3 with a concrete example.

    Customer moves from California to New York. Type 1 overwrites: the customer row now says New York and history is lost. Type 2 inserts a new row with effective_from = move date, effective_to = NULL, is_current = true; the previous row's effective_to becomes the move date and is_current becomes false. Type 3 adds a current_state and previous_state column on the same row. Type 2 is the default for analytics because the natural question is 'what state were they in at the time of the order'. Type 1 is right when you don't care about history. Type 3 is right when only the immediately previous value matters.

  6. Q06L4 · Surrogate keys

    When should you use a surrogate key instead of a natural key?

    Use a surrogate (a system-generated integer or UUID) when the natural key is unstable (email changes), when you need SCD Type 2 (the surrogate identifies the version, the natural key identifies the entity), or when the natural key is composite and clumsy. Use the natural key when it's stable, simple, and human-readable; analyst joins are easier with a natural key when readability matters more than versioning.

  7. Q07L4 · Modeling

    Model a ride-sharing platform with riders, drivers, trips, and surge pricing.

    Grain: one row per trip. Fact: trip_id, rider_sk, driver_sk, pickup_location_sk, dropoff_location_sk, request_ts, accept_ts, pickup_ts, dropoff_ts, distance_miles, base_fare, surge_multiplier, total_fare, tip. Dimensions: dim_rider (Type 1), dim_driver (Type 2 for vehicle and rating), dim_location (probably H3 cells or zones, not raw lat/long), dim_date, dim_time_of_day. Surge goes on the ride fact, not its own table, because surge state at trip start is what reporting needs. The interesting design call is modeling pickup_location and dropoff_location as two FKs to the same dim, not two separate dims.

  8. Q08L4 · Modeling

    Design the schema for a payments ledger.

    Grain: one row per payment event. Fact: payment_id, account_sk, transaction_type, amount, currency, status, ts. Status is event-time; payments move through pending → captured → settled → reconciled, and each transition is a new row in an append-only event log. Build a payment-state view on top to compute current status. The interview signal is realizing payments need event-sourcing, not row-update semantics, because finance and audit need the trail.

  9. Q09L4 · Conformed dimensions

    What is a conformed dimension and why does it matter?

    A dimension that's defined identically across multiple fact tables. dim_customer is conformed when sales, support, marketing, and product analytics all join to the exact same customer dimension. This is what lets cross-fact queries work: count customers who placed an order and filed a support ticket in the same month. Without conformed dimensions, every analytics team builds its own version of 'customer' and the numbers stop matching.

  10. Q10L4 · OBT

    When is one big table (OBT) the right pattern?

    OBT pre-joins facts and dimensions into a single wide denormalized table. Wins for dashboards with predictable access patterns (no join cost at query time) and ML training (one row per record, no point-in-time leakage from late-joined dims). Loses when the same data needs to be sliced multiple ways or when dimensions are large enough that storage 3x for the denormalization is real money. In a Kimball model OBT is a gold-layer output shape, not a replacement.

  11. Q11L5 · SCD Type 2 implementation

    Walk through the merge logic for SCD Type 2.

    On update, find the current row by natural key. Set its effective_to = now and is_current = false. Insert a new row with surrogate_key = generate_new(), natural_key = same, attributes = new values, effective_from = now, effective_to = NULL, is_current = true. The surrogate key changes; the natural key stays. Every fact table joining this dimension must use the surrogate key, not the natural key, or point-in-time queries break. In a MERGE statement: WHEN MATCHED AND attributes differ THEN UPDATE the current row, WHEN MATCHED OR NOT MATCHED handle the insert.

  12. Q12L5 · Late-arriving dimensions

    How do you handle a dimension row that arrives after the fact that references it?

    Insert a placeholder dimension row keyed on the natural key with is_late = true and the descriptive columns populated as Unknown. Facts join on the surrogate when it eventually gets created, or on the natural key in the meantime. When the real dimension row arrives, update the placeholder in place (still Type 1 for that fix) or insert a new Type 2 version, depending on whether you want history. Volunteering this pattern is a senior signal because most candidates don't think about it.

  13. Q13L5 · Late-arriving facts

    What happens when facts arrive after a downstream rollup has been computed?

    Two strategies. Reprocess the affected rollup partitions: clean but expensive. Or land the late fact and write a separate adjustment that downstream rollups apply on read: cheaper, more complex. The decision depends on how the consumer queries. Dashboards that scan a single partition each query tolerate adjustments; ML training jobs that read full partitions don't. Volunteering the trade-off is the senior version of the answer.

  14. Q14L5 · Data vault

    When would you use Data Vault 2.0 instead of Kimball?

    Data Vault is hubs (business keys), links (relationships), and satellites (descriptive attributes versioned over time). The case for it is governance: regulated industries (finance, insurance, healthcare), full audit lineage, sources that change schema frequently, multiple parallel ingestion teams. The trade-off is that nobody queries it directly because every business question becomes a six-table join, so teams build a Kimball-shaped business vault on top. In an interview, the right call is usually Kimball unless the company explicitly has the EDW org and the compliance requirement.

  15. Q15L5 · Medallion

    What is medallion architecture and how does it relate to star schema?

    Three layers. Bronze: raw, immutable, schema-on-read. Silver: cleaned, typed, deduplicated, conformed across sources. Gold: business-ready, usually as star schemas. Each layer has its own ownership and quality contract. A bug in silver doesn't require re-ingesting from source; you replay from bronze. Medallion is a data organization pattern; star schema is a modeling pattern that lives in gold. Saying 'we use medallion instead of star' is a category error.

  16. Q16L5 · Slowly changing facts

    How do you handle a fact that needs to be corrected after the fact?

    Append-only with a correction event is cleanest: the original row stays, a new row records the delta with a reference to the original. Or in-place update with an audit log table tracking what changed. The append-only approach makes reproducibility easy (you can compute any historical state by filtering on event date) but adds query complexity. In-place is simpler to query but harder to audit. Pick by whether the consumer needs point-in-time correctness more than query simplicity.

  17. Q17L5 · Bridge tables

    How do you model a many-to-many relationship in a dimensional schema?

    A bridge table with two foreign keys, one to each dimension. Add a weighting factor if the relationship is fractional (a sale attributed to two sales reps gets 0.5 each). Common in healthcare (patient-to-diagnosis), retail (product-to-category), ad tech (impression-to-conversion). The trade-off is that bridge tables explode row counts on the fact side; the alternative is denormalizing the categories into the fact, which works for small fixed sets.

  18. Q18L5 · Partitioning

    How do you choose a partition key for a fact table?

    Date is the default because almost all analytical queries filter by time. Use the column that's in the WHERE clause of most queries. Composite partitions (date plus tenant_id for multi-tenant) work when access patterns are uneven. Avoid high-cardinality partitions (more than 10K partitions) because metadata operations slow down. The senior signal is naming partition pruning explicitly and how it interacts with the query optimizer's predicate pushdown.

  19. Q19L5 · Normalization

    When does 3NF still make sense for OLTP?

    Transactional systems where write integrity matters more than read latency. E-commerce checkout, banking ledgers, inventory management. These need referential integrity, atomic updates, and small row footprints because they're writing constantly. 3NF gives you the constraints to enforce that. Analytics needs the opposite: read-optimized, denormalized for query speed, eventual consistency is fine. OLTP modeling and OLAP modeling are different jobs with different rubrics.

  20. Q20L5 · Schema evolution

    How do you handle adding a new column to a fact table consumed by multiple downstream pipelines?

    Additive nullable columns are safe; every downstream tolerates them by ignoring. Breaking changes (rename, type narrow, drop) require coordination: deprecate, dual-write for a window, migrate consumers, then remove. The schema registry pattern (Avro or Protobuf in a registry like Confluent's) enforces compatibility rules at the producer side, which prevents the most common failure: someone changed the column type and nobody told the downstream team.

  21. Q21L5 · Time dimension

    Why use a dim_date table instead of relying on the warehouse's date functions?

    Three reasons. One: business calendars that don't match Gregorian (fiscal year ending in March, retail 4-5-4 weeks). Two: pre-computed attributes (is_weekend, is_holiday, quarter, fiscal_period) that would otherwise be derived in every query. Three: consistent joins. Every fact joins on date_sk to dim_date, so changing the holiday list updates everywhere at once. The interview test is being able to defend a dim_date when the interviewer pushes back with 'why not just use DATE_TRUNC'.

  22. Q22L6 · Multi-region

    How do you model data when reads happen in multiple regions?

    Region-local reads from region-local replicas. Cross-region writes go through a single primary or use CRDTs for monotonic counters. The model itself usually doesn't change region-to-region; what changes is the replication strategy and the consistency level. Volunteer the trade-off explicitly: stronger consistency (eventual versus strong) means higher latency for cross-region writes.

  23. Q23L6 · Graph

    When would you choose a graph database over a relational model?

    When the queries are fundamentally graph traversals (friends-of-friends, shortest path, connected components) and the relational version requires multiple recursive CTEs or self-joins per query. Fraud detection rings, recommendation networks, supply chain dependencies. Neo4j, Neptune. The trade-off is that graph databases don't replace analytics warehouses; you have both, with the graph fed from the warehouse for traversal-heavy queries.

  24. Q24L6 · Scale

    Your fact table is approaching a billion rows. What do you change?

    Partition by date if not already. Cluster (Snowflake) or Z-order (Delta) on the most common filter columns after date. Move historical partitions to cheaper storage tiers if access is bimodal (recent partitions hot, historical cold). Consider pre-aggregating into a daily or hourly rollup table with the fine-grained fact as the source of truth. The wrong move is sharding the fact across multiple tables; it solves storage and breaks every query.

  25. Q25L6 · Tradeoffs

    Walk through a modeling decision you made in production that turned out wrong.

    Behavioral but technical. The senior signal is naming a real call you made, what the constraint was at the time, what the failure mode was, and what you'd do differently with the information you have now. Avoid retrospective certainty. The answer that lands sounds like: given what we knew, the call was reasonable, but here's what we missed and how I'd catch it earlier next time. Faux failures or 'I never made a wrong call' read as inexperienced.

What the modeling round actually tests

The modeling round is the loop-decider at senior level. It shows up in roughly a third of all data engineering loops and more than half of L5-and-above loops. The format is consistent: a vague business prompt (e-commerce, ride-sharing, payments, marketplace), you draw a schema on a whiteboard or canvas, and the interviewer pushes back on every choice.

Pass rate correlates more with how you defend tradeoffs than with the schema you draw. Most candidates can name the components: facts, dimensions, SCD types. Fewer can defend the grain when the interviewer adds a requirement five minutes in. The strongest candidates volunteer the failure mode (what happens to this schema when the business changes the question) before being asked.

Grain is the first trap. State it before drawing anything. The questions in this list all assume you can answer the grain question without hesitation; if you can't, that's the gap to close first.

Prepare for the interview
01 / Open invite
02min.

Know Data Modeling the way the interviewer who asks it knows it.

a Data Modeling query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1fact_orders
2 order_id bigint PK
3 customer_sk bigint FK
4 order_date date SCD2
5
Execute your solution0.4s avg.
SpotifyInterview question
Solve a Data Modeling problem

Run a live mock interview on a real canvas

Drawing a schema is easy. Defending it under pushback while a timer runs is what wrecks most candidates' first modeling round. The mock interview tool gives you a deliberately vague business prompt, a real schema canvas, and an AI interviewer that pushes on grain, SCD type, denormalization choices, and the consequence of each call. Hire-or-no-hire at the end with the specific exchanges that decided it. Worth running two or three times in the week before a real loop.

Retailer Data Warehouse Design

> We're a mid-size e-commerce company processing about 10,000 orders per day. The analytics team needs to build dashboards for sales performance, customer segmentation, and product trends. Right now all the data lives in a normalized Postgres OLTP database and queries are slow. Can you design a dimensional model for the analytics warehouse?

+ Table
+ Column
PK
FK
Partition
Cluster
Dedupe
Index
Soft Delete
Data Modeling
Model the schema.

Click + Table in the toolbar, or right-click the canvas to add one.

Drag from a key column's edge dot to another column to draw a foreign key.

Common questions about data modeling interview practice

What does a data modeling interview question actually look like?+
A vague business scenario: 'model the data for an e-commerce analytics team', 'design a schema for a payments ledger', 'model a ride-sharing platform'. Your job is to pin the grain, sketch a star schema, defend the SCD type, and survive the follow-ups when the interviewer changes the requirement mid-round.
Kimball or Inmon?+
Kimball (star schema, dimensional modeling) is the default for analytics in 2026. Most companies hiring data engineers use a Kimball-style warehouse, even when they call it a lakehouse. Inmon (3NF, atomic data, then build marts) is rare in interviews and almost never the right answer unless the interviewer specifically asks for 3NF.
Do I need to know Data Vault?+
Only for senior roles at financial services, insurance, healthcare, or government. If asked, name the three components (hubs, links, satellites) and move on. Don't propose Data Vault for a generic e-commerce problem; it reads as showing off.
How detailed should my schema diagram be?+
Tables, columns, types (string, int, decimal, timestamp), primary keys, foreign keys, SCD type on each dimension. Don't write CREATE TABLE statements unless asked. Boxes and lines are fine. The diagram exists to support your tradeoff discussion, not to be a production deliverable.
What if I don't finish drawing in time?+
Get the fact table grain and the main dimensions on the board. Verbalize the rest. The interviewer is assessing your reasoning, not your drawing speed. A clear partial diagram with a clear verbal completion beats a rushed full one.
Is medallion architecture the same as star schema?+
No, and confusing them is a category error. Medallion is a layering strategy (bronze raw, silver cleaned, gold business-ready). Star schema is a modeling pattern that usually lives in the gold layer. Use both terms correctly to signal current-decade fluency.
02 / Why practice

Run a modeling mock before the real one

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

More on data modeling