Data Modeling Interview Questions

25 data modeling interview questions worked to a real schema, with the tradeoff that decides each round. Star and snowflake, SCD Type 1/2/3, conformed dimensions, late-arriving data, medallion, data vault. Written the way you would say it to an interviewer about to push back, plus a live mock to practice the pushback itself.

25 data modeling interview questions, each worked to an actual schema and the one tradeoff that decides the round. Ordered roughly by how often they come up in real data engineering loops and by the level they land at. Star and snowflake, SCD Type 1/2/3, conformed dimensions, late-arriving data, medallion, data vault. Every answer is written the way you would say it to an interviewer who is about to push back, not the way a glossary defines it.

Run a live modeling mock interview Topic-by-topic modeling guide

The spine of a passing modeling round

What this page actually argues

01State the grain out loud before you draw anything. “One row per ___” is the sentence that opens the round, and skipping it is the most common way a strong candidate stumbles.
02Default to a star. Snowflake is a carve-out you justify, not a starting point. Leading with snowflake reads as textbook, not production.
03SCD Type 2 is the analytics default because the real question is always “what was true at the time of the event,” and that means facts join on the version key, never the natural key.
04The verdict tracks how you defend tradeoffs under pushback far more than the boxes you drew. Volunteer the failure mode before you are asked.
05Senior signals cluster in the unglamorous corners: conformed dimensions, late-arriving data, role-playing dimensions, partition pruning. Most candidates only model the happy path.

What an e-commerce star should look like on the whiteboard

Question 3 worked as a diagram. Grain stated first, the fact in the middle, dimensions hanging off the foreign keys, the SCD type called out on each one. This is the picture to have ready before the interviewer starts pushing.

GrainOne row per order line. You can analyze by product without re-aggregating; drop to order-level and that question is gone.

SCD splitCustomer splits into Type 1 (name, email) and Type 2 (region, segment). Email churn is not analytically interesting; a region move is.

Conformeddim_date is the same table every fact in the warehouse joins. That shared calendar is what makes cross-fact reporting reconcile.

What the modeling round is really scoring

By senior level, the modeling round is often the round that decides the loop. It turns up in roughly a third of data engineering interviews and in more than half of the loops at L5 and above. The format barely varies: a deliberately underspecified business prompt (e-commerce, ride-share, payments, a marketplace), a whiteboard or shared canvas, and an interviewer whose job for the next 45 minutes is to poke holes in every choice you make.

Here is the part that surprises people the first time. Whether you pass correlates more with how you defend a tradeoff than with the diagram you produce. Almost everyone can name facts, dimensions, and the three SCD types. Far fewer can hold their grain steady when the interviewer adds a requirement eight minutes in, or can say what breaks about their schema when the business changes the question next quarter. Saying that out loud, unprompted, is most of the signal.

And grain is the first place rounds go sideways. Pin it before you draw. Every question below assumes you can answer 'one row per what' without a pause; if that pause is real for you, close it before anything else on this list.

How a Type 2 dimension actually stores history

The answer to question 11, drawn. One natural key (customer_id), many surrogate-keyed versions, and the validity window that lets a backdated fact join to the version that was current at the time.

Two versions, one entitycustomer_id 4471 has two rows: region=CA (valid_from 2023-01-01, valid_to 2024-08-12, is_current false) and region=NY (valid_from 2024-08-12, valid_to null, is_current true).

The join that mattersThe fact carries customer_version_sk, not customer_id. A 2023 order joins to the CA row; a 2025 order joins to the NY row. Point-in-time correctness for free.

Current-state viewFilter is_current = true for 'where do they live now'. Join on the version key for 'where did they live then'. Both questions, one table.

Type 2 is the one the follow-ups circle back to, so the diagram above is worth burning into memory: one entity, many surrogate-keyed versions, a validity window, and facts that join to the version rather than the entity. Get that join wrong and a backdated report snaps every old order to today's customer attributes without warning anyone.

Three corrections that move a modeling verdict

The Myth

Snowflake schemas are more advanced, so reaching for one shows depth.

The Reality

On a columnar cloud warehouse the snowflake mostly adds joins for storage you no longer need to save. Defaulting to star and carving out snowflake only for a genuinely independent hierarchy is the move that reads as experience.

The Myth

Medallion architecture is an alternative to a star schema.

The Reality

They answer different questions. Medallion organizes layers (bronze, silver, gold); star models the gold layer. 'We use medallion instead of star' confuses a filing system for a blueprint, and interviewers notice.

The Myth

A finished, detailed diagram is what gets you the hire.

The Reality

A clear partial diagram with the grain and main dimensions on the board, finished verbally, beats a rushed full one. They are scoring your reasoning, not your drawing speed.

A payments ledger is an event log, not a status column

Question 8 as a schema. The fact is append-only, one row per state transition. Current status is a view over the latest event, never an UPDATE that erases the trail finance needs.

Append, never updateOne payment_id walks through authorized, captured, settled as three rows. Nothing is overwritten, so the audit trail is the table itself.

Status is derivedv_payment_current takes the latest event per payment_id. The 'current status' nobody can corrupt, because it is computed, not stored.

Why interviewers like itReaching for event sourcing unprompted on a payments prompt is the senior tell. A mutable status column is the answer that gets a follow-up about audit you cannot satisfy.

Notice what the payments diagram refuses to do: it never stores a single mutable 'current status' that an UPDATE could overwrite. Status is a view computed from the latest event. That instinct, preferring an append-only log over an in-place update whenever audit or history is on the line, generalizes well past payments. It is the same reasoning behind SCD Type 2, behind correction events on slowly-changing facts, and behind why nobody who has been burned trusts a column they can silently change.

Most of the design calls on this page come down to one of a handful of these recurring tradeoffs. The matrix below collapses the biggest one, star versus the alternatives, into a single table you can hold in your head walking into the room.

Star or snowflake, decided in one table

If you only memorize one comparison before a modeling round, make it this one.

If your situation is

Pick

Why

Analytical queries on a columnar cloud warehouse, dimensions that fit in memory

Star

Fewer joins, dimensions broadcast into the join, analysts read it without reassembling a hierarchy.

A dimension hierarchy whose levels genuinely change on different clocks

Snowflake (one carve-out)

Version each level on its own. The one case where normalizing a dimension still earns its keep.

Fixed dashboard access pattern, or an ML feature table needing point-in-time values

OBT

Pre-join into one wide table: no read-time joins, dimension values frozen as of the event.

Regulated industry, full audit lineage, many parallel ingestion teams

Data Vault, Kimball on top

Hubs, links, satellites for governance; a Kimball layer above it because nobody queries the Vault directly.

The dimension that two foreign keys share

One pattern trips up more mid-level candidates than any other on this list: a fact that references the same dimension twice. A trip has a pickup and a dropoff; an order has a bill-to and a ship-to address; a transfer has a source and a destination account. The instinct is to build two tables. The right answer is one dimension, joined twice under different aliases.

It matters because the alternative rots quietly. Split a location into dim_pickup and dim_dropoff and the two copies drift the first time someone fixes a zone name in one and forgets the other, and now the same physical place reports two different cities depending on which end of the trip you joined. The diagram below is question 7 drawn with that single shared dimension.

Two foreign keys into one dimension (the role-playing trap)

The design call from question 7. Pickup and dropoff are the same kind of thing, so they are two foreign keys into one dim_location, not two near-identical tables. Get this wrong and your location attributes drift apart.

One table, two rolespickup_location_sk and dropoff_location_sk both point at dim_location. In SQL you alias it twice (dim_location pickup, dim_location dropoff) in the join.

Why not two dimsSplit it into dim_pickup and dim_dropoff and the zone names, H3 cells, and city spellings drift out of sync the first time someone backfills one and forgets the other.

Surge lives heresurge_multiplier sits on the fact, frozen at trip start. It is a measure of that trip, not an attribute of a location or a rider.

How the 25 questions map to seniority

A modeling round at your target level pulls mostly from the matching band, plus one warm-up below it and one stretch above.

L3 (Junior)
3 questionsQ1 to Q3
Fact vs dimension, grain, a first star schema.
L4 (Mid)
7 questionsQ4 to Q10
Star vs snowflake, the SCD types, surrogate keys, conformed dimensions, OBT, the ride-share and payments prompts.
L5 (Senior)
11 questionsQ11 to Q21
Type 2 merge logic, late-arriving data, data vault, medallion, slowly-changing facts, bridge tables, partitioning, 3NF, schema evolution, the time dimension.
L6 (Staff)
4 questionsQ22 to Q25
Multi-region topology, graph vs relational, billion-row scale, a real production tradeoff that went wrong.

The list below is the full set, ordered by the band each question tends to land at. You will not get all 25 in one loop. A round at your level pulls mostly from your band, warms up with one below it, and stretches with one above, so the useful way to read this is to go deep on your band and stay conversant with the neighbors.

The 25 questions, each with the answer that earns the signal

Tagged with the seniority band the question usually lands at and the topic it tests. Read it as the version you would actually say, not the version you would write on a flashcard.

Q01L3 · Fundamentals
What is the difference between a fact table and a dimension table?
A fact table records things that happened: an order placed, a page viewed, a payment captured, plus the numbers attached to that event. A dimension table records the context you slice those events by: who the customer is, what the product is, when it happened, where it shipped. Facts run tall and narrow (billions of rows, a handful of columns); dimensions run short and wide (thousands to a few million rows, lots of descriptive attributes). The shape difference is the whole point of dimensional modeling.
Q02L3 · Grain
What is the grain of a fact table and why does it matter?
Grain is what one row means, said out loud as 'one row per X'. Declare it before you draw a single box. 'One row per order line' and 'one row per order' are different tables with different columns, and picking the wrong one quietly poisons every metric downstream. Most modeling rounds open here, and a candidate who reaches for tables before pinning grain has already given away that they model by reflex, not by reasoning. Go as fine as the questions you need to answer require, no finer than the storage will bear.
Q03L3 · Star schema
Design a star schema for an e-commerce analytics warehouse.
Grain: one row per order line. The fact carries order_item_id, order_id, the four surrogate keys (customer_sk, product_sk, date_sk, store_sk), and the additive measures: quantity, unit_price, discount, line_total. Dimensions are dim_customer, dim_product, dim_date, dim_store. Reach for a star, not a snowflake: on a columnar warehouse the dimensions are small enough to broadcast into the join, and analysts read a flat star far faster than a normalized hierarchy they have to reassemble in their head.
Q04L4 · Star vs snowflake
When would you choose a snowflake schema over a star?
Rarely, and you should say so. The classic reason to snowflake (normalize a fat dimension to save disk) stopped paying rent once cloud storage got cheap. The one honest case left: a dimension hierarchy whose levels change on genuinely different clocks, where versioning each level on its own buys you something. The thing being tested is your default. Lead with the star and carve out the snowflake when a level demands it; lead with the snowflake and you sound like you learned modeling from a textbook diagram instead of a query plan.
Q05L4 · SCDs
Walk through SCD Type 1 vs Type 2 vs Type 3 with a concrete example.
Take a customer who moves from California to New York. Type 1 overwrites the row: it now reads New York, and the fact that they were ever in California is gone. Type 2 closes the old row (set valid_to to the move date, is_current to false) and opens a new one (valid_from = move date, valid_to = null, is_current = true), so a year-old order still joins to the California version. Type 3 keeps both values side by side in current_state and prior_state columns on one row. Type 2 is the analytics default, because the question you actually get asked is 'what did this look like at the time of the event'. Type 1 fits when history is noise. Type 3 fits the narrow case where only the single previous value matters.
Q06L4 · Surrogate keys
When should you use a surrogate key instead of a natural key?
Mint a surrogate (a meaningless integer or UUID the warehouse owns) when the natural key drifts (emails change, SKUs get re-issued), when you need Type 2 history (the surrogate names the version, the natural key names the entity), or when the natural key is an ugly composite. Keep the natural key as the join when it is stable, short, and something an analyst actually recognizes; a surrogate buys you nothing there except an extra lookup. Most production dimensions end up with both: surrogate as the primary key, natural key indexed alongside it for lineage.
Q07L4 · Modeling
Model a ride-sharing platform with riders, drivers, trips, and surge pricing.
Grain: one row per completed trip. The fact holds trip_id, rider_sk, driver_sk, pickup_location_sk, dropoff_location_sk, the timestamp set (requested, accepted, picked_up, dropped_off), distance_miles, base_fare, surge_multiplier, total_fare, tip. Dimensions: dim_rider (Type 1 is usually enough), dim_driver (Type 2 to version vehicle and rating), dim_location (H3 cells or named zones, never raw lat/long), dim_date, dim_time_of_day. The design call worth making out loud: pickup and dropoff are two foreign keys into the same dim_location, a role-playing dimension, not two separate tables. Surge belongs on the fact because what reporting wants is the multiplier at trip start, frozen.
Q08L4 · Modeling
Design the schema for a payments ledger.
Grain: one row per payment event, append only. The fact carries payment_id, account_sk, event_type, amount, currency, status, event_ts. A payment is not a row you update; it is a life cycle (authorized, captured, settled, refunded), and each transition lands as a new immutable row. Current status comes from a view that takes the latest event per payment_id. The signal the interviewer is listening for is whether you reach for event sourcing on your own, because finance and audit need the trail, and a mutable status column erases exactly the history they will eventually subpoena.
Q09L4 · Conformed dimensions
What is a conformed dimension and why does it matter?
A conformed dimension is one shared, identical definition that every fact joins to. When sales, support, and marketing all point at the same dim_customer, you can ask a question that spans them: how many customers both placed an order and opened a ticket last month. Let each team build its own 'customer' and that question stops being answerable, because the row counts never reconcile and every meeting turns into a fight about whose number is right. Conformed dimensions are the boring infrastructure that makes the warehouse a single source of truth instead of three.
Q10L4 · OBT
When is one big table (OBT) the right pattern?
OBT collapses the star into one wide pre-joined table. It wins for two jobs: dashboards with a fixed access pattern (no joins at read time, so it is fast and cheap) and ML feature tables (one row per training example, with the dimension values frozen as of the event, which kills point-in-time leakage). It loses when the same data has to be sliced from many angles or when the denormalization triples your storage on a dimension you change weekly. In a Kimball shop, OBT is a gold-layer output you build from the star, not a thing you model instead of one.
Q11L5 · SCD Type 2 implementation
Walk through the merge logic for SCD Type 2.
Match the incoming row to the current version by natural key. If the tracked attributes are unchanged, do nothing. If they changed, stamp valid_to = now and is_current = false on the existing current row, then insert a fresh row: new surrogate key, same natural key, new attribute values, valid_from = now, valid_to = null, is_current = true. The surrogate rotates, the natural key holds. Every fact must join on the surrogate, never the natural key, or a backdated query silently snaps to today's version. In a single MERGE you express this as WHEN MATCHED AND a hash of the tracked columns differs THEN expire the old row, paired with an INSERT for the new version (most warehouses need the insert as a second statement, since one MERGE cannot both close and open a row).
Q12L5 · Late-arriving dimensions
How do you handle a dimension row that arrives after the fact that references it?
Land a placeholder. Insert a dimension row keyed on the natural key with the descriptive columns set to 'Unknown' and a flag like is_inferred = true, so the fact has a real surrogate to point at on day one. When the genuine dimension data shows up, you either overwrite the placeholder in place (a Type 1 correction) or open a Type 2 version, depending on whether the gap is worth keeping as history. Bringing this up before you are asked reads as senior, because most candidates model the happy path where every dimension is conveniently present and never think about the order events actually arrive in.
Q13L5 · Late-arriving facts
What happens when facts arrive after a downstream rollup has been computed?
You pick your poison. Reprocess the affected partitions of the rollup: correct and clean, but you pay for the recompute. Or land the late fact and emit a separate adjustment record that the rollup folds in on read: cheaper to write, more complex to read, and you now owe everyone documentation. The choice follows the consumer. A dashboard that scans one partition per query barely notices an adjustment; an ML job that reads whole partitions wants the partition simply rebuilt so it never sees a split. Naming that the right answer depends on the read pattern, rather than declaring one approach 'correct', is the senior version.
Q14L5 · Data vault
When would you use Data Vault 2.0 instead of Kimball?
Data Vault splits the world into hubs (business keys), links (the relationships between them), and satellites (descriptive attributes versioned over time). Its real customer is governance: regulated industries (finance, insurance, healthcare), end-to-end audit lineage, source systems that change schema constantly, and several ingestion teams loading in parallel without stepping on each other. The cost is that nobody queries it directly, because every business question becomes a six-table join, so teams build a Kimball-shaped layer on top anyway. In an interview, propose Kimball unless the company actually has the regulated mandate and the EDW org to carry the Vault, and say that out loud so it reads as judgment rather than ignorance.
Q15L5 · Medallion
What is medallion architecture and how does it relate to star schema?
Three layers, each with its own owner and quality bar. Bronze is raw and immutable, schema-on-read, the thing you can always replay from. Silver is cleaned, typed, deduplicated, and conformed across sources. Gold is business-ready, and gold is usually where your star schemas live. The payoff is that a bug in silver does not send you back to the source systems; you replay silver from bronze. Medallion answers 'how do I organize the layers'; star schema answers 'how do I model the gold layer'. They are not alternatives, and a candidate who says 'we use medallion instead of star' has confused a filing system with a blueprint.
Q16L5 · Slowly changing facts
How do you handle a fact that needs to be corrected after the fact?
Two clean options. Append a correction event: leave the original row untouched and write a new row that records the delta with a pointer back to the original. Or update in place and write the before-and-after into an audit table. Append-only makes any historical state reproducible (filter to a cutoff date and replay) at the cost of query complexity. In-place is trivial to read and a pain to audit. Pick by which the consumer needs more: point-in-time correctness, or a query they can write without a CTE explaining the corrections.
Q17L5 · Bridge tables
How do you model a many-to-many relationship in a dimensional schema?
Put a bridge table between the fact and the dimension, with a foreign key to each side. If the relationship is fractional, add a weight (a sale credited to two reps gets 0.5 each, so the totals still foot). You see this in healthcare (one encounter, many diagnoses), retail (one product, many categories), ad tech (one impression, many touchpoints). The catch is that the bridge multiplies rows on the fact side and any analyst who forgets the weight double-counts. When the set is small and fixed, denormalizing the categories straight onto the fact is often the saner call.
Q18L5 · Partitioning
How do you choose a partition key for a fact table?
Date wins by default, because nearly every analytical query filters on time, and date partitioning lets the engine prune to the few partitions a query touches. Generalize that: partition on whatever sits in the WHERE clause of most queries. For a multi-tenant fact, a composite of date plus tenant_id pays off when one tenant dominates scans. Avoid blowing past roughly ten thousand partitions; the catalog metadata operations start to drag. The detail that signals seniority is talking about partition pruning and predicate pushdown by name, and knowing that an over-partitioned table can be slower than an unpartitioned one.
Q19L5 · Normalization
When does 3NF still make sense for OLTP?
When writes have to be correct more than reads have to be fast. Checkout, banking ledgers, inventory: these write constantly, need referential integrity and atomic updates, and benefit from small row footprints. 3NF gives you the constraints that enforce all of that. Analytics wants the mirror image: denormalized, read-optimized, eventually consistent is fine. The trap in an interview is treating modeling as one skill. OLTP and OLAP are scored on opposite rubrics, and the person who normalizes a warehouse is making the same mistake as the person who denormalizes a checkout flow.
Q20L5 · Schema evolution
How do you add a column to a fact table consumed by many downstream pipelines?
Additive nullable columns are free: every consumer ignores what it does not select. The danger is the breaking change (a rename, a type narrowing, a drop), which needs a real migration: deprecate, dual-write through a window, move the consumers, then remove. The pattern that prevents the classic 3am incident is a schema registry (Avro or Protobuf, enforced at the producer) so the compatibility check fails the bad change at the source, instead of failing a downstream job after someone changed a type and told nobody.
Q21L5 · Time dimension
Why use a dim_date table instead of the warehouse's date functions?
Three reasons that compound. First, business calendars that do not match the Gregorian one: a fiscal year ending in March, retail's 4-5-4 week structure, region-specific holidays. Second, attributes you would otherwise recompute in every query (is_weekend, is_holiday, fiscal_period, quarter) sit precomputed in the table. Third, one place to change them; update the holiday list in dim_date and every fact that joins date_sk inherits it. When the interviewer pushes back with 'why not just DATE_TRUNC', the move is to grant that DATE_TRUNC handles the trivial cases and then name the fiscal-calendar case it cannot.
Q22L6 · Multi-region
How do you model data when reads happen in multiple regions?
The model usually stays put; what changes is replication and consistency. Serve reads from region-local replicas. Route writes either through a single primary or, for monotonic things like counters, through a CRDT that merges without coordination. Then say the quiet part: stronger consistency for cross-region writes costs you latency, and the right point on that curve is a product decision, not a database default. The interviewer is checking whether you treat 'multi-region' as a schema change (it usually is not) or as a topology and consistency problem (it is).
Q23L6 · Graph
When would you choose a graph database over a relational model?
When the query is a traversal, not a join. Friends-of-friends, shortest path, connected components, ring detection: in SQL these become recursive CTEs or a stack of self-joins that get slower with every hop. Fraud rings, recommendation networks, and supply-chain dependency graphs are the canonical fits (Neo4j, Neptune). The honest framing is that a graph database does not replace your warehouse; you run both, and feed the graph from the warehouse for the traversal-heavy slice of the workload.
Q24L6 · Scale
Your fact table is approaching a billion rows. What do you change?
Partition by date if you somehow have not. Then cluster (Snowflake) or Z-order (Delta) on the columns that follow date in your filters, so the engine skips files it does not need. If access is bimodal (recent partitions hot, ancient ones cold), tier the cold partitions to cheaper storage. If the dashboards only ever read daily totals, build a daily rollup and keep the fine grain as the source of truth behind it. The move to avoid is sharding the fact across many tables by hand; it relieves storage pressure and breaks every query that used to be a single scan.
Q25L6 · Tradeoffs
Walk through a modeling decision you made in production that turned out wrong.
Behavioral, but they are grading the engineering. Name a real call: the constraint you were under, the choice you made, the way it broke, and the thing you would do differently now. The version that lands sounds like 'given what we knew, this was reasonable, and here is the signal we missed and how I would catch it a quarter earlier next time.' Retrospective certainty ('obviously I should have') reads as someone rewriting history, and 'I have never made a wrong call' reads as someone who has not shipped enough to have one.

Reading the schema versus surviving the schema

The two halves of a modeling round are scored separately. Most prep covers only the left one.

Drawing the schema

Pin the grain, place the fact, hang the dimensions, label the SCD types. This is the part every guide teaches and most candidates can do.

It gets you to competent. On its own it gets you a 'leaning no hire' from a sharp interviewer, because everyone in the pool can draw boxes.

Defending the schema

The interviewer adds a requirement, narrows the budget, or changes the question, and watches whether your grain holds and whether you name the failure mode yourself.

This is the half that separates the band. The mock interview drills exactly this: pushback on a live canvas, with the specific exchanges that decided the verdict.

Turning a vague prompt into a schema in the first five minutes

“Pin the grain, name the measures, then let the dimensions fall out of the questions you have to answer.”

The five-minute opening that keeps a modeling round on rails

The prompts are vague on purpose, and the vagueness is part of the test. 'Model the data for a ride-sharing company' is not missing information by accident; the interviewer wants to see you carve a concrete, defensible scope out of an open one. So narrow it out loud: which business process, which grain, what does one row represent.

From there the schema almost builds itself. Once the grain is 'one row per completed trip', the measures are whatever is numeric and additive at that grain (fare, distance, tip), and the dimensions are whatever you need to slice those measures by (rider, driver, location, time). Name the measures, name the slice-by axes, and you have a star without ever having drawn a box yet.

The candidates who freeze are the ones who try to model everything the business does. You are not designing their warehouse; you are designing the one fact table that answers the question in front of you, and saying which other facts would live next to it later.

Prepare for the interview

01 / Open invite

02min.

Know Data Modeling the way the interviewer who asks it knows it.

a Data Modeling query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1fact_orders

2 order_id bigint PK

3 customer_sk bigint FK

4 order_date date SCD2

Execute your solution0.4s avg.

SpotifyInterview question

Solve a Data Modeling problem

Practice the half that actually breaks people

Drawing a schema in the quiet of your own notes is the easy version. Defending it while a timer runs and an interviewer keeps changing the requirement is the version that wrecks most first modeling rounds. That gap is the whole reason to run a mock.

The mock interview hands you a deliberately underspecified prompt, a real schema canvas, and an interviewer that pushes on grain, SCD type, denormalization, and the cost of each call you make. At the end you get a hire / no-hire read and the specific exchanges that decided it, so you know which pushback you folded on. Two or three runs in the week before an onsite changes how the live round feels: the pushback stops being a surprise and starts being a rhythm you have a move for.

Livestream Analytics Schema

> We're building the analytics backend for a livestream platform. Creators go live, viewers watch and interact through chat and gifts. We need to track everything for creator payouts, content recommendations, and engagement analytics. Can you design the data model?

+ Table

+ Column

Architecture

Data Modeling

Model the schema.

Click + Table in the toolbar, or right-click the canvas to add one.

Drag from a key column's edge dot to another column to draw a foreign key.

Data modeling interview questions: common follow-ups

What does a data modeling interview question actually look like?+

A vague business scenario: 'model the data for an e-commerce analytics team', 'design a schema for a payments ledger', 'model a ride-sharing platform'. You pin the grain, sketch a star, defend the SCD type on each dimension, and then survive the follow-ups when the interviewer changes the requirement mid-round.

Are these scenario-based, or definition questions?+

Both, weighted toward scenarios at senior level. L3 leans on definitions (fact vs dimension, what grain means). From L4 up, most of the round is a scenario you model live: ride-share, payments, a marketplace. The scenario-based questions on this page (Q3, Q7, Q8) are the ones to rehearse out loud, because reasoning through them under pushback is the thing being graded.

Kimball or Inmon?+

Kimball (star schema, dimensional modeling) is the analytics default in 2026, and it is what most companies hiring data engineers actually run, even when the deck says 'lakehouse'. Inmon (3NF atomic core, then marts) almost never wins an interview unless the interviewer specifically asks you to normalize.

Do I need to know Data Vault for a data modeler interview?+

Only for senior roles in finance, insurance, healthcare, or government. If it comes up, name the three pieces (hubs, links, satellites), say why governance and parallel ingestion drive it, and add that you would still put a Kimball layer on top. Proposing Data Vault for a generic e-commerce prompt reads as showing off.

How detailed should my schema diagram be?+

Tables, columns, types (string, int, decimal, timestamp), primary keys, foreign keys, and the SCD type on each dimension. Skip CREATE TABLE statements unless asked. Boxes and lines are fine; the diagram exists to anchor your tradeoff discussion, not to be a production artifact.

What if I don't finish drawing in time?+

Get the fact grain and the main dimensions on the board and verbalize the rest. The interviewer is grading reasoning, not drawing speed. A clear partial diagram finished out loud beats a rushed complete one every time.

Is medallion architecture the same thing as a star schema?+

No, and conflating them is a tell. Medallion is a layering strategy (bronze raw, silver cleaned, gold business-ready); star schema is a modeling pattern that usually lives in gold. Using both terms correctly signals you have worked in a current-decade stack.

02 / Why practice

Run a modeling mock before the real one

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Start a modeling mock interview

More on data modeling

Data modeling interview questions→

Topic-by-topic breakdown with more worked examples.

Star schema→

Fact tables, dimensions, grain, the full Kimball pattern.

SCD Type 2→

Versioned dimensions and the merge logic, step by step.

Data Vault→

Hubs, links, satellites, and when the governance case is real.

Practice problems with solutions→

Modeling exercises worked end to end with the answer.

Live modeling mock→

A canvas, a vague prompt, and an interviewer that pushes back.

Data Modeling Interview Questions

The spine of a passing modeling round

What an e-commerce star should look like on the whiteboard

What the modeling round is really scoring

How a Type 2 dimension actually stores history

Three corrections that move a modeling verdict

A payments ledger is an event log, not a status column

Star or snowflake, decided in one table

The dimension that two foreign keys share

Two foreign keys into one dimension (the role-playing trap)

How the 25 questions map to seniority

The 25 questions, each with the answer that earns the signal

What is the difference between a fact table and a dimension table?

What is the grain of a fact table and why does it matter?

Design a star schema for an e-commerce analytics warehouse.

When would you choose a snowflake schema over a star?

Walk through SCD Type 1 vs Type 2 vs Type 3 with a concrete example.

When should you use a surrogate key instead of a natural key?

Model a ride-sharing platform with riders, drivers, trips, and surge pricing.

Design the schema for a payments ledger.

What is a conformed dimension and why does it matter?

When is one big table (OBT) the right pattern?

Walk through the merge logic for SCD Type 2.

How do you handle a dimension row that arrives after the fact that references it?

What happens when facts arrive after a downstream rollup has been computed?

When would you use Data Vault 2.0 instead of Kimball?

What is medallion architecture and how does it relate to star schema?

How do you handle a fact that needs to be corrected after the fact?

How do you model a many-to-many relationship in a dimensional schema?

How do you choose a partition key for a fact table?

When does 3NF still make sense for OLTP?

How do you add a column to a fact table consumed by many downstream pipelines?

Why use a dim_date table instead of the warehouse's date functions?

How do you model data when reads happen in multiple regions?

When would you choose a graph database over a relational model?

Your fact table is approaching a billion rows. What do you change?

Walk through a modeling decision you made in production that turned out wrong.

Reading the schema versus surviving the schema

Turning a vague prompt into a schema in the first five minutes

Know Data Modeling the way the interviewer who asks it knows it.

Practice the half that actually breaks people

Livestream Analytics Schema

Data modeling interview questions: common follow-ups

Run a modeling mock before the real one

More on data modeling