Data Modeling Round Prep

Data Modeling Interview Questions (2026)

Q: Why is data modeling the hardest interview round?

Because there is no single correct answer. SQL and Python have objectively right or wrong output. Data modeling has trade-offs. The interviewer is evaluating your reasoning process, not just the final schema. You need to articulate why you made each choice and what you would change if requirements shifted.

Q: How do I practice data modeling without a study partner?

Design a schema for a business you know well: your favorite app, your company, an e-commerce site. Write down 3 trade-offs for each design decision. Then change a requirement (add real-time, add a new entity, increase scale 100x) and see how your schema holds up. DataDriven offers interactive schema building where you create tables, define relationships, and validate your design.

Q: Do I need to memorize all SCD types for interviews?

Know Types 1, 2, and 3 cold. Type 1 overwrites, Type 2 adds rows with date ranges, Type 3 adds a column for the previous value. Types 4-7 exist but are rarely asked. If you can explain the trade-offs between Types 1-3 with a concrete example, you are well prepared.

Q: Star schema or snowflake schema: which should I default to in interviews?

Start with star schema unless you have a specific reason to snowflake. Star schemas are simpler, faster for queries, and easier to explain. If the interviewer asks about storage efficiency or dimension table normalization, that is your cue to discuss snowflake. Always frame it as a trade-off, not a rule.

Q: How long should a data modeling interview answer take?

Spend 5 minutes asking clarifying questions about the business requirements. Spend 15 minutes drawing the schema. Spend the remaining 20-25 minutes defending your design and discussing alternatives. Interviewers value the discussion more than the diagram.

Q: What is the difference between a fact table and a dimension table?

A fact table stores measurable events (orders, clicks, transactions) at a specific grain. Each row is one event. Columns are numeric measures (quantity, amount, duration) and foreign keys to dimension tables. A dimension table stores descriptive attributes (customer name, product category, date components). Dimensions change slowly; facts accumulate quickly.

Q: What is a surrogate key and when should I use one?

A surrogate key is a system-generated, meaningless integer (or UUID/hash) used as the primary key of a table, independent of business values. Use surrogate keys in dimensional warehouses (SCD type 2 requires them because the natural key is no longer unique once you add date ranges), when the natural key is composite or string-heavy, or when the natural key may change over time.

Q: What is the difference between a primary key and a foreign key?

A primary key uniquely identifies each row in a table (NOT NULL + UNIQUE constraint). A foreign key is a column that references the primary key of another table to enforce relationships. In OLTP systems both are heavily used; in modern cloud warehouses, primary keys are still defined for documentation and modeling tools, but foreign key constraints are often disabled or skipped for write performance.

Q: What is a data model and what types are tested in interviews?

A data model is a structured definition of how data is organized, related, and constrained. Three layers are commonly tested: conceptual (entities and relationships, no implementation detail), logical (tables, columns, keys, normalization), and physical (engine-specific: indexes, partitioning, distribution keys). Most interviews focus on logical modeling with occasional physical-modeling questions for senior roles.

Q: What are the most common data modeling tools used in industry?

ER/Studio, ERwin, and Lucidchart are common for enterprise and on-premise. dbdiagram.io and DrawSQL are popular for cloud and modern stack work. Within data warehouses, Looker (LookML), dbt models, and Cube.dev are de-facto modeling layers. For interviews, you mostly draw on a whiteboard or in a shared doc; tool fluency rarely matters as much as model reasoning.

Roughly a third of data engineering interviews include a dedicated data modeling round. Below: questions by topic, plus three fully worked examples that cover the topics interviewers reach for most often: star schema design, normalization, and SCD Type 2.

Practice Now Full Interview Guide

Updated April 2026·By The DataDriven Team

What this guide actually says

01The data modeling round is the only round that cannot be passed by memorizing patterns. The interviewer is grading reasoning, not recall.
02Interviewers grade you on grain, not on Kimball trivia. If your one-sentence grain statement is sloppy, the rest of the schema does not matter.
03If you cannot defend why a column lives on the fact table and not the dimension, you fail. Pricing and quantities are the most common offenders.
04SCD Type 2 is the median question. The way you handle late-arriving facts is the differentiator that separates mid-level from senior.
05Star schema is the default in 2026, OBT is increasingly defensible on columnar warehouses, snowflake is rare. Have a one-line reason for each.
06Bridge tables, closure tables, and many-to-many time-bounded relationships are where senior candidates separate from the pack.

TL;DR: Data Modeling Interview Questions

The six topics interviewers reach for most often, and what you need to actually do under time pressure.

Six areas

What gets tested

Data modeling interview questions for data engineers in 2026 focus on six areas: normalization (1NF-3NF) and denormalization, dimensional modeling (star vs snowflake), fact and dimension table design, SCD types (1, 2, 3, 6), keys (primary, foreign, surrogate, natural), and ER diagrams and constraints.

Scope

What you need to do

For most roles you need to design a star schema for a given business case, choose the right SCD strategy, justify primary key choices (surrogate vs natural), and identify normalization violations in a denormalized table. Senior roles add modeling for slowly-changing dimensions in cloud warehouses, hybrid normalization (Data Vault, OBT), and migration patterns.

What grain actually means and why interviewers care

Most candidates flunk this round by being imprecise about grain. Grain is the one sentence that says what one row in the fact table represents. If you cannot say it cleanly, every downstream answer is shaky.

The interview-ready grain statement

Three required pieces: subject, qualifier, temporal scope.

Subject: the noun being measured (order line item, sensor ping, payment, login). Concrete, not abstract.
Qualifier: the modifier that disambiguates. "Per fulfillment event," "per attempted vs successful," "per warehouse location."
Temporal scope: the time grain. Per day. Per event timestamp. Per hour bucket. Without this, summing measures will silently double-count or under-count.

Five grain statements ranked from bad to senior

The same scenario, articulated five ways. The bottom three are interview-passing. Only the last one is what senior candidates say.

Bad

"One row per customer."

Useless. Customers do things over time. This grain answers no useful question. Most candidates start here, get pushed, and never recover.

Vague

"One row per order."

Better, but still incomplete. Is a partial shipment one row or two? Is a returned line item subtracted from this row or written as a separate row? The interviewer will ask.

Acceptable

"One row per order line item."

Now the table answers product-level questions. This is the median answer that gets you through most interviews if you can defend it.

Strong

"One row per order line item per fulfillment event."

Now you have an event grain. Splits, returns, partial cancellations all become first-class rows. Measures stay additive across the lifecycle. This is the answer that signals you have shipped this before.

Senior

"One row per order line item per fulfillment event, with the dim version effective at the event timestamp captured as a foreign key."

You have grain plus you have explicitly tied the row to the right SCD Type 2 dim version, so late-arriving facts and dimension drift do not corrupt historical reporting. This is the senior answer.

Interview tip

Before you draw a single table, say the grain out loud. Watch the interviewer's face. If they nod, build on it. If they squint, the grain is wrong and the next 25 minutes will be defensive instead of constructive. State, do not draw, until grain is agreed.

Data ModelingPractice grain on this problem

Customer Address History

People move. Sometimes twice in a month. How do you remember where everyone was, and when?

“The interviewer doesn't want to see your favorite schema. They want to see whether you can defend its weakest point under pressure.”

What every senior data engineer figures out by their second offer cycle

SCD Types: when each one is wrong

Most candidates know 'Type 2 = history.' Push past that. The question is when each type is the wrong choice and what you reach for instead.

Type 1

Overwrite. When it is right.

Use Type 1 when downstream queries genuinely do not care about history. Typo corrections. Email fixes after a bounce. ML feature stores where the feature is computed at request time and you want the latest value. Type 1 is wrong when any downstream user might say 'show me what this customer looked like at the time of the order.'

Type 2

Default for analytics.

Add a new row, close the old one with eff_to. Surrogate key on the dim, fact tables join on the surrogate. This is the median right answer in 80% of interviews. The differentiator is whether you can talk about late-arriving facts, change-data-capture sources, and how you avoid duplicate effective windows.

Type 3

Rarely the right call. When it is NOT.

Type 3 stores only the previous value (previous_city, previous_tier). It is rarely the right answer because most analytics needs more than one version back. Pick Type 3 only when the business explicitly says 'I only ever compare current vs prior' (price tier before/after a promotion). Default to Type 2; if the interviewer asks why not Type 3, the answer is 'because the business will eventually want three versions back, not one.'

Type 4

Dim plus history table.

Keep the current state in a narrow, fast dim and push history to a separate table. Useful when the dim is wide (50+ attributes) but only 2 or 3 of them change. Saves storage vs Type 2 because you do not duplicate the unchanging columns. Cost: queries that need history have to know about the second table.

Type 6

1 + 2 + 3, the auditor's pick.

Combines overwriting current attributes (Type 1), historical rows with effective dating (Type 2), and a current_value column on every historical row (Type 3-style). When auditors or compliance need 'what was the value when this happened, AND what is the current value of the same attribute, in one row.' Heavy, but unambiguous.

Bridge

Many-to-many over time.

When a customer can be in two regions at once, or held by two relationship managers simultaneously, no single SCD type fits. A bridge table with eff_from / eff_to per assignment, optionally with a weighting factor, expresses the relationship cleanly. Often missed by candidates who default to 'just use Type 2' when the relationship is not 1:1.

Data ModelingPractice SCD Type 2 here

The Customer Who Changed

She moved. She upgraded. She became someone new. The record has to keep up.

Late-arriving facts and how candidates lose the offer

The single biggest differentiator between mid-level and senior in the data modeling round. Walk through the scenario, the naive fix, and the right fix.

Scenario

The setup

Your fact table loads sensor pings every 15 minutes. A backlog clears, and 50,000 events from 11 days ago arrive at once. Your dim_sensor has had two SCD Type 2 changes in those 11 days (recalibration, gantry reassignment). The candidate's job: load the late events without corrupting historical reporting.

Naive

What most candidates say

'Just look up the current sensor surrogate key.' This is wrong. The car passed under the old calibration. Joining to the current sk attributes the event to a sensor configuration that did not exist when the event happened. Every report aggregated by calibration_date breaks.

Right answer (eager)

If your dim materializes effective ranges

Use an as_of_timestamp join: fact.event_ts BETWEEN dim.eff_from AND dim.eff_to. The query naturally pulls the version that was current at the event timestamp. This is cheap on a columnar warehouse with clustering on (sensor_id, eff_from).

Right answer (lazy)

If your dim is a snapshot table

If you only persist the current version, you reconstruct historical state from a change log (CDC). Walk back from the current row using changes ordered by event_ts to get the as-of value. More expensive per query but no storage cost in the dim.

The trade-off

What you give up

Eager materialization (Type 2 with explicit ranges) costs storage and write complexity. Lazy reconstruction (CDC log replay) costs query time and infra complexity. The senior signal is naming both options, picking one, and stating which downstream use-case made you pick it.

What breaks

When candidates lose the offer

If the candidate cannot articulate why joining on the natural key is wrong, they fail. If they propose 'just reload everything every night,' they fail (does not scale, breaks late-arriving fact contracts upstream). If they wave their hands and say 'we use Snowflake's time travel,' they fail (time travel is bounded retention, not a modeling pattern).

Scenario: a sensor event arrives 11 days late.

Event timestamp: 2025-06-15  (the day the car passed sensor 17)
Arrived at warehouse: 2025-06-26
Current dim_sensor row for sensor 17:
  +--------+-----------+---------+------------+------------+------------+
  | sk     | sensor_id | gantry  | calibrated | eff_from   | eff_to     |
  +--------+-----------+---------+------------+------------+------------+
  | 4471   | S-17      | G-W04   | 2024-09-01 | 2024-09-01 | 2025-06-21 |
  | 4719   | S-17      | G-W04   | 2025-06-22 | 2025-06-22 | 9999-12-31 |
  +--------+-----------+---------+------------+------------+------------+

The naive answer: join on the latest sensor_sk for S-17 (4719).
Wrong. The car passed under the OLD calibration (sk 4471).

The right answer: as_of_timestamp join.
  fact.event_timestamp BETWEEN dim.eff_from AND dim.eff_to
This pulls sk 4471 for the 2025-06-15 event and sk 4719 for events after 2025-06-22.

What strong candidates say

"The naive answer is to join on the latest sensor surrogate key, but that attributes the event to a calibration that did not exist when the car passed under the gantry. The right answer is an as-of-timestamp join: fact.event_ts BETWEEN dim.eff_from AND dim.eff_to. The trade-off is between materializing effective ranges eagerly (storage cost, simple queries) and reconstructing them lazily from a CDC log (storage savings, query cost). For high-volume reporting I pick eager; for an operational system with mostly point lookups I pick lazy."

Data ModelingLate-arriving facts: practice

Toll Road Sensor Analytics

Cars enter, cars exit. Except when they don't.

Star vs Snowflake vs One Big Table

Opinionated take. In 2026, with cheap storage and cheap columnar compute, OBT is more defensible than it was when Kimball wrote the books. Star is still the right default. Snowflake is rare.

Property	Star Schema	Snowflake Schema	One Big Table (OBT)
Query simplicity	High	Medium	High (single table)
Query perf on columnar	Strong	Weaker (deeper joins)	Strong (no joins)
ETL complexity	Moderate	Highest	Highest (one wide table to maintain)
SCD handling	Clean (Type 2 on dim)	Same as star, more places to coordinate	Awkward (rebuilds expensive)
Storage cost	Moderate	Lowest	Highest (dim attrs duplicated)
Self-service analytics	Best	Worse (more tables to model)	Variable (works for one cut, fails for many)
When it goes wrong	Cross-fact joins, conformed-dim drift	Anywhere the join chain exceeds 3 levels	Schema drift, multi-audience filtering

The 2026 take, comparison-by-comparison

Query simplicity: OBT wins for a single dominant audience cut. Star wins when multiple audiences need different filter sets. Snowflake loses on simplicity for everyone.
Query performance on a columnar store: OBT is competitive because columnar pruning ignores unused columns. Star pays one or two cheap dimension joins. Snowflake pays many deeper joins, often with worse selectivity.
ETL complexity: OBT is the most complex to build (many sources land in one wide table; one schema change touches everything). Star is moderate. Snowflake is the most complex to keep consistent across many normalized dim tables.
Slowly-changing handling: Star handles SCD Type 2 cleanly via dim surrogate keys. OBT handles SCD by writing the dim attributes onto each fact row (eager); rebuilds become expensive. Snowflake inherits star's behavior, more places to coordinate.
When OBT goes wrong: when multiple downstream users need to filter by dimension attributes that were not duplicated onto the OBT, when SCD changes require expensive backfills, and when self-service tools need a clean dim to power their dropdowns.

Interview-ready stance

"I default to star for BI workloads on a columnar warehouse. OBT is increasingly defensible when there is one dominant audience and one dominant query pattern, especially for ML features. Snowflake I reach for only when a dimension hierarchy has attributes that change at very different rates and the storage savings actually matter at the scale we operate at."

Bridge tables and many-to-many

Skipped in too many study guides. Bridge tables are how you model relationships that a single SCD type cannot express. If the relationship is many-to-many, especially over time, you need a bridge.

Customer-Account

Customer-to-Account in a household model

One customer can hold many accounts. One account can be held by multiple customers (joint, household). bridge_account_customer with role + weighting_factor cleanly expresses the relationship and lets you allocate balances when reporting needs per-customer measures.

Order-Promo

Order-to-Promotion stacking

One order can carry multiple promotions (loyalty + first-time + free-shipping). bridge_order_promo with allocated_discount per promo lets reporting attribute the discount cost to the right promo without double-counting.

Patient-Clinician

Patient-to-Clinician care teams

Modern care models assign multiple clinicians per patient (PCP, specialist, nurse practitioner). bridge_patient_clinician with role + eff_from / eff_to over time lets you answer 'who was on this patient's care team during this admission' even when the team rotated mid-stay.

Customer-to-Account (household model)

  dim_customer            bridge_account_customer        dim_account
  +-------------+         +------------------------+      +-------------+
  | customer_sk |--+      | account_sk         FK  |  +---| account_sk  |
  | name        |  +-----<| customer_sk        FK  |>-+   | account_no  |
  | email       |         | role                   |      | open_date   |
  +-------------+         | weighting_factor       |      | tier        |
                          +------------------------+      +-------------+

Why the bridge:
  - One account can be held by multiple customers (joint, household).
  - One customer can hold multiple accounts (checking + savings).
  - role is on the bridge: "primary", "joint", "authorized_user".
  - weighting_factor allocates account-level measures (balance, fees) when
    the business needs to attribute them to individual customers.
    Three joint holders with weighting 0.33 each prevents triple-counting.

Without the bridge:
  - You either pick a single primary customer (lose other holders),
    cartesian-join (triple-count balances), or denormalize into the fact
    (still cannot represent the many-to-many cleanly).

Why candidates miss this

Most candidates default to "just put it on the dim" or "just put it on the fact." Both fail when the relationship is genuinely many-to-many. The senior signal is to recognize the m:n shape early, name the bridge, and explain how the weighting_factor prevents double-counting downstream.

Data ModelingBridge a real m:n relationship

B2B Invoicing Data Model

Invoices go out, partial payments trickle in, and some customers are three months overdue.

Star vs Snowflake Schema: Decision Matrix

The most common dimensional modeling question. Memorize this table and you can answer 80% of star/snowflake comparisons.

Property	Star Schema	Snowflake Schema
Dimensions	Denormalized (single table per dim)	Normalized (broken into multiple tables)
Query joins	Few (1 fact + N dims)	Many (deeper join chains)
Storage	Higher (redundant attributes)	Lower
BI tool friendliness	Excellent	Worse (more tables to model)
Update consistency	Lower (denormalized data drifts)	Higher (single source of truth)
Best for	Analytics, BI dashboards, modern warehouses	Storage-constrained OLAP, hierarchical attributes

Primary, Foreign, Surrogate Keys: When to Use Each

Key choice questions appear in nearly every data modeling interview. Know the tradeoffs.

Key type	What it is	Use when
Natural key	A real-world identifier (email, SSN, ISBN)	Source-of-truth tables where the identifier won't change
Surrogate key	System-generated meaningless integer (auto-increment, UUID, hash)	Dimensional warehouses, especially SCD type 2 where natural key alone is not unique
Primary key	Constraint that uniquely identifies a row (NOT NULL + UNIQUE)	Every persisted table; choose the surrogate or natural based on stability
Foreign key	Constraint that enforces a row references a primary key in another table	OLTP systems where referential integrity matters; often omitted in cloud warehouses for performance

What interviewers actually grade on

The six questions you will hear, with answer outlines that separate strong from senior. The questions are not new. The answers that earn offers are not the obvious ones.

Q01

Design a schema for an Uber-like ride-sharing platform.

Strong answers ask clarifying questions first: which audience (operations, finance, ML)? What grain (one ride, one ride-leg, one driver-shift)? Then they sketch fact_ride at one-row-per-ride grain plus dim_rider, dim_driver, dim_vehicle, dim_geography (pickup AND dropoff as separate FKs), dim_date, dim_promo. Senior answers explicitly call out SCD Type 2 on dim_driver (rating, vehicle, status all change) and a separate fact_ride_lifecycle event-grain table for ETA, accept, cancel events.

Q02

How would you handle a customer with three addresses, two emails, and a merge into another customer?

Stop and clarify what 'three addresses' means. Three current addresses (billing, shipping, residential)? That is bridge_customer_address with role + eff_from/eff_to. Three historical addresses? That is SCD Type 2 on dim_customer. Then the merge: a customer-survivor table mapping merged_customer_id -> surviving_customer_sk so historical facts still resolve. Most candidates miss the merge survivor table and lose history.

Q03

What's the grain of your fact table for [scenario]? Defend it.

Subject + qualifier + temporal scope. 'One row per shipment line per warehouse-event per ship-date.' Defend by enumerating the questions the grain answers and the questions a coarser grain would not. If you cannot list two questions a coarser grain would fail, your grain is over-granular.

Q04

Walk through a Type 2 dimension with effective dating. What goes wrong with late-arriving facts?

Walk through the as_of_timestamp join (fact.event_ts BETWEEN dim.eff_from AND dim.eff_to). Then walk through the failure mode: a fact arrives with an event_ts older than every active eff_to row in the dim because the dim's history was not backfilled deeply enough. Fix: extend the dim's effective windows backwards (UPDATE the earliest row's eff_from to the global minimum), or insert a synthetic 'unknown' dim row for events before recorded history.

Q05

Star or snowflake or one big table? When would you pick each?

Star is the default for BI dashboards on a columnar store. Snowflake when dimension hierarchies have rich attributes that change at different rates (geography: country attrs change every decade, city attrs change yearly, neighborhood attrs change monthly). OBT when one dominant audience uses one denormalized surface and storage is cheap. State the tie-breaker: 'I default to star unless someone tells me a specific reason to snowflake.'

Q06

How do you model a hierarchy where the depth is unknown?

Two options. Adjacency list (parent_id self-reference) is simple but recursive queries are expensive. Closure table (one row per ancestor-descendant pair with depth) makes hierarchical queries O(1) at the cost of storage and write complexity. Bridge tables can express hierarchies that are not strict trees (employees with two managers). State which option you would pick for the asked use case and why.

Why Data Modeling Is the Hardest Interview Round

Data modeling appears in roughly a third of all data engineering interviews, yet it is the round candidates prepare for the least. Unlike SQL or Python, there is no single correct answer. Every schema design is a series of decisions with consequences, and the interviewer's job is to probe whether you understand those consequences.

No single answer

There is no single right schema

Two candidates can produce different schemas for the same problem, and both can pass. What matters is whether you can explain why you chose one approach over another. If the interviewer asks "why not snowflake?" and you cannot articulate the trade-off, that is a fail.

Moving target

Requirements change mid-interview

Interviewers deliberately add new requirements after you finish your initial design. "Now the product team wants real-time updates." "Now we need to track 3 years of history." Your schema needs to bend without breaking.

Conversation

You must think out loud

Unlike SQL where you can silently write and submit, data modeling is a conversation. Interviewers are scoring your reasoning process. Verbalize your assumptions, state your trade-offs, and ask clarifying questions before you draw a single table.

Worked Example: E-Commerce Star Schema

"Design a star schema for an e-commerce platform. Define the grain of your fact table and list your dimension tables." This question appears in nearly every data modeling interview.

                    dim_customer                 dim_product
                +-----------------+         +------------------+
                | customer_id PK  |         | product_id  PK   |
                | name            |         | name             |
                | email           |         | category         |
                | segment         |         | subcategory      |
                | region          |         | brand            |
                | signup_date     |         | unit_cost        |
                +-----------------+         +------------------+
                         |                           |
                         |    fact_order_items        |
                         |  +--------------------+   |
                         +--| customer_id   FK   |---+
                            | product_id    FK   |
                  +---------| date_id       FK   |---------+
                  |         | order_id      DD   |         |
                  |         | order_item_id PK   |         |
                  |         |--------------------|         |
                  |         | quantity           |         |
                  |         | unit_price         |         |
                  |         | discount_amount    |         |
                  |         | total_amount       |         |
                  |         +--------------------+         |
                  |                                        |
          +--------------+                     +-----------------+
          | date_id  PK  |                     | promo_id   PK   |
          | full_date    |                     | promo_name      |
          | day_of_week  |                     | discount_pct    |
          | month        |                     | start_date      |
          | quarter      |                     | end_date        |
          | year         |                     | promo_type      |
          | is_holiday   |                     +-----------------+
          +--------------+                        dim_promotion
             dim_date

Grain

Grain: One row per order line item

An order with 3 products creates 3 fact rows. This is the lowest useful grain: it supports product-level analysis (revenue per SKU, discount effectiveness per category) without requiring re-aggregation. If you chose order-level grain, you would need a separate query to break down product mix.

Degenerate dim

Why order_id is a degenerate dimension

order_id has no attributes of its own (no order_name, order_type). It exists only as a grouping key to reassemble complete orders from line items. Putting it in a separate dimension table would add a JOIN with no analytical value.

Pricing

Why unit_price is on the fact table, not dim_product

Product prices change over time. The fact table captures the price at the moment of sale. If you stored price only on dim_product, historical revenue calculations would use today's price, not the price when the order was placed. This is one of the most common modeling mistakes.

Trade-offs to state in the interview

Star vs snowflake: dim_product.category could be normalized into a separate dim_category table (snowflake). This saves storage if categories have many attributes, but adds a JOIN to every product query. At this scale, star is simpler and query performance is better.
dim_customer SCD strategy: Currently Type 1 (overwrite). If the business needs to analyze orders by the customer's region at the time of purchase, upgrade to SCD Type 2 with effective_from and effective_to columns. Cost: table grows 3-5x over 3 years.
Aggregation tables: If the dashboard queries daily revenue by category, a pre-aggregated table (daily_category_revenue) avoids scanning 500M fact rows. Cost: refresh latency and storage. Benefit: sub-second dashboard queries.

Worked Example: Normalization from 1NF to 3NF

"This table has anomalies. Normalize it." The interviewer hands you a denormalized table and expects you to identify the violations at each normal form and fix them step by step.

Original Table (Unnormalized)

orders_raw
+----------+------------+---------+--------+-----------+----------+
| order_id | order_date | cust_id | cust_name | products           |
+----------+------------+---------+-----------+--------------------+
| 1001     | 2025-03-15 | C1      | Alice     | Widget, Gadget     |
| 1002     | 2025-03-16 | C2      | Bob       | Widget             |
| 1003     | 2025-03-16 | C1      | Alice     | Gadget, Thingamajig|
+----------+------------+---------+-----------+--------------------+

Problem: "products" column contains comma-separated values.
This violates 1NF (each cell must hold a single atomic value).

Step 1: First Normal Form (1NF)

Eliminate repeating groups. Each row stores one product per order.

order_items_1nf
+----------+------------+---------+-----------+-------------+
| order_id | order_date | cust_id | cust_name | product     |
+----------+------------+---------+-----------+-------------+
| 1001     | 2025-03-15 | C1      | Alice     | Widget      |
| 1001     | 2025-03-15 | C1      | Alice     | Gadget      |
| 1002     | 2025-03-16 | C2      | Bob       | Widget      |
| 1003     | 2025-03-16 | C1      | Alice     | Gadget      |
| 1003     | 2025-03-16 | C1      | Alice     | Thingamajig |
+----------+------------+---------+-----------+-------------+

PK: (order_id, product)  -- composite key
Now 1NF: every cell is atomic, every row is unique.

Step 2: Second Normal Form (2NF)

Eliminate partial dependencies. order_date and cust_name depend only on order_id, not on the full composite key (order_id, product).

orders_2nf                          order_items_2nf
+----------+------------+---------+-----------+   +----------+-------------+
| order_id | order_date | cust_id | cust_name |   | order_id | product     |
+----------+------------+---------+-----------+   +----------+-------------+
| 1001     | 2025-03-15 | C1      | Alice     |   | 1001     | Widget      |
| 1002     | 2025-03-16 | C2      | Bob       |   | 1001     | Gadget      |
| 1003     | 2025-03-16 | C1      | Alice     |   | 1002     | Widget      |
+----------+------------+---------+-----------+   | 1003     | Gadget      |
  PK: order_id                                     | 1003     | Thingamajig |
                                                   +----------+-------------+
Now 2NF: no non-key column depends on part of the PK.

Step 3: Third Normal Form (3NF)

Eliminate transitive dependencies. cust_name depends on cust_id, not directly on order_id. Extract it into a separate table.

customers_3nf         orders_3nf                   order_items_3nf
+---------+-----------+  +----------+------------+---------+  +----------+-------------+
| cust_id | cust_name |  | order_id | order_date | cust_id |  | order_id | product     |
+---------+-----------+  +----------+------------+---------+  +----------+-------------+
| C1      | Alice     |  | 1001     | 2025-03-15 | C1      |  | 1001     | Widget      |
| C2      | Bob       |  | 1002     | 2025-03-16 | C2      |  | 1001     | Gadget      |
+---------+-----------+  | 1003     | 2025-03-16 | C1      |  | 1002     | Widget      |
  PK: cust_id            +----------+------------+---------+  | 1003     | Gadget      |
                           PK: order_id  FK: cust_id          | 1003     | Thingamajig |
                                                              +----------+-------------+

Now 3NF: every non-key column depends on the key, the whole key,
and nothing but the key.

When to denormalize back

If your warehouse dashboard queries join orders_3nf to customers_3nf on every single query, and the customer table is small (under 1M rows), denormalizing cust_name back onto the orders table saves a JOIN and speeds up every dashboard query. The trade-off: if Alice changes her name, you need to update every order row. For analytical workloads where name changes are rare and query speed matters, this is a reasonable trade-off. State it explicitly.

Worked Example: SCD Type 2 Implementation

"A customer changes their address. Show how you track the history." SCD Type 2 is the most commonly tested slowly changing dimension pattern.

Three SCD types compared

Type	How it works	Keeps history?	Table growth	Best for
Type 1	Overwrite the old value	No	None	Typo corrections, non-analytical fields
Type 2	Add new row with date range	Full history	3-5x over 3 years	Address, segment, pricing tier
Type 3	Add column for previous value	One version back	Minimal	Before/after comparisons only

SCD Type 2: Before and After a Customer Moves

dim_customer (SCD Type 2)
+----------------+-------------+---------+----------+------------+------------+------------+
| customer_sk PK | customer_id | name    | city     | state      | eff_from   | eff_to     | is_current |
+----------------+-------------+---------+----------+------------+------------+------------+
| 1001           | C-42        | Alice   | Chicago  | IL         | 2023-01-15 | 2025-06-30 | false      |
| 1002           | C-42        | Alice   | Austin   | TX         | 2025-07-01 | 9999-12-31 | true       |
| 1003           | C-88        | Bob     | Seattle  | WA         | 2024-03-01 | 9999-12-31 | true       |
+----------------+-------------+---------+----------+------------+------------+------------+

Key design choices:
  1. customer_sk (surrogate key) is the PK. customer_id is the natural/business key.
  2. eff_to = 9999-12-31 marks the current row. is_current flag is redundant but speeds up queries.
  3. When Alice moves, the old row gets eff_to = 2025-06-30 and is_current = false.
     A new row is inserted with eff_from = 2025-07-01.
  4. Fact tables FK to customer_sk, NOT customer_id. This preserves point-in-time accuracy:
     an order placed when Alice lived in Chicago joins to the Chicago row.

Querying with SCD Type 2

-- Current state of all customers
SELECT * FROM dim_customer WHERE is_current = true;

-- Point-in-time: customer state on a specific date
SELECT * FROM dim_customer
WHERE '2024-06-15' BETWEEN eff_from AND eff_to;

-- Join fact to dimension preserving historical accuracy
SELECT f.order_date, f.total_amount, d.city, d.state
FROM fact_orders f
JOIN dim_customer d ON f.customer_sk = d.customer_sk;

The fact table joins on the surrogate key, not the business key. This means each fact row is permanently linked to the correct version of the customer dimension. If you join on customer_id instead, you get the current address for all orders, which destroys historical accuracy.

Common interview follow-ups

"The table is growing too fast." Partition by is_current, archive old rows to cold storage, or increase the granularity of change tracking (only track meaningful changes, not every field update).
"How do you handle a late-arriving dimension update?" Backdate eff_from and adjust the overlapping row's eff_to. This requires an UPDATE + INSERT, not just an INSERT.
"When would you use Type 3 instead?" When you only need before/after comparison (e.g., pricing tier before and after a promotion) and do not need full history. Type 3 adds a previous_city column instead of new rows.

Myth vs Reality

Five things candidates have been told that are wrong, or wrong in important ways. Each one has cost someone an offer.

The Myth

3NF is the right answer for a warehouse.

The Reality

3NF is a transactional concept. Warehouses denormalize on purpose so analytics queries do not pay for joins on every dashboard load. Knowing 3NF rules is interview table-stakes; applying them blindly to a warehouse design is a fail.

The Myth

Surrogate keys are always better than natural keys.

The Reality

Surrogate is the right default for dim tables (Type 2 history requires it). But natural keys can be the right call for cross-system reconciliation, idempotent loads, and source-of-truth tables where the natural key never changes (ISBN, transaction_id from upstream). The best answer states the rule plus the carve-outs.

The Myth

SCD Type 2 = always the right answer for changing dimensions.

The Reality

Only when downstream queries actually need history. For ML feature stores serving the freshest value at request time, Type 1 is better. For audit-heavy domains, Type 6. For attributes that are many-to-many over time (customer-region), no single SCD type fits and you need a bridge.

The Myth

Star schemas are obsolete.

The Reality

Still the default for BI dashboards on columnar stores. The cheap-storage / cheap-compute argument for OBT does not eliminate the need for clean, denormalized dimensions when self-service users need to filter and group. Star plus aggregation tables remains the boring, reliable answer.

The Myth

Just denormalize everything into one table.

The Reality

Works until it does not. Joins are cheap on columnar stores; rebuilds when the schema drifts are not. Every new column you want to add to a dim attribute touches every historical fact row in the OBT, which is a multi-hour backfill at scale.

Decision matrix: which model for which situation

Eight situations, the schema pattern that fits, and a one-line reason. Internalize this and you have an answer for the most common 'so which would you pick' question.

If your situation is

Pick

Why

BI dashboards, < 10 facts, slowly changing dims

Star + SCD Type 2

Cleanest grammar for self-service analytics; dim history preserved.

ML feature store, point-in-time correctness needed

OBT or feature-table per entity

Features computed and frozen at training-time row; no SCD complexity.

Operational reporting (real-time), low-latency dashboards

Materialized view + denormalized cache

Pre-aggregated by the dimensions you filter on; refresh on a streaming cadence.

Compliance audit trail required

SCD Type 6 or full event log

Type 6 keeps current + history + previous in one row; event log is unambiguous and append-only.

Self-service analytics, lots of ad-hoc joins

Star schema, defensive

Predictable join paths; analysts know which dim to filter without consulting the model.

Wide dim that changes in 2 of 50 columns

SCD Type 4

Avoid duplicating 48 unchanging columns on every history row.

Customer can be in two regions at once

Bridge table with eff dates + weighting

Many-to-many relationships over time cannot be modeled by a single SCD type.

Hierarchy depth is unknown and queries traverse depth

Closure table

O(1) ancestor-descendant lookups beat recursive CTEs at scale; pay storage for query speed.

All Data Modeling Topics by Frequency

Six topic clusters, ordered by how often interviewers reach for them. Each cluster lists the questions you should be ready to answer cold.

Topic 1

Normalization (1NF through 3NF)

Difficulty: Medium

Frequency: Very High

Normalization is the foundation of relational design. Interviewers test whether you can identify violations, explain why they matter, and know when denormalization is the right call.

Given a table with repeating groups in a single column (comma-separated tags), normalize it to 1NF. Explain what problem the original design creates for queries.

Identify the partial dependency in a table with a composite primary key and decompose it into 2NF.

A table stores employee_name, department_name, and department_location. Identify the transitive dependency and normalize to 3NF.

When would you intentionally denormalize a 3NF schema? Give a specific scenario with the trade-offs.

Practice Normalization (1NF through 3NF) questions

Practice Normalization (1NF through 3NF)→

Run real schema and SQL exercises in the browser.

Topic 2

Star vs Snowflake Schemas

Difficulty: Medium

Frequency: Star schema: 4.7% of modeling questions

Star schema appears in 4.7% of data modeling questions, with fact tables (4.7%) and dimension tables (4.2%) closely behind. Interviewers want you to design fact and dimension tables, choose the right grain, and justify star vs snowflake trade-offs.

Design a star schema for an e-commerce platform. Define the fact table grain and list 4 dimension tables with their key attributes.

Convert a star schema into a snowflake schema by normalizing the product dimension. What queries become harder? What storage do you save?

Your fact table has 500M rows. A dimension table has 50M rows. Should you use a star or snowflake schema? Explain your reasoning based on query patterns.

What is a degenerate dimension? Give an example and explain why it lives in the fact table instead of a separate dimension.

Practice Star vs Snowflake Schemas questions

Practice Star vs Snowflake Schemas→

Run real schema and SQL exercises in the browser.

Topic 3

Slowly Changing Dimensions (SCD Types 1-3)

Difficulty: Medium-Hard

Frequency: High

Tracking historical changes in dimension data is a classic interview topic. Interviewers test whether you understand the trade-offs between SCD types and can implement them correctly.

A customer changes their address. Show how SCD Type 1, Type 2, and Type 3 each handle this change. Which columns does each approach add or modify?

Design an SCD Type 2 implementation for a product dimension where price changes frequently. Include the surrogate key, natural key, effective dates, and current flag.

Your team uses SCD Type 2 for a dimension with 10M rows. After 3 years, the table has grown to 80M rows. What strategies can reduce the size without losing history?

When would you choose SCD Type 3 over Type 2? Describe a real scenario where storing only the previous value is sufficient.

Practice Slowly Changing Dimensions (SCD Types 1-3) questions

Practice Slowly Changing Dimensions (SCD Types 1-3)→

Run real schema and SQL exercises in the browser.

Topic 4

Entity-Relationship Modeling

Difficulty: Medium

Frequency: Entity ID: 6.6% of modeling questions

Entity identification is the most common data modeling topic at 6.6%. ER diagrams are the starting point for most schema design interviews. You draw entities, relationships, and cardinalities. Then the interviewer asks follow-up questions about edge cases.

Draw an ER model for a university system: students, courses, professors, and enrollments. Define cardinalities for each relationship.

Model a social network where users can follow other users, create posts, and like posts. How do you handle the self-referencing follow relationship?

Design an ER model for a hotel booking system. Handle the case where a single booking can span multiple rooms and multiple nights.

Your ER model has a many-to-many relationship between orders and products. The intersection table needs to store quantity and unit_price. Why is unit_price stored here instead of on the product table?

Practice Entity-Relationship Modeling questions

Practice Entity-Relationship Modeling→

Run real schema and SQL exercises in the browser.

Topic 5

Keys and Constraints

Difficulty: Easy-Medium

Frequency: Primary keys 5.9%, Foreign keys 4.7%

Primary keys (5.9%) and foreign keys (4.7%) are core modeling topics. Interviewers probe whether you understand why each constraint exists and what breaks when you skip one.

When should you use a surrogate key (auto-increment ID) vs a natural key (email, SSN)? Give one scenario for each.

A table has a composite primary key of (user_id, event_date). Can you add a foreign key from another table to this composite key? What are the implications?

Your schema has no foreign key constraints for performance reasons. How do you ensure referential integrity? What is the cost of skipping enforcement?

Practice Keys and Constraints questions

Practice Keys and Constraints→

Run real schema and SQL exercises in the browser.

Topic 6

Trade-off Reasoning

Difficulty: Hard

Frequency: Very High

The hardest part of a data modeling interview is not the design itself. It is defending your choices when the interviewer pushes back. Every design has trade-offs, and you need to articulate them clearly.

You denormalized a dimension to speed up a critical dashboard query. The interviewer asks: what happens when the source data changes? How do you keep the denormalized copy in sync?

Your schema uses a single wide fact table with 200 columns. The interviewer suggests splitting it into multiple fact tables. Argue for and against each approach.

A stakeholder wants real-time data in the warehouse. Your current design uses daily batch loads. What schema changes would you make to support streaming ingestion without breaking existing queries?

You chose a star schema, but a new requirement needs a query that joins 5 dimension tables. The query is slow. What are your options? Which one do you recommend and why?

Practice Trade-off Reasoning questions

Practice Trade-off Reasoning→

Run real schema and SQL exercises in the browser.

How to Defend Your Schema Design Decisions

When the interviewer pushes back, your job is to explain why you chose what you chose. Four moves that turn a defensive answer into a confident one.

Step 1

Name the trade-off explicitly

"I chose to denormalize the product category onto the fact table. This costs storage and creates update complexity, but it eliminates a JOIN on every dashboard query, which matters at our query volume."

Step 2

Reference the query patterns

Every design decision should tie back to how the data will be queried. "Analysts will filter by date range and group by region, so I put both on the fact table as foreign keys to keep those queries single-scan."

Step 3

Acknowledge what you are giving up

Interviewers respect honesty. "This design does not handle the case where a product belongs to multiple categories. If that requirement appears later, I would add a bridge table. But for now, the simpler design serves the stated requirements."

Step 4

Have a fallback ready

When the interviewer pushes back, do not defend your original design to the death. Say: "If the requirement changes to X, I would restructure this table by Y. Here is what that migration looks like."

Practice the data modeling patterns interviewers actually test

Six real practice problems on DataDriven, free. Each one is a scenario interviewers reach for: history tracking, SCD changes, m:n invoicing, balance accuracy, network ripple, late-arriving sensor events.

Data ModelingOpen

Customer Address History

People move. Sometimes twice in a month. How do you remember where everyone was, and when?

Data ModelingOpen

The Customer Who Changed

She moved. She upgraded. She became someone new. The record has to keep up.

Data ModelingOpen

B2B Invoicing Data Model

Invoices go out, partial payments trickle in, and some customers are three months overdue.

Data ModelingOpen

Loan Management Schema

Money out, payments back. The balance has to be exact.

Data ModelingOpen

Telecom Network Connectivity Warehouse

One device goes down. The ripple keeps going.

Data ModelingOpen

Toll Road Sensor Analytics

Cars enter, cars exit. Except when they don't.

Data Modeling Interview FAQ

Why is data modeling the hardest interview round?+

Because there is no single correct answer. SQL and Python have objectively right or wrong output. Data modeling has trade-offs. The interviewer is evaluating your reasoning process, not just the final schema. You need to articulate why you made each choice and what you would change if requirements shifted.

How do I practice data modeling without a study partner?+

Design a schema for a business you know well: your favorite app, your company, an e-commerce site. Write down 3 trade-offs for each design decision. Then change a requirement (add real-time, add a new entity, increase scale 100x) and see how your schema holds up. DataDriven offers interactive schema building where you create tables, define relationships, and validate your design.

Do I need to memorize all SCD types for interviews?+

Know Types 1, 2, and 3 cold. Type 1 overwrites, Type 2 adds rows with date ranges, Type 3 adds a column for the previous value. Types 4-7 exist but are rarely asked. If you can explain the trade-offs between Types 1-3 with a concrete example, you are well prepared.

Star schema or snowflake schema: which should I default to in interviews?+

Start with star schema unless you have a specific reason to snowflake. Star schemas are simpler, faster for queries, and easier to explain. If the interviewer asks about storage efficiency or dimension table normalization, that is your cue to discuss snowflake. Always frame it as a trade-off, not a rule.

How long should a data modeling interview answer take?+

Spend 5 minutes asking clarifying questions about the business requirements. Spend 15 minutes drawing the schema. Spend the remaining 20-25 minutes defending your design and discussing alternatives. Interviewers value the discussion more than the diagram.

What is the difference between a fact table and a dimension table?+

A fact table stores measurable events (orders, clicks, transactions) at a specific grain. Each row is one event. Columns are numeric measures (quantity, amount, duration) and foreign keys to dimension tables. A dimension table stores descriptive attributes (customer name, product category, date components). Dimensions change slowly; facts accumulate quickly.

What is a surrogate key and when should I use one?+

A surrogate key is a system-generated, meaningless integer (or UUID/hash) used as the primary key of a table, independent of business values. Use surrogate keys in dimensional warehouses (SCD type 2 requires them because the natural key is no longer unique once you add date ranges), when the natural key is composite or string-heavy, or when the natural key may change over time.

What is the difference between a primary key and a foreign key?+

A primary key uniquely identifies each row in a table (NOT NULL + UNIQUE constraint). A foreign key is a column that references the primary key of another table to enforce relationships. In OLTP systems both are heavily used; in modern cloud warehouses, primary keys are still defined for documentation and modeling tools, but foreign key constraints are often disabled or skipped for write performance.

What is a data model and what types are tested in interviews?+

A data model is a structured definition of how data is organized, related, and constrained. Three layers are commonly tested: conceptual (entities and relationships, no implementation detail), logical (tables, columns, keys, normalization), and physical (engine-specific: indexes, partitioning, distribution keys). Most interviews focus on logical modeling with occasional physical-modeling questions for senior roles.

What are the most common data modeling tools used in industry?+

ER/Studio, ERwin, and Lucidchart are common for enterprise and on-premise. dbdiagram.io and DrawSQL are popular for cloud and modern stack work. Within data warehouses, Looker (LookML), dbt models, and Cube.dev are de-facto modeling layers. For interviews, you mostly draw on a whiteboard or in a shared doc; tool fluency rarely matters as much as model reasoning.

What is an entity-relationship diagram (ERD) and why does it matter?+

An ERD visualizes entities (tables) and the relationships between them (one-to-many, many-to-many, etc.) along with key constraints. Most data modeling interviews require you to sketch one. Mastering the notation (crow's foot, Chen) and being able to translate business requirements into entities and relationships in real time is the central skill being tested.

How do you handle late-arriving facts in an SCD Type 2 model?+

Look up the dimension version that was current at the event timestamp, not the row that is current today. If the dim is materialized eagerly with effective_from and effective_to, an as_of_timestamp join (BETWEEN eff_from AND eff_to) gets you the right surrogate key. If the change came in late and overlaps a currently-active row, you split the existing row by closing it at the new event date and inserting a new version starting from that date. The naive approach (re-applying the latest dim row to all backfilled facts) is wrong and is the classic interview-failing answer.

What is the grain of a fact table and why does it matter so much?+

The grain is one sentence that tells you what one row in the fact represents. 'One order line per product per customer per day.' If you cannot say that sentence, you do not have a model. Grain determines what every measure means, what dimensions can sit on the table, and which questions the table can answer. Mid-level and senior interviews almost always probe grain before they probe anything else.

When is one-big-table (OBT) the right answer over a star schema?+

OBT works when downstream users mostly query a single denormalized surface (ML feature stores, ad-hoc Looker explores on a columnar warehouse, BI tools that flatten joins anyway), and when the storage cost of duplicating dimension attributes is cheap compared to the join cost. It breaks when the same fact rows must support many different audience cuts, when SCD history must be enforced, or when downstream tools need a clean dimension to filter by. The 2026 take: in a columnar warehouse with cheap storage and pre-computed micro-partitions, OBT is more defensible than it was in 2010, but it is still the wrong default for self-service analytics.

What is a bridge table and when do you reach for one?+

A bridge table sits between a fact and a dimension, or between two dimensions, when the relationship is many-to-many. Customer-to-account in a household model. Patient-to-clinician in healthcare. Order-to-promotion when one order can have multiple promos. The bridge stores the pair plus an optional weighting factor when measures need to be allocated. Without the bridge you either lose rows (inner join collapses), double-count (cartesian product), or invent fake parent-child hierarchies that mislead the business.

Practice Data Modeling Interactively

Build schemas, define relationships, and validate your designs. The only platform with hands-on data modeling practice.

Build a Star Schema Take the daily challenge

Data Modeling Deep Dives

Dimensional Modeling→

Kimball methodology: facts, dimensions, grain, bus matrix

Slowly Changing Dimensions→

SCD Type 1, 2, 3 and when to use each

Medallion Architecture→

Bronze, silver, gold layers for progressive refinement

Data Vault Modeling→

Hubs, links, satellites for enterprise-scale integration

Related Guides

Star Schema vs Snowflake→

When to use each dimensional modeling approach

OLAP vs OLTP→

How storage layout shapes schema design decisions

SQL Interview Questions→

50+ SQL questions with worked solutions for DE interviews

Data Modeling Interview Questions (2026)

TL;DR: Data Modeling Interview Questions

What gets tested

What you need to do

What grain actually means and why interviewers care

Five grain statements ranked from bad to senior

"One row per customer."

"One row per order."

"One row per order line item."

"One row per order line item per fulfillment event."

"One row per order line item per fulfillment event, with the dim version effective at the event timestamp captured as a foreign key."

SCD Types: when each one is wrong

Overwrite. When it is right.

Default for analytics.

Rarely the right call. When it is NOT.

Dim plus history table.

1 + 2 + 3, the auditor's pick.

Many-to-many over time.

Late-arriving facts and how candidates lose the offer

The setup

What most candidates say

If your dim materializes effective ranges

If your dim is a snapshot table

What you give up

When candidates lose the offer

Star vs Snowflake vs One Big Table

Bridge tables and many-to-many

Customer-to-Account in a household model

Order-to-Promotion stacking

Patient-to-Clinician care teams

Star vs Snowflake Schema: Decision Matrix

Primary, Foreign, Surrogate Keys: When to Use Each

What interviewers actually grade on

Design a schema for an Uber-like ride-sharing platform.

How would you handle a customer with three addresses, two emails, and a merge into another customer?

What's the grain of your fact table for [scenario]? Defend it.

Walk through a Type 2 dimension with effective dating. What goes wrong with late-arriving facts?

Star or snowflake or one big table? When would you pick each?

How do you model a hierarchy where the depth is unknown?

Why Data Modeling Is the Hardest Interview Round

There is no single right schema

Requirements change mid-interview

You must think out loud

Worked Example: E-Commerce Star Schema

Grain: One row per order line item

Why order_id is a degenerate dimension

Why unit_price is on the fact table, not dim_product

Worked Example: Normalization from 1NF to 3NF

Worked Example: SCD Type 2 Implementation

Myth vs Reality

Decision matrix: which model for which situation

All Data Modeling Topics by Frequency

Normalization (1NF through 3NF)

Practice Normalization (1NF through 3NF) questions

Star vs Snowflake Schemas

Practice Star vs Snowflake Schemas questions

Slowly Changing Dimensions (SCD Types 1-3)

Practice Slowly Changing Dimensions (SCD Types 1-3) questions

Entity-Relationship Modeling

Practice Entity-Relationship Modeling questions

Keys and Constraints

Practice Keys and Constraints questions

Trade-off Reasoning

Practice Trade-off Reasoning questions

How to Defend Your Schema Design Decisions

Name the trade-off explicitly

Reference the query patterns

Acknowledge what you are giving up

Have a fallback ready

Practice the data modeling patterns interviewers actually test

Data Modeling Interview FAQ

Practice Data Modeling Interactively

Data Modeling Deep Dives

Related Guides

Data Engineer Interview Prep, explore the full guide

Interview Rounds

By Company

By Role

By Technology

Decisions