The Data Modeling Round

The modeling round is the loop-decider. It shows up in roughly a third of loops and more than half of senior ones, and the rejection pattern is consistent: the candidate drew a schema and couldn't defend the choices when the interviewer pushed back. The round is a defense, not a design. Worked schemas, the tradeoffs to volunteer, and the questions the interviewer will ask next. Part of the complete data engineer interview preparation framework.

What the modeling round actually tests

Concepts ranked by how often they come up in real modeling rounds. The first three are non-negotiable; the rest scale with seniority and company type.

ConceptHow oftenWhere it lands hardest
Star schema designEvery loopAll levels
Grain definitionEvery loopState it before drawing
Fact vs dimension classificationEvery loopAll levels
SCD Type 1 vs Type 2 vs Type 3HighL4 and up, BI-heavy roles
Surrogate vs natural keysHighEvery loop
Medallion (bronze/silver/gold)CommonDatabricks and lakehouse roles
Snowflake schema tradeoffsCommonL4 and up
Conformed dimensionsSenior signalMulti-team, enterprise roles
Normalization to 3NFOccasionalOLTP-heavy roles
Slowly changing factsSenior signalAudit-heavy roles
Bridge tables for many-to-manyOccasionalL4 and up
Late-arriving dimensionsSenior signalStreaming-aware roles
Data Vault 2.0RareRegulated industries, large EDWs

Three Worked Schemas From Real Loops

Each schema was the answer expected in a real interview. The trade-off discussion is the part interviewers score, not the diagram itself.

E-commerce

Star schema for orders, products, customers

Grain stated first: one row per order line item. This grain lets you analyze by product without re-aggregation, which is the most common analytics pattern.

dim_product (T1)PKproduct_skproduct_idcategorybranddim_customer (T1)PKcustomer_skcustomer_idemailnamefact_order_itemsPKorder_item_skorder_idFKproduct_skFKcustomer_skFKdate_skquantityunit_price_usdline_total_usddim_customer_history (T2)PKcustomer_history_skFKcustomer_skregionsegmentvalid_fromvalid_tois_currentdim_datePKdate_skdatefiscal_periodis_holiday
GrainOrder-line grain, not order-level. Lets you analyze product mix without re-aggregating.
SCD splitCustomer splits into Type 1 (name, email) and Type 2 (region, segment). The Type 2 dimension carries the history facts join against.
Conformeddim_date conformed across every fact in the warehouse. Same dimension, same calendar, every report.

Other tradeoffs worth volunteering: discount as dollars not percent (additive across products), order_id kept on the fact as a degenerate dimension so order-level rollups stay cheap, currency held alongside original_currency to support FX restatement after the fact.

Ride-sharing

Trips, drivers, riders, surge events

The hard part is modeling surge as either a slowly changing fact (one row per surge event) or a snapshot dimension (per minute, per zone). The right answer is to ask about query patterns first, then choose.
fact_trip
  (trip_id PK, rider_sk FK, driver_sk FK, pickup_geo_sk FK,
   dropoff_geo_sk FK, request_ts, accept_ts, pickup_ts, dropoff_ts,
   distance_miles, duration_seconds, fare_usd, tip_usd, surge_multiplier,
   trip_type, status)

fact_surge_event (slowly changing fact, immutable)
  (surge_id PK, geo_sk FK, multiplier, start_ts, end_ts, reason_code)

dim_driver (Type 2 for vehicle)
  (driver_sk PK, driver_id NK, name, signup_date, vehicle_make,
   vehicle_model, vehicle_year, valid_from, valid_to, is_current)

dim_rider (Type 1)
  (rider_sk PK, rider_id NK, name, signup_date, app_version)

dim_geo (conformed; H3 hex level 9)
  (geo_sk PK, h3_index, city, state, country)

dim_date and dim_time (conformed, minute grain)

Trade-off: surge stored as both a column on fact_trip and as a
separate fact_surge_event. The column captures "what was the
multiplier on this trip", the table captures "when did surges happen".
Two query patterns, two structures.
Ad analytics

Impressions, clicks, conversions at billions of rows per day

Scale forces decisions: pre-aggregated fact tables for dashboards, raw fact tables for fraud and ML. Document retention windows in the model, not as an afterthought.
fact_impression_raw (kept 30 days, partitioned by date)
  (impression_id PK, ad_sk FK, user_sk FK, placement_sk FK,
   date_id FK, hour_id FK, device_id, ip_hash,
   page_url_hash, served_ts)

fact_impression_hourly (kept 13 months, agg from raw)
  (date_id, hour_id, ad_sk, placement_sk, geo_sk,
   impressions, unique_users_hll, click_count, viewable_count)

fact_click (kept 90 days, partitioned by date)
  (click_id PK, impression_id FK, user_sk FK, ad_sk FK,
   click_ts, click_type, revenue_usd)

fact_conversion (kept 13 months)
  (conversion_id PK, click_id FK, user_sk FK, conversion_ts,
   conversion_type, revenue_usd, attributed_revenue_usd)

dim_ad (Type 2 for creative changes; Type 1 for advertiser)
dim_user (anonymized, no PII)

Decisions to volunteer:
  1. HyperLogLog for unique users in hourly agg. Approximate, but
     the only way to roll up unique counts at billion-row scale.
  2. Impression IDs in raw and click in hourly because attribution
     joins back. State the join window (typically 7-day post-click).
  3. Retention as a model property, not a TTL afterthought.

What Interviewers Watch For

  1. 01

    Grain stated before drawing

    One sentence: 'one row per order line item per day'. This single statement determines whether the rest of your model is right or wrong. Many candidates draw first, then realize halfway through that the grain was wrong.
  2. 02

    Star schema as the default

    Star schema is right for analytics 90% of the time. Snowflake schema only when dimensions are huge and rarely joined together. 3NF only when the workload is OLTP. Naming the default and the exceptions shows judgment.
  3. 03

    Surrogate keys for type 2 dimensions

    If a dimension changes over time and you keep history, the natural key is no longer unique. Surrogate keys are required for Type 2. Most candidates know this. Few state it without prompting.
  4. 04

    Trade-offs voiced for every choice

    Every denormalization is a trade-off (faster read, slower write, harder updates). Every Type 2 is a trade-off (history vs storage). Naming the trade-off before the interviewer asks is the L5 signal.
  5. 05

    Late-arriving and out-of-order data

    Almost no candidate volunteers this. Saying 'if a click arrives 2 days after the impression, here is how the model handles it' is a near-instant promotion to top of stack.
  6. 06

    Conformed dimensions across teams

    If the company has multiple data marts, dim_customer should be the same schema across all of them. Stating this without prompting tells the interviewer you have worked at scale.

How the Modeling Round Connects to the Rest of the Loop

Every modeling decision re-surfaces in another round. The SCD Type 2 pattern is the schema defense of the same instinct you use in how to pass the SQL round when you reach for ROW_NUMBER deduplication. The grain-of-the-fact-table question is the same partitioning question you answer in how to pass the system design round. The conformed-dimension argument is the same deduplication logic you write in how to pass the Python round.

Companies with warehouse or lakehouse focus push modeling harder. The Snowflake Data Engineer loop spends real time on micro-partition design, and the Databricks Data Engineer loop tests Delta Lake modeling. If you are targeting an analytics-engineer role, the how to pass the analytics engineer interview leans 70% on modeling, so the depth bar is higher than on a pure Data Engineer loop.

Prepare for the interview
01 / Open invite
02min.

Know the patterns before the interviewer asks them.

a data modeling query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1fact_orders
2 order_id bigint PK
3 customer_sk bigint FK
4 order_date date SCD2
5
Execute your solution0.4s avg.
PinterestInterview question
Solve a problem

SCD Type 2 Without the Boilerplate

Slowly Changing Dimension Type 2 is the single most asked follow-up in modeling rounds. Most candidates can name it. Fewer can write the merge logic from memory. Practice this pattern until you can produce it without thinking:

When a row in the source changes, expire the current row by setting valid_to and is_current = false, then insert a new row with valid_from = now, valid_to = NULL, is_current = true. The surrogate key changes. The natural key stays. Every fact table joining this dimension must use the surrogate key, not the natural key, or history queries break.

The follow-up after SCD Type 2 is always: how do you query "what did this customer look like on March 15"? Answer: join the fact to the dim using surrogate key (or join on natural key WHERE date BETWEEN valid_from AND valid_to). Be ready for both phrasings.

How to Prepare in Four Weeks

  1. 01

    Week 1: Star schema fluency

    Design 10 schemas from scratch: e-commerce, ride-sharing, ad tech, video streaming, food delivery, social network, B2B SaaS, IoT sensors, banking, healthcare. State grain first. Write trade-offs.
  2. 02

    Week 2: SCD and history

    Implement Type 1, Type 2, and Type 3 logic in SQL or Python. Practice the upsert pattern. Practice queries that ask 'what did this look like on date X'.
  3. 03

    Week 3: Scale and lakehouse

    Medallion architecture (bronze/silver/gold). Partitioning strategies. When to denormalize for read performance. When to use Delta or Iceberg time travel instead of SCD Type 2.
  4. 04

    Week 4: Defense rounds

    Mock modeling rounds where someone pushes back on every choice. Practice answering 'why not snowflake?', 'why surrogate keys?', 'why this grain?', 'how does this handle late data?'. Defense is the round, not the diagram.

Data Modeling Round FAQ

Should I use Kimball or Inmon in a data engineer interview?+
Kimball (star schema) is the default for analytics. Most companies hiring data engineers in 2026 use a Kimball-style warehouse, even when they call it a lakehouse. Inmon (3NF, atomic data, then build marts) is rare in interviews and almost never the right answer unless the interviewer specifically asks for 3NF.
Do I need to know Data Vault 2.0?+
Only for senior roles at financial services, insurance, healthcare, or government. Mention it as 'I have read the Linstedt material; the model is hubs, links, and satellites' if asked. Do not propose Data Vault for a generic e-commerce problem; it is overkill.
How detailed should my schema diagram be?+
Tables, columns, types (string, int, decimal, timestamp, geography), primary keys, foreign keys. Do not write CREATE TABLE statements unless asked. Boxes and lines on a whiteboard are fine. The diagram exists to support your trade-off discussion, not to be a deliverable.
What is the difference between a fact table and a dimension table?+
Facts are events (what happened, when, with what measures). Dimensions are descriptive context (who, what, where). Facts are tall and skinny (billions of rows, few columns). Dimensions are short and wide (thousands to millions of rows, many descriptive columns). Facts grow forever; dimensions grow with the business.
Should I always use surrogate keys?+
Use surrogate keys when the natural key is unstable (changes over time, like email), when you need SCD Type 2 (the surrogate identifies the version), or when the natural key is composite and clumsy. Use natural keys when they are stable, simple, and human-readable, especially for join readability in ad-hoc queries.
How do I handle many-to-many relationships?+
A bridge table with two foreign keys, one to each dimension. Add a weighting factor if the relationship is fractional (e.g., a sale attributed to multiple sales reps). This shows up in healthcare (patient-to-diagnosis), retail (product-to-category), and ad tech (impression-to-conversion).
Is medallion architecture the same as star schema?+
No. Medallion is a layering strategy (bronze = raw ingest, silver = cleaned and conformed, gold = business-ready aggregates). Star schema is a modeling pattern. Gold layer in a medallion is usually a star schema. Bronze is rarely modeled. Silver is sometimes 3NF or wide tables. Use both terms correctly to signal current-decade fluency.
What if I do not finish drawing the schema?+
Get the fact table grain and the main dimensions on the board. Verbalize the rest. The interviewer is assessing your reasoning, not your drawing speed. A clear partial diagram with a clear verbal completion beats a rushed full diagram.
02 / Why practice

Practice schema design with a real interviewer

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

More data engineer interview prep reading

More data engineer interview prep guides