The Data Modeling Round

The modeling round is the loop-decider. It shows up in roughly a third of loops and more than half of senior ones, and the rejection pattern is consistent: the candidate drew a schema and couldn't defend the choices when the interviewer pushed back. The round is a defense, not a design. Worked schemas, the tradeoffs to volunteer, and the questions the interviewer will ask next. Part of the complete data engineer interview preparation framework.

What the modeling round actually tests

Concepts ranked by how often they come up in real modeling rounds. The first three are non-negotiable; the rest scale with seniority and company type.

Concept	How often	Where it lands hardest
Star schema design	Every loop	All levels
Grain definition	Every loop	State it before drawing
Fact vs dimension classification	Every loop	All levels
SCD Type 1 vs Type 2 vs Type 3	High	L4 and up, BI-heavy roles
Surrogate vs natural keys	High	Every loop
Medallion (bronze/silver/gold)	Common	Databricks and lakehouse roles
Snowflake schema tradeoffs	Common	L4 and up
Conformed dimensions	Senior signal	Multi-team, enterprise roles
Normalization to 3NF	Occasional	OLTP-heavy roles
Slowly changing facts	Senior signal	Audit-heavy roles
Bridge tables for many-to-many	Occasional	L4 and up
Late-arriving dimensions	Senior signal	Streaming-aware roles
Data Vault 2.0	Rare	Regulated industries, large EDWs

Three Worked Schemas From Real Loops

Each schema was the answer expected in a real interview. The trade-off discussion is the part interviewers score, not the diagram itself.

E-commerce

Star schema for orders, products, customers

Grain stated first: one row per order line item. This grain lets you analyze by product without re-aggregation, which is the most common analytics pattern.

GrainOrder-line grain, not order-level. Lets you analyze product mix without re-aggregating.

SCD splitCustomer splits into Type 1 (name, email) and Type 2 (region, segment). The Type 2 dimension carries the history facts join against.

Conformeddim_date conformed across every fact in the warehouse. Same dimension, same calendar, every report.

Other tradeoffs worth volunteering: discount as dollars not percent (additive across products), order_id kept on the fact as a degenerate dimension so order-level rollups stay cheap, currency held alongside original_currency to support FX restatement after the fact.

Ride-sharing

Trips, drivers, riders, surge events

The hard part is modeling surge as either a slowly changing fact (one row per surge event) or a snapshot dimension (per minute, per zone). The right answer is to ask about query patterns first, then choose.

fact_trip
  (trip_id PK, rider_sk FK, driver_sk FK, pickup_geo_sk FK,
   dropoff_geo_sk FK, request_ts, accept_ts, pickup_ts, dropoff_ts,
   distance_miles, duration_seconds, fare_usd, tip_usd, surge_multiplier,
   trip_type, status)

fact_surge_event (slowly changing fact, immutable)
  (surge_id PK, geo_sk FK, multiplier, start_ts, end_ts, reason_code)

dim_driver (Type 2 for vehicle)
  (driver_sk PK, driver_id NK, name, signup_date, vehicle_make,
   vehicle_model, vehicle_year, valid_from, valid_to, is_current)

dim_rider (Type 1)
  (rider_sk PK, rider_id NK, name, signup_date, app_version)

dim_geo (conformed; H3 hex level 9)
  (geo_sk PK, h3_index, city, state, country)

dim_date and dim_time (conformed, minute grain)

Trade-off: surge stored as both a column on fact_trip and as a
separate fact_surge_event. The column captures "what was the
multiplier on this trip", the table captures "when did surges happen".
Two query patterns, two structures.

Ad analytics

Impressions, clicks, conversions at billions of rows per day

Scale forces decisions: pre-aggregated fact tables for dashboards, raw fact tables for fraud and ML. Document retention windows in the model, not as an afterthought.

fact_impression_raw (kept 30 days, partitioned by date)
  (impression_id PK, ad_sk FK, user_sk FK, placement_sk FK,
   date_id FK, hour_id FK, device_id, ip_hash,
   page_url_hash, served_ts)

fact_impression_hourly (kept 13 months, agg from raw)
  (date_id, hour_id, ad_sk, placement_sk, geo_sk,
   impressions, unique_users_hll, click_count, viewable_count)

fact_click (kept 90 days, partitioned by date)
  (click_id PK, impression_id FK, user_sk FK, ad_sk FK,
   click_ts, click_type, revenue_usd)

fact_conversion (kept 13 months)
  (conversion_id PK, click_id FK, user_sk FK, conversion_ts,
   conversion_type, revenue_usd, attributed_revenue_usd)

dim_ad (Type 2 for creative changes; Type 1 for advertiser)
dim_user (anonymized, no PII)

Decisions to volunteer:
  1. HyperLogLog for unique users in hourly agg. Approximate, but
     the only way to roll up unique counts at billion-row scale.
  2. Impression IDs in raw and click in hourly because attribution
     joins back. State the join window (typically 7-day post-click).
  3. Retention as a model property, not a TTL afterthought.

What Interviewers Watch For

01
Grain stated before drawing
One sentence: 'one row per order line item per day'. This single statement determines whether the rest of your model is right or wrong. Many candidates draw first, then realize halfway through that the grain was wrong.
02
Star schema as the default
Star schema is right for analytics 90% of the time. Snowflake schema only when dimensions are huge and rarely joined together. 3NF only when the workload is OLTP. Naming the default and the exceptions shows judgment.
03
Surrogate keys for type 2 dimensions
If a dimension changes over time and you keep history, the natural key is no longer unique. Surrogate keys are required for Type 2. Most candidates know this. Few state it without prompting.
04
Trade-offs voiced for every choice
Every denormalization is a trade-off (faster read, slower write, harder updates). Every Type 2 is a trade-off (history vs storage). Naming the trade-off before the interviewer asks is the L5 signal.
05
Late-arriving and out-of-order data
Almost no candidate volunteers this. Saying 'if a click arrives 2 days after the impression, here is how the model handles it' is a near-instant promotion to top of stack.
06
Conformed dimensions across teams
If the company has multiple data marts, dim_customer should be the same schema across all of them. Stating this without prompting tells the interviewer you have worked at scale.

How the Modeling Round Connects to the Rest of the Loop

Every modeling decision re-surfaces in another round. The SCD Type 2 pattern is the schema defense of the same instinct you use in how to pass the SQL round when you reach for ROW_NUMBER deduplication. The grain-of-the-fact-table question is the same partitioning question you answer in how to pass the system design round. The conformed-dimension argument is the same deduplication logic you write in how to pass the Python round.

Companies with warehouse or lakehouse focus push modeling harder. The Snowflake Data Engineer loop spends real time on micro-partition design, and the Databricks Data Engineer loop tests Delta Lake modeling. If you are targeting an analytics-engineer role, the how to pass the analytics engineer interview leans 70% on modeling, so the depth bar is higher than on a pure Data Engineer loop.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a data modeling query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1fact_orders

2 order_id bigint PK

3 customer_sk bigint FK

4 order_date date SCD2

Execute your solution0.4s avg.

PinterestInterview question

Solve a problem

SCD Type 2 Without the Boilerplate

Slowly Changing Dimension Type 2 is the single most asked follow-up in modeling rounds. Most candidates can name it. Fewer can write the merge logic from memory. Practice this pattern until you can produce it without thinking:

When a row in the source changes, expire the current row by setting valid_to and is_current = false, then insert a new row with valid_from = now, valid_to = NULL, is_current = true. The surrogate key changes. The natural key stays. Every fact table joining this dimension must use the surrogate key, not the natural key, or history queries break.

The follow-up after SCD Type 2 is always: how do you query "what did this customer look like on March 15"? Answer: join the fact to the dim using surrogate key (or join on natural key WHERE date BETWEEN valid_from AND valid_to). Be ready for both phrasings.

How to Prepare in Four Weeks

01
Week 1: Star schema fluency
Design 10 schemas from scratch: e-commerce, ride-sharing, ad tech, video streaming, food delivery, social network, B2B SaaS, IoT sensors, banking, healthcare. State grain first. Write trade-offs.
02
Week 2: SCD and history
Implement Type 1, Type 2, and Type 3 logic in SQL or Python. Practice the upsert pattern. Practice queries that ask 'what did this look like on date X'.
03
Week 3: Scale and lakehouse
Medallion architecture (bronze/silver/gold). Partitioning strategies. When to denormalize for read performance. When to use Delta or Iceberg time travel instead of SCD Type 2.
04
Week 4: Defense rounds
Mock modeling rounds where someone pushes back on every choice. Practice answering 'why not snowflake?', 'why surrogate keys?', 'why this grain?', 'how does this handle late data?'. Defense is the round, not the diagram.

Data Modeling Round FAQ

Should I use Kimball or Inmon in a data engineer interview?+

Kimball (star schema) is the default for analytics. Most companies hiring data engineers in 2026 use a Kimball-style warehouse, even when they call it a lakehouse. Inmon (3NF, atomic data, then build marts) is rare in interviews and almost never the right answer unless the interviewer specifically asks for 3NF.

Do I need to know Data Vault 2.0?+

Only for senior roles at financial services, insurance, healthcare, or government. Mention it as 'I have read the Linstedt material; the model is hubs, links, and satellites' if asked. Do not propose Data Vault for a generic e-commerce problem; it is overkill.

How detailed should my schema diagram be?+

Tables, columns, types (string, int, decimal, timestamp, geography), primary keys, foreign keys. Do not write CREATE TABLE statements unless asked. Boxes and lines on a whiteboard are fine. The diagram exists to support your trade-off discussion, not to be a deliverable.

What is the difference between a fact table and a dimension table?+

Facts are events (what happened, when, with what measures). Dimensions are descriptive context (who, what, where). Facts are tall and skinny (billions of rows, few columns). Dimensions are short and wide (thousands to millions of rows, many descriptive columns). Facts grow forever; dimensions grow with the business.

Should I always use surrogate keys?+

Use surrogate keys when the natural key is unstable (changes over time, like email), when you need SCD Type 2 (the surrogate identifies the version), or when the natural key is composite and clumsy. Use natural keys when they are stable, simple, and human-readable, especially for join readability in ad-hoc queries.

How do I handle many-to-many relationships?+

A bridge table with two foreign keys, one to each dimension. Add a weighting factor if the relationship is fractional (e.g., a sale attributed to multiple sales reps). This shows up in healthcare (patient-to-diagnosis), retail (product-to-category), and ad tech (impression-to-conversion).

Is medallion architecture the same as star schema?+

No. Medallion is a layering strategy (bronze = raw ingest, silver = cleaned and conformed, gold = business-ready aggregates). Star schema is a modeling pattern. Gold layer in a medallion is usually a star schema. Bronze is rarely modeled. Silver is sometimes 3NF or wide tables. Use both terms correctly to signal current-decade fluency.

What if I do not finish drawing the schema?+

Get the fact table grain and the main dimensions on the board. Verbalize the rest. The interviewer is assessing your reasoning, not your drawing speed. A clear partial diagram with a clear verbal completion beats a rushed full diagram.

02 / Why practice

Practice schema design with a real interviewer

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Start a modeling mock

More data engineer interview prep guides

how to pass the SQL round→

Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.

how to pass the Python round→

JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.

how to pass the system design round→

Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.

how to pass the behavioral round→

STAR-D answers tailored to data engineering, with example responses for impact and conflict.

how to pass the Data Engineer take-home→

What graders look for in a 4 to 8 hour Data Engineer take-home, with a rubric breakdown.

how to pass the live coding round→

How to think out loud, handle silence, and avoid the traps that sink fluent coders.

The Data Modeling Round

What the modeling round actually tests

Three Worked Schemas From Real Loops

Star schema for orders, products, customers

Trips, drivers, riders, surge events

Impressions, clicks, conversions at billions of rows per day

What Interviewers Watch For

Grain stated before drawing

Star schema as the default

Surrogate keys for type 2 dimensions

Trade-offs voiced for every choice

Late-arriving and out-of-order data

Conformed dimensions across teams

How the Modeling Round Connects to the Rest of the Loop

Know the patterns before the interviewer asks them.

SCD Type 2 Without the Boilerplate

How to Prepare in Four Weeks

Week 1: Star schema fluency

Week 2: SCD and history

Week 3: Scale and lakehouse

Week 4: Defense rounds

Data Modeling Round FAQ

Practice schema design with a real interviewer

More data engineer interview prep reading

More data engineer interview prep guides