The Data Modeling Round
The modeling round is the loop-decider. It shows up in roughly a third of loops and more than half of senior ones, and the rejection pattern is consistent: the candidate drew a schema and couldn't defend the choices when the interviewer pushed back. The round is a defense, not a design. Worked schemas, the tradeoffs to volunteer, and the questions the interviewer will ask next. Part of the complete data engineer interview preparation framework.
What the modeling round actually tests
Concepts ranked by how often they come up in real modeling rounds. The first three are non-negotiable; the rest scale with seniority and company type.
| Concept | How often | Where it lands hardest |
|---|---|---|
| Star schema design | Every loop | All levels |
| Grain definition | Every loop | State it before drawing |
| Fact vs dimension classification | Every loop | All levels |
| SCD Type 1 vs Type 2 vs Type 3 | High | L4 and up, BI-heavy roles |
| Surrogate vs natural keys | High | Every loop |
| Medallion (bronze/silver/gold) | Common | Databricks and lakehouse roles |
| Snowflake schema tradeoffs | Common | L4 and up |
| Conformed dimensions | Senior signal | Multi-team, enterprise roles |
| Normalization to 3NF | Occasional | OLTP-heavy roles |
| Slowly changing facts | Senior signal | Audit-heavy roles |
| Bridge tables for many-to-many | Occasional | L4 and up |
| Late-arriving dimensions | Senior signal | Streaming-aware roles |
| Data Vault 2.0 | Rare | Regulated industries, large EDWs |
Three Worked Schemas From Real Loops
Each schema was the answer expected in a real interview. The trade-off discussion is the part interviewers score, not the diagram itself.
Star schema for orders, products, customers
Grain stated first: one row per order line item. This grain lets you analyze by product without re-aggregation, which is the most common analytics pattern.
Other tradeoffs worth volunteering: discount as dollars not percent (additive across products), order_id kept on the fact as a degenerate dimension so order-level rollups stay cheap, currency held alongside original_currency to support FX restatement after the fact.
Trips, drivers, riders, surge events
fact_trip (trip_id PK, rider_sk FK, driver_sk FK, pickup_geo_sk FK, dropoff_geo_sk FK, request_ts, accept_ts, pickup_ts, dropoff_ts, distance_miles, duration_seconds, fare_usd, tip_usd, surge_multiplier, trip_type, status) fact_surge_event (slowly changing fact, immutable) (surge_id PK, geo_sk FK, multiplier, start_ts, end_ts, reason_code) dim_driver (Type 2 for vehicle) (driver_sk PK, driver_id NK, name, signup_date, vehicle_make, vehicle_model, vehicle_year, valid_from, valid_to, is_current) dim_rider (Type 1) (rider_sk PK, rider_id NK, name, signup_date, app_version) dim_geo (conformed; H3 hex level 9) (geo_sk PK, h3_index, city, state, country) dim_date and dim_time (conformed, minute grain) Trade-off: surge stored as both a column on fact_trip and as a separate fact_surge_event. The column captures "what was the multiplier on this trip", the table captures "when did surges happen". Two query patterns, two structures.
Impressions, clicks, conversions at billions of rows per day
fact_impression_raw (kept 30 days, partitioned by date)
(impression_id PK, ad_sk FK, user_sk FK, placement_sk FK,
date_id FK, hour_id FK, device_id, ip_hash,
page_url_hash, served_ts)
fact_impression_hourly (kept 13 months, agg from raw)
(date_id, hour_id, ad_sk, placement_sk, geo_sk,
impressions, unique_users_hll, click_count, viewable_count)
fact_click (kept 90 days, partitioned by date)
(click_id PK, impression_id FK, user_sk FK, ad_sk FK,
click_ts, click_type, revenue_usd)
fact_conversion (kept 13 months)
(conversion_id PK, click_id FK, user_sk FK, conversion_ts,
conversion_type, revenue_usd, attributed_revenue_usd)
dim_ad (Type 2 for creative changes; Type 1 for advertiser)
dim_user (anonymized, no PII)
Decisions to volunteer:
1. HyperLogLog for unique users in hourly agg. Approximate, but
the only way to roll up unique counts at billion-row scale.
2. Impression IDs in raw and click in hourly because attribution
joins back. State the join window (typically 7-day post-click).
3. Retention as a model property, not a TTL afterthought.What Interviewers Watch For
- 01
Grain stated before drawing
One sentence: 'one row per order line item per day'. This single statement determines whether the rest of your model is right or wrong. Many candidates draw first, then realize halfway through that the grain was wrong. - 02
Star schema as the default
Star schema is right for analytics 90% of the time. Snowflake schema only when dimensions are huge and rarely joined together. 3NF only when the workload is OLTP. Naming the default and the exceptions shows judgment. - 03
Surrogate keys for type 2 dimensions
If a dimension changes over time and you keep history, the natural key is no longer unique. Surrogate keys are required for Type 2. Most candidates know this. Few state it without prompting. - 04
Trade-offs voiced for every choice
Every denormalization is a trade-off (faster read, slower write, harder updates). Every Type 2 is a trade-off (history vs storage). Naming the trade-off before the interviewer asks is the L5 signal. - 05
Late-arriving and out-of-order data
Almost no candidate volunteers this. Saying 'if a click arrives 2 days after the impression, here is how the model handles it' is a near-instant promotion to top of stack. - 06
Conformed dimensions across teams
If the company has multiple data marts, dim_customer should be the same schema across all of them. Stating this without prompting tells the interviewer you have worked at scale.
How the Modeling Round Connects to the Rest of the Loop
Every modeling decision re-surfaces in another round. The SCD Type 2 pattern is the schema defense of the same instinct you use in how to pass the SQL round when you reach for ROW_NUMBER deduplication. The grain-of-the-fact-table question is the same partitioning question you answer in how to pass the system design round. The conformed-dimension argument is the same deduplication logic you write in how to pass the Python round.
Companies with warehouse or lakehouse focus push modeling harder. The Snowflake Data Engineer loop spends real time on micro-partition design, and the Databricks Data Engineer loop tests Delta Lake modeling. If you are targeting an analytics-engineer role, the how to pass the analytics engineer interview leans 70% on modeling, so the depth bar is higher than on a pure Data Engineer loop.
Know the patterns before the interviewer asks them.
SCD Type 2 Without the Boilerplate
Slowly Changing Dimension Type 2 is the single most asked follow-up in modeling rounds. Most candidates can name it. Fewer can write the merge logic from memory. Practice this pattern until you can produce it without thinking:
When a row in the source changes, expire the current row by setting valid_to and is_current = false, then insert a new row with valid_from = now, valid_to = NULL, is_current = true. The surrogate key changes. The natural key stays. Every fact table joining this dimension must use the surrogate key, not the natural key, or history queries break.
The follow-up after SCD Type 2 is always: how do you query "what did this customer look like on March 15"? Answer: join the fact to the dim using surrogate key (or join on natural key WHERE date BETWEEN valid_from AND valid_to). Be ready for both phrasings.
How to Prepare in Four Weeks
- 01
Week 1: Star schema fluency
Design 10 schemas from scratch: e-commerce, ride-sharing, ad tech, video streaming, food delivery, social network, B2B SaaS, IoT sensors, banking, healthcare. State grain first. Write trade-offs. - 02
Week 2: SCD and history
Implement Type 1, Type 2, and Type 3 logic in SQL or Python. Practice the upsert pattern. Practice queries that ask 'what did this look like on date X'. - 03
Week 3: Scale and lakehouse
Medallion architecture (bronze/silver/gold). Partitioning strategies. When to denormalize for read performance. When to use Delta or Iceberg time travel instead of SCD Type 2. - 04
Week 4: Defense rounds
Mock modeling rounds where someone pushes back on every choice. Practice answering 'why not snowflake?', 'why surrogate keys?', 'why this grain?', 'how does this handle late data?'. Defense is the round, not the diagram.
Data Modeling Round FAQ
Should I use Kimball or Inmon in a data engineer interview?+
Do I need to know Data Vault 2.0?+
How detailed should my schema diagram be?+
What is the difference between a fact table and a dimension table?+
Should I always use surrogate keys?+
How do I handle many-to-many relationships?+
Is medallion architecture the same as star schema?+
What if I do not finish drawing the schema?+
Practice schema design with a real interviewer
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
More data engineer interview prep reading
More data engineer interview prep guides
Window functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.
Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.
STAR-D answers tailored to data engineering, with example responses for impact and conflict.
What graders look for in a 4 to 8 hour Data Engineer take-home, with a rubric breakdown.
How to think out loud, handle silence, and avoid the traps that sink fluent coders.