The data modeling round appears in 30.7% of data engineer interview loops, and 4.7% of all rounds in our dataset are dedicated onsite modeling sessions. The round is a defense, not a design. Interviewers ask you to draw a schema for a real business scenario, then push back on every choice. Pass rate correlates more with how you justify trade-offs than with the schema you draw. This page is one of eight rounds in the complete data engineer interview preparation framework.
Frequency of concepts that come up in modeling rounds, from 1,042 interview reports.
| Concept | Share of Modeling Questions | Common In |
|---|---|---|
| Star schema design | 18.4% | Every loop |
| Grain definition | 16.2% | Every loop, often before drawing |
| Fact vs dimension classification | 12.1% | Every loop |
| SCD Type 1 vs Type 2 vs Type 3 | 11.7% | L4+, BI-heavy roles |
| Surrogate vs natural keys | 8.9% | Every loop |
| Snowflake schema trade-offs | 5.4% | L4+ |
| Medallion (bronze/silver/gold) | 6.8% | Databricks, lakehouse roles |
| Data Vault 2.0 | 2.1% | Senior, regulated industries |
| Conformed dimensions | 4.3% | Multi-team, enterprise roles |
| Slowly changing facts | 3.6% | Senior |
| Bridge tables for many-to-many | 3.2% | L4+ |
| Late-arriving dimensions | 2.4% | Senior, streaming-aware |
| Normalization to 3NF | 4.9% | OLTP-heavy roles |
Each schema below was the answer expected in a real interview. The trade-off discussion is the part graders score, not the diagram itself.
fact_order_items
(order_item_id PK, order_id FK, product_id FK, customer_id FK,
date_id FK, quantity, unit_price_usd, discount_amount_usd,
line_total_usd, currency, original_currency)
dim_customer (Type 1 for name/email; Type 2 for region)
(customer_sk PK, customer_id NK, name, email,
region_current, signup_date)
dim_customer_history (SCD Type 2 for region/segment)
(customer_sk PK, customer_id NK, region, segment,
valid_from, valid_to, is_current)
dim_product (Type 1)
(product_sk PK, product_id NK, name, category, subcategory,
brand, list_price_usd)
dim_date (conformed)
(date_id PK, date, day_of_week, month, quarter, year, is_weekend)
Trade-offs to volunteer:
1. Order-line grain, not order-level. Lets us analyze product mix.
2. Discount as $ amount, not %, because $ aggregations are
additive across products.
3. Customer split into Type 1 + Type 2 dimensions because email
changes are not analytically interesting; region changes are.fact_trip (trip_id PK, rider_sk FK, driver_sk FK, pickup_geo_sk FK, dropoff_geo_sk FK, request_ts, accept_ts, pickup_ts, dropoff_ts, distance_miles, duration_seconds, fare_usd, tip_usd, surge_multiplier, trip_type, status) fact_surge_event (slowly changing fact, immutable) (surge_id PK, geo_sk FK, multiplier, start_ts, end_ts, reason_code) dim_driver (Type 2 for vehicle) (driver_sk PK, driver_id NK, name, signup_date, vehicle_make, vehicle_model, vehicle_year, valid_from, valid_to, is_current) dim_rider (Type 1) (rider_sk PK, rider_id NK, name, signup_date, app_version) dim_geo (conformed; H3 hex level 9) (geo_sk PK, h3_index, city, state, country) dim_date and dim_time (conformed, minute grain) Trade-off: surge stored as both a column on fact_trip and as a separate fact_surge_event. The column captures "what was the multiplier on this trip", the table captures "when did surges happen". Two query patterns, two structures.
fact_impression_raw (kept 30 days, partitioned by date)
(impression_id PK, ad_sk FK, user_sk FK, placement_sk FK,
date_id FK, hour_id FK, device_id, ip_hash,
page_url_hash, served_ts)
fact_impression_hourly (kept 13 months, agg from raw)
(date_id, hour_id, ad_sk, placement_sk, geo_sk,
impressions, unique_users_hll, click_count, viewable_count)
fact_click (kept 90 days, partitioned by date)
(click_id PK, impression_id FK, user_sk FK, ad_sk FK,
click_ts, click_type, revenue_usd)
fact_conversion (kept 13 months)
(conversion_id PK, click_id FK, user_sk FK, conversion_ts,
conversion_type, revenue_usd, attributed_revenue_usd)
dim_ad (Type 2 for creative changes; Type 1 for advertiser)
dim_user (anonymized, no PII)
Decisions to volunteer:
1. HyperLogLog for unique users in hourly agg. Approximate, but
the only way to roll up unique counts at billion-row scale.
2. Impression IDs in raw and click in hourly because attribution
joins back. State the join window (typically 7-day post-click).
3. Retention as a model property, not a TTL afterthought.Every modeling decision re-surfaces in another round. The SCD Type 2 pattern is the schema defense of the same instinct you use in how to pass the SQL round when you reach for ROW_NUMBER deduplication. The grain-of-the-fact-table question is the same partitioning question you answer in how to pass the system design round. The conformed-dimension argument is the same deduplication logic you write in how to pass the Python round.
Companies with warehouse or lakehouse focus push modeling harder. The Snowflake Data Engineer loop spends real time on micro-partition design, and the Databricks Data Engineer loop tests Delta Lake modeling. If you are targeting an analytics-engineer role, the how to pass the analytics engineer interview leans 70% on modeling, so the depth bar is higher than on a pure Data Engineer loop.
Slowly Changing Dimension Type 2 is the single most asked follow-up in modeling rounds. Most candidates can name it. Fewer can write the merge logic from memory. Practice this pattern until you can produce it without thinking:
When a row in the source changes, expire the current row by setting valid_to and is_current = false, then insert a new row with valid_from = now, valid_to = NULL, is_current = true. The surrogate key changes. The natural key stays. Every fact table joining this dimension must use the surrogate key, not the natural key, or history queries break.
The follow-up after SCD Type 2 is always: how do you query "what did this customer look like on March 15"? Answer: join the fact to the dim using surrogate key (or join on natural key WHERE date BETWEEN valid_from AND valid_to). Be ready for both phrasings.
Practice schema design against real business problems. Defend your trade-offs. Build the instinct that gets you to L5.
Start the Modeling Mock InterviewWindow functions, gap-and-island, and the patterns interviewers test in 95% of Data Engineer loops.
JSON flattening, sessionization, and vanilla-Python data wrangling in the Data Engineer coding round.
Pipeline architecture, exactly-once semantics, and the framing that gets you to L5.
STAR-D answers tailored to data engineering, with example responses for impact and conflict.
What graders look for in a 4 to 8 hour Data Engineer take-home, with a rubric breakdown.
How to think out loud, handle silence, and avoid the traps that sink fluent coders.
Continue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.