Interview Round Guide

The Data Modeling Round

The data modeling round appears in 30.7% of data engineer interview loops, and 4.7% of all rounds in our dataset are dedicated onsite modeling sessions. The round is a defense, not a design. Interviewers ask you to draw a schema for a real business scenario, then push back on every choice. Pass rate correlates more with how you justify trade-offs than with the schema you draw. This page is one of eight rounds in the complete data engineer interview preparation framework.

The Short Answer
Expect a 45 to 60 minute round. You will draw a schema for a domain (e-commerce, ride-sharing, ad analytics, content platform) on a shared whiteboard or doc. State the grain of every fact table before drawing. Use star schema by default. Justify every denormalization with a query pattern. Be ready to convert dim_customer to SCD Type 2 when the interviewer says "we need to track changes over time". Strong candidates draw 4 tables, talk for 30 minutes about why, and never need to redraw.
Updated April 2026ยทBy The DataDriven Team

What the Modeling Round Actually Tests

Frequency of concepts that come up in modeling rounds, from 1,042 interview reports.

ConceptShare of Modeling QuestionsCommon In
Star schema design18.4%Every loop
Grain definition16.2%Every loop, often before drawing
Fact vs dimension classification12.1%Every loop
SCD Type 1 vs Type 2 vs Type 311.7%L4+, BI-heavy roles
Surrogate vs natural keys8.9%Every loop
Snowflake schema trade-offs5.4%L4+
Medallion (bronze/silver/gold)6.8%Databricks, lakehouse roles
Data Vault 2.02.1%Senior, regulated industries
Conformed dimensions4.3%Multi-team, enterprise roles
Slowly changing facts3.6%Senior
Bridge tables for many-to-many3.2%L4+
Late-arriving dimensions2.4%Senior, streaming-aware
Normalization to 3NF4.9%OLTP-heavy roles

Three Worked Schemas From Real Loops

Each schema below was the answer expected in a real interview. The trade-off discussion is the part graders score, not the diagram itself.

E-commerce

Star schema for orders, products, customers

Grain stated first: one row per order line item. This grain lets you analyze by product without re-aggregation, which is the most common analytics pattern.
fact_order_items
  (order_item_id PK, order_id FK, product_id FK, customer_id FK,
   date_id FK, quantity, unit_price_usd, discount_amount_usd,
   line_total_usd, currency, original_currency)

dim_customer (Type 1 for name/email; Type 2 for region)
  (customer_sk PK, customer_id NK, name, email,
   region_current, signup_date)

dim_customer_history (SCD Type 2 for region/segment)
  (customer_sk PK, customer_id NK, region, segment,
   valid_from, valid_to, is_current)

dim_product (Type 1)
  (product_sk PK, product_id NK, name, category, subcategory,
   brand, list_price_usd)

dim_date (conformed)
  (date_id PK, date, day_of_week, month, quarter, year, is_weekend)

Trade-offs to volunteer:
  1. Order-line grain, not order-level. Lets us analyze product mix.
  2. Discount as $ amount, not %, because $ aggregations are
     additive across products.
  3. Customer split into Type 1 + Type 2 dimensions because email
     changes are not analytically interesting; region changes are.
Ride-sharing

Trips, drivers, riders, surge events

The hard part is modeling surge as either a slowly changing fact (one row per surge event) or a snapshot dimension (per minute, per zone). The right answer is to ask about query patterns first, then choose.
fact_trip
  (trip_id PK, rider_sk FK, driver_sk FK, pickup_geo_sk FK,
   dropoff_geo_sk FK, request_ts, accept_ts, pickup_ts, dropoff_ts,
   distance_miles, duration_seconds, fare_usd, tip_usd, surge_multiplier,
   trip_type, status)

fact_surge_event (slowly changing fact, immutable)
  (surge_id PK, geo_sk FK, multiplier, start_ts, end_ts, reason_code)

dim_driver (Type 2 for vehicle)
  (driver_sk PK, driver_id NK, name, signup_date, vehicle_make,
   vehicle_model, vehicle_year, valid_from, valid_to, is_current)

dim_rider (Type 1)
  (rider_sk PK, rider_id NK, name, signup_date, app_version)

dim_geo (conformed; H3 hex level 9)
  (geo_sk PK, h3_index, city, state, country)

dim_date and dim_time (conformed, minute grain)

Trade-off: surge stored as both a column on fact_trip and as a
separate fact_surge_event. The column captures "what was the
multiplier on this trip", the table captures "when did surges happen".
Two query patterns, two structures.
Ad analytics

Impressions, clicks, conversions at billions of rows per day

Scale forces decisions: pre-aggregated fact tables for dashboards, raw fact tables for fraud and ML. Document retention windows in the model, not as an afterthought.
fact_impression_raw (kept 30 days, partitioned by date)
  (impression_id PK, ad_sk FK, user_sk FK, placement_sk FK,
   date_id FK, hour_id FK, device_id, ip_hash,
   page_url_hash, served_ts)

fact_impression_hourly (kept 13 months, agg from raw)
  (date_id, hour_id, ad_sk, placement_sk, geo_sk,
   impressions, unique_users_hll, click_count, viewable_count)

fact_click (kept 90 days, partitioned by date)
  (click_id PK, impression_id FK, user_sk FK, ad_sk FK,
   click_ts, click_type, revenue_usd)

fact_conversion (kept 13 months)
  (conversion_id PK, click_id FK, user_sk FK, conversion_ts,
   conversion_type, revenue_usd, attributed_revenue_usd)

dim_ad (Type 2 for creative changes; Type 1 for advertiser)
dim_user (anonymized, no PII)

Decisions to volunteer:
  1. HyperLogLog for unique users in hourly agg. Approximate, but
     the only way to roll up unique counts at billion-row scale.
  2. Impression IDs in raw and click in hourly because attribution
     joins back. State the join window (typically 7-day post-click).
  3. Retention as a model property, not a TTL afterthought.

What Interviewers Watch For

1

Grain stated before drawing

One sentence: 'one row per order line item per day'. This single statement determines whether the rest of your model is right or wrong. Many candidates draw first, then realize halfway through that the grain was wrong.
2

Star schema as the default

Star schema is right for analytics 90% of the time. Snowflake schema only when dimensions are huge and rarely joined together. 3NF only when the workload is OLTP. Naming the default and the exceptions shows judgment.
3

Surrogate keys for type 2 dimensions

If a dimension changes over time and you keep history, the natural key is no longer unique. Surrogate keys are required for Type 2. Most candidates know this. Few state it without prompting.
4

Trade-offs voiced for every choice

Every denormalization is a trade-off (faster read, slower write, harder updates). Every Type 2 is a trade-off (history vs storage). Naming the trade-off before the interviewer asks is the L5 signal.
5

Late-arriving and out-of-order data

Almost no candidate volunteers this. Saying 'if a click arrives 2 days after the impression, here is how the model handles it' is a near-instant promotion to top of stack.
6

Conformed dimensions across teams

If the company has multiple data marts, dim_customer should be the same schema across all of them. Stating this without prompting tells the interviewer you have worked at scale.

How the Modeling Round Connects to the Rest of the Loop

Every modeling decision re-surfaces in another round. The SCD Type 2 pattern is the schema defense of the same instinct you use in how to pass the SQL round when you reach for ROW_NUMBER deduplication. The grain-of-the-fact-table question is the same partitioning question you answer in how to pass the system design round. The conformed-dimension argument is the same deduplication logic you write in how to pass the Python round.

Companies with warehouse or lakehouse focus push modeling harder. The Snowflake Data Engineer loop spends real time on micro-partition design, and the Databricks Data Engineer loop tests Delta Lake modeling. If you are targeting an analytics-engineer role, the how to pass the analytics engineer interview leans 70% on modeling, so the depth bar is higher than on a pure Data Engineer loop.

SCD Type 2 Without the Boilerplate

Slowly Changing Dimension Type 2 is the single most asked follow-up in modeling rounds. Most candidates can name it. Fewer can write the merge logic from memory. Practice this pattern until you can produce it without thinking:

When a row in the source changes, expire the current row by setting valid_to and is_current = false, then insert a new row with valid_from = now, valid_to = NULL, is_current = true. The surrogate key changes. The natural key stays. Every fact table joining this dimension must use the surrogate key, not the natural key, or history queries break.

The follow-up after SCD Type 2 is always: how do you query "what did this customer look like on March 15"? Answer: join the fact to the dim using surrogate key (or join on natural key WHERE date BETWEEN valid_from AND valid_to). Be ready for both phrasings.

How to Prepare in Four Weeks

1

Week 1: Star schema fluency

Design 10 schemas from scratch: e-commerce, ride-sharing, ad tech, video streaming, food delivery, social network, B2B SaaS, IoT sensors, banking, healthcare. State grain first. Write trade-offs.
2

Week 2: SCD and history

Implement Type 1, Type 2, and Type 3 logic in SQL or Python. Practice the upsert pattern. Practice queries that ask 'what did this look like on date X'.
3

Week 3: Scale and lakehouse

Medallion architecture (bronze/silver/gold). Partitioning strategies. When to denormalize for read performance. When to use Delta or Iceberg time travel instead of SCD Type 2.
4

Week 4: Defense rounds

Mock modeling rounds where someone pushes back on every choice. Practice answering 'why not snowflake?', 'why surrogate keys?', 'why this grain?', 'how does this handle late data?'. Defense is the round, not the diagram.

Data Engineer Interview Prep FAQ

Should I use Kimball or Inmon in a data engineer interview?+
Kimball (star schema) is the default for analytics. Most companies hiring data engineers in 2026 use a Kimball-style warehouse, even when they call it a lakehouse. Inmon (3NF, atomic data, then build marts) is rare in interviews and almost never the right answer unless the interviewer specifically asks for 3NF.
Do I need to know Data Vault 2.0?+
Only for senior roles at financial services, insurance, healthcare, or government. Mention it as 'I have read the Linstedt material; the model is hubs, links, and satellites' if asked. Do not propose Data Vault for a generic e-commerce problem; it is overkill.
How detailed should my schema diagram be?+
Tables, columns, types (string, int, decimal, timestamp, geography), primary keys, foreign keys. Do not write CREATE TABLE statements unless asked. Boxes and lines on a whiteboard are fine. The diagram exists to support your trade-off discussion, not to be a deliverable.
What is the difference between a fact table and a dimension table?+
Facts are events (what happened, when, with what measures). Dimensions are descriptive context (who, what, where). Facts are tall and skinny (billions of rows, few columns). Dimensions are short and wide (thousands to millions of rows, many descriptive columns). Facts grow forever; dimensions grow with the business.
Should I always use surrogate keys?+
Use surrogate keys when the natural key is unstable (changes over time, like email), when you need SCD Type 2 (the surrogate identifies the version), or when the natural key is composite and clumsy. Use natural keys when they are stable, simple, and human-readable, especially for join readability in ad-hoc queries.
How do I handle many-to-many relationships?+
A bridge table with two foreign keys, one to each dimension. Add a weighting factor if the relationship is fractional (e.g., a sale attributed to multiple sales reps). This shows up in healthcare (patient-to-diagnosis), retail (product-to-category), and ad tech (impression-to-conversion).
Is medallion architecture the same as star schema?+
No. Medallion is a layering strategy (bronze = raw ingest, silver = cleaned and conformed, gold = business-ready aggregates). Star schema is a modeling pattern. Gold layer in a medallion is usually a star schema. Bronze is rarely modeled. Silver is sometimes 3NF or wide tables. Use both terms correctly to signal current-decade fluency.
What if I do not finish drawing the schema?+
Get the fact table grain and the main dimensions on the board. Verbalize the rest. The interviewer is grading your reasoning, not your drawing speed. A clear partial diagram with a clear verbal completion beats a rushed full diagram.

Pass the Modeling Round in 4 Weeks

Practice schema design against real business problems. Defend your trade-offs. Build the instinct that gets you to L5.

Start the Modeling Mock Interview

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats