Data Modeling Interview Questions

Data Modeling Interview Questions

Whiteboard data modeling interview problems with rubric-scored verdicts for data engineer prep.

Schema design problems pulled from data engineer and analytics engineer interview loops. Each problem ships with a worked rubric covering grain choice, SCD type, dimension shape, and the trade-off questions interviewers ask next. Star, snowflake, data vault, medallion, conformed dimensions, bridge tables, late-arriving dimensions.

The data modeling round shows up on 55 percent of data engineer interview loops and 80 percent of analytics engineer loops. Format is a 45-minute whiteboard or canvas exercise. The interviewer hands you a domain (a marketplace, a rideshare app, a payments ledger, a content platform, an ad-tech impression-conversion pipeline) and one or two analytical questions to support. You are expected to pick the grain in one sentence, choose dimension shapes including SCD type with justification, defend why star schema instead of snowflake or one-big-table, and articulate the trade-offs the interviewer asks about next.

Eight design patterns appear in the catalog. Star schema with conformed dimensions: one row per order-line-item or per-trip or per-impression as the fact grain, with dim_customer, dim_product, dim_date as conformed dimensions shared across multiple facts. Snowflake schema: normalized dimensions with their own dimensional tables, rarely the right answer in modern columnar warehouses because the join cost outweighs the storage savings. Data vault 2.0: hubs for business keys, links for relationships, satellites for descriptive attributes with full versioning, the case for it is mostly governance in regulated industries. Medallion architecture: bronze raw immutable, silver cleaned and conformed, gold business-ready aggregates with 3x storage as the trade-off for replay safety. SCD types: Type 1 overwrite, Type 2 row-per-version with effective_from and effective_to and is_current, Type 3 current-and-previous columns side by side. Bridge tables for many-to-many: patient-diagnosis in healthcare, product-category in retail, impression-conversion in ad tech. Late-arriving dimensions: placeholder row with is_late=true, update when real row arrives. Slowly-changing fact tables for corrections: append-only with version column versus in-place update with audit log.

The grain question is the most-failed part of this round. The data engineer candidate who starts drawing before stating the grain (even a strong candidate) almost always builds a fact table that mixes grains and then has to throw it out. Fix is mechanical: say "the fact table is one row per X" out loud, write it on the whiteboard, then design the rest from there. Senior data engineer rubrics also weight conformed dimensions (the same dim_customer schema across multiple facts so analysts can join across them) and the trade-off articulation between star, snowflake, OBT, and vault for the specific domain the interviewer chose.

Companies represented in this catalog through tagged interview reports: Meta (ads attribution model with SCD2 advertiser dimension), Amazon (Redshift-flavored star schemas with explicit DISTKEY and SORTKEY decisions), Stripe (financial-data SCD2 merchant dimension with time-varying attributes), Netflix (Iceberg-flavored medallion architecture with structured streaming feeding gold), Airbnb (search and booking funnel star schema), Uber (rideshare trip-grain fact with H3 location dim), Snowflake and Databricks (warehouse-vendor neutrality testing across patterns). The data engineer who can defend grain choice and SCD type across three of these domains usually clears the modeling round at any of them.

What is the first thing a data engineer should say in a data modeling whiteboard round?
The grain of the fact table, in one sentence: 'One row per order line item' or 'one row per trip' or 'one row per impression'. Say it out loud before drawing anything. The grain determines which dimensions you need, which measures are additive, and where SCDs come into play. Starting to draw without stating the grain is the most common L4-to-L5 failure mode for data engineer candidates.
When does a data engineer use Type 1 versus Type 2 versus Type 3 SCD?
Type 1 (overwrite) when history does not matter and you only need the current value: useful for typos and corrections. Type 2 (new row per version with effective_from, effective_to, is_current) when downstream queries need point-in-time correctness: the default for customer address, employee department, product category. Type 3 (current and previous columns) when there are only two states that matter and joins should stay simple: rarely the right answer in 2026 data engineer interviews.
Star schema or snowflake schema?
Star, almost always. Snowflake normalizes dimensions into their own dimensional tables to reduce storage; the trade-off is more joins per query. On columnar storage (Snowflake, BigQuery, Redshift), dimension tables are usually small enough to broadcast, and the join cost dominates the storage saving. Star wins unless the dimension is genuinely large enough to matter (millions of rows with high-cardinality attributes), in which case snowflake the specific dimension that is too big, not the whole model.
When does data vault beat Kimball star schema?
Data vault splits the model into hubs (business keys), links (relationships), and satellites (descriptive attributes versioned over time). The case for it is mostly governance: multiple ingestion teams loading independently, full audit lineage, sources that change schema frequently, regulated industries like banking and pharma. The tradeoff is that nobody wants to query it directly: every business question becomes a 6-table join, so teams build a star-shaped business vault on top. In a data engineer interview, the right answer is usually 'star, unless the company has a real EDW org and a compliance reason to track every change'.
How does a data engineer model a many-to-many relationship?
Bridge table with two FKs, one to each dimension. Optional weighting factor for fractional attribution (the impression-to-conversion case where one conversion is attributed across multiple touchpoints). Common in healthcare (patient-diagnosis), retail (product-category), ad tech (impression-conversion). Without a bridge, a JOIN inflates row counts and silently corrupts SUM and COUNT: the many-to-many duplication trap.
What is the medallion architecture?
Three layers. Bronze: raw, append-only, schema-on-read, ideally still in source format. Silver: cleaned, typed, deduplicated, with bad rows quarantined. Gold: business-ready and modeled, usually as star schemas. Each layer has its own ownership and quality contract. The point: a bug in silver does not require re-ingesting from the source. Default on lakehouses like Databricks Delta and Apache Iceberg. Not an alternative to star schemas; it is where star schemas live.
How does a data engineer handle late-arriving dimensions?
Insert a placeholder dim row with surrogate key and is_late=true; mark known fields as NULL and unknown_flag=true. Update when the real row arrives (replace the placeholder, or insert a new SCD Type 2 row depending on whether is_late changes the version). Alternative is to backfill facts after dim arrives, which works for small fact tables and breaks at scale. Most candidates skip this question; raising it unprompted is a senior data engineer signal.
How does a data engineer model a slowly changing fact (corrections)?
Two patterns. Append-only with version column: every correction is a new row with version+1; the latest version is the truth. Audit-friendly but query needs filter on max version per natural key. In-place update with audit log table: query is simple but you need a separate audit table for compliance. Trade-off: query simplicity vs storage cost vs auditability. Mention both, pick based on the domain. Financial corrections almost always need audit; product catalog corrections usually do not.

62 practice problems matching this filter. Difficulty: medium (35), easy (8), hard (19).

Data Modeling (62)