Data Modeling Interview Questions
Data Modeling Interview Questions
Whiteboard data modeling interview problems with rubric-scored verdicts for data engineer prep.
Schema design problems pulled from data engineer and analytics engineer interview loops. Each problem ships with a worked rubric covering grain choice, SCD type, dimension shape, and the trade-off questions interviewers ask next. Star, snowflake, data vault, medallion, conformed dimensions, bridge tables, late-arriving dimensions.
The data modeling round shows up on 55 percent of data engineer interview loops and 80 percent of analytics engineer loops. Format is a 45-minute whiteboard or canvas exercise. The interviewer hands you a domain (a marketplace, a rideshare app, a payments ledger, a content platform, an ad-tech impression-conversion pipeline) and one or two analytical questions to support. You are expected to pick the grain in one sentence, choose dimension shapes including SCD type with justification, defend why star schema instead of snowflake or one-big-table, and articulate the trade-offs the interviewer asks about next.
Eight design patterns appear in the catalog. Star schema with conformed dimensions: one row per order-line-item or per-trip or per-impression as the fact grain, with dim_customer, dim_product, dim_date as conformed dimensions shared across multiple facts. Snowflake schema: normalized dimensions with their own dimensional tables, rarely the right answer in modern columnar warehouses because the join cost outweighs the storage savings. Data vault 2.0: hubs for business keys, links for relationships, satellites for descriptive attributes with full versioning, the case for it is mostly governance in regulated industries. Medallion architecture: bronze raw immutable, silver cleaned and conformed, gold business-ready aggregates with 3x storage as the trade-off for replay safety. SCD types: Type 1 overwrite, Type 2 row-per-version with effective_from and effective_to and is_current, Type 3 current-and-previous columns side by side. Bridge tables for many-to-many: patient-diagnosis in healthcare, product-category in retail, impression-conversion in ad tech. Late-arriving dimensions: placeholder row with is_late=true, update when real row arrives. Slowly-changing fact tables for corrections: append-only with version column versus in-place update with audit log.
The grain question is the most-failed part of this round. The data engineer candidate who starts drawing before stating the grain (even a strong candidate) almost always builds a fact table that mixes grains and then has to throw it out. Fix is mechanical: say "the fact table is one row per X" out loud, write it on the whiteboard, then design the rest from there. Senior data engineer rubrics also weight conformed dimensions (the same dim_customer schema across multiple facts so analysts can join across them) and the trade-off articulation between star, snowflake, OBT, and vault for the specific domain the interviewer chose.
Companies represented in this catalog through tagged interview reports: Meta (ads attribution model with SCD2 advertiser dimension), Amazon (Redshift-flavored star schemas with explicit DISTKEY and SORTKEY decisions), Stripe (financial-data SCD2 merchant dimension with time-varying attributes), Netflix (Iceberg-flavored medallion architecture with structured streaming feeding gold), Airbnb (search and booking funnel star schema), Uber (rideshare trip-grain fact with H3 location dim), Snowflake and Databricks (warehouse-vendor neutrality testing across patterns). The data engineer who can defend grain choice and SCD type across three of these domains usually clears the modeling round at any of them.
- What is the first thing a data engineer should say in a data modeling whiteboard round?
- The grain of the fact table, in one sentence: 'One row per order line item' or 'one row per trip' or 'one row per impression'. Say it out loud before drawing anything. The grain determines which dimensions you need, which measures are additive, and where SCDs come into play. Starting to draw without stating the grain is the most common L4-to-L5 failure mode for data engineer candidates.
- When does a data engineer use Type 1 versus Type 2 versus Type 3 SCD?
- Type 1 (overwrite) when history does not matter and you only need the current value: useful for typos and corrections. Type 2 (new row per version with effective_from, effective_to, is_current) when downstream queries need point-in-time correctness: the default for customer address, employee department, product category. Type 3 (current and previous columns) when there are only two states that matter and joins should stay simple: rarely the right answer in 2026 data engineer interviews.
- Star schema or snowflake schema?
- Star, almost always. Snowflake normalizes dimensions into their own dimensional tables to reduce storage; the trade-off is more joins per query. On columnar storage (Snowflake, BigQuery, Redshift), dimension tables are usually small enough to broadcast, and the join cost dominates the storage saving. Star wins unless the dimension is genuinely large enough to matter (millions of rows with high-cardinality attributes), in which case snowflake the specific dimension that is too big, not the whole model.
- When does data vault beat Kimball star schema?
- Data vault splits the model into hubs (business keys), links (relationships), and satellites (descriptive attributes versioned over time). The case for it is mostly governance: multiple ingestion teams loading independently, full audit lineage, sources that change schema frequently, regulated industries like banking and pharma. The tradeoff is that nobody wants to query it directly: every business question becomes a 6-table join, so teams build a star-shaped business vault on top. In a data engineer interview, the right answer is usually 'star, unless the company has a real EDW org and a compliance reason to track every change'.
- How does a data engineer model a many-to-many relationship?
- Bridge table with two FKs, one to each dimension. Optional weighting factor for fractional attribution (the impression-to-conversion case where one conversion is attributed across multiple touchpoints). Common in healthcare (patient-diagnosis), retail (product-category), ad tech (impression-conversion). Without a bridge, a JOIN inflates row counts and silently corrupts SUM and COUNT: the many-to-many duplication trap.
- What is the medallion architecture?
- Three layers. Bronze: raw, append-only, schema-on-read, ideally still in source format. Silver: cleaned, typed, deduplicated, with bad rows quarantined. Gold: business-ready and modeled, usually as star schemas. Each layer has its own ownership and quality contract. The point: a bug in silver does not require re-ingesting from the source. Default on lakehouses like Databricks Delta and Apache Iceberg. Not an alternative to star schemas; it is where star schemas live.
- How does a data engineer handle late-arriving dimensions?
- Insert a placeholder dim row with surrogate key and is_late=true; mark known fields as NULL and unknown_flag=true. Update when the real row arrives (replace the placeholder, or insert a new SCD Type 2 row depending on whether is_late changes the version). Alternative is to backfill facts after dim arrives, which works for small fact tables and breaks at scale. Most candidates skip this question; raising it unprompted is a senior data engineer signal.
- How does a data engineer model a slowly changing fact (corrections)?
- Two patterns. Append-only with version column: every correction is a new row with version+1; the latest version is the truth. Audit-friendly but query needs filter on max version per natural key. In-place update with audit log table: query is simple but you need a separate audit table for compliance. Trade-off: query simplicity vs storage cost vs auditability. Mention both, pick based on the domain. Financial corrections almost always need audit; product catalog corrections usually do not.
62 practice problems matching this filter. Difficulty: medium (35), easy (8), hard (19).
Data Modeling (62)
- Split Decision - medium - One user, one experiment, one variant. No exceptions.
- Where They Used to Live - medium - They moved. The data stayed behind.
- Airline Flight Operations Schema - medium - Flights, passengers, and routes. Before you draw a single table, tell me the grain.
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
- Clickstream and Session Schema - medium - Millions of clicks, mostly anonymous.
- Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
- Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
- Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
- Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
- E-Commerce Supply Chain Tracking - hard - A package splits, reroutes, and (maybe) arrives.
- EdTech Classroom Engagement Schema - medium - They opened the assignment. Did they actually read it?
- Employee Application Time Tracking - medium - Every minute tracked. Every app accounted for.
- Employee Transfer Tracking System - medium - People switch teams. HR loses track.
- Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
- Held to Account - medium
- Financial Trading Warehouse - hard - Every trade, every tick, every fraction of a share. The regulators want receipts.
- Fitness App Data Model - easy - Reps, sets, streaks, and personal bests. Gym rats love their stats.
- The No-Show - easy - Every reserved seat ends one of five ways. Build the model that can tell them apart.
- Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
- Housing Marketplace Analytics - medium - Sellers want buyers. Buyers want deals.
- Insurance Claims Lifecycle - hard - A claim gets filed. Then it gets complicated. Then it gets reassigned. Then it loops back.
- Livestream Analytics Schema - medium - Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.
- Loan Application Reporting Schema - medium - Approved, declined, or pending. Design the tables that say so.
- The Balance Always Reconciles - easy - Money out, payments back. The balance has to be exact.
- Log Parsing Pipeline Schema - medium - Raw text files, terabytes of them, full of buried signals and cryptic error codes.
- The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.
- Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
- Metric Definition Reverse Engineering - hard - Five numbers on a dashboard. Your job: figure out where they come from.
- Movie Streaming Analytics Schema - medium - They pressed play. What happened next is the whole question.
- Multiplayer Game Match History - medium - Millions of matches. The leaderboard refreshes in fifteen minutes.
- Online Marketplace - Seller Payouts - hard - The buyer paid one number. The seller got a different one.
- The Retail Blueprint - medium - One business. A thousand transactions. Only one layout survives the analytics layer.
- The Last Mile - medium - Order placed. Now track it to the door.
- POS Sales Data Warehouse - medium - Every beep at the register. Coupons, returns, all of it.
- Property Booking Platform - hard - Five-star listing. Three-star reality.
- Retailer Data Warehouse Design - medium - Queries are crawling. The analysts are not happy.
- Ride-Sharing Platform Schema - medium - Riders, drivers, and fares. Everyone takes a cut.
- The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
- The Customer Who Changed - hard - She moved. She upgraded. She became someone new. The record has to keep up.
- Social Platform Data Model - medium - Follows, likes, replies to replies. It never stops.
- Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
- Subscription Churn Analysis Model - medium - Subscribers are leaving. The data knows why.
- The Heat of the Map - hard
- Telecom Network Connectivity Warehouse - hard - One device goes down. The ripple keeps going.
- The Celebrity Problem - medium - One post. A million notifications. Something has to give.
- The Churner Who Came Back - hard - They cancelled. They came back. The report has to tell both stories correctly.
- The Handoff - hard
- The JSON Files That Became a Data Mart - medium - Three semi-structured inputs. One queryable warehouse.
- The League With Too Many Loyalties - hard - A player can belong to many teams. The schema must agree.
- The Plan That Changed Twice This Month - medium - Subscribers come, go, downgrade, and share. The schema has to keep up.
- The Retail Tables That Need a New Home - medium - A working system. Now redesign it so the analysts can actually use it.
- The Schema That Could Not Answer Back - hard - Forty columns in. Zero useful answers out.
- The Table That Lies - medium - Every query comes out wrong. The data is all there.
- The Talent Funnel - medium - Thousands applied. One accepted. Where did the rest go?
- The Territory That Keeps Moving - hard - Reps get reassigned. The receipts have to survive.
- The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
- Three-Sided Marketplace Delivery Schema - hard - One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?
- Toll Road Sensor Analytics - easy - Cars enter, cars exit. Except when they don't.
- Trending Dishes Dashboard - medium - What's everyone eating? The answer changes hourly.
- What the Clock Remembers - medium
- What the Script Remembers - medium