Data Modeling Interview Prep
Data Modeling Interview Prep
Prep guide for the data modeling round of a data engineer interview loop.
Data modeling interview prep for data engineer roles. Start with grain selection (one row per X), then dim_customer with Type 2 SCD, then conformed dimensions across multiple facts, then defend star over snowflake. Whiteboard practice problems with rubric-scored verdicts for the 55 percent of data engineer loops that include a modeling round.
Data modeling interview prep for data engineer roles centers on five skills. First, grain identification: pick the right unit of analysis (one row per order line item, one row per trip, one row per impression) and state it before drawing. Second, dimension design: choose Type 1 versus Type 2 versus Type 3 SCD per attribute, justify in one sentence per dimension. Third, conformed dimensions: build dim_customer with one schema shared across the orders fact, the returns fact, the support ticket fact. Fourth, fact additivity: identify additive (revenue, quantity), semi-additive (account balance, inventory), and non-additive (ratio, percentage) measures because they affect what aggregations are valid. Fifth, the bridge table for many-to-many.
Drill order for a data engineer with four weeks until an onsite. Weeks 1-2: design 6 star schemas on real domains (e-commerce, marketplace, rideshare, payments ledger, ad tech, content platform). For each, state the grain in one sentence, draw the fact table with FKs and measures, add 3-4 dimensions with SCD type per attribute, defend star versus snowflake versus OBT. Week 3: drill SCD merge logic. The Type 2 merge requires identifying changed rows (anti-join from staging to current dim), expiring the matched current rows (set valid_to to now, is_current to false), and inserting the new rows (valid_from now, is_current true). Practice this in SQL and in pandas. Week 4: drill the failure modes interviewers test. Mixed-grain fact tables, missing conformed dimension, the bridge table forgotten on a many-to-many, late-arriving dimension without a placeholder strategy.
Modeling rubrics in 2026 data engineer interviews score on five dimensions. SLA match: does the proposed model support the analytical workload's freshness requirement (real-time, micro-batch, daily, weekly). Grain correctness: is the fact grain one row per X stated explicitly and consistent across measures. Dimension design: are the SCD types correct, are conformed dimensions used where appropriate. Trade-off articulation: does the candidate defend star versus snowflake versus OBT in the specific domain. Adapt-on-fly: when the interviewer changes a requirement (real-time freshness instead of daily, multi-region instead of single-region), does the candidate modify the existing model in place or throw it out and restart.
Companies whose data modeling rounds appear most often in interview reports: Stripe (financial-data SCD2 merchant dimension, idempotent aggregations), Netflix (Iceberg medallion with streaming feeding gold), Meta (ads attribution with multi-touch SCD2 advertiser), Amazon (Redshift DISTKEY/SORTKEY-aware star schemas), Airbnb (search-funnel star with SCD2 listing dimension), Uber (trip-grain fact with H3 location). The data engineer who has practiced the modeling round across three of these domains usually clears the modeling round at any of them.
- What percentage of data engineer interviews include a data modeling round?
- Roughly 55 percent of data engineer loops include an explicit data modeling whiteboard round. The share rises with seniority: nearly all L5+ data engineer loops include modeling, and analytics engineer loops include it at 80 percent. The format is usually 45 minutes on a whiteboard or canvas with a real-world domain (marketplace, rideshare, payments, ad tech).
- How should a data engineer prep for the modeling round in 4 weeks?
- Weeks 1-2: design 6 star schemas on real domains, state grain first, defend star vs snowflake vs OBT. Week 3: drill SCD Type 2 merge logic in SQL and pandas. Week 4: drill failure modes (mixed grain, missing conformed dimension, forgotten bridge table, late-arriving dim). Two timed mocks with someone in the final week.
- What is the most common failure mode in the modeling round?
- Starting to draw before stating the grain. The candidate who jumps to dim_customer and dim_product before saying 'one row per order line item' almost always builds a fact table that mixes grains and then has to throw it out. Fix is mechanical: say the grain out loud before drawing.
- Do interviewers expect specific warehouse vendor knowledge?
- Depends on the company. Amazon expects Redshift DISTKEY and SORTKEY decisions. Snowflake-and-Databricks expect QUALIFY and MERGE INTO patterns. Google expects BigQuery partitioning and clustering. Most other companies stay vendor-neutral in the modeling round and test pure Kimball-style design. Mention vendor-specific optimizations when relevant; do not force them where the question is neutral.
- What trade-offs does a senior data engineer modeling round expect?
- Star versus snowflake versus OBT in the specific domain. SCD Type 2 versus Type 1 per attribute. Conformed dimensions across facts versus per-fact denormalization. Bridge table versus pre-aggregation. Real-time freshness versus daily batch trade-off in the design that surrounds the model. The L5 signal is naming two alternatives and defending the choice; the L4 signal is producing a working model.
- How does grain selection affect the rest of the model?
- The grain determines which dimensions are needed (only the dimensions that vary at the chosen grain), which measures are additive (sum to higher levels) versus semi-additive (account balance: sums over entities, averages over time) versus non-additive (ratios, percentages, distinct counts), and where SCDs come into play (only dimensions where the grain's attributes change over time). Wrong grain at the start propagates errors through every subsequent decision.
- What is a conformed dimension and why does it matter?
- A conformed dimension is a dimension table with one schema and identity used across multiple fact tables. dim_customer with the same columns, surrogate keys, and SCD semantics shared by the orders fact, the returns fact, and the support ticket fact. Without conformed dimensions, analysts cannot join across facts without explicit translation. Senior data engineer rubrics weight this; junior rubrics rarely test it.
62 practice problems matching this filter. Difficulty: medium (35), easy (8), hard (19).
Data Modeling (62)
- Split Decision - medium - One user, one experiment, one variant. No exceptions.
- Where They Used to Live - medium - They moved. The data stayed behind.
- Airline Flight Operations Schema - medium - Flights, passengers, and routes. Before you draw a single table, tell me the grain.
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
- Clickstream and Session Schema - medium - Millions of clicks, mostly anonymous.
- Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
- Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
- Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
- Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
- E-Commerce Supply Chain Tracking - hard - A package splits, reroutes, and (maybe) arrives.
- EdTech Classroom Engagement Schema - medium - They opened the assignment. Did they actually read it?
- Employee Application Time Tracking - medium - Every minute tracked. Every app accounted for.
- Employee Transfer Tracking System - medium - People switch teams. HR loses track.
- Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
- Held to Account - medium
- Financial Trading Warehouse - hard - Every trade, every tick, every fraction of a share. The regulators want receipts.
- Fitness App Data Model - easy - Reps, sets, streaks, and personal bests. Gym rats love their stats.
- The No-Show - easy - Every reserved seat ends one of five ways. Build the model that can tell them apart.
- Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
- Housing Marketplace Analytics - medium - Sellers want buyers. Buyers want deals.
- Insurance Claims Lifecycle - hard - A claim gets filed. Then it gets complicated. Then it gets reassigned. Then it loops back.
- Livestream Analytics Schema - medium - Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.
- Loan Application Reporting Schema - medium - Approved, declined, or pending. Design the tables that say so.
- The Balance Always Reconciles - easy - Money out, payments back. The balance has to be exact.
- Log Parsing Pipeline Schema - medium - Raw text files, terabytes of them, full of buried signals and cryptic error codes.
- The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.
- Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
- Metric Definition Reverse Engineering - hard - Five numbers on a dashboard. Your job: figure out where they come from.
- Movie Streaming Analytics Schema - medium - They pressed play. What happened next is the whole question.
- Multiplayer Game Match History - medium - Millions of matches. The leaderboard refreshes in fifteen minutes.
- Online Marketplace - Seller Payouts - hard - The buyer paid one number. The seller got a different one.
- The Retail Blueprint - medium - One business. A thousand transactions. Only one layout survives the analytics layer.
- The Last Mile - medium - Order placed. Now track it to the door.
- POS Sales Data Warehouse - medium - Every beep at the register. Coupons, returns, all of it.
- Property Booking Platform - hard - Five-star listing. Three-star reality.
- Retailer Data Warehouse Design - medium - Queries are crawling. The analysts are not happy.
- Ride-Sharing Platform Schema - medium - Riders, drivers, and fares. Everyone takes a cut.
- The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
- The Customer Who Changed - hard - She moved. She upgraded. She became someone new. The record has to keep up.
- Social Platform Data Model - medium - Follows, likes, replies to replies. It never stops.
- Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
- Subscription Churn Analysis Model - medium - Subscribers are leaving. The data knows why.
- The Heat of the Map - hard
- Telecom Network Connectivity Warehouse - hard - One device goes down. The ripple keeps going.
- The Celebrity Problem - medium - One post. A million notifications. Something has to give.
- The Churner Who Came Back - hard - They cancelled. They came back. The report has to tell both stories correctly.
- The Handoff - hard
- The JSON Files That Became a Data Mart - medium - Three semi-structured inputs. One queryable warehouse.
- The League With Too Many Loyalties - hard - A player can belong to many teams. The schema must agree.
- The Plan That Changed Twice This Month - medium - Subscribers come, go, downgrade, and share. The schema has to keep up.
- The Retail Tables That Need a New Home - medium - A working system. Now redesign it so the analysts can actually use it.
- The Schema That Could Not Answer Back - hard - Forty columns in. Zero useful answers out.
- The Table That Lies - medium - Every query comes out wrong. The data is all there.
- The Talent Funnel - medium - Thousands applied. One accepted. Where did the rest go?
- The Territory That Keeps Moving - hard - Reps get reassigned. The receipts have to survive.
- The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
- Three-Sided Marketplace Delivery Schema - hard - One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?
- Toll Road Sensor Analytics - easy - Cars enter, cars exit. Except when they don't.
- Trending Dishes Dashboard - medium - What's everyone eating? The answer changes hourly.
- What the Clock Remembers - medium
- What the Script Remembers - medium