Star Schema Interview Questions
Star Schema Interview Questions
Star schema interview problems for data engineer interview prep.
Star schema interview questions isolated from the data modeling catalog. One row per X grain selection, conformed dimensions shared across multiple facts, additive versus semi-additive measures, and the trade-off arguments for star over snowflake or one-big-table in modern columnar warehouses.
Star schema is the default answer in most data engineer modeling rounds. The structure is one fact table surrounded by dimension tables, each dimension joined to the fact via a foreign key. The fact holds numeric measures (quantity, unit_price, total_amount, duration_seconds) plus FKs; the dimensions hold descriptive attributes (customer name, product category, store address, date attributes). In a columnar warehouse like Snowflake, BigQuery, Redshift, or Databricks Delta, the optimizer broadcasts the small dimensions and the join cost is negligible compared to the storage and clarity benefits.
Star schema interview questions for data engineer roles cluster into six sub-patterns. Grain selection: one row per order line item versus one row per order versus one row per customer-day. The grain choice determines what measures are additive and what dimensions are needed. Conformed dimensions: dim_customer used across the orders fact, the returns fact, the support ticket fact with one schema and one set of surrogate keys. The benefit is that analysts can join across facts without explicit translation; the cost is upfront design effort. Additive versus semi-additive versus non-additive measures: revenue is additive (sums across customer, product, date), account balance is semi-additive (sums across customer but averages across date), ratios and percentages are non-additive (must be computed from raw counts). Surrogate keys versus natural keys: surrogate when the natural key is unstable, when SCD Type 2 is needed, or when the natural key is composite. SCD type per dimension attribute: Type 1 for typo correction, Type 2 for history retention. The bridge table for many-to-many: a product-category bridge avoids the Cartesian explosion when joining the orders fact to a category dim.
Star schema versus snowflake schema is a recurring interview question. Snowflake normalizes dimensions into their own dimensional tables to reduce storage. The trade-off is more joins per query and harder analyst SQL. In a 2026 columnar warehouse, dimensions are usually small enough to broadcast in 5-50ms, so the join cost is negligible. Star wins unless one specific dimension is genuinely too large to broadcast (millions of rows with high-cardinality attributes), in which case the data engineer snowflakes that one dimension, not the whole model.
Star schema versus one-big-table (OBT) is the newer trade-off question in 2026 data engineer interviews. OBT flattens all dimensions into the fact table, producing a wide denormalized table. The benefit is zero joins at query time and simple SQL. The cost is data redundancy (one customer's name repeated on every order row, increasing storage 10x or more), no conformed dimensions across facts (each fact-OBT has its own embedded customer attributes), and slow ingestion (changing a customer name requires updating every fact row). OBT can win for a single analytical workload at a small company where simplicity outweighs cost; star wins everywhere else.
- Why does a data engineer pick star schema over snowflake schema?
- On a 2026 columnar warehouse (Snowflake, BigQuery, Redshift, Databricks Delta), the optimizer broadcasts small dimension tables and the join cost is negligible. Snowflake's storage savings from normalization rarely outweigh the query complexity and join cost. Star wins unless a dimension is genuinely too large to broadcast (millions of rows with high-cardinality attributes), in which case snowflake that one dim, not the whole model.
- What is a conformed dimension?
- A conformed dimension is a dimension table with one schema and one set of surrogate keys, used by multiple fact tables. dim_customer with the same columns and identity in the orders fact, the returns fact, and the support ticket fact. The benefit is that analysts can join across facts without explicit translation. Senior data engineer modeling rubrics explicitly weight conformed dimensions; junior rubrics often skip the question.
- When does one-big-table (OBT) beat a star schema?
- Rarely, in 2026. OBT can win for a single analytical workload at a small company where query simplicity outweighs storage and update cost. Star wins everywhere there are multiple facts, multiple analytical workloads, or any need for conformed dimensions. OBT also breaks when a dimension attribute changes (changing a customer name requires updating every fact row). Mention OBT, defend why star is the better choice in the specific domain.
- What is an additive measure?
- A measure that can be summed across all dimensions. Revenue is additive (sums across customer, product, date, region). Quantity is additive. Cost is additive. Semi-additive measures (account balance, inventory level) can be summed across some dimensions but not others (you sum balances across customers but you do not sum a single customer's balance across dates; you take the latest). Non-additive measures (ratios, percentages, distinct counts) must be computed from raw counts at the desired aggregation level.
- Should I use a surrogate key or natural key on dimensions?
- Surrogate key when the natural key is unstable (customer emails change), when SCD Type 2 is needed (the natural key recurs across versions), or when the natural key is composite. Natural key when stable, simple, human-readable for joins, and SCD Type 1 is sufficient. dim_date almost always uses an int surrogate (20260527) that is both surrogate and natural; dim_customer almost always uses a separate surrogate from customer_id.
- How does star schema handle a many-to-many relationship?
- Through a bridge table. If a product can belong to multiple categories, do not join orders directly to a category dim (causes Cartesian explosion). Instead, build a product_category bridge with two FKs (product_id, category_id) and optionally a weighting factor. Join orders to dim_product, then to product_category bridge, then to dim_category. SUM across the bridge requires explicit deduplication or pre-aggregation.
- What is the grain of a star schema fact table?
- The grain is the unit of analysis: one row per X. For an orders star, the grain is usually one row per order line item (not one row per order, which loses line-item detail; not one row per customer, which loses individual orders). State the grain in one sentence before drawing the fact table. Mixed-grain fact tables (some rows at order-level, others at line-item-level) are the failure mode interviewers fish for.
62 practice problems matching this filter. Difficulty: medium (35), easy (8), hard (19).
Data Modeling (62)
- Split Decision - medium - One user, one experiment, one variant. No exceptions.
- Where They Used to Live - medium - They moved. The data stayed behind.
- Airline Flight Operations Schema - medium - Flights, passengers, and routes. Before you draw a single table, tell me the grain.
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
- Clickstream and Session Schema - medium - Millions of clicks, mostly anonymous.
- Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
- Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
- Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
- Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
- E-Commerce Supply Chain Tracking - hard - A package splits, reroutes, and (maybe) arrives.
- EdTech Classroom Engagement Schema - medium - They opened the assignment. Did they actually read it?
- Employee Application Time Tracking - medium - Every minute tracked. Every app accounted for.
- Employee Transfer Tracking System - medium - People switch teams. HR loses track.
- Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
- Held to Account - medium
- Financial Trading Warehouse - hard - Every trade, every tick, every fraction of a share. The regulators want receipts.
- Fitness App Data Model - easy - Reps, sets, streaks, and personal bests. Gym rats love their stats.
- The No-Show - easy - Every reserved seat ends one of five ways. Build the model that can tell them apart.
- Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
- Housing Marketplace Analytics - medium - Sellers want buyers. Buyers want deals.
- Insurance Claims Lifecycle - hard - A claim gets filed. Then it gets complicated. Then it gets reassigned. Then it loops back.
- Livestream Analytics Schema - medium - Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.
- Loan Application Reporting Schema - medium - Approved, declined, or pending. Design the tables that say so.
- The Balance Always Reconciles - easy - Money out, payments back. The balance has to be exact.
- Log Parsing Pipeline Schema - medium - Raw text files, terabytes of them, full of buried signals and cryptic error codes.
- The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.
- Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
- Metric Definition Reverse Engineering - hard - Five numbers on a dashboard. Your job: figure out where they come from.
- Movie Streaming Analytics Schema - medium - They pressed play. What happened next is the whole question.
- Multiplayer Game Match History - medium - Millions of matches. The leaderboard refreshes in fifteen minutes.
- Online Marketplace - Seller Payouts - hard - The buyer paid one number. The seller got a different one.
- The Retail Blueprint - medium - One business. A thousand transactions. Only one layout survives the analytics layer.
- The Last Mile - medium - Order placed. Now track it to the door.
- POS Sales Data Warehouse - medium - Every beep at the register. Coupons, returns, all of it.
- Property Booking Platform - hard - Five-star listing. Three-star reality.
- Retailer Data Warehouse Design - medium - Queries are crawling. The analysts are not happy.
- Ride-Sharing Platform Schema - medium - Riders, drivers, and fares. Everyone takes a cut.
- The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
- The Customer Who Changed - hard - She moved. She upgraded. She became someone new. The record has to keep up.
- Social Platform Data Model - medium - Follows, likes, replies to replies. It never stops.
- Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
- Subscription Churn Analysis Model - medium - Subscribers are leaving. The data knows why.
- The Heat of the Map - hard
- Telecom Network Connectivity Warehouse - hard - One device goes down. The ripple keeps going.
- The Celebrity Problem - medium - One post. A million notifications. Something has to give.
- The Churner Who Came Back - hard - They cancelled. They came back. The report has to tell both stories correctly.
- The Handoff - hard
- The JSON Files That Became a Data Mart - medium - Three semi-structured inputs. One queryable warehouse.
- The League With Too Many Loyalties - hard - A player can belong to many teams. The schema must agree.
- The Plan That Changed Twice This Month - medium - Subscribers come, go, downgrade, and share. The schema has to keep up.
- The Retail Tables That Need a New Home - medium - A working system. Now redesign it so the analysts can actually use it.
- The Schema That Could Not Answer Back - hard - Forty columns in. Zero useful answers out.
- The Table That Lies - medium - Every query comes out wrong. The data is all there.
- The Talent Funnel - medium - Thousands applied. One accepted. Where did the rest go?
- The Territory That Keeps Moving - hard - Reps get reassigned. The receipts have to survive.
- The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
- Three-Sided Marketplace Delivery Schema - hard - One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?
- Toll Road Sensor Analytics - easy - Cars enter, cars exit. Except when they don't.
- Trending Dishes Dashboard - medium - What's everyone eating? The answer changes hourly.
- What the Clock Remembers - medium
- What the Script Remembers - medium