Star schemas show up in 67% of the dimensional modeling questions in our 1,042-question corpus. Grain declaration errors appear in the first 60 seconds of roughly half of failed whiteboard attempts. Those two numbers explain why this page exists. The pattern is simple. The execution is where candidates get marked down.
Data modeling is 18% of the DE interview corpus (191 questions), and within that slice, star schema design is the most common live whiteboard exercise. Pair this page with the star schema reference page and you've covered the component list and the process.
Dim Qs Use Star Schema
Test Fact/Dim Split
Include SCD Type 2
Modeling Qs Analyzed
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Kimball's four-step process shows up in about 67% of star schema walkthroughs, and candidates who declare the grain in the first sentence pass the whiteboard at roughly 2x the rate of those who start listing columns. Do the steps in order. Skipping ahead to column names before declaring the grain is the single most common failure mode in the corpus.
Every star schema models one business process. Not a department, not a report, not a dashboard. A business process is a measurable activity that generates data: sales transactions, website sessions, shipments, support tickets, ad impressions. The business process determines what your fact table records. If you skip this step, you end up with a vague 'everything table' that serves nobody well.
Start by asking: what is the event we want to analyze? An e-commerce company might have several business processes: orders, returns, page views, and inventory movements. Each gets its own star schema. Do not try to combine orders and returns into one fact table. They have different grains, different measures, and different dimension relationships. One business process, one star schema.
Example: Business process: Online order placement. This generates a measurable event every time a customer completes a purchase. The fact table will record one row per event.
The grain defines what one row in the fact table represents. This is the single most important decision in the entire design. Get it wrong and every downstream query produces incorrect aggregations. The grain must be stated in plain language before you write a single column name.
Common grains include: one row per order line item, one row per daily session per user, one row per ad impression, one row per shipment. The grain determines the level of detail. A grain of 'one row per order' means you cannot answer 'which products were in each order' without a separate line-item fact table. A grain of 'one row per order line item' answers both order-level and product-level questions through aggregation.
Example: Grain: one row per order line item. Each row represents a single product within a single order. If order #1234 contains 3 products, it produces 3 rows in the fact table.
Dimensions provide the descriptive context for each fact row: the who, what, where, when, and why. They are the axes along which analysts slice and filter data. For every fact row, ask: what descriptive information would an analyst need to filter or group by?
Start with the obvious dimensions: date (almost always present), customer, product. Then think about the business process more carefully. Does the order have a promotion applied? Add dim_promotion. Does it have a payment method? Add dim_payment_method. Was it fulfilled from a specific warehouse? Add dim_warehouse. Each dimension becomes a foreign key in the fact table and a separate dimension table with rich attributes.
Example: Dimensions for online orders: dim_date (order date with day, week, month, quarter, year attributes), dim_customer (name, email, city, state, signup_date, customer_tier), dim_product (name, category, brand, price_tier, supplier), dim_store (for multi-channel retailers: online vs physical, region), dim_promotion (promo_code, discount_type, campaign_name).
Facts are the numeric, measurable values in each row. They represent the quantitative aspect of the business event. Revenue, quantity, duration, cost, count. Every fact should be classifiable as additive, semi-additive, or non-additive. This classification determines how the measure can be aggregated in reports.
Additive facts can be summed across every dimension: revenue, quantity, discount_amount. These are the most useful because analysts can aggregate freely. Semi-additive facts can be summed across some dimensions but not others: account_balance makes sense summed across customers but not across dates (you would not add Monday's balance to Tuesday's). Non-additive facts cannot be meaningfully summed: unit_price, conversion_rate, temperature. They must be averaged, weighted, or used in ratios.
Example: Facts for order line items: quantity (additive), unit_price (non-additive), line_total (additive), discount_amount (additive), shipping_cost (additive at order level, semi-additive at line-item level if allocated).
Each dimension table is a denormalized, flat table with a surrogate key primary key and many descriptive attributes. Do not normalize dimensions into sub-tables. In a star schema, the product dimension includes category_name directly, not a foreign key to a separate categories table. This denormalization is intentional and is what makes the star schema efficient for analytics.
Rich dimension tables are the key to flexible analytics. A thin product dimension with only product_id and product_name forces analysts to join to additional tables for category, brand, and supplier information. A rich product dimension with 20+ attributes lets analysts slice by any attribute with a single join. Err toward more columns in dimensions, not fewer. Storage is cheap. Analytical flexibility is expensive to add later.
CREATE TABLE dim_product (
product_key INT PRIMARY KEY, -- surrogate key
product_id VARCHAR(50), -- natural key from source
product_name VARCHAR(200),
category VARCHAR(100),
subcategory VARCHAR(100),
brand VARCHAR(100),
supplier VARCHAR(200),
price_tier VARCHAR(20),
is_active BOOLEAN,
effective_date DATE,
expiry_date DATE
);The fact table contains foreign keys to every dimension table plus the numeric measures. The primary key is typically the combination of all foreign keys (the composite key) or a dedicated surrogate fact key. Every column in the fact table is either a foreign key or a measure. No descriptive text belongs in the fact table.
Keep the fact table narrow. Dimension keys are integers (4 bytes each). Measures are numeric types. A fact table with 6 dimension keys and 4 measures has only 10 columns. This narrow, tall structure is what columnar databases like Snowflake and BigQuery optimize for. Compression works best on columns with low cardinality (like dimension keys) and numeric columns with patterns (like prices).
CREATE TABLE fact_order_line (
order_line_key INT PRIMARY KEY,
date_key INT REFERENCES dim_date(date_key),
customer_key INT REFERENCES dim_customer(customer_key),
product_key INT REFERENCES dim_product(product_key),
store_key INT REFERENCES dim_store(store_key),
promotion_key INT REFERENCES dim_promotion(promotion_key),
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
line_total DECIMAL(10,2)
);Putting all six steps together for an online retail company. The business process is order placement. The grain is one row per order line item. Five dimensions surround the fact table.
-- Fact table: one row per order line item
-- 5 dimension foreign keys + 4 measures
-- No descriptive text in the fact table
fact_order_line
order_line_key (PK, surrogate)
date_key (FK -> dim_date)
customer_key (FK -> dim_customer)
product_key (FK -> dim_product)
store_key (FK -> dim_store)
promotion_key (FK -> dim_promotion)
quantity (additive)
unit_price (non-additive)
discount_amount (additive)
line_total (additive)
dim_date: date_key, full_date, day_of_week, month, quarter, year, is_holiday
dim_customer: customer_key, customer_id, name, email, city, state, signup_date, tier
dim_product: product_key, product_id, name, category, brand, supplier, price_tier
dim_store: store_key, store_id, store_name, channel, region, country
dim_promotion: promotion_key, promo_code, discount_type, campaign, start_date, end_dateAn analyst asking "total revenue by product category last quarter" joins fact_order_line to dim_product (for category) and dim_date (for quarter), filters on quarter, groups by category, and sums line_total. Two joins, one aggregation. That simplicity is the whole point of the star schema data model.
SELECT
p.category,
SUM(f.line_total) AS total_revenue
FROM fact_order_line f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.quarter = '2026-Q1'
GROUP BY p.category
ORDER BY total_revenue DESC;These four mistakes come up repeatedly in interviews and real projects. Each one seems minor but produces schemas that are harder to query and maintain.
Starting to list columns without first stating what one row represents. This leads to ambiguous fact tables where the same event might be counted multiple times. Always state the grain in one sentence before naming any columns.
Adding product_name or customer_email directly to the fact table instead of referencing dimension tables. This bloats the fact table, creates redundancy, and defeats the purpose of dimensional modeling. Only foreign keys and numeric measures belong in fact tables.
Referencing dimensions by business keys (customer_email, product_sku) instead of integer surrogate keys. Natural keys change (email addresses get updated, SKU formats change). Surrogate keys are stable. They also enable slowly changing dimensions and handle the 'unknown member' case.
Creating a separate categories table referenced by the product dimension. This turns the star schema into a snowflake schema, adding joins and complexity. In a star schema, category_name belongs directly in the product dimension table. Denormalization is the point.
Grain errors appear in about half of failed whiteboard attempts in the corpus. Practice the four-step process on DataDriven until you can state the grain in a single sentence before the interviewer finishes reading the prompt.
Start Practicing