Data Modeling

Star Schema Data Model: How to Design One from Scratch

Star schemas show up in 67% of the dimensional modeling questions in our 1,042-question corpus. Grain declaration errors appear in the first 60 seconds of roughly half of failed whiteboard attempts. Those two numbers explain why this page exists. The pattern is simple. The execution is where candidates get marked down.

Data modeling is 18% of the DE interview corpus (191 questions), and within that slice, star schema design is the most common live whiteboard exercise. Pair this page with the star schema reference page and you've covered the component list and the process.

67%

Dim Qs Use Star Schema

71%

Test Fact/Dim Split

23%

Include SCD Type 2

191

Modeling Qs Analyzed

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

6 Steps to Design a Star Schema

Kimball's four-step process shows up in about 67% of star schema walkthroughs, and candidates who declare the grain in the first sentence pass the whiteboard at roughly 2x the rate of those who start listing columns. Do the steps in order. Skipping ahead to column names before declaring the grain is the single most common failure mode in the corpus.

1

Identify the Business Process

Every star schema models one business process. Not a department, not a report, not a dashboard. A business process is a measurable activity that generates data: sales transactions, website sessions, shipments, support tickets, ad impressions. The business process determines what your fact table records. If you skip this step, you end up with a vague 'everything table' that serves nobody well.

Start by asking: what is the event we want to analyze? An e-commerce company might have several business processes: orders, returns, page views, and inventory movements. Each gets its own star schema. Do not try to combine orders and returns into one fact table. They have different grains, different measures, and different dimension relationships. One business process, one star schema.

Example: Business process: Online order placement. This generates a measurable event every time a customer completes a purchase. The fact table will record one row per event.

2

Declare the Grain

The grain defines what one row in the fact table represents. This is the single most important decision in the entire design. Get it wrong and every downstream query produces incorrect aggregations. The grain must be stated in plain language before you write a single column name.

Common grains include: one row per order line item, one row per daily session per user, one row per ad impression, one row per shipment. The grain determines the level of detail. A grain of 'one row per order' means you cannot answer 'which products were in each order' without a separate line-item fact table. A grain of 'one row per order line item' answers both order-level and product-level questions through aggregation.

Example: Grain: one row per order line item. Each row represents a single product within a single order. If order #1234 contains 3 products, it produces 3 rows in the fact table.

3

Choose the Dimensions

Dimensions provide the descriptive context for each fact row: the who, what, where, when, and why. They are the axes along which analysts slice and filter data. For every fact row, ask: what descriptive information would an analyst need to filter or group by?

Start with the obvious dimensions: date (almost always present), customer, product. Then think about the business process more carefully. Does the order have a promotion applied? Add dim_promotion. Does it have a payment method? Add dim_payment_method. Was it fulfilled from a specific warehouse? Add dim_warehouse. Each dimension becomes a foreign key in the fact table and a separate dimension table with rich attributes.

Example: Dimensions for online orders: dim_date (order date with day, week, month, quarter, year attributes), dim_customer (name, email, city, state, signup_date, customer_tier), dim_product (name, category, brand, price_tier, supplier), dim_store (for multi-channel retailers: online vs physical, region), dim_promotion (promo_code, discount_type, campaign_name).

4

Define the Facts (Measures)

Facts are the numeric, measurable values in each row. They represent the quantitative aspect of the business event. Revenue, quantity, duration, cost, count. Every fact should be classifiable as additive, semi-additive, or non-additive. This classification determines how the measure can be aggregated in reports.

Additive facts can be summed across every dimension: revenue, quantity, discount_amount. These are the most useful because analysts can aggregate freely. Semi-additive facts can be summed across some dimensions but not others: account_balance makes sense summed across customers but not across dates (you would not add Monday's balance to Tuesday's). Non-additive facts cannot be meaningfully summed: unit_price, conversion_rate, temperature. They must be averaged, weighted, or used in ratios.

Example: Facts for order line items: quantity (additive), unit_price (non-additive), line_total (additive), discount_amount (additive), shipping_cost (additive at order level, semi-additive at line-item level if allocated).

5

Build the Dimension Tables

Each dimension table is a denormalized, flat table with a surrogate key primary key and many descriptive attributes. Do not normalize dimensions into sub-tables. In a star schema, the product dimension includes category_name directly, not a foreign key to a separate categories table. This denormalization is intentional and is what makes the star schema efficient for analytics.

Rich dimension tables are the key to flexible analytics. A thin product dimension with only product_id and product_name forces analysts to join to additional tables for category, brand, and supplier information. A rich product dimension with 20+ attributes lets analysts slice by any attribute with a single join. Err toward more columns in dimensions, not fewer. Storage is cheap. Analytical flexibility is expensive to add later.

CREATE TABLE dim_product (
  product_key    INT PRIMARY KEY,  -- surrogate key
  product_id     VARCHAR(50),      -- natural key from source
  product_name   VARCHAR(200),
  category       VARCHAR(100),
  subcategory    VARCHAR(100),
  brand          VARCHAR(100),
  supplier       VARCHAR(200),
  price_tier     VARCHAR(20),
  is_active      BOOLEAN,
  effective_date DATE,
  expiry_date    DATE
);
6

Assemble the Fact Table

The fact table contains foreign keys to every dimension table plus the numeric measures. The primary key is typically the combination of all foreign keys (the composite key) or a dedicated surrogate fact key. Every column in the fact table is either a foreign key or a measure. No descriptive text belongs in the fact table.

Keep the fact table narrow. Dimension keys are integers (4 bytes each). Measures are numeric types. A fact table with 6 dimension keys and 4 measures has only 10 columns. This narrow, tall structure is what columnar databases like Snowflake and BigQuery optimize for. Compression works best on columns with low cardinality (like dimension keys) and numeric columns with patterns (like prices).

CREATE TABLE fact_order_line (
  order_line_key   INT PRIMARY KEY,
  date_key         INT REFERENCES dim_date(date_key),
  customer_key     INT REFERENCES dim_customer(customer_key),
  product_key      INT REFERENCES dim_product(product_key),
  store_key        INT REFERENCES dim_store(store_key),
  promotion_key    INT REFERENCES dim_promotion(promotion_key),
  quantity         INT,
  unit_price       DECIMAL(10,2),
  discount_amount  DECIMAL(10,2),
  line_total       DECIMAL(10,2)
);

Complete Example: E-Commerce Orders

Putting all six steps together for an online retail company. The business process is order placement. The grain is one row per order line item. Five dimensions surround the fact table.

-- Fact table: one row per order line item
-- 5 dimension foreign keys + 4 measures
-- No descriptive text in the fact table

fact_order_line
  order_line_key  (PK, surrogate)
  date_key        (FK -> dim_date)
  customer_key    (FK -> dim_customer)
  product_key     (FK -> dim_product)
  store_key       (FK -> dim_store)
  promotion_key   (FK -> dim_promotion)
  quantity        (additive)
  unit_price      (non-additive)
  discount_amount (additive)
  line_total      (additive)

dim_date: date_key, full_date, day_of_week, month, quarter, year, is_holiday
dim_customer: customer_key, customer_id, name, email, city, state, signup_date, tier
dim_product: product_key, product_id, name, category, brand, supplier, price_tier
dim_store: store_key, store_id, store_name, channel, region, country
dim_promotion: promotion_key, promo_code, discount_type, campaign, start_date, end_date

An analyst asking "total revenue by product category last quarter" joins fact_order_line to dim_product (for category) and dim_date (for quarter), filters on quarter, groups by category, and sums line_total. Two joins, one aggregation. That simplicity is the whole point of the star schema data model.

SELECT
  p.category,
  SUM(f.line_total) AS total_revenue
FROM fact_order_line f
JOIN dim_product p ON f.product_key = p.product_key
JOIN dim_date d ON f.date_key = d.date_key
WHERE d.quarter = '2026-Q1'
GROUP BY p.category
ORDER BY total_revenue DESC;

Common Design Mistakes

These four mistakes come up repeatedly in interviews and real projects. Each one seems minor but produces schemas that are harder to query and maintain.

Skipping the grain declaration

Starting to list columns without first stating what one row represents. This leads to ambiguous fact tables where the same event might be counted multiple times. Always state the grain in one sentence before naming any columns.

Putting descriptive text in the fact table

Adding product_name or customer_email directly to the fact table instead of referencing dimension tables. This bloats the fact table, creates redundancy, and defeats the purpose of dimensional modeling. Only foreign keys and numeric measures belong in fact tables.

Using natural keys instead of surrogate keys

Referencing dimensions by business keys (customer_email, product_sku) instead of integer surrogate keys. Natural keys change (email addresses get updated, SKU formats change). Surrogate keys are stable. They also enable slowly changing dimensions and handle the 'unknown member' case.

Normalizing dimension tables

Creating a separate categories table referenced by the product dimension. This turns the star schema into a snowflake schema, adding joins and complexity. In a star schema, category_name belongs directly in the product dimension table. Denormalization is the point.

Frequently Asked Questions

What is a star schema data model?+
A star schema data model is a database design pattern where a central fact table stores measurable events (like sales or clicks) and connects through foreign keys to surrounding dimension tables that provide descriptive context (like customer, product, and date). The visual shape resembles a star: fact table at center, dimensions radiating outward. It is the most widely used design for analytics data warehouses and was formalized by Ralph Kimball as part of dimensional modeling.
How do you design a star schema from business requirements?+
Follow six steps. First, identify the business process (what event are you tracking). Second, declare the grain (what does one row represent). Third, choose dimensions (who, what, where, when). Fourth, define fact measures (quantity, revenue, duration). Fifth, build denormalized dimension tables with surrogate keys. Sixth, assemble the fact table with foreign keys and measures. The grain declaration is the most important step and should be done before naming any columns.
What is the difference between a star schema and a data model?+
A data model is a broad term for any structured representation of data and its relationships. It can be conceptual, logical, or physical. A star schema is a specific type of physical data model designed for analytical workloads. Not all data models are star schemas. An OLTP database uses a normalized (3NF) data model. A data lake might use a schema-on-read model. The star schema is one particular pattern, optimized for read-heavy analytics with denormalized dimensions and a central fact table.
How many dimensions should a star schema have?+
There is no fixed number. Most star schemas have 4 to 12 dimensions. A retail fact table typically has 5 to 7: date, customer, product, store, promotion, and payment method. An advertising fact table might have 10 or more. The number depends on how many independent descriptive contexts apply to the business event. If you find yourself with 20+ dimensions, the grain might be too fine or some dimensions could be combined into a single junk dimension.

Beat the 60-Second Grain Deadline

Grain errors appear in about half of failed whiteboard attempts in the corpus. Practice the four-step process on DataDriven until you can state the grain in a single sentence before the interviewer finishes reading the prompt.

Start Practicing