Star Schema: The Complete Guide for Data Engineering Interviews

A star schema puts a fact table at the center and denormalized dimension tables around it. It is the most common design pattern in analytical warehousing and the default answer in every data modeling interview. Understanding grain, surrogate keys, and degenerate dimensions is the difference between a passing answer and an exceptional one.

What Is a Star Schema

A star schema organizes a data warehouse around a central fact table joined to a set of denormalized dimension tables. The name comes from the shape: the fact table at the center, dimension tables radiating outward like points of a star.

The fact table stores measurable events at a stated grain: one row per order, one row per click, one row per sensor reading. The dimension tables store descriptive context: which customer, which product, which date, which store.

Star schemas dominate analytical warehousing because they minimize join depth. A query that needs revenue by product category and customer segment joins two dimension tables to one fact table. That is far cheaper than traversing a normalized hierarchy of five tables to answer the same question.

Star Schema Structure

-- Fact table: one row per order line
fact_order_lines (
  order_line_sk     BIGINT PRIMARY KEY,   -- surrogate key
  order_date_fk     INT REFERENCES dim_date(date_sk),
  customer_fk       INT REFERENCES dim_customer(customer_sk),
  product_fk        INT REFERENCES dim_product(product_sk),
  store_fk          INT REFERENCES dim_store(store_sk),
  order_number      VARCHAR,              -- degenerate dimension
  quantity          INT,
  unit_price        DECIMAL(10,2),
  discount_amount   DECIMAL(10,2),
  net_revenue       DECIMAL(10,2)
)

-- One dimension table (flat, denormalized)
dim_customer (
  customer_sk       INT PRIMARY KEY,      -- surrogate key
  customer_id       VARCHAR,              -- natural key
  full_name         VARCHAR,
  email             VARCHAR,
  segment           VARCHAR,
  city              VARCHAR,
  state             VARCHAR,
  country           VARCHAR,
  acquisition_channel VARCHAR
)

The fact table is narrow (measures + foreign keys). Dimensions are wide and flat.

The Three Core Components

Fact Table

Stores measurable events: orders, clicks, page views. The fact table sits at the center of the star. Every row is one event at the stated grain. Columns are either foreign keys to dimensions or numeric measures (revenue, quantity, duration). Fact tables are narrow and tall: few columns, many rows. Interview note: Interviewers will ask you to state the grain before drawing any columns. 'One row per order' or 'one row per click event' must be explicit. The grain determines which measures belong and which belong in a dimension.

Dimension Table

Describes the context around events: customer, product, date, store. Dimensions are denormalized by design. A single customer dimension might contain customer name, email, segment, city, state, and country all in one table. This denormalization is intentional: it eliminates joins at query time and makes the model readable to non-engineers. Interview note: When interviewers see a normalized dimension with a separate geography table, they'll ask whether the extra join is worth the storage savings. In most warehouses, it is not.

Surrogate Key

An artificial primary key generated by the warehouse, independent of the source system. Source systems reuse or recycle natural keys over time. A surrogate key (typically an integer or UUID generated at load time) insulates the warehouse from upstream changes. Every dimension row gets a surrogate key. Fact table foreign keys point to surrogate keys, not natural keys. Interview note: If you use natural keys in a fact table and the source system resets IDs, historical facts point to the wrong dimension row. Surrogate keys prevent this.

Star Schema vs Snowflake Schema

Dimension	Star Schema	Snowflake Schema
Dimension structure	Denormalized (flat)	Normalized (hierarchy of tables)
Join count per query	Low (1 join per dimension)	High (multiple joins per dimension)
Storage	More (repeated values in dims)	Less (values stored once)
Query performance	Faster on most engines	Slower due to join cost
Ease of understanding	High	Lower for non-engineers
Preferred when	Analytics workloads, large query volume	Frequent updates to dimension hierarchies

When to Use Star Schema

Use star schema when

You have a stable analytical workload with repeated query patterns, your warehouse engine is optimized for columnar scans over joins (Snowflake, BigQuery, Redshift), and your dimension data does not change frequently enough to warrant normalization.

Avoid star schema when

Dimension attributes change frequently and you need exact update semantics (use SCD Type 2 instead), or when storage cost matters more than query performance and you have many repeated dimension values (consider snowflake schema).

Star Schema Interview Questions

What is the grain of a fact table and why do you state it first?

The grain is the level of detail represented by one row. You state it first because every subsequent design decision depends on it: which measures belong in the fact, which attributes belong in dimensions, and what foreign keys are required. A misstated grain produces a fact table with mixed granularity, which corrupts aggregations.

What is a degenerate dimension?

A degenerate dimension is a dimension attribute stored directly in the fact table rather than in a separate dimension table. Order number is a classic example: it is a textual identifier with no additional attributes, so adding a separate dimension table buys nothing. It lives as a column in the fact table.

When would you add a junk dimension?

A junk dimension consolidates low-cardinality flags and indicators that do not belong to any natural dimension. If your fact table has a dozen boolean columns (is_return, is_gift, is_promo, is_fraud), you can move them into a single junk dimension keyed by the combination of flag values. This cleans up the fact table and reduces its column count.

How does a star schema handle slowly changing dimension data?

With SCD Type 2: when a dimension attribute changes, a new row is added with updated values, an effective_from date, an effective_to date, and an is_current flag. Historical fact rows continue to point to the old surrogate key, preserving the dimension state at the time of the event. The star schema structure itself does not change; only the dimension management process changes.

Star Schema FAQ

What is a star schema?+

A star schema is a data warehouse design pattern with a central fact table joined to a ring of denormalized dimension tables. The shape resembles a star when drawn as an entity-relationship diagram. The fact table holds measurable events; dimension tables hold the descriptive context around those events.

What is the difference between a fact table and a dimension table?+

A fact table stores quantitative, time-stamped events such as orders, clicks, or sensor readings. It contains foreign keys to dimensions and numeric measures. A dimension table stores descriptive attributes about the entities involved in those events: customers, products, dates, locations. Facts are tall and narrow; dimensions are shorter and wider.

Why are dimension tables denormalized in a star schema?+

Denormalization reduces join count at query time. Analytical queries run against many rows and need fast aggregations. Adding multiple joins to resolve a normalized geography hierarchy slows every query. The storage cost of repeated dimension values is small compared to the performance cost of extra joins at scale.

How many dimensions does a typical star schema have?+

Most practical star schemas have 3 to 10 dimensions per fact table. A retail schema might have date, customer, product, store, and promotion. The exact count depends on the business domain and the queries the schema must support.

02 / Why practice

Practice Star Schema Design

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
The round is won on tradeoffs, not on the diagram
Grain, star vs snowflake, SCD type, conformed dimensions, late-arriving data. Modeling under live pushback is what separates the bands, and it is the half almost nobody rehearses

Start a modeling problem

Related Data Modeling Guides

Snowflake Schema Guide→

Normalized dimensions and when to use them.

Dimensional Modeling Guide→

Kimball methodology from grain to conformed dimensions.

Slowly Changing Dimensions→

SCD Types 1, 2, and 3 with merge logic.

Fact vs Dimension Tables→

How to split attributes between facts and dimensions.

Data Modeling Interview Questions→

The full guide to modeling rounds.

Data Modeling Practice Problems→

Canvas-based practice with a validator.