Star Schema: The Complete Guide for Data Engineering Interviews
A star schema puts a fact table at the center and denormalized dimension tables around it. It is the most common design pattern in analytical warehousing and the default answer in every data modeling interview. Understanding grain, surrogate keys, and degenerate dimensions is the difference between a passing answer and an exceptional one.
What Is a Star Schema
A star schema organizes a data warehouse around a central fact table joined to a set of denormalized dimension tables. The name comes from the shape: the fact table at the center, dimension tables radiating outward like points of a star.
The fact table stores measurable events at a stated grain: one row per order, one row per click, one row per sensor reading. The dimension tables store descriptive context: which customer, which product, which date, which store.
Star schemas dominate analytical warehousing because they minimize join depth. A query that needs revenue by product category and customer segment joins two dimension tables to one fact table. That is far cheaper than traversing a normalized hierarchy of five tables to answer the same question.
Know Star Schema the way the interviewer who asks it knows it.
Star Schema Structure
-- Fact table: one row per order line
fact_order_lines (
order_line_sk BIGINT PRIMARY KEY, -- surrogate key
order_date_fk INT REFERENCES dim_date(date_sk),
customer_fk INT REFERENCES dim_customer(customer_sk),
product_fk INT REFERENCES dim_product(product_sk),
store_fk INT REFERENCES dim_store(store_sk),
order_number VARCHAR, -- degenerate dimension
quantity INT,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
net_revenue DECIMAL(10,2)
)
-- One dimension table (flat, denormalized)
dim_customer (
customer_sk INT PRIMARY KEY, -- surrogate key
customer_id VARCHAR, -- natural key
full_name VARCHAR,
email VARCHAR,
segment VARCHAR,
city VARCHAR,
state VARCHAR,
country VARCHAR,
acquisition_channel VARCHAR
)The fact table is narrow (measures + foreign keys). Dimensions are wide and flat.
The Three Core Components
Fact Table
Stores measurable events: orders, clicks, page views. The fact table sits at the center of the star. Every row is one event at the stated grain. Columns are either foreign keys to dimensions or numeric measures (revenue, quantity, duration). Fact tables are narrow and tall: few columns, many rows. Interview note: Interviewers will ask you to state the grain before drawing any columns. 'One row per order' or 'one row per click event' must be explicit. The grain determines which measures belong and which belong in a dimension.
Dimension Table
Describes the context around events: customer, product, date, store. Dimensions are denormalized by design. A single customer dimension might contain customer name, email, segment, city, state, and country all in one table. This denormalization is intentional: it eliminates joins at query time and makes the model readable to non-engineers. Interview note: When interviewers see a normalized dimension with a separate geography table, they'll ask whether the extra join is worth the storage savings. In most warehouses, it is not.
Surrogate Key
An artificial primary key generated by the warehouse, independent of the source system. Source systems reuse or recycle natural keys over time. A surrogate key (typically an integer or UUID generated at load time) insulates the warehouse from upstream changes. Every dimension row gets a surrogate key. Fact table foreign keys point to surrogate keys, not natural keys. Interview note: If you use natural keys in a fact table and the source system resets IDs, historical facts point to the wrong dimension row. Surrogate keys prevent this.
Star Schema vs Snowflake Schema
| Dimension | Star Schema | Snowflake Schema |
|---|---|---|
| Dimension structure | Denormalized (flat) | Normalized (hierarchy of tables) |
| Join count per query | Low (1 join per dimension) | High (multiple joins per dimension) |
| Storage | More (repeated values in dims) | Less (values stored once) |
| Query performance | Faster on most engines | Slower due to join cost |
| Ease of understanding | High | Lower for non-engineers |
| Preferred when | Analytics workloads, large query volume | Frequent updates to dimension hierarchies |
When to Use Star Schema
Use star schema when
You have a stable analytical workload with repeated query patterns, your warehouse engine is optimized for columnar scans over joins (Snowflake, BigQuery, Redshift), and your dimension data does not change frequently enough to warrant normalization.
Avoid star schema when
Dimension attributes change frequently and you need exact update semantics (use SCD Type 2 instead), or when storage cost matters more than query performance and you have many repeated dimension values (consider snowflake schema).
Star Schema Interview Questions
What is the grain of a fact table and why do you state it first?
The grain is the level of detail represented by one row. You state it first because every subsequent design decision depends on it: which measures belong in the fact, which attributes belong in dimensions, and what foreign keys are required. A misstated grain produces a fact table with mixed granularity, which corrupts aggregations.
What is a degenerate dimension?
A degenerate dimension is a dimension attribute stored directly in the fact table rather than in a separate dimension table. Order number is a classic example: it is a textual identifier with no additional attributes, so adding a separate dimension table buys nothing. It lives as a column in the fact table.
When would you add a junk dimension?
A junk dimension consolidates low-cardinality flags and indicators that do not belong to any natural dimension. If your fact table has a dozen boolean columns (is_return, is_gift, is_promo, is_fraud), you can move them into a single junk dimension keyed by the combination of flag values. This cleans up the fact table and reduces its column count.
How does a star schema handle slowly changing dimension data?
With SCD Type 2: when a dimension attribute changes, a new row is added with updated values, an effective_from date, an effective_to date, and an is_current flag. Historical fact rows continue to point to the old surrogate key, preserving the dimension state at the time of the event. The star schema structure itself does not change; only the dimension management process changes.
Star Schema FAQ
What is a star schema?+
What is the difference between a fact table and a dimension table?+
Why are dimension tables denormalized in a star schema?+
How many dimensions does a typical star schema have?+
Practice Star Schema Design
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition
Related Data Modeling Guides
Normalized dimensions and when to use them.
Kimball methodology from grain to conformed dimensions.
SCD Types 1, 2, and 3 with merge logic.
How to split attributes between facts and dimensions.
The full guide to modeling rounds.
Canvas-based practice with a validator.