Star Schemas

Operational databases are optimized for writing one row at a time. Analytical databases are optimized for reading millions of rows at once. They need different schemas. The star schema is the foundational pattern for analytical data modeling: a central fact table surrounded by dimension tables. This lesson teaches you how to design star schemas, choose the right type of fact table, build effective dimensions, define the grain, and use surrogate keys to keep your model stable as source data changes.

The Star Schema

Daily Life

Interviews

Structure data for fast analytical queries

Facts in the Center, Dimensions Around Them

A star schema has one table in the middle (the fact table) surrounded by several tables around it (dimension tables). The fact table stores measurable events: a sale happened, a click occurred, a payment was processed. Dimension tables store the descriptive context: who (customer), what (product), when (date), where (store).

The fact table is tall and narrow: billions of rows, each with a few FK columns pointing to dimensions plus a few numeric measure columns. Dimension tables are short and wide: thousands to millions of rows, each with many descriptive columns.

dim_customer

PKcustomer_sk

name

region

segment

dim_product

PKproduct_sk

name

Star Schema vs Snowflake Schema

✓Star Schema

Dimensions are fully denormalized (flat)
Fewer JOINs per query
Simpler for analysts to understand
Standard in most data warehouses

✓Snowflake Schema

Dimensions are normalized (sub-tables)
More JOINs per query
Less storage duplication
Useful when dimension hierarchies change independently

In a snowflake schema, the product dimension is split: dim_product has a category_id FK pointing to dim_category. In a star schema, category_name is denormalized directly onto dim_product. Star schemas are the standard for data warehouses because the storage cost of denormalization is negligible compared to the query performance gain of eliminating JOINs.

Why Star Schemas Are Fast

Columnar storage: fact tables are scanned column-by-column, not row-by-row. Only the columns in your query are read.

Partition pruning: fact tables are partitioned by date. Queries that filter by date skip entire partitions.

Dimension tables are small: they fit in memory. Joining a billion-row fact to a 10,000-row dimension is fast because the dimension is broadcast.

Predicate pushdown: filter conditions on dimensions push into the fact scan. The engine reads only matching fact rows.

TIP

When in doubt, use a star schema. It is the default that works for 90% of analytical use cases. Only reach for a snowflake when you have a specific reason: dimension hierarchies that change independently, or extreme dimension size where denormalization wastes significant storage.

Types of Fact Tables

Daily Life

Interviews

Choose transaction, snapshot, or accumulating

There are three fundamental types of fact tables. Each models a different kind of business process. Choosing the wrong type means your table cannot answer the questions it was built for.

Transaction Facts

One row per discrete business event: a sale, a click, a payment. Transaction facts are insert-only. Rows are never updated. They are the most common fact type and the right choice for any event-level data.

GrainMeasuresUpdates

Grain

One row per event

Each row represents one thing that happened: one order line item, one click, one payment.

Measures

Additive numbers about the event

quantity, amount, discount. These can be SUMmed across any dimension.

Updates

Never

Transaction facts are append-only. A new event = a new row. Corrections use reversal rows, not UPDATEs.

Transaction facts cannot answer 'what is the current state?' questions. They record events, not states. 'What is the current inventory level?' requires summing all inventory transactions from the beginning of time. For state queries, use a periodic snapshot.

Periodic Snapshots

One row per entity per time period: daily account balance, monthly inventory level, quarterly headcount. Periodic snapshots capture the state at regular intervals. Unlike transaction facts, they answer 'what was the value on date X?' with a simple lookup instead of a full-history aggregation.

The key characteristic: snapshot measures are semi-additive. You can SUM balances across accounts (total balance for all accounts on March 15). You CANNOT SUM balances across time (total balance for Q1 is meaningless: it would add up 90 daily snapshots). Across time, use LAST_VALUE, AVG, or point-in-time filtering.

Accumulating Snapshots

One row per lifecycle instance: one order, one loan application, one support ticket. The row is created at the first milestone and UPDATED as subsequent milestones are reached. This is the only fact table type where rows are updated.

Each milestone gets its own date column: placed_date, approved_date, shipped_date, delivered_date. NULL means the milestone has not been reached yet. Pre-computed lag columns (days_to_ship = shipped_date minus approved_date) make lifecycle analysis fast.

fact_order_lifecycle

PKorder_sk

FKcustomer_sk

placed_date_sk

approved_date_sk

shipped_date_sk

delivered_date_sk

days_to_ship

current_status

amount

Factless Facts

A fact table with no measures. It records that a relationship existed at a point in time: which students attended which classes on which dates. The row itself is the fact. COUNT(*) is the metric. Factless facts enable 'what did NOT happen?' queries: students enrolled but absent, products on promotion but not sold.

✓Transaction Fact

One row per event
Never updated
Additive measures (quantity, amount)
Cannot answer 'current state' queries

✓Periodic Snapshot

One row per entity per period
Never updated (new snapshot each period)
Semi-additive measures (balance, level)
Answers 'what was the state on date X?'

TIP

Production warehouses often have all three for the same business process. Transaction fact for event analysis. Periodic snapshot for balance queries. Accumulating snapshot for lifecycle dashboards. The three types are complementary, not competing.

Types of Dimensions

Daily Life

Interviews

Build descriptive lookup tables

Dimensions are the descriptive lookup tables that give meaning to fact rows. A fact row says 'customer_sk = 42, amount = $100.' The dimension tells you customer 42 is 'Alice Zhang from Seattle in the Enterprise segment.' Without dimensions, facts are just numbers.

Conformed Dimensions

A conformed dimension is shared across multiple fact tables. dim_customer is referenced by fact_sales, fact_returns, fact_support_tickets. The same customer_sk and the same attribute definitions. This is what makes cross-domain analysis possible: 'show me customers who bought product X but also filed a support ticket.'

Without conformed dimensions, each fact table defines 'customer' differently. Sales has one customer table, support has another. Their definitions of 'active customer' differ. Cross-domain queries produce inconsistent results.

Role-Playing Dimensions

The same physical dimension used multiple times in one fact table, each time in a different role. The classic example: dim_date appears as order_date, ship_date, and deliver_date. Three foreign keys in the fact table, all pointing to the same dim_date.

To query revenue by order month, join on order_date_sk. To query revenue by ship month, join on ship_date_sk. Same table, different role, different analytical perspective.

Junk Dimensions

Low-cardinality flags (is_gift, is_prime, payment_method) that do not belong in any real dimension get consolidated into a junk dimension. Instead of five boolean columns on the fact table, one FK to a 32-row dim_order_flags table. The fact table stays narrow, and adding new flags does not require altering the fact table.

Degenerate Dimensions

Transaction identifiers (invoice_number, receipt_id) that live directly on the fact table with no separate dimension. They are dimensions in the conceptual sense (they describe the transaction) but do not warrant their own table because they have no additional attributes. Creating dim_invoice with one column would be a one-to-one dimension anti-pattern.

The Date Dimension

Every data warehouse has a dim_date table: one row per calendar date, pre-populated for 100+ years. It contains every attribute you might filter or group by: day_of_week, month_name, quarter, fiscal_year, is_weekend, is_holiday. Without it, every query that groups by quarter must compute the quarter from the date, and every team computes it differently.

dim_date

PKdate_sk

full_date

day_of_week

month_name

quarter

year

is_weekend

is_holiday

fiscal_quarter

TIP

Never skip the date dimension. Without it, fiscal quarter logic, holiday flags, and day-of-week groupings are inconsistent across queries. dim_date is the cheapest dimension to build and the most expensive to not have.

Defining the Grain

Daily Life

Interviews

Set the exact level of detail per row

The grain is the most important decision in dimensional modeling. It defines what one row in the fact table represents. 'One row per order line item.' 'One row per click event.' 'One row per daily account balance.' Everything else follows from this: which dimensions make sense, which measures are valid, which queries produce correct results.

How to State the Grain

A grain statement is precise enough that another engineer can write a uniqueness test against it. 'One row per order' is vague. 'One row per order line item, unique on (order_id, line_item_seq)' is precise. The grain statement names the unit of analysis AND the columns that enforce uniqueness.

Precision	Grain Statement	Quality
Vague	"It's an orders table"	Unclear what 'order' means
Better	"One row per order"	What if an order has multiple items?
Good	"One row per order line item"	Clear unit of analysis
Best	"One row per order line item, unique on (order_id, line_seq)"	Enforceable with a constraint

Why Grain Matters

Getting the grain wrong does not produce an error message. It produces wrong numbers that look right. If you define the grain as 'one row per order' but some orders have multiple line items, your row count will be higher than expected, and SUM(amount) will be correct at the order level but wrong at the line-item level. Nobody notices until someone tries to reconcile with the source system.

If the grain is order-line-item, product_sk makes sense. If the grain is order, product_sk does not (which product? there are multiple).
quantity and line_amount are additive at the line-item grain. They must be pre-aggregated at the order grain.
Joining a line-item grain fact to a customer dimension is safe (one customer per order). Joining to a product dimension may fan out if the grain is order, not line item.

Grain Validation

After defining the grain, validate it. Run a uniqueness check on the grain columns. If any group has COUNT(*) > 1, the grain is violated: either the data has duplicates or the grain statement is wrong.

TIP

Always define the grain before adding any columns to a fact table. Not sometimes. Not after you think about measures. First. Everything else flows from the grain.

✗Wrong Grain (Order Level)

Multi-item orders have one row
product_sk is ambiguous (which product?)
SUM(amount) correct per order but wrong per item
Cannot do product-level analysis

✓Right Grain (Line Item Level)

Each line item is its own row
product_sk is unambiguous
SUM(amount) correct at every level
Supports order-level AND product-level analysis

Surrogate Keys

Daily Life

Interviews

Generate stable IDs independent of source data

Surrogate keys are system-generated identifiers that replace natural keys in dimensional models. Every dimension row gets its own surrogate key (customer_sk, product_sk, date_sk). Fact tables reference dimensions using these surrogate keys, not the natural business keys.

Why Not Just Use Natural Keys?

Natural keys (email, SKU, employee_id) come from source systems. They have three problems in analytical models: they change (email updates), they get reused (SKU recycled for a new product), and they complicate SCD Type 2 (a customer with 5 historical versions would have 5 rows with the same email, and fact rows need to reference the specific version).

Surrogate keys are immutable: once assigned, they never change.

Surrogate keys are unique per version: each SCD2 historical row gets its own SK.

Surrogate keys are compact: a BIGINT SK is 8 bytes vs a 36-byte UUID natural key.

Natural keys from the source system can change, get reused, or have different formats across systems.

How Surrogate Keys Enable SCD Type 2

When a customer moves from Portland to Seattle, you create a new dimension row with a new surrogate key. The old row (SK 1001, Portland) stays. The new row (SK 1002, Seattle) is added. Fact rows from before the move reference SK 1001 (Portland). Fact rows after the move reference SK 1002 (Seattle). Revenue is attributed to the correct city at the time of each transaction.

dim_customer (SCD2)

PKcustomer_sk

customer_id

name

city

effective_from

effective_to

is_current

fact_orders

PKorder_sk

FKcustomer_sk

date_sk

amount

The fact table stores customer_sk, not customer_id. This is the critical detail. If the fact stored the natural key (customer_id), a point-in-time join would require a date-range filter on the dimension. With the surrogate key, the join is a simple equality: fact.customer_sk = dim.customer_sk. The SK already encodes which version of the customer this fact row relates to.

Surrogate Key Assignment

Surrogate keys are assigned during the ETL process, not by the source system. When a new customer appears, the ETL assigns the next SK. When an existing customer changes (SCD2), the ETL assigns a new SK for the new version. The source system's customer_id is preserved as a natural key column on the dimension for traceability.

TIP

Use BIGINT surrogate keys for all dimensions. Assign them in ETL, not in the source system. Store the natural key alongside the SK for traceability. This is the standard pattern in every production data warehouse.

❯❯❯PUTTING IT ALL TOGETHER

> You are building an analytical model for an e-commerce platform. The business needs daily revenue dashboards and lifecycle tracking for order fulfillment.

You create a transaction fact (fact_order_line_items) at the line-item grain for revenue analysis, and an accumulating snapshot (fact_order_lifecycle) for fulfillment tracking.

Conformed dim_customer, dim_product, and dim_date are shared between both facts. Surrogate keys assigned in ETL. Natural keys preserved for traceability.

dim_date has is_holiday and fiscal_quarter columns. Every query that groups by quarter uses the same definition. No inconsistency across teams.

KEY TAKEAWAYS

Star schema: central fact table surrounded by dimension tables; the standard for analytical modeling

Three fact types: transaction (events), periodic snapshot (states), accumulating snapshot (lifecycles)

Conformed dimensions: shared across fact tables for consistent cross-domain analysis

Grain first: define what one row represents before adding any columns

Surrogate keys: immutable, version-specific, assigned by ETL; enable SCD2 history

Stars, snowflakes, and facts between

Category: Data Modeling
Duration: 30 minutes
Challenges: 12 hands-on challenges

Topics covered: The Star Schema, Types of Fact Tables, Types of Dimensions, Defining the Grain, Surrogate Keys

Lesson Sections

The Star Schema (concepts: dmStarSchema)
Facts in the Center, Dimensions Around Them A star schema has one table in the middle (the fact table) surrounded by several tables around it (dimension tables). The fact table stores measurable events: a sale happened, a click occurred, a payment was processed. Dimension tables store the descriptive context: who (customer), what (product), when (date), where (store). The fact table is tall and narrow: billions of rows, each with a few FK columns pointing to dimensions plus a few numeric measure
Types of Fact Tables (concepts: dmFactTables)
There are three fundamental types of fact tables. Each models a different kind of business process. Choosing the wrong type means your table cannot answer the questions it was built for. Transaction Facts One row per discrete business event: a sale, a click, a payment. Transaction facts are insert-only. Rows are never updated. They are the most common fact type and the right choice for any event-level data. Transaction facts cannot answer 'what is the current state?' questions. They record event
Types of Dimensions (concepts: dmDimensionTables)
Dimensions are the descriptive lookup tables that give meaning to fact rows. A fact row says 'customer_sk = 42, amount = $100.' The dimension tells you customer 42 is 'Alice Zhang from Seattle in the Enterprise segment.' Without dimensions, facts are just numbers. Conformed Dimensions A conformed dimension is shared across multiple fact tables. dim_customer is referenced by fact_sales, fact_returns, fact_support_tickets. The same customer_sk and the same attribute definitions. This is what makes
Defining the Grain (concepts: dmGrainDefinition)
The grain is the most important decision in dimensional modeling. It defines what one row in the fact table represents. 'One row per order line item.' 'One row per click event.' 'One row per daily account balance.' Everything else follows from this: which dimensions make sense, which measures are valid, which queries produce correct results. How to State the Grain A grain statement is precise enough that another engineer can write a uniqueness test against it. 'One row per order' is vague. 'One
Surrogate Keys (concepts: dmSurrogateKeys)
Surrogate keys are system-generated identifiers that replace natural keys in dimensional models. Every dimension row gets its own surrogate key (customer_sk, product_sk, date_sk). Fact tables reference dimensions using these surrogate keys, not the natural business keys. Why Not Just Use Natural Keys? Natural keys (email, SKU, employee_id) come from source systems. They have three problems in analytical models: they change (email updates), they get reused (SKU recycled for a new product), and th