Loading lesson...
Stars, snowflakes, and facts between
Stars, snowflakes, and facts between
Topics covered: The Star Schema, Types of Fact Tables, Types of Dimensions, Defining the Grain, Surrogate Keys
Facts in the Center, Dimensions Around Them A star schema has one table in the middle (the fact table) surrounded by several tables around it (dimension tables). The fact table stores measurable events: a sale happened, a click occurred, a payment was processed. Dimension tables store the descriptive context: who (customer), what (product), when (date), where (store). The fact table is tall and narrow: billions of rows, each with a few FK columns pointing to dimensions plus a few numeric measure
There are three fundamental types of fact tables. Each models a different kind of business process. Choosing the wrong type means your table cannot answer the questions it was built for. Transaction Facts One row per discrete business event: a sale, a click, a payment. Transaction facts are insert-only. Rows are never updated. They are the most common fact type and the right choice for any event-level data. Transaction facts cannot answer 'what is the current state?' questions. They record event
Dimensions are the descriptive lookup tables that give meaning to fact rows. A fact row says 'customer_sk = 42, amount = $100.' The dimension tells you customer 42 is 'Alice Zhang from Seattle in the Enterprise segment.' Without dimensions, facts are just numbers. Conformed Dimensions A conformed dimension is shared across multiple fact tables. dim_customer is referenced by fact_sales, fact_returns, fact_support_tickets. The same customer_sk and the same attribute definitions. This is what makes
The grain is the most important decision in dimensional modeling. It defines what one row in the fact table represents. 'One row per order line item.' 'One row per click event.' 'One row per daily account balance.' Everything else follows from this: which dimensions make sense, which measures are valid, which queries produce correct results. How to State the Grain A grain statement is precise enough that another engineer can write a uniqueness test against it. 'One row per order' is vague. 'One
Surrogate keys are system-generated identifiers that replace natural keys in dimensional models. Every dimension row gets its own surrogate key (customer_sk, product_sk, date_sk). Fact tables reference dimensions using these surrogate keys, not the natural business keys. Why Not Just Use Natural Keys? Natural keys (email, SKU, employee_id) come from source systems. They have three problems in analytical models: they change (email updates), they get reused (SKU recycled for a new product), and th