# The Retail Blueprint

> One business. A thousand transactions. Only one layout survives the analytics layer.

Canonical URL: <https://datadriven.io/problems/online_retail_star_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L4

## Problem

We're a mid-size online retailer selling about 50,000 SKUs. The BI team needs self-service dashboards to analyze sales by product category, customer segment, geography, and time period. We also need to track inventory levels by warehouse. Design the dimensional model.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the canonical star schema probe. Interviewers use it to test whether a candidate can declare a fact grain, attach conformed dimensions, and resist the temptation to denormalize attributes onto the fact. The signal is whether the candidate treats the order line as an atomic, additive measurement event rather than an operational record.

> **Trick to Solving**
>
> The phrase "analyze sales" is a tell for the star pattern. Before drawing any tables, a strong candidate asks: what is one row in the fact? If the answer is "one product within one order", the grain is fixed and every other decision follows.
> 
> 1. Declare the grain as one row per order line
> 2. Keep only additive measures on the fact
> 3. Push all descriptive attributes into dimensions
> 4. Use surrogate keys so dimensions can evolve independently

---

### Break down the requirements

#### Step 1: Lock the grain at the order line

`fact_order_lines` stores one row per product within an order. This grain supports basket analysis, product-level discounting, and category roll-ups without pre-aggregation.

#### Step 2: Pull every descriptor into a dimension

Category, brand, and loyalty tier belong in `dim_products` and `dim_customers`. Keeping them off the fact is what lets analysts slice without schema changes when a new attribute is added.

#### Step 3: Introduce a conformed date dimension

`dim_dates` gives fiscal calendars, holidays, and week attributes one home. Every fact that reports "by week" joins the same dimension and returns the same answer.

#### Step 4: Use surrogate keys on dimensions

`customer_sk` and `product_sk` decouple the warehouse from source system keys. When a product is relaunched with the same SKU, the surrogate absorbs the change; the fact is untouched.

#### Step 5: Split geography from customer

A customer can ship to multiple regions, and geography attributes (region, country) evolve independently. A dedicated `dim_geography` keeps the fact from duplicating locale fields.

---

### The solution

Below is one conceptually sound star. The grain declaration is the anchor; the surrogate keys are what keep the model evolvable.

> **Why This Design Works**
>
> Atomic grain plus surrogate keys buys schema evolvability. New product attributes are added to `dim_products` without rewriting facts; loyalty changes are absorbed by `dim_customers` with no historical impact. The cost is join work at query time, which columnar warehouses handle cheaply.

> **Interviewers Watch For**
>
> A strong candidate says "one row per product per order" in the first 60 seconds. They also mention that `net_amount` is fully additive, which is what makes the star work for roll-ups. Weaker candidates embed category or brand on the fact and discover the cost only when a category hierarchy changes.

> **Common Pitfall**
>
> Storing `category_name` and `brand_name` on the fact "to avoid joins". Every dimension attribute change then requires a fact backfill, and dashboards drift silently when the product team relabels a category.

---

### The analysis pattern

**Weekly category revenue with holiday flag**

```sql
SELECT
    d.fiscal_week,
    p.category,
    SUM(f.net_amount) AS revenue,
    COUNT(DISTINCT f.order_id) AS orders,
    SUM(CASE WHEN d.is_holiday THEN f.net_amount ELSE 0 END) AS holiday_revenue
FROM fact_order_lines f
JOIN dim_products p ON p.product_sk = f.product_sk
JOIN dim_dates d ON d.date_sk = f.date_sk
WHERE d.calendar_date >= '2025-01-01'
GROUP BY d.fiscal_week, p.category
ORDER BY d.fiscal_week, revenue DESC
```

---

### Trade-offs and alternatives

**Classic Kimball star**

Clean conformed dimensions, surrogate-key evolvability, additive facts. Cost: joins at query time, dimension maintenance pipelines, separate SCD handling per dimension.

**One Big Table (OBT)**

Flat, denormalized, single-table reads on columnar engines. Cost: attribute changes require fact rewrites, no conformity across facts, and storage grows with every dimension attribute.

---

## Common follow-up questions

- How would you handle a product that changes categories mid-year without breaking historical reports? _(Tests SCD awareness and surrogate-key versioning on dim_products.)_
- Finance wants gross, discount, and net on the fact. Are these additive, semi-additive, or non-additive? _(Tests additivity reasoning and whether the candidate can classify measures correctly.)_
- Volume jumps from 10M to 500M order lines per month. What changes? _(Tests partitioning by date_sk and clustering by product_sk on columnar engines.)_
- Marketing wants to attribute each order line to a touchpoint. Where does that live? _(Tests whether the candidate adds a new conformed dimension or a junction fact rather than widening fact_order_lines.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/online_retail_star_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.