# The Retail Tables That Need a New Home

> A working system. Now redesign it so the analysts can actually use it.

Canonical URL: <https://datadriven.io/problems/the_retail_tables_that_need_a_new_home>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

You are given an existing transactional database from a retail operation covering orders, customers, products, stores, and employees. The analytics team cannot write performant queries against this structure. Redesign it as a dimensional warehouse that supports reporting on sales performance, product mix, and customer behavior.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether you can translate an OLTP order model into an analytics-ready star. The signal is picking the fact grain deliberately (one row per order line, not per order) and using surrogate keys so the warehouse is not coupled to the transactional primary keys.

> **Trick to Solving**
>
> Before drawing any tables, a strong candidate asks: "what is the smallest additive unit the business reports on, and which dimensions share across facts?" The answer is the grain. Everything after that is conformed dimensions and surrogate key discipline.
> 
> 1. Declare `fact_sales` grain = one row per order line
> 2. Pick conformed dimensions (customer, product, store, employee, date)
> 3. Decide SCD strategy per dimension
> 4. Introduce surrogate keys on every dimension

---

### Break down the requirements

#### Step 1: Declare the fact grain as the order line

One row per order line lets analysts compute units, gross sales, and product mix by simple aggregation. Order-grain would force fan-out joins back to line detail for any product question.

#### Step 2: Introduce surrogate keys on every dimension

Natural keys stay as `*_nk` columns but are not the PK. This lets a customer_id renumber in the source without cascading to fact rows.

#### Step 3: Apply SCD Type 2 on dim_customer address

Customer city and state move. Tracking the as-of address via `effective_from`, `effective_to`, and `is_current` lets a past sale attribute to the address it shipped to.

#### Step 4: Conform dim_date across any time-based lookup

One calendar dimension is shared by every fact and by date-typed columns on dimensions via role-playing keys.

#### Step 5: Keep dim_store flexible enough for digital

Digital storefronts have no physical address. A nullable address is one defensible choice; a `store_type` flag lets reporting partition physical vs digital cleanly.

---

### The solution

Below is one defensible model. The grain anchors the rest of the design, and Type 2 on `dim_customer` is the one SCD lever worth defending out loud.

> **Why this design holds up**
>
> The line-grain fact makes every product and category question additive. Surrogate keys decouple the warehouse from OLTP churn. Type 2 on `dim_customer` preserves historical attribution when a customer moves, which matters for any geo report.

> **What strong candidates do**
>
> They name the grain before drawing a single box. They call out which dimensions are Type 1 vs Type 2 and justify each. They acknowledge digital storefronts as a dimension nuance, not a schema exception.

> **Red flags to avoid**
>
> Using OLTP primary keys as warehouse PKs couples reporting to source churn. Keeping order-grain only forces fan-out joins on product mix questions. Modeling customer address as Type 1 silently rewrites historical geo reports when a customer moves.

---

### The analysis pattern

**Weekly gross sales by region and category**

```sql
SELECT
    d.fiscal_week,
    s.region,
    p.category,
    SUM(f.quantity) AS units,
    SUM(f.extended_price) AS gross_sales
FROM fact_sales f
JOIN dim_date d ON d.date_key = f.date_key
JOIN dim_store s ON s.store_sk = f.store_sk
JOIN dim_product p ON p.product_sk = f.product_sk
GROUP BY d.fiscal_week, s.region, p.category
```

---

### Trade-offs and alternatives

**Kimball star with line-grain fact**

Predictable joins, conformed dimensions, good BI tool support. Writes are heavier because dimensions require SCD processing. Storage grows with dimension history.

**One Big Table on columnar storage**

Single wide fact with dimension attributes flattened onto every row. No joins at query time. Schema evolution is trickier and reprocessing history on a dimension change rewrites millions of rows.

---

## Common follow-up questions

- How would you extend the model to track promotion attribution per order line? _(Tests adding a `dim_promotion` and whether it lives as a factless bridge.)_
- Which SCD type would you use on `dim_employee` role and why? _(Tests role tracking trade-offs; promotions usually need Type 2 for commission history.)_
- A customer address corrects a typo. Does that trigger a new `dim_customer` row? _(Tests the difference between a real change and a data quality fix.)_
- How would you handle returns without breaking the additive grain? _(Tests whether returns become negative-quantity rows or a separate `fact_returns`.)_
- How would you partition `fact_sales` to keep yearly comparisons bounded? _(Tests partitioning by `date_key` and its impact on fiscal reporting.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_retail_tables_that_need_a_new_home)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.