# What the Script Remembers

Canonical URL: <https://datadriven.io/problems/what-the-script-remembers>

Domain: Data Modeling · Difficulty: medium · Seniority: mid

## Problem

Sanofi's commercial analytics team wants a warehouse for prescription fills across the US, where brand managers slice volume and revenue by therapeutic class and manufacturer while field teams roll the same numbers up through prescriber territory and region. Drug attributes like formulary status and packaging change over time, and an analyst pulling last quarter's numbers must see each fill against the attributes that were in effect on the day it was dispensed. Design the model that serves both the therapeutic-class rollups and that point-in-time accuracy.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the star-versus-snowflake question wearing a lab coat. The interviewer is not asking you to recite two textbook diagrams; they are watching whether you can decide WHERE to normalize a hierarchy and WHERE to leave it flat, and whether you can keep a fill tied to history that has since been rewritten. The trap is treating the choice as all-or-nothing: fully snowflake every dimension and your two headline reports drag a six-table join through 4 billion rows, or flatten everything and you lose the therapeutic-class handle and the point-in-time guarantee the moment finance restates a quarter.

> **Trick to Solving**
>
> Snowflake a branch only when the sub-level is a real reporting handle with its own attributes and its own change history. Therapeutic class qualifies, so it gets an outrigger off dim_product. Strength, form, and packaging do not roll up on their own, so they stay flat on the product row. The point-in-time requirement is a separate axis: solve it with Type 2 on the product dimension, not by normalizing harder.

---

### Break down the requirements

#### Step 1: Declare the grain out loud

One row in fact_prescriptions equals one fill event: this prescriber, this product, this pharmacy, this payer, on this date. It is the most atomic grain, so quantity, days supply, and revenue stay additive and roll up to any therapeutic class, territory, or month. A monthly pre-aggregate here would amputate the territory drill-down before you even start.

#### Step 2: Find the hierarchy that earns an outrigger

Brand managers slice by therapeutic class and manufacturer, and those classes get reassigned and restated. That independent reporting need plus its own change history is exactly what justifies a snowflake branch: dim_therapeutic_class hangs off dim_product. A flat class_name column on the product row would still report, but it could not carry class-level attributes or a clean restatement trail.

#### Step 3: Pin history with Type 2, not a join to now

Formulary status and packaging change a few times a year and finance restates prior periods. dim_product gets valid_from, valid_to, and is_current, and fact_prescriptions stores the product surrogate key that was current on the dispense date. The fill is welded to the version that was true then; changing today's attributes cannot re-attribute last quarter.

#### Step 4: Conform the prescriber and territory spine

A sales-call mart already reports on the same prescribers and territories. Build dim_prescriber and dim_territory as conformed dimensions with shared surrogate keys, so the prescription mart and the activity mart reconcile without a translation layer. Territory rolls up to region inside that branch for the field-team report.

---

### The reference model

Below is one defensible design: a mostly-star core with a deliberate snowflake on the therapeutic-class branch and Type 2 history on the product dimension. The fact is at the fill grain, dimensions carry surrogate keys, and the only normalized branch is the one that pays for its joins with real reporting and restatement value.

> **How It Scales**
>
> fact_prescriptions is the only big table (about 4 billion rows a year); every dimension is small (600 products, 80 classes, 900k prescribers). Partition the fact by date_key and cluster by product_sk. The therapeutic-class report touches one outrigger join on an 80-row table, which the optimizer broadcasts for free, so the snowflake branch costs effectively nothing while keeping class attributes maintainable in one place.

> **Interviewers Watch For**
>
> A strong candidate states the grain first, then justifies each normalization decision per-branch instead of declaring 'I'll use a star' or 'I'll use a snowflake' globally. They name Type 2 unprompted the moment they hear 'restated' and 'as of the dispense date', and they catch that joining the fact to the current product row would silently re-attribute history.

> **Common Pitfall**
>
> Storing only product_code on the fact and resolving attributes by joining to is_current = TRUE. The first time formulary status or a class assignment changes, every historical report shifts and finance can no longer reconcile last quarter. The fix is to bind the fact to the surrogate key that was valid at dispense time, which is the whole point of Type 2.

---

### Star versus snowflake, decided per branch

**Snowflake the class branch**

dim_therapeutic_class is its own table off dim_product.

* Class attributes and ATC codes live in one place
* Reassignments and restatements have a natural home
* One tiny broadcast join, negligible at this scale

**Flatten everything into a pure star**

Flatten therapeutic class and manufacturer onto every product row.

* No join for class rollups
* Class attributes duplicated across products
* A class reassignment must rewrite many product rows and has no clean home for its own history

**Therapeutic-class revenue trend, point-in-time correct**

```sql
SELECT
    tc.class_name,
    d.year,
    d.month,
    SUM(f.gross_revenue) AS revenue,
    SUM(f.quantity_dispensed) AS units
FROM fact_prescriptions f
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_therapeutic_class tc ON tc.therapeutic_class_key = p.therapeutic_class_key
JOIN dim_date d ON d.date_key = f.date_key
WHERE d.year = 2024
GROUP BY tc.class_name, d.year, d.month
ORDER BY tc.class_name, d.month
```

*f.product_sk already points at the version valid on the dispense date, so no effective-date filter is needed at query time.*

---

## Common follow-up questions

- A prescriber moves to a new territory mid-year and the field team wants both the old and new attribution to reconcile. How does your model handle it? _(Tests whether the candidate extends Type 2 to dim_prescriber or reaches for a separate assignment fact.)_
- Finance wants net revenue after rebates, which arrive 60 days late and adjust prior fills. Where does that land? _(Tests late-arriving facts and whether rebates are signed adjustment rows or a separate fact at a different grain.)_
- At 4 billion fills a year, how would you partition and cluster the fact so both the class trend and the territory rollup stay fast? _(Tests partitioning by date and clustering or secondary structures aligned to the two headline access patterns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/what-the-script-remembers)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.