# The JSON Files That Became a Data Mart

> Three semi-structured inputs. One queryable warehouse.

Canonical URL: <https://datadriven.io/problems/the_json_files_that_became_a_data_mart>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

You are given three sample JSON data files from a client's source system: one for orders with embedded customer details and line items, one for products with nested category hierarchies, and one for shipments with address objects and status history arrays. Design a relational data mart from these inputs.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether you can translate nested document structure into relational grain. The signal is whether you recognize that a JSON array inside a parent object is a child entity at a finer grain, and that naively mirroring the document produces repeating groups that destroy join performance.

> **Trick to Solving**
>
> Before drawing any tables, a strong candidate asks: "what is the unique row in each output table and which JSON array produced it?" Every array inside a JSON document is a candidate child fact. Name the grain out loud for each output, then back-solve columns from the source payload.
> 
> 1. Inventory every array in the payloads
> 2. Declare a grain for each array (one row per element)
> 3. Keep parent attributes on the parent table only
> 4. Assign surrogate keys; keep source IDs as natural keys

---

### Break down the requirements

#### Step 1: Unfold line_items into fact_order_lines

The line_items array is one row per item per order. Embedding line fields as columns on `fact_orders` would create a repeating group and break any per-product analysis.

#### Step 2: Unfold status_history into fact_shipment_events

Status history is a sequence of status transitions. Modeling it as `fact_shipment_events` at one-row-per-transition grain lets analysts measure dwell time at each step.

#### Step 3: Normalize customer and product as conformed dimensions

`dim_customer` and `dim_product` get surrogate keys; source IDs stay as natural keys. This lets the mart evolve without coupling to the source document shape.

#### Step 4: Model categories as a closure-style dimension

Nested category hierarchies flatten into `dim_category` with `path` and `depth`. A bridge table is a defensible alternative, but a denormalized path is cheaper for BI.

#### Step 5: Separate order header from order lines

`fact_orders` stores order-total and status, `fact_order_lines` stores per-product measures. The split prevents fan-out on queries that only need header counts.

---

### The solution

Below is one conceptually sound approach. The grain of each child fact is derived directly from the corresponding JSON array.

> **Why this design holds up**
>
> Splitting the document along its natural arrays turns analytic questions into simple joins. Any new product attribute lands on `dim_product` without touching the fact, and any new shipment status transition appends one row to `fact_shipment_events` without a migration.

> **What strong candidates do**
>
> They never model a JSON array as repeating columns. They explicitly declare grain per output table. They distinguish the natural keys that came from the source (order_nk, customer_nk) from surrogate keys that the mart owns.

> **Red flags to avoid**
>
> Storing line_items as a JSON column on `fact_orders` defers the problem to query time. Deleting and re-inserting customer rows on every order drops history. Letting the category hierarchy live as a recursive self-join without a denormalized path makes BI tools struggle.

---

### The analysis pattern

**Top products by category path in the last 30 days**

```sql
SELECT
    p.name AS product,
    c.path AS category_path,
    SUM(ol.quantity) AS units_sold,
    SUM(ol.line_total) AS gross_sales
FROM fact_order_lines ol
JOIN fact_orders o ON o.order_sk = ol.order_sk
JOIN dim_product p ON p.product_sk = ol.product_sk
JOIN dim_category c ON c.product_sk = p.product_sk
WHERE o.order_ts >= NOW() - INTERVAL '30 days'
  AND o.status = 'fulfilled'
GROUP BY p.name, c.path
```

---

### Trade-offs and alternatives

**Fully normalized star**

Child arrays unfold into their own facts. Writes are heavier (multiple inserts per document). Reads are cheap and each dimension evolves independently. Best when downstream BI drives the workload.

**Document-preserving OBT**

One wide fact stores JSON payloads as semi-structured columns. Ingestion is a single insert. Reads pay the cost of JSON parsing every query. Best when query patterns are exploratory and storage is cheap.

---

## Common follow-up questions

- What if a product moves between categories over time? How does `dim_category` change? _(Tests whether the candidate considers Type 2 SCD on category assignment.)_
- How would you backfill `fact_shipment_events` from a source that now emits status_history in a different shape? _(Tests schema evolution and ingestion re-processability.)_
- What is the grain of `dim_category` and why is `depth` stored denormalized? _(Tests understanding of closure tables vs materialized path trade-offs.)_
- How does the mart handle a JSON payload missing `line_items`? _(Tests late-arriving or malformed data handling.)_
- If the source publishes an order update two days late, which tables are touched? _(Tests idempotent upsert strategy on `fact_orders` and `fact_order_lines`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_json_files_that_became_a_data_mart)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.