# Marketplace Sales Warehouse

> No schema given. The interviewer is watching.

Canonical URL: <https://datadriven.io/problems/marketplace_sales_warehouse>

Domain: Data Modeling · Difficulty: hard · Seniority: L4

## Problem

We run a two-sided marketplace where buyers and sellers transact. The analytics team needs a self-service warehouse to analyze GMV, conversion rates, and seller performance. There is no provided schema. You are expected to establish the entities, their relationships, and the dimensional model from scratch. Start by asking clarifying questions before designing anything.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can **build a star schema from a blank canvas** and articulate which measures belong on the fact versus the dimension. The “no provided schema” framing is the real test: strong candidates ask clarifying questions before drawing anything.

> **Trick to Solving**
>
> When a prompt gives you no schema and asks for self-service analytics, the trick is to open with clarifying questions: what is the transaction grain, who are the consumers, how does price change over time? Only then does a star schema emerge.
> 
> 1. Ask for the transaction grain and currencies
> 2. Declare fact_transactions at one row per line item
> 3. Build dim_user, dim_product, dim_date, dim_category
> 4. Snapshot price_at_sale on the fact, not the product

---

### Break down the requirements

#### Step 1: Open with clarifying questions

Before drawing any tables, surface consumers, grain, currency, and late-arriving data. Interviewers explicitly score this step on hard prompts.

#### Step 2: Declare the fact grain

`fact_transactions` is one row per line item in one order. GMV, quantity, and commission are additive at this grain.

#### Step 3: Snapshot price and commission

`price_at_sale` and `commission_amount` live on the fact. Pricing drifts; historical GMV cannot silently change.

#### Step 4: Build the dimensional spine

`dim_user`, `dim_product`, `dim_date`, `dim_category`. User holds region, product holds brand and category FK, date is a conformed calendar dimension.

---

### The solution

Below is one conceptually sound star layout. Price_at_sale on the fact and category hierarchy on dim_product are the decisions that define this design.

> **Why this works**
>
> A narrow fact with conformed dimensions lets a BI layer slice GMV by region, category, and date without any joins the analyst has to reason about. Snapshotting price on the fact preserves historical totals under price churn, which is the canonical Kimball move.

> **Interviewers watch for**
>
> A strong candidate opens with clarifying questions before a single table is drawn. They name “conformed dimensions” and “snapshot measures” out loud, and they defend why category is on the product dimension rather than the fact.

> **Common pitfall**
>
> Putting `price` on `dim_product` and joining at query time. The first price change silently rewrites last quarter’s GMV. A second red flag: a combined fact that mixes order-level and line-level grain.

---

### The analysis pattern

**GMV and commission by category and region**

```sql
SELECT
    c.category_name,
    u.region,
    SUM(f.price_at_sale * f.quantity) AS gmv,
    SUM(f.commission_amount) AS commission
FROM fact_transactions f
JOIN dim_user u ON u.user_sk = f.user_sk
JOIN dim_product p ON p.product_sk = f.product_sk
JOIN dim_category c ON c.category_sk = p.category_sk
JOIN dim_date d ON d.date_sk = f.date_sk
WHERE d.full_date >= DATE '2026-01-01'
GROUP BY c.category_name, u.region
ORDER BY gmv DESC
```

---

### Trade-offs and alternatives

**Star schema**

fact_transactions with conformed dimensions.

* Clean grain and additive measures
* Dimension updates are cheap
* Queries pay join cost

**One Big Table (OBT)**

One denormalized table with all dim attributes inline.

* Fastest scan-only queries on columnar engines
* Dimension updates mean rewriting the fact
* Storage blows up with attribute churn

---

## Common follow-up questions

- How do you handle multi-currency GMV across 30 marketplaces? _(Tests whether the candidate adds an FX rate snapshot column or joins a currency dimension by date.)_
- Product category is reorganized. How does the schema evolve? _(Tests whether dim_category can absorb a reorg without rewriting fact rows and whether Type 2 is considered.)_
- How would you support fraud-flagged transactions without deleting them? _(Tests whether the candidate reaches for a soft-flag column on the fact and filter defaults in the semantic layer.)_
- At 100M transactions per day, how do you partition fact_transactions? _(Tests scale thinking: date partitioning and clustering by user or product.)_
- How would you compute seller cohort retention from this schema? _(Tests whether the schema supports repeat-seller detection without schema changes.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/marketplace_sales_warehouse)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.