# Three-Sided Marketplace Delivery Schema

> One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?

Canonical URL: <https://datadriven.io/problems/three_sided_marketplace_delivery_schema>

Domain: Data Modeling · Difficulty: hard · Seniority: L6

## Problem

We run a three-sided marketplace: consumers place orders from merchants, and couriers deliver them. Finance needs accurate revenue reporting. Operations needs delivery latency metrics. One order can be reassigned to a different courier if the first one cancels. Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether you can keep two related but differently-grained processes from corrupting each other. The signal is separating `fact_orders` (one row per order, owns revenue) from `fact_deliveries` (one row per delivery attempt, owns latency) so a reassigned courier cannot double-count GMV and a cancelled attempt cannot null out revenue.

> **Trick to Solving**
>
> Before drawing any tables, a strong candidate asks: "can one order have more than one delivery attempt, and which measures belong to which entity?" The moment the answer is "yes, couriers get reassigned," revenue and latency must live in different facts at different grains.
> 
> 1. Spot the one-to-many between order and delivery
> 2. Declare two fact tables at two grains
> 3. Put revenue on orders only, latency on deliveries only
> 4. Filter cancelled delivery attempts out of latency averages

---

### Break down the requirements

#### Step 1: Declare two grains

`fact_orders` sits at one row per order. `fact_deliveries` sits at one row per delivery attempt per order. Revenue questions join neither; latency questions filter to completed deliveries.

#### Step 2: Revenue lives only on fact_orders

`subtotal`, `tip`, `delivery_fee`, and `platform_commission` all live here. Never on deliveries. A reassignment duplicates a delivery row, not a revenue row.

#### Step 3: Latency lives only on fact_deliveries

`assigned_ts`, `pickup_ts`, and `dropoff_ts` belong to the attempt. Orders hold only `order_placed_ts`.

#### Step 4: Filter cancelled attempts out of averages

A cancelled attempt has a null `dropoff_ts`. Any latency query filters `WHERE delivery_status = 'completed'` to avoid contaminating averages.

---

### The solution

Below is one defensible model. The conceptual anchor is grain separation: revenue is order-level, latency is delivery-level, and the two facts never trade places.

> **Why this design holds up**
>
> Putting revenue and latency on separate facts eliminates the whole class of double-count bugs that reassigned couriers cause. Finance joins orders alone. Operations joins deliveries alone. The two meet only when a dashboard needs both, and they meet via a filtered join.

> **What strong candidates do**
>
> They name the grain for both facts in the first two minutes. They explicitly refuse to put revenue columns on deliveries and justify it with the reassignment case. They point out that cancelled attempts must be filtered from latency aggregates.

> **Red flags to avoid**
>
> One fact with both revenue and latency columns guarantees double-counted revenue on reassigned orders and null revenue on cancelled attempts. A naive JOIN on order_id between the two facts produces fan-out whenever a reassignment happened. Including cancelled attempts in latency averages understates pickup and dropoff speed.

---

### The analysis pattern

**GMV and completed-delivery latency by merchant category**

```sql
SELECT
    m.category,
    SUM(o.subtotal + o.delivery_fee + o.tip) AS gmv,
    AVG(EXTRACT(EPOCH FROM (d.dropoff_ts - o.order_placed_ts)) / 60.0)
      FILTER (WHERE d.delivery_status = 'completed') AS avg_minutes
FROM fact_orders o
JOIN dim_merchants m ON m.merchant_sk = o.merchant_sk
LEFT JOIN fact_deliveries d
  ON d.order_sk = o.order_sk
 AND d.delivery_status = 'completed'
GROUP BY m.category
```

---

### Trade-offs and alternatives

**Two facts at different grains**

Revenue and latency are independently queryable. No double-count risk. Cross-fact queries need an explicit filter to completed deliveries.

**Accumulating snapshot on order**

One row per order with the latest delivery timestamps upserted in place. Reporting is a single table scan. Reassignment history is lost unless stored on a child audit table.

---

## Common follow-up questions

- How would you report courier utilization without double-counting reassigned orders? _(Tests aggregation on `fact_deliveries` alone, filtered by status.)_
- Finance wants GMV by merchant. Which fact do they query and why? _(Tests whether revenue ownership is clear on `fact_orders` only.)_
- A cancelled delivery attempt still triggered a small courier fee. Where does that fee live? _(Tests whether courier-level fees need a third fact or a column on `fact_deliveries`.)_
- How would you partition both facts at 20M orders per day across 40 cities? _(Tests partition key choice (order_placed_ts and city) and its impact on late events.)_
- How would you handle an order that is refunded a week after delivery? _(Tests whether refunds are an update on `fact_orders` or an append-only `fact_refunds`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/three_sided_marketplace_delivery_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.