# E-Commerce Supply Chain Tracking

> A package splits, reroutes, and (maybe) arrives.

Canonical URL: <https://datadriven.io/problems/e_commerce_supply_chain_tracking>

Domain: Data Modeling · Difficulty: hard · Seniority: L5

## Problem

We're an e-commerce company like Amazon. When a customer orders a product, it moves through several stages: we purchase it from a vendor, it ships to one of our fulfillment centers, gets picked and packed, then delivered to the customer. Operations needs to see where every unit is at any point. Can you design the data model?

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can model a **lifecycle with many states and multiple one-to-many splits**. Orders split into shipments, shipments split from inbound receipts, and a status column can never capture the full trajectory of a unit. Strong candidates reach for state transition events immediately.

> **Trick to Solving**
>
> Before drawing tables, a strong candidate asks: do we need to know where every unit is right now, or the full trajectory of every unit? The signal is 'see where every unit is at any point,' which rules out a status column and demands an event log.
> 
> 1. Dimension vendors, warehouses, and products
> 2. Split inbound (vendor) and outbound (customer) lifecycles
> 3. Model status changes as rows, not columns
> 4. Allow orders to split into multiple shipments

---

### Break down the requirements

#### Step 1: Identify the entities and their roles

`vendors`, `warehouses`, and `products` are stable dimensions. `purchase_orders` and `customer_orders` are headers. `shipments` and `shipment_events` are the facts that move units through the system.

#### Step 2: Split inbound from outbound

`purchase_orders` with `purchase_order_lines` track vendor-to-warehouse movement. `customer_orders` with `order_lines` track warehouse-to-customer movement. Two lifecycles, one shared product dimension.

#### Step 3: One order can have many shipments

A single order often splits across fulfillment centers. `shipments` is 1:M from either a purchase order or a customer order, and its grain is one row per physical box.

#### Step 4: Model status transitions as events

`shipment_events` is one row per state change (picked, packed, in-transit, delivered). A status column on `shipments` would erase the history; the event log preserves every transition with a timestamp.

#### Step 5: Keep order lines and shipment lines aligned

An order line can be split across shipments. Either model a `shipment_line` junction or let `shipment_events` carry `(order_line_id, quantity)` for precise tracking.

---

### The solution

Below is one defensible design: separate inbound and outbound lifecycles sharing a product dimension, with `shipment_events` as an append-only state log.

> **Why this works**
>
> The event log is the source of truth for 'where is every unit right now.' Status can be derived as the latest event per `shipment_id`. The canonical trade-off is write volume (one row per transition) for full lifecycle auditability.

> **Interviewers watch for**
>
> A strong candidate argues for immutable events within the first two minutes and names the derived 'current status' query. They also catch that an order line may split into multiple shipments. Weak candidates put a `status` column on `shipments` and then cannot answer how to compute dwell time per state.

> **Common pitfall**
>
> Mutating `shipments.status` in place. Once the history is overwritten, SLA diagnostics (why was this box delayed in picking) become impossible. An event log costs more storage but preserves the trajectory.

---

### The analysis pattern

**Dwell time per state per warehouse**

```sql
WITH ranked AS (
    SELECT
        se.shipment_id,
        se.event_type,
        se.event_ts,
        LEAD(se.event_ts) OVER (PARTITION BY se.shipment_id ORDER BY se.event_ts) AS next_event_ts
    FROM shipment_events se
)
SELECT
    s.warehouse_id,
    r.event_type,
    AVG(EXTRACT(EPOCH FROM (r.next_event_ts - r.event_ts)) / 3600) AS avg_hours_in_state
FROM ranked r
JOIN shipments s ON s.shipment_id = r.shipment_id
WHERE r.next_event_ts IS NOT NULL
GROUP BY s.warehouse_id, r.event_type
```

---

### Trade-offs and alternatives

**Append-only event log**

Perfect trajectory, easy dwell-time queries, idempotent inserts. Cost: deriving current status requires a window function and the event table grows fastest of all.

**Mutable status columns plus audit table**

Simple current-status queries. Cost: history lives in a second table, state machine violations are possible, and reconciliation between the two tables is manual.

---

## Common follow-up questions

- How do you detect out-of-order events arriving from multiple warehouses with clock skew? _(Tests event time versus processing time and whether the candidate uses monotonic sequence numbers.)_
- How would you back out a canceled order while keeping the original shipment events? _(Tests whether cancellation is a new event type or a mutation.)_
- What if a unit is returned from the customer and re-shelved at a warehouse? _(Tests whether the return is a new order in the opposite direction or an extension of the original.)_
- At 10M shipments per day, how do you partition `shipment_events`? _(Tests partition on event_ts and possibly subpartition by warehouse_id.)_
- How do you enforce that a shipment cannot move from delivered back to in-transit? _(Tests whether the state machine lives in the database, the application, or a separate service.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/e_commerce_supply_chain_tracking)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.