# Ride-Sharing Platform Schema

> Riders, drivers, and fares. Everyone takes a cut.

Canonical URL: <https://datadriven.io/problems/ride_sharing_platform_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We run a ride-sharing service. We need to track every ride from request to completion and power dashboards for driver utilization, rider retention, and revenue. Can you design the data model?

## Worked solution and explanation

### Why this problem exists in real interviews

Ride-sharing is a clean probe for entity identification and temporal fact modeling. The interviewer wants to see three entities (rider, driver, vehicle), a driver-to-vehicle one-to-many, and a trip fact with multiple temporal markers for the request, pickup, and drop-off. The signal is whether the candidate models time as first-class data rather than cramming status into one column.

> **Trick to Solving**
>
> The tell is "requested, accepted, started, ended". These are four timestamps on one entity. Before drawing any tables, a strong candidate asks: does the business need state-transition analytics (wait time, time to accept) or just completed-trip reporting? If the former, all four timestamps are first-class columns on the trip.
> 
> 1. Separate the driver entity from the vehicle entity
> 2. Keep drivers one-to-many with vehicles
> 3. Put every state-transition timestamp on the trip row
> 4. Treat the trip as an accumulating snapshot fact

---

### Break down the requirements

#### Step 1: Identify four entities

Rider, driver, vehicle, and trip. Driver and vehicle are distinct because one driver can switch vehicles and one vehicle can be shared across drivers in a fleet model.

#### Step 2: Model the trip as an accumulating snapshot

A single `trips` row captures the whole lifecycle: `requested_at`, `started_at`, `ended_at`. Updates are fine here because the grain is one row per trip and the row is mutable until the trip completes.

#### Step 3: Keep vehicle under driver

`vehicles.driver_id` captures the current assignment. If vehicles can change drivers, a separate `driver_vehicle_assignments` table is a natural extension, but the base model starts simple.

#### Step 4: Declare constraints explicitly

`ended_at >= started_at`, `started_at >= requested_at`, `fare >= 0`. CHECK constraints are cheap correctness guarantees that catch upstream bugs at insert time.

---

### The solution

Below is one defensible model. Treating `trips` as an accumulating snapshot is the anchor: all lifecycle timestamps live in one row so wait-time and duration analytics are trivial joins.

> **Why This Design Works**
>
> Accumulating snapshot grain is the right Kimball fact type when an entity has a bounded, well-known lifecycle. It trades write-once immutability for a single queryable row per trip, which makes wait-time, ETA-accuracy, and fare analytics single-pass queries. The cost is that trips are mutable until the final state lands.

> **Interviewers Watch For**
>
> Strong candidates distinguish accumulating snapshot from transaction fact. They also think about partial trips (cancelled, rider no-show) and propose NULL-allowed lifecycle timestamps. Weaker candidates put every status change in a separate table and then struggle to answer "average wait time this week".

> **Common Pitfall**
>
> Storing `status` as a single TEXT column with values like `requested`, `started`, `completed`. You lose the ability to answer any time-between-states question without reconstructing history from an event log you did not build.

---

### The analysis pattern

**Median wait time by city and hour**

```sql
SELECT
    r.home_city,
    DATE_TRUNC('hour', t.requested_at) AS hour_bucket,
    COUNT(*) AS trips,
    PERCENTILE_CONT(0.5) WITHIN GROUP (
        ORDER BY EXTRACT(EPOCH FROM (t.started_at - t.requested_at))
    ) AS median_wait_seconds,
    SUM(t.fare) AS gross_fare
FROM trips t
JOIN riders r ON r.rider_id = t.rider_id
WHERE t.started_at IS NOT NULL
  AND t.requested_at >= NOW() - INTERVAL '7 days'
GROUP BY r.home_city, hour_bucket
ORDER BY r.home_city, hour_bucket
```

---

### Trade-offs and alternatives

**Accumulating snapshot trip fact**

Single row per trip, lifecycle timestamps on one record, fast wait-time queries. Cost: rows are mutable until completion, requiring careful write semantics and idempotent updates.

**Immutable trip event log**

Append-only, perfect audit, easy replay. Cost: every analytical query reconstructs trip state via window functions or a separate projection, and simple questions become harder.

---

## Common follow-up questions

- Surge pricing changes fare rules mid-trip. Where does the surge multiplier live? _(Tests whether the candidate adds a fare component column or a separate fare_components table.)_
- One driver logs in on two phones and requests overlap. How does the constraint set protect trip integrity? _(Tests uniqueness constraints on (driver_id, active_trip) and idempotent trip creation.)_
- A rider cancels after the driver is en route. How does the accumulating snapshot represent cancellation? _(Tests NULL semantics on started_at and whether a cancellation_reason column is warranted.)_
- Background checks expire and drivers must be suspended. Where does that state live? _(Tests whether the candidate attaches compliance state to drivers or to a separate driver_status table.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/ride_sharing_platform_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.