# The Fleet That Never Stops

> Every truck is talking. Not everyone can hear them yet.

Canonical URL: <https://datadriven.io/problems/the_fleet_that_never_stops>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

We operate a large fleet of delivery vehicles. Operations needs a live dashboard showing where every vehicle is and alerting on anomalies in near real-time. The data science team needs a clean historical archive for route optimization models. Design the pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Live ops dashboards plus an archive for route-optimization training plus a privacy boundary on driver location. Trucks lose connectivity and replay events. The trap is one stream that updates the map and lets data science read raw GPS.

The default reach is one streaming pipeline that writes positions to a map store, with a side write to a warehouse for the data science team. Replayed buffered events from disconnects land as 'now' and the historical route teleports. Data science reads raw lat/long because the masking lives in a downstream view nobody enforces. Privacy review takes a finding.

> **Trick to Solving**
>
> Streaming for the live map, event-time replay for the archive, masked location for data science.
> 
> 1. The streaming path serves the live map within seconds; the same events also land in cold storage partitioned by event-time.
> 2. Replayed buffered events sort into the right historical hour; the route history is built off the event-time partitions, not arrival time.
> 3. Data science reads from a masked-location view that exposes coarse cells (or distance-from-stop), not raw GPS.

---

### Walk the requirements

#### Step 1: Live map within seconds, archive for the historical truth

GPS pings flow through a streaming consumer that updates the live map within seconds. The same events also land in cold storage partitioned by event-time. Operations sees vehicles live; data science reads the archive for training. Without two cadences either ops is on a slow path or training is paying streaming compute it doesn't need.

#### Step 2: Replayed events sort into the right route by event-time

When a truck reconnects after a tunnel and dumps buffered events, each event carries the device's event-time. The archive partitions on event-time; the route-builder for analytics sorts events by event-time before stitching. A truck that buffered an hour of pings replays them and the historical route comes out smooth, not teleporting. Arrival-time-keyed storage is the version where the route looks wrong; event-time partitioning is the contract.

#### Step 3: Data science reads masked location, not raw GPS

Driver location is personal data under company policy. The data science view exposes a coarse spatial cell (or distance-from-stop, or another bucketed feature) rather than raw lat/long. The masking lives in a warehouse view tied to the data science role; raw GPS is restricted to operations. A 'we'll trust people not to query the raw column' approach is what fails the privacy review; the masked view is the contract.

---

### The shape that fits

> **What this design gives up**
>
> Two paths are more pieces than one shared consumer; event-time partitioning means the route-builder waits for a watermark before finalizing routes; the masked view requires a coarsening step and access policies. Implementation cost is the price; the win is a live map that feels live, route history that survives connectivity gaps, and data science training that doesn't expose raw GPS.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A streaming path serves the live map within seconds.
> - A cold-storage archive holds events partitioned by event-time so replayed buffered events land in the right route.
> - Data science reads location through a masked view that doesn't expose raw GPS coordinates.

> **The mistake that ships**
>
> What gets shipped runs one stream into a map store with a side write keyed on arrival. Replayed events from disconnects teleport on the historical route. Data science reads raw GPS because masking lived in a downstream view that nobody enforced. Privacy review takes a finding. The eventual rebuild adds event-time partitioning and the warehouse-enforced masked view.

---

## Common follow-up questions

- A truck buffers events for hours and replays after the route_builder has already produced today's routes. What in this design picks them up? _(Tests whether the candidate sees the route_builder's idempotent rebuild on the affected event-time partition: late events land in yesterday's partition and the rebuild for that day produces the corrected route. The route warehouse for that day is replaced.)_
- Data science wants to study driver behavior at specific stop locations. What in this design lets them, and what does it not let them see? _(Tests whether the candidate's masked view exposes distance-from-stop or stop-level activity without raw lat/long; the data science role can study behavior without identifying location. Raw GPS stays in the operations-only view.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_fleet_that_never_stops)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.