# A Million Cars Phoning Home

> Every vehicle is a sensor. Deploy the pipeline to catch it all.

Canonical URL: <https://datadriven.io/problems/a_million_cars_phoning_home>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We collect telemetry from millions of connected vehicles - speed, braking events, OBD sensor readings, and ADAS alerts - and need to make this data available to our safety analytics and predictive maintenance teams. Our current setup has no IaC and promoting a pipeline change from dev to prod involves a manual Confluence checklist. Design the pipeline architecture and the operational model for deploying and promoting changes safely.

## Worked solution and explanation

### Why this problem exists in real interviews

This challenge looks like a streaming pipeline question and is actually four questions stacked. Streaming for safety. Promotion path so a change doesn't go through Confluence. Schema evolution that doesn't break older cars. Location data that survives a GDPR audit. The trap is treating it as one big stream and forgetting that the operational and governance work is half the design.

The whiteboard answer is one Kafka topic, one stream processor, one warehouse table; ship a code change by editing the cluster config and rolling. It works on day one. A new sensor is added in firmware, the schema changes, the stream processor crashes on cars still on the old firmware. Somebody pushes a hotfix by hand because the promotion path is a Confluence page nobody read. Vehicle location is sitting in raw form in the warehouse and a GDPR request lands in legal's inbox. Three of the four requirements are actively failing.

> **Trick to Solving**
>
> A streaming pipeline is half the design; the other half is environments, schema tolerance, and a privacy boundary in the layout.
> 
> 1. Two paths from one ingest: a streaming path for safety events that fans out in seconds, and a batch path for predictive maintenance on a slower cadence. Same source, different latency budgets.
> 2. Promotion is dev → staging → small prod cohort → full prod, owned by IaC. The 'small cohort' is the safety net; if a change misbehaves, only a slice of the fleet is affected.
> 3. Schema evolution is forward- and backward-compatible by contract: new fields are optional, old fields stick around. The downstream tables ingest both firmware versions without splitting.
> 4. Precise location lives in cold storage with a retention rule and a controlled access path; analytics gets a coarse location.

---

### Walk the requirements

#### Step 1: Two paths from one ingest, sized for the two consumers

Telemetry comes in once and fans out twice: a streaming path for safety analytics with sub-minute end-to-end latency, and a batch path for predictive maintenance on a slower cadence. The streaming path is the expensive one; keep it narrow to safety-critical events. The batch path lands the broader telemetry in cold storage and aggregates for maintenance. One shared tier sized for safety is too expensive; one shared tier sized for maintenance leaves safety blind.

#### Step 2: Promote through environments owned by IaC, not Confluence

Today shipping a change is a manual checklist nobody trusts. The fix is defined environments (dev, staging, prod) with a staged promotion path: a change runs in dev against synthetic traffic, in staging against a mirrored cohort, then in a small prod cohort (a percent of the fleet) before full rollout. The whole promotion is owned by infrastructure-as-code so the path is repeatable, reviewable, and rollback is one config flip, not a runbook. Confluence isn't part of the deploy.

#### Step 3: Make the schema tolerate both firmware versions

Most of the fleet is on older firmware; new sensors can't break the cars still on it. The contract is that schema changes are additive: new fields are optional, default-null at ingest, and old fields stay around even after newer firmware stops emitting them. Downstream tables ingest both versions into the same row, with new fields populated for new firmware and null for old. The version that splits old-firmware data into a separate table is the version that doubles the ETL surface every time the firmware changes.

#### Step 4: Precise location restricted, coarse location exposed

Vehicle location is regulated personal data. Land precise coordinates in cold storage with a retention rule that drops them at the regulatory window's edge, and reachable only through a controlled, audited query path. Expose a coarse-grain location (a coarse spatial cell) on the warehouse table that analytics and predictive maintenance read. The platform enforces who can read which; a 'we trust the analytics team to filter precise lat/long' posture is what fails the audit.

---

### The shape that fits

> **What this design gives up**
>
> Two paths from one ingest is two systems to operate. IaC promotion infrastructure is upfront work that the team won't see ROI on until the first change goes wrong. Schema tolerance discipline slows feature work because every change has to consider compatibility. A restricted precise-location path adds an access-control layer to a path that would otherwise be a normal table. Simplicity goes in every direction; what arrives is an operational model that survives the next firmware rollout, audit, or production incident.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A streaming path serves safety events within seconds; a batch path serves predictive maintenance on a slower cadence.
> - Precise vehicle location lives in cold storage with retention rules; analytics reads only a coarsened location.

> **The mistake that ships**
>
> The first version out the door uses one Flink job, one warehouse table, and a deploy script that lives in someone's home directory. A new firmware rollout changes the schema, the Flink job throws on old-firmware events, the on-call engineer pushes a hotfix straight to prod and breaks safety alerts on every car for an hour. Six months later, a regulator asks for a list of vehicles that had a precise location recorded outside the retention window; the answer is 'we'd have to look,' and legal escalates. The remediation comes as three separate projects: an environments-and-IaC promotion path, an additive schema contract enforced in code, and a precise-location archive with a retention rule. Each was avoidable as a property of the original design.

---

## Common follow-up questions

- A safety bug is found in production after a small-cohort rollout. What does rollback look like in this design, and what doesn't it cover? _(Tests whether the candidate sees that rollback in IaC handles the code path, but not the data path: events already written to the safety store and the event lake during the bad rollout still need a remediation step (replay, soft-delete, or a dedup correction).)_
- The maintenance team asks for vehicle-level location traces over a recent window for predictive failure analysis. What do you change in the design? _(Tests whether the candidate keeps the privacy boundary intact: the traces come from the precise-location archive through the controlled access path, not by exposing precise location in the maintenance warehouse. The maintenance team's access has to go through the same audited path everyone else does.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/a_million_cars_phoning_home)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.