# The Fare Aggregator

> Airfares shift every minute. Catch the best ones.

Canonical URL: <https://datadriven.io/problems/the_fare_aggregator>

Domain: Pipeline Design · Difficulty: medium · Seniority: L5

## Problem

We aggregate airfares from dozens of airline APIs and global distribution systems, and we check 80 billion prices per day to power flight search for 100 million users. The problem is that every GDS and direct airline API has a completely different schema - one calls it departure_time, another calls it dep_utc, another sends it as a Unix timestamp. Our prices also go stale within seconds during booking surges. Design a pipeline that keeps prices fresh and routes users away from sold-out flights before they see a failed booking.

## Worked solution and explanation

### Why this problem exists in real interviews

Aggregating airfares from many GDSs and APIs into one search experience under a couple-second latency, with disruption alerts to users actively on the booking page. The trap is per-source branching downstream of search and treating disruption as a separate batch system that finds out about cancellations after users do.

The default reach is for search and booking to branch on source schema. Each source change ripples through search and booking; the first format change costs a quarter to land. Disruption notifications run in a separate nightly system; users who are actively on the booking page when a flight cancels find out at the airport.

> **Trick to Solving**
>
> Canonical fare shape on the bus, streaming search reads canonical only, disruption events stream to active sessions.
> 
> 1. Each source's events normalize to a canonical fare shape on the bus; downstream search and booking read one schema regardless of the source.
> 2. Search reads from a low-latency store fed by the streaming canonicalizer; the price the user sees is fresh, not stale-cached.
> 3. Disruption events ride the same bus and route to a session-aware alert path; users on the booking page for the affected route get notified before they pay.

---

### Walk the requirements

#### Step 1: Canonical fare shape on the bus; downstream reads one schema

Each source's events normalize to a canonical (origin, destination, departure_time, fare, taxes, source) shape at ingest; the bus carries canonical events. Search and booking read the canonical shape; a new source adds a normalizer mapping, not a search-or-booking change. A 'branch on source' design is the version where every source change ripples through every consumer; canonical-up-front is what keeps the consumers stable.

#### Step 2: Search results within the page's latency budget against fresh prices

Search queries hit a low-latency store fed by the streaming canonicalizer. The price the user sees is the latest canonical price, not a stale cache; booking confirms against the same store so the price doesn't change between search and book. A request-time aggregation across sources is the version where the page hangs at peak; pre-canonicalized fresh prices in a search-sized store is what makes search feel instant.

#### Step 3: Disruption events route to active booking sessions

When a flight is delayed or cancelled, the airline emits a disruption event onto the bus. A session-aware consumer matches it to users currently on the booking page for the affected route and pushes an alert before they pay. A 'nightly disruption batch' is the version where users find out at the airport; the streaming path matched to active sessions is what makes the alert actionable.

---

### The shape that fits

> **What this design gives up**
>
> The canonical shape requires every source to map at ingest; the search store is sized for query latency at peak which is more expensive than a warehouse; session-aware disruption matching needs active-session state. Implementation cost is the price; the win is search that doesn't branch on source, prices that aren't stale, and disruption alerts that reach users before they pay.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An event bus carries canonical fare events from all sources; search and booking read one schema.
> - A streaming path serves search results within the page's latency budget against fresh prices.
> - Disruption events route to active booking sessions for the affected route within seconds.

> **The mistake that ships**
>
> What gets shipped lets search and booking branch on source schema and runs disruption notifications as a nightly batch. Every source change ripples through every consumer; the first format update costs a quarter. Users on the booking page when a flight cancels find out at the airport. The eventual rebuild adds canonicalization, the search-sized store, and session-aware disruption.

---

## Common follow-up questions

- A new GDS source signs on with a more complex tax breakdown than the canonical shape. What in this design lets it land without changing search or booking? _(Tests whether the candidate sees the canonical shape's tax breakdown either accommodating the new source's structure (additive evolution) or the canonicalizer mapping the new source down to the canonical shape with the extra detail dropped or summarized. Search and booking don't change; the canonicalizer absorbs the difference.)_
- Disruption events for a route burst as a major weather event hits. What does this design do, and how does it avoid notifying users twice? _(Tests whether the candidate sees the disruption consumer dedup on (flight, event_type) and notify each session once per route per event. Active-session state tracks notified status. Without dedup, users would get repeated notifications and tune them out.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_fare_aggregator)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.