# Airline Flight Operations Schema

> Flights, passengers, and routes. Before you draw a single table, tell me the grain.

Canonical URL: <https://datadriven.io/problems/airline_flight_operations_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We're an airline and we want to build a data warehouse to analyze flight operations. We need to understand on-time performance, load factor, and revenue per seat mile. Finance, operations, and product all consume this data. Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can **declare the correct grain under conflicting stakeholder asks**. Finance wants revenue per seat mile, operations wants on-time performance, and product wants passenger journeys. Each implies a different grain, and a strong candidate notices the collision before they sketch tables.

> **Trick to Solving**
>
> Before drawing any tables, a strong candidate asks: is the unit of analysis a flight leg, a passenger segment, or a booking? The signal here is that operations metrics (load factor, departure delay) live at the flight leg grain, while passenger journeys live at the segment grain. One defensible approach is two fact tables that share conformed dimensions.
> 
> 1. Spot the two distinct grains inside one prompt
> 2. Declare `fact_flights` at one row per scheduled flight leg
> 3. Declare `fact_ticket_segments` at one row per passenger per leg
> 4. Conform `dim_routes`, `dim_aircraft`, and `dim_date` across both

---

### Break down the requirements

#### Step 1: Declare the grain first

State it out loud: `fact_flights` is one row per flight leg, identified by flight number, origin, destination, and scheduled departure date. Every measure on that table must be additive at that grain.

#### Step 2: Separate passenger from flight facts

Passenger to flight is many-to-many. A single leg carries hundreds of passengers, and a single itinerary spans multiple legs. Modeling them in one fact forces a double count on either load factor or revenue.

#### Step 3: Store the ingredients, not the ratio

Load factor is `passengers_boarded / seats_available`. Store both columns. A pre-computed percentage cannot be re-aggregated across routes or days without losing correctness.

#### Step 4: Model delay cause as a dimension attribute

Carrier, weather, and NAS delays are categorical reasons attached to the leg. They belong as attributes on the fact row (or as a junk dimension) so analysts can slice performance without joining a bridge table.

#### Step 5: Conform dimensions across both facts

`dim_aircraft`, `dim_routes`, `dim_airports`, and `dim_date` are shared. Conforming them lets finance and operations agree on the same airport and the same calendar without reconciliation meetings.

---

### The solution

Below is one conceptually sound approach: a classic Kimball star with two fact tables at two clean grains. The grain declaration anchors the rest of the design.

> **Why this works**
>
> Two facts, one shared dimension set. Load factor rolls up cleanly from `fact_flights`, revenue per seat mile rolls up from `fact_ticket_segments` joined to `dim_routes.distance_miles`, and no query double counts because each fact owns a distinct grain. The canonical trade-off is storage for clarity: duplicating the flight key on every segment costs bytes but keeps the two grains composable.

> **Interviewers watch for**
>
> A strong candidate states the grain before touching the whiteboard. They also volunteer that storing `load_factor` as a percentage is a silent correctness bug when rolled up across routes. Weak candidates collapse passengers and flights into a single fact and then struggle to explain why on-time performance suddenly depends on revenue rows.

> **Common pitfall**
>
> Treating every operational metric as a column on one wide flight table. Load factor stays correct, but revenue per seat mile becomes impossible to compute without exploding the flight row by passenger count, which breaks every delay query downstream.

---

### The analysis pattern

**Load factor and RASM by route and month**

```sql
SELECT
    r.route_key,
    DATE_TRUNC('month', d.full_date) AS month,
    SUM(f.passengers_boarded)::NUMERIC / NULLIF(SUM(f.seats_available), 0) AS load_factor,
    SUM(ts.revenue_usd) / NULLIF(SUM(f.seats_available * r.distance_miles), 0) AS rasm
FROM fact_flights f
JOIN dim_routes r ON r.route_key = f.route_key
JOIN dim_date d ON d.date_key = f.scheduled_dep_date_key
JOIN fact_ticket_segments ts ON ts.flight_key = f.flight_key
GROUP BY r.route_key, DATE_TRUNC('month', d.full_date)
```

---

### Trade-offs and alternatives

**Two-fact star**

Clean grain per fact, conformed dimensions, additive measures. Cost: revenue queries require a join, and ETL has to maintain two load paths. Evolves well when new passenger attributes arrive.

**Single wide fact**

One row per passenger per leg with flight attributes denormalized on. Lower join cost for passenger-level queries. Cost: operational metrics must be computed with `COUNT(DISTINCT flight_key)`, double counting is one `GROUP BY` slip away, and schema churn on flight-level attributes rewrites billions of rows.

---

## Common follow-up questions

- How would you handle a codeshare flight where the marketing carrier differs from the operating carrier? _(Tests whether the candidate separates the leg (operating) from the ticket segment (marketing) grain.)_
- What if `dim_aircraft` is repainted and re-registered mid-year, changing tail number and seat count? _(Tests SCD Type 2 reasoning on equipment dimensions that drive load factor.)_
- How would you backfill `fact_flights` when delay codes arrive 48 hours late from the FAA feed? _(Tests late-arriving dimension handling and whether the candidate considers a reprocessing window.)_
- At 30k flight legs per day, how would you partition `fact_flights` for a year-over-year comparison dashboard? _(Tests partitioning strategy: by scheduled_dep_date_key for pruning on time-series queries.)_
- How does your model change when the airline introduces a standby passenger status that books after boarding begins? _(Tests schema evolvability and whether the grain still holds under new booking semantics.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/airline_flight_operations_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.