# Bikes Before Rush Hour

> Bikes in, bikes out. The city needs to predict demand.

Canonical URL: <https://datadriven.io/problems/bikes_before_rush_hour>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

We run a bike-share network across dozens of cities and need to predict hourly bike demand at each station so we can pre-position bikes before rush hour. The raw data comes from multiple city operators and external sources in different schemas and formats. Design the end-to-end pipeline from raw ingestion through to model-ready features.

## Worked solution and explanation

### Why this problem exists in real interviews

Hourly demand prediction across many cities is a feature pipeline, not a one-off training job. The trap is treating it like one big batch run that pulls all cities, transforms together, and writes features. Cities arrive on different clocks, a new city's backfill is huge, and the model needs features ready every hour, not whenever the slowest city finishes.

The default shape is one DAG that fans in across cities, runs ingest then quality then feature computation in a chain, and writes one feature table at the end. It works for the first few cities. Then one city's file is late, the whole DAG waits, the hourly window is missed, and the rebalancing team doesn't have predictions in time. Onboard a new city with a year of history and the backfill bogs down the same compute the live hourly run depends on.

> **Trick to Solving**
>
> Per-city, per-hour: each city is its own ingest task, and live runs and backfills don't share a compute pool.
> 
> 1. Each city is its own ingest task with its own sensor; the orchestrator decides what's ready to run, not a top-level wait-for-all.
> 2. Backfill and live run on the same DAG but on different worker pools so a backfill can't starve the next hourly run.
> 3. Features land in an online lookup tier / warehouse table the model reads on the hour, not in a file the model has to recompute against.

---

### Walk the requirements

#### Step 1: Run hourly with the orchestrator owning the cadence

Operations needs the forecast before the next rush hour, which means the pipeline runs hourly and lands features before the next prediction run. The orchestrator owns the schedule: every hour, sensors fire for each city's available data, ingest runs per city as the data is ready, feature computation runs immediately after ingest for that city, and the model reads from the online lookup tier on the hour. Without an orchestration layer there's nowhere for the hourly cadence, the per-city sensors, or the alerting to live.

#### Step 2: One ingest task per city, not one DAG that fans in

Each city has its own ingest task and its own sensor; none of the city tasks know about each other. Feature computation downstream takes whichever cities finished this hour and skips the ones that didn't, with an alert on the missing city. When one operator's file is late, the other cities have already finished ingest, their features are in the store, and the model has predictions for them on time. A single fan-in step at the top is what made one late city block the others.

#### Step 3: Backfill on a separate worker pool

Onboarding a new city with historical data is a backfill: a long-running run that processes many hours of data. The orchestrator routes backfill tasks to a separate worker pool (or queue, or executor) from the live hourly run. The live run gets its compute on the hour; the backfill runs in parallel on its own pool and finishes when it finishes. Sharing a pool means a backfill drains the live workers and the hourly forecast misses its window.

---

### The shape that fits

> **What this design gives up**
>
> Per-city ingest tasks mean the DAG graph is wider and the orchestration config is heavier than a single fan-in job. Two worker pools (live vs backfill) cost more idle capacity than one shared pool. The win for that idle capacity is predictions that ship every hour even when one city is late, and a new-city onboarding that doesn't break the live cities.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - An orchestration layer schedules per-city ingest, with per-city sensors and alerting before each hourly window.
> - Backfill compute runs on a separate worker pool from the live hourly run.
> - Features land in a queryable warehouse / online lookup tier the model reads on the hour.

> **The mistake that ships**
>
> What ends up in production uses one DAG with a top-level wait-for-all-cities, runs everything on a single Spark cluster, and writes features to a daily-refreshed CSV in S3 that the model reads at prediction time. The first time one operator's file is delayed, the entire hourly run is stuck and the rebalancing team operates from yesterday's forecast. The first new-city backfill saturates the cluster, the live hourly forecast misses its window, and operations starts manually moving bikes off intuition. The team rebuilds with per-city tasks and split pools, which is what the orchestration was supposed to give them.

---

## Common follow-up questions

- Two cities use the same operator's API but with different schemas. How does the per-city ingest design handle that? _(Tests whether the candidate sees that 'per-city task' is a unit of failure isolation, not a unit of code. The two cities can share the ingest code with different config; the isolation that matters is in the task instance and its sensor, not in copying the script.)_
- The model team asks to retrain on the last 90 days of features. Where does that data come from, and what does the retrain need from the orchestrator? _(Tests whether the online lookup tier is the source for both serving and training. The retrain is just another consumer of the online lookup tier, scheduled on a slower cadence and routed (like backfills) to its own worker pool so it doesn't compete with the hourly live run.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/bikes_before_rush_hour)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.