# The Models Going Stale

> The model is only as good as what you feed it.

Canonical URL: <https://datadriven.io/problems/the_models_going_stale>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

Our data science team has trained several risk models in SageMaker but they go stale quickly because features aren't refreshed fast enough. We need a proper feature pipeline that keeps the feature store current so models can serve accurate predictions at low latency. Design the feature pipeline and feature store architecture.

## Worked solution and explanation

### Why this problem exists in real interviews

A feature pipeline succeeds or fails on four properties together: per-model freshness, training-serving consistency, point-in-time correctness for retraining, and audit traceability for every decision. The trap is treating the online lookup tier as a key-value cache; what's actually needed is a shared definition layer feeding both an online store sized for inference and an offline store sized for training and audit.

The default reach is one batch job that computes features nightly into a key-value store the model reads at inference. Credit risk under-performs because activity from the last hour isn't there. The data science team retrains by joining the warehouse's current feature values onto historical events; the model's offline metrics look great and production performance is worse, because the training features used today's values for old events. An audit asks 'what features produced this decision' and the answer is a reconstruction.

> **Trick to Solving**
>
> One feature definition, two stores sized for two consumers, point-in-time joins for training, decision logging for audit.
> 
> 1. A shared feature definition (code, not a wiki) computes the value once. The online store is fed by streaming or near-real-time updates; the offline store is fed by the same definition on a slower path.
> 2. Training joins event-time features against the offline store's history with point-in-time semantics: for each historical event, the feature values that were available then.
> 3. Each model decision logs the features and the feature version used at scoring time, so an audit query returns the answer rather than reconstructs it.
> 4. Per-model freshness is set on the feature, not the pipeline. A feature used by credit risk refreshes within minutes; the same feature for a slower model can read the same store at a slower rate.

---

### Walk the requirements

#### Step 1: Refresh each model's features on its own cadence

Credit risk wants features within minutes; recommendations tolerate an hour; account closure refreshes nightly. The pipeline runs each feature on a cadence matched to the slowest model that uses it: the activity feature credit risk needs runs on a streaming path, the recommendation feature runs hourly, account-closure features refresh once a day. Forcing every feature onto the streaming path is over-engineering; forcing every feature onto the nightly batch is the named cause of credit risk going stale.

#### Step 2: One feature definition feeding training and serving

Production performance suffers when training features compute differently than serving features. The fix is one feature definition (code in a shared library or feature-store framework) that both paths invoke: the streaming path writes to the online store; the same definition runs on historical data into the offline store. A feature's value at scoring time and at training time agree because the same code produced both. A wiki page describing the feature is what produces the drift; the definition has to be executable.

#### Step 3: Point-in-time joins so training uses what we knew then

When data science retrains on a historical event, the training row needs the feature values that were available at the event's time, not today's. The offline store keeps each feature's history with effective timestamps; training does an as-of join on (entity_id, event_time). A model trained on today's values for old events looks great offline (it's seen the future) and disappoints in production. The point-in-time join is what makes the offline metric match the production metric.

#### Step 4: Decision log with feature values and version, queryable

Each model decision writes the features and the feature version used at scoring to a decision log. When risk or compliance asks 'why did this loan get the score it did,' the answer is a query against the log: these features had these values, this version of the model scored them. Without the log, the audit answer is a reconstruction that may not match what actually happened. With it, the answer is a SQL query.

---

### The shape that fits

> **What this design gives up**
>
> A shared definition that computes both online and offline values is more code surface than a one-off batch job; the offline history of every feature is more storage than current state alone; the decision log grows with every inference. Implementation cost is the price; the win is models that don't go stale, training that matches production, and an audit answer that's a query.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A feature definition layer computes each feature once and writes to both an online store (low-latency inference) and an offline store (training and audit).
> - Per-feature refresh cadences match the consuming model's freshness need; not every feature runs on the same schedule.
> - Training joins historical events to the offline store with point-in-time semantics so retraining uses the values available then.
> - Each model decision logs the feature values and version used at scoring time.

> **The mistake that ships**
>
> What gets shipped runs nightly batch features into a key-value store and tells data science to query the warehouse for training. Credit risk under-performs because activity from the last hour isn't in the store. The training join uses today's feature values against historical events and the offline metrics overstate production performance. An audit asks for the lineage of a denied loan and the team reconstructs it from logs; the reconstruction differs from the actual decision by enough to matter. The rebuild centres on a shared definition, point-in-time joins, and a decision log; each was reachable in the original conversation if it had gone past 'put features in a key-value store.'

---

## Common follow-up questions

- Data science adds a new feature that requires a slow aggregation that can't run in streaming. What in this design lets credit risk still use it without going stale? _(Tests whether the candidate sees the dual-store as a contract: features that can't be computed in real time live in the offline store at their slower cadence; credit risk reads the most-recent value at inference time and accepts the staleness budget that feature carries. Different features can have different cadences within the same model.)_
- A scoring service bug overwrites a feature value in the online store. Training runs against the offline store and produces a model that doesn't match production behavior. Where does the design recover from? _(Tests whether the candidate sees the offline store as the source of truth post-event and the online store as derived state. Recovery is replaying the offline history into the online store; the decision log lets the team see which decisions were affected during the bug window.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_models_going_stale)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.