# What Should We Recommend Tonight

> They ordered pad thai twice. That means something.

Canonical URL: <https://datadriven.io/problems/what_should_we_recommend_tonight>

Domain: Pipeline Design · Difficulty: hard · Seniority: L6

## Problem

We run a meal kit delivery service and want to personalize recipe recommendations for each customer. We have two rich data sources: customer order history and our full menu catalog with nutritional and ingredient data. The recommendations team needs a feature store they can query in real time, but right now orders and menu updates live in separate operational systems that are never joined. Design the ingestion and feature pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

A meal-kit recommendations pipeline with a safety constraint inside it: an allergen change has to propagate fast or you serve unsafe food. That property forces a streaming path on menu data, alongside checkout-time serving, point-in-time training, and consistent A/B feature reads. The trap is treating recommendations as one nightly batch and the safety as a 'we'll be careful' postcondition.

The default reach is one nightly job that joins orders against the current menu and writes a recommendations table. Checkout reads the table and the model scores fine. A menu item gets a new allergen at noon and the next nightly run picks it up; until then, allergic customers are getting recommended the unsafe item. Training joins yesterday's orders against today's menu and the model learns from a future state; offline metrics overstate production. A/B variants read from different cached subsets and the experiment is biased.

> **Trick to Solving**
>
> Online store for tens-of-milliseconds checkout, streaming path for menu changes, point-in-time joins for training, one feature read per user per request shared across variants.
> 
> 1. Online store holds the current feature values keyed for tens-of-milliseconds reads at checkout. Both A/B variants read from it through the same call so they get the same values.
> 2. Menu changes (especially allergens) ride a streaming path that updates the online store within minutes; the offline store also gets the change with its effective date.
> 3. Training reads from an offline store with point-in-time joins: for each historical order, the menu and ingredient state at that order's date.
> 4. The same feature definition computes both online and offline values; the streaming and batch paths invoke the same code so the values agree.

---

### Walk the requirements

#### Step 1: Checkout-time serving with a tens-of-milliseconds budget

Customer features (order history, dietary preferences) and item features (ingredients, nutritional info) live in an online store keyed for fast point-lookups. At checkout, the recommendation service reads the user's features and the candidate items' features from the online store within tens of milliseconds, the model scores, and the page renders. A 'compute features at request time' design is the version where the page hangs; pre-computed features in an online store sized for the request budget is what makes recommendations feel instant.

#### Step 2: Menu allergen changes propagate within minutes

When a menu item's allergens change, the safety contract is that allergic customers stop getting recommended the item within minutes. A streaming path tails the menu changes (CDC from the menu system) and updates the item-feature row in the online store. The same change also lands in the offline store with its effective date so training learns the right history. Waiting for a nightly batch to refresh menu features is the version where allergic customers get unsafe recommendations for a window the business can't accept.

#### Step 3: Point-in-time joins for training

When the model retrains on an old order, it has to see the menu and ingredient state from that order's date, not today's. The offline store keeps each menu item's history with effective dates; training joins (order_id, order_date) against the offline store as-of order_date. A model trained on today's menu against historical orders has seen the future and disappoints in production. Point-in-time joins make the offline metric match the production metric.

#### Step 4: Both A/B variants read the same features for the same user

When 10% of users are on a new model variant, both variants read the same online-store features for the same user. The feature read happens once per request and the value is shared across the variants; the experiment compares model behavior on the same input. Letting each variant read from its own cache or computation produces different feature values per variant and biases the result; the experiment can't tell whether the variant or the cache made the difference.

---

### The shape that fits

> **What this design gives up**
>
> An online store sized for checkout latency costs more than serving from the warehouse; the streaming menu-change path adds CDC and a stream consumer; point-in-time history grows the offline store with every change; the shared-feature-read constraint means request-time logic that reads once and shares across variants. Implementation cost is the price; the win is checkout that feels instant, allergen safety the business can vouch for, training that matches production, and A/B results that aren't biased by feature drift.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - Checkout reads customer and item features from an online store with a tens-of-milliseconds budget.
> - Menu changes propagate through a streaming path so allergen updates reach the online store within minutes.
> - Training joins historical orders against the offline store with point-in-time semantics.
> - Both A/B variants read the same feature values for the same user at the same moment.

> **The mistake that ships**
>
> What gets shipped runs nightly batch features into a single store the model reads at request time. Checkout latency includes the warehouse query and the page hangs at peak. A menu allergen update at noon doesn't reach recommendations until tomorrow's batch; an allergic customer gets recommended the unsafe item for the window between. Training joins yesterday's orders against today's menu and the model sees the future; offline metrics look great and production performance disappoints. A/B variants read from independently cached subsets and the result is uninterpretable. The eventual rebuild is the dual-store, the streaming menu path, the point-in-time join, and the shared feature read; each was reachable in the original conversation if 'instant at checkout' had been treated as architecture.

---

## Common follow-up questions

- A customer's dietary preferences update mid-session. How fast does the next checkout see the change, and what doesn't change? _(Tests whether the candidate sees the order_stream propagating the customer-feature update through the same path as menu changes; the next checkout reads the updated features within minutes. The training path doesn't see the change yet (it lags by the offline cadence); the request-time service does.)_
- A new feature has to be added that requires a heavy batch computation. How does this design serve it at checkout latency? _(Tests whether the candidate sees that not every feature has to be sub-minute: the new feature computes on the offline path, lands in the online store on its slower cadence, and serves at the same tens-of-milliseconds budget. The cadence is per-feature, not per-pipeline.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/what_should_we_recommend_tonight)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.