# The Decision Before the Door Closes

> The window to stop it is smaller than you think.

Canonical URL: <https://datadriven.io/problems/the_decision_before_the_door_closes>

Domain: Pipeline Design · Difficulty: hard · Seniority: L5

## Problem

We process millions of card transactions per day, and our fraud team needs a scoring pipeline that can flag suspicious activity before authorization completes. The current approach is batch-based and catches fraud only after the fact. Design the pipeline.

## Worked solution and explanation

### Why this problem exists in real interviews

Card auth in well under a second with fraud scoring inside that budget, plus a sane fallback when scoring is unavailable, plus a feedback loop from chargebacks. The trap is wiring scoring synchronously into auth and treating chargebacks as 'we'll add later'; both shortcuts compound until either auth latency breaks or model accuracy plateaus.

The default reach is a synchronous call from auth into the scoring service for every transaction. The first time scoring is slow, auth times out and the on-call engineer adds an approve-on-timeout. The fallback policy emerges in hotfixes; nobody documents what 'big' means versus 'small'. Chargebacks land weeks later in a separate system the model never sees.

> **Trick to Solving**
>
> Decouple scoring through a queue with a deadline; when scoring is unavailable, fall back per the documented size-based policy; chargebacks loop back as labels.
> 
> 1. A queue between auth and scoring decouples them. Auth publishes the transaction with the scoring deadline; the scorer reads, scores, returns. Auth proceeds on the fallback if no decision in time.
> 2. When scoring is unavailable, the documented policy gates: small transactions approve, big ones block. The threshold lives in policy config, not in on-call code.
> 3. Chargebacks (confirmed weeks later) feed a labels store the next training run reads alongside scoring decisions, so the model learns from real outcomes.

---

### Walk the requirements

#### Step 1: Score inside the auth budget on a decoupled path

Auth publishes each transaction with its scoring deadline onto the queue; the scoring service reads, scores, writes the decision back inside the deadline. Auth waits up to the deadline and proceeds. End-to-end fits inside the well-under-a-second budget. A synchronous call into the scoring service is the version where a slow score blocks auth and timeouts pile up; the queue is what gives the deadline a real bound.

#### Step 2: Size-based fallback when scoring is unavailable, by policy

When scoring is unavailable or the deadline expires, the fallback applies the documented policy: small transactions approve, large ones block. The threshold is config the business has signed off on. The fallback path is part of the design, not a hotfix from on-call. Letting auth implement the fallback ad-hoc is the version where the policy emerges from incident decisions; the documented threshold is what makes the failure mode predictable.

#### Step 3: Chargebacks loop into training as real labels

Chargebacks confirm fraud weeks after the score. A labels store records each transaction's score, the actual outcome, and the gap. Retraining reads predictions joined to labels and learns from real outcomes. Without the loop, the model retrains on prior predictions and accuracy plateaus; with it, the model gets better as chargebacks accumulate.

---

### The shape that fits

> **What this design gives up**
>
> A queue between auth and scoring is more infrastructure than a synchronous call; an explicit fallback policy adds config and a review queue for fallback approvals; the chargeback labels store grows for years and joins back to historical scores. Implementation cost is the price; the win is auth latency that doesn't break under scoring slowness, an outage policy the business signed for, and a model that learns from real fraud.

> **What reviewers check**
>
> A reviewer looks at the canvas for these properties:
> - A queue decouples authorization from scoring with a deadline so a slow score doesn't block auth.
> - An approve-and-review fallback path applies the size-based policy when scoring is unavailable.
> - Confirmed chargeback labels feed back into the model's training data alongside the scoring decisions.

> **The mistake that ships**
>
> What gets shipped wires auth synchronously to scoring. The first time scoring is slow, auth hangs and the on-call engineer adds approve-on-timeout. The fallback policy emerges from hotfixes. Chargebacks land in a separate system the model never reads; the model retrains on prior predictions and stops improving. The eventual rebuild adds the queue, the documented size-based fallback, and the chargeback feedback loop.

---

## Common follow-up questions

- The fallback policy approves a large transaction during a scoring outage and it turns out to be fraud. What does this design surface, and where does the post-mortem look? _(Tests whether the candidate sees that the fallback approval is logged with the policy version applied, the chargeback eventually labels the transaction as fraud, and the post-mortem reads the policy + chargeback log to assess whether the threshold needs adjustment. The fallback isn't a free pass; it's a tradeoff the business signed for.)_
- A scorer change causes a small uptick in chargebacks two weeks later. What in this design makes that visible, and how does the training loop respond? _(Tests whether the candidate sees the labels store joining scoring decisions to chargeback outcomes; the next training reads the joined labels and learns from the change. A regression in scoring quality shows up in retrospective metrics the team can investigate before the next deploy.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_decision_before_the_door_closes)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.