# Rapid Retry Detection

> Detect retried API calls within 5 minutes of failure.

Canonical URL: <https://datadriven.io/problems/rapid_retry_detection>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

During an incident postmortem, the on-call engineer needs to detect retry storms. Using the api_calls table, find all calls where the same user hit the same endpoint within 5 minutes after a call that returned an error status (status >= 400). Return the user_id, endpoint, the failed call time, and the retry call time.

## Worked solution and explanation

### Why this problem exists in real interviews

Reliability teams hunt for retry storms because they distort SLO math and can DDoS internal services. Interviewers use this prompt to test the candidate's ability to express a 'within N minutes after' temporal join in SQLite without window functions, handle NULL user_id correctly, and pick the right comparison direction so the retry comes AFTER the failure (not before).

---

### Break down the requirements

#### Step 1: Self join on user and endpoint

Alias api_calls as a (the failed call) and b (the retry). Join on a.user_id = b.user_id AND a.endpoint = b.endpoint. The retry must hit the SAME endpoint, so endpoint equality is part of the join key, not a post filter.

#### Step 2: Time window: 5 minutes after, strictly after

b.call_time > a.call_time keeps only retries that come AFTER the failure (no self matches, no time travel). (julianday(b.call_time) minus julianday(a.call_time)) times 24 times 60 <= 5 enforces 'within 5 minutes' in SQLite. julianday returns days, so multiply by 24 times 60 for minutes.

#### Step 3: Filter failures and drop NULL users

WHERE a.status >= 400 AND a.user_id IS NOT NULL ensures only failed origin calls qualify and avoids spurious matches on NULL = NULL semantics. The prompt explicitly says retries can have any status, so no filter on b.status.

---

### The solution

**Self join with a 5 minute forward window**

```sql
SELECT a.user_id, a.endpoint, a.call_time AS failed_call_time, b.call_time AS retry_call_time
FROM api_calls a
JOIN api_calls b
  ON a.user_id = b.user_id
 AND a.endpoint = b.endpoint
 AND b.call_time > a.call_time
 AND (julianday(b.call_time) - julianday(a.call_time)) * 24 * 60 <= 5
WHERE a.status >= 400 AND a.user_id IS NOT NULL
```

> **Cost Analysis**
>
> 300M rows self joined is the hard part. The join predicate equality on (user_id, endpoint) limits the pairing to per user per endpoint sub partitions, which the planner can handle with a hash join or index lookup if (user_id, endpoint, call_time) is indexed. The time window filter applies after the equi join. In production, replacing the self join with LEAD over a partition by (user_id, endpoint) is far faster.

> **Interviewers Watch For**
>
> Whether the candidate uses strict greater than on call_time so a row does not match itself, whether they push the time math into the join condition rather than into WHERE (semantically equivalent here but signals understanding of join cost), and whether the user_id IS NOT NULL filter is on the failure side (a) to short circuit early.

> **Common Pitfall**
>
> Writing b.call_time >= a.call_time instead of strict greater than causes every failed call to match itself (status >= 400 means the same row joins to itself with delta 0). Forgetting a.user_id IS NOT NULL pulls in NULL = NULL pairings, which SQLite treats as not equal so it accidentally works, but the resulting NULL output column is usually a bug downstream.

---

## Common follow-up questions

- Replace the self join with LEAD over a window. What does the SQL look like? _(Tests scalability awareness. PARTITION BY user_id, endpoint ORDER BY call_time, then LEAD(call_time) and a WHERE on the time delta plus a status filter. Drops the join from quadratic per partition to linear.)_
- What if 'retry' should mean 'same endpoint AND same HTTP method'? _(Tests requirement clarification. Add a.method = b.method to the join condition. The prompt currently does not require it, but it is a common follow up.)_
- How would you count distinct retry storms (clusters of failures within 5 minutes) per user? _(Tests gap and islands again. After this query, group consecutive failures within 5 minutes using rn minus ROW_NUMBER per user.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/rapid_retry_detection)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.