# Top Percentile API Tokens

> The most suspicious tokens.

Canonical URL: <https://datadriven.io/problems/top_percentile_api_tokens>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Consider tokens at or above the 95th percentile by total requests within each scope as high-risk. Show the token ID, scope, request count, and percentile rank. Round to 2 decimal places.

## Worked solution and explanation

### Why this problem exists in real interviews

"95th percentile" sounds unambiguous until you try three functions and get three answers. PERCENT_RANK, CUME_DIST, and NTILE all return something defensible, and only one matches what a security engineer means when they flag the noisiest tokens in each scope.

---

### Break down the requirements

#### Step 1: Partition by scope

High-volume is relative inside each `scope`. A read-only token doing 10k requests is normal; an admin token doing the same is not. Every percentile calculation must `PARTITION BY scope`.

#### Step 2: Pick the right percentile function

Use `PERCENT_RANK()` with `ORDER BY requests ASC`. It returns `(rank - 1) / (n - 1)`, so the >= 0.95 filter captures the top 5 percent within each partition.

#### Step 3: Round for presentation only

`ROUND(pct_rank, 2)` shapes the output column. Do not round inside the window or before the comparison or you lose precision on the boundary.

---

### The solution

**TOP PERCENTILE API TOKENS**

```sql
WITH ranked AS (
  SELECT token_id, scope, requests,
         PERCENT_RANK() OVER (
           PARTITION BY scope
           ORDER BY requests ASC
         ) AS pct_rank
  FROM api_tokens
)
SELECT token_id, scope, requests,
       ROUND(pct_rank, 2) AS percentile_rank
FROM ranked
WHERE pct_rank >= 0.95
ORDER BY scope, requests DESC
```

> **Cost Analysis**
>
> 600k rows with a per-scope window sort. If scope cardinality is low (a handful of values), each partition is still hundreds of thousands of rows and the sort dominates. An index on `(scope, requests)` makes the window a streaming operation instead of a spill.

> **Interviewers Watch For**
>
> Choice of percentile function. NTILE(100) buckets rows into 100 groups of equal count, which is not the same as percentile rank when partitions are small. CUME_DIST includes ties in the numerator. Naming the difference matters more than the SQL.

> **Common Pitfall**
>
> Flipping the ORDER BY to DESC and filtering `pct_rank <= 0.05`. It looks symmetric but PERCENT_RANK is not. The lowest-ranked row always gets 0.0, so DESC ordering hands 0.0 to the highest-request token and the comparison drops it.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would the output differ if you used `CUME_DIST` instead of `PERCENT_RANK`? _(Tests the practical distinction: CUME_DIST counts ties on the same side, so it can return values above 0.95 for rows that PERCENT_RANK puts at 0.94.)_
- What happens to the result if a scope has only 20 tokens? _(Probes awareness that PERCENT_RANK on a 20-row partition only yields discrete steps of roughly 0.053, so the top single token clears 0.95 trivially.)_
- Rewrite this to also exclude tokens where `status = 'revoked'` without changing the percentile semantics. _(Forces the filter into the CTE before the window function runs, since filtering after `PERCENT_RANK()` would keep the revoked tokens in the denominator.)_

> **Why this query runs at 3 a.m.**
>
> Security teams sweep this nightly and page on tokens that jump scopes overnight. The percentile framing survives growth: as token counts climb, an absolute threshold rots, but "top 5 percent inside your scope" keeps its meaning.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_percentile_api_tokens)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.