# Weighted Variant Selection

> Select a row using cumulative weight probabilities.

Canonical URL: <https://datadriven.io/problems/weighted_variant_selection>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

The A/B testing platform uses rollout percentages to weight feature flag selection. Compute each flag's selection probability (its rollout divided by the total rollout across all flags) and its cumulative probability (the running sum of probabilities in rollout order).

## Worked solution and explanation

### Why this problem exists in real interviews

This tests window functions for running totals and ratio computation. Interviewers check whether you can compute proportional weights and cumulative distributions in a single query.

> **Trick to Solving**
>
> Two separate window operations are needed: a global SUM for the total weight, and a running SUM for the cumulative probability. Both can be computed in a single SELECT.
> 
> 1. Compute total rollout across all flags: `SUM(rollout) OVER ()`
> 2. Compute each flag's probability: `rollout / total_rollout`
> 3. Compute cumulative probability: `SUM(probability) OVER (ORDER BY rollout)`

---

### Break down the requirements

#### Step 1: Compute total rollout

`SUM(rollout) OVER ()` gives the total rollout across all 500 flags as a scalar accessible per row.

#### Step 2: Compute selection probability

`rollout / SUM(rollout) OVER ()` gives each flag's share of the total.

#### Step 3: Compute cumulative probability

`SUM(rollout / total) OVER (ORDER BY rollout)` produces the running sum of probabilities.

---

### The solution

**Probability and cumulative distribution via window functions**

```sql
SELECT
    flag_id,
    flag_name,
    rollout,
    rollout * 1.0 / SUM(rollout) OVER () AS selection_probability,
    SUM(rollout * 1.0 / SUM(rollout) OVER ()) OVER (ORDER BY rollout) AS cumulative_probability
FROM feat_flags
```

> **Cost Analysis**
>
> With only 500 rows, the nested window functions are trivially fast. The outer SUM OVER (ORDER BY rollout) sorts 500 rows, which is negligible.

> **Interviewers Watch For**
>
> Nested window functions. `SUM(rollout) OVER ()` inside another window expression is a pattern some candidates have not seen. Some engines require a CTE to stage the total.

> **Common Pitfall**
>
> The nested `SUM(rollout) OVER ()` inside another window function may not work in all SQL engines. A safer approach uses a CTE to compute the total first, then reference it.

---

## Common follow-up questions

- How would you rewrite this without nested window functions? _(Use a CTE: `WITH totals AS (SELECT SUM(rollout) AS total FROM feat_flags)` then CROSS JOIN.)_
- What if rollout values could be zero? _(Zero rollout means zero probability; the cumulative distribution would skip them.)_
- How would you select a random flag using the cumulative distribution? _(Generate a random number between 0 and 1, then find the first flag where cumulative_probability exceeds it.)_
- What if you needed to update rollout percentages to sum to exactly 100? _(Normalize: UPDATE SET rollout = rollout * 100.0 / (SELECT SUM(rollout) FROM feat_flags).)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/weighted_variant_selection)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.