# Largest A/B Test by Participants

> The biggest experiment ever run.

Canonical URL: <https://datadriven.io/problems/largest_a_b_test_by_participants>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The data science team is identifying which A/B test drew the widest audience. Which test had the most unique participants? Return the test name and count for the top one.

## Worked solution and explanation

### Why this problem exists in real interviews

Eight million rows in `ab_results` and the grain is per-event, not per-person. A power tester who triggered the conversion metric forty times shows up as forty rows. Anyone who writes `COUNT(*)` is counting telemetry pings and calling it audience reach.

---

### Break down the requirements

#### Step 1: Aggregate to the human

Group by `test_name` and reduce per-event rows with `COUNT(DISTINCT user_id)`. A given `user_id` may appear across `variant` values or many `metric` rows in the same test.

#### Step 2: Rank and slice

Sort descending on the distinct count and take the top entry with `LIMIT 1`. The prompt asks for one largest test, so tie-breaking is not requested.

---

### The solution

**TOP TEST BY UNIQUE PARTICIPANTS**

```sql
SELECT test_name,
       COUNT(DISTINCT user_id) AS unique_participants
FROM ab_results
GROUP BY test_name
ORDER BY unique_participants DESC
LIMIT 1
```

> **Cost Analysis**
>
> `COUNT(DISTINCT)` builds a hash set per group. On 8M rows across a handful of tests, each group holds a wide set in memory. A covering index on `(test_name, user_id)` lets the engine stream sorted pairs and dedupe cheaply.

> **Interviewers Watch For**
>
> Whether you say the word grain before you type. `ab_results` has one row per metric event, so `COUNT(*)` answers event volume, not audience size. Name that distinction up front.

> **Common Pitfall**
>
> Wrapping the aggregate in `SELECT DISTINCT test_name, COUNT(user_id)` to dedupe rows. `DISTINCT` runs after `GROUP BY`, so it cannot rescue an under-aggregated count. Deduplication has to live inside the aggregate.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you also report the variant split for that winning test? _(Tests per-variant aggregation in one pass.)_
- If someone appears under two `variant` values for the same test, do you count them once? _(Probes whether you push back on ambiguous prompts.)_
- How would you rewrite this if `ab_results` were partitioned by month and you only cared about the last 30 days? _(Tests partition pruning on event tables.)_

> **Real World**
>
> Experimentation platforms often expose a pre-aggregated `exposures` table at the person-by-test grain to avoid this trap. If one exists, prefer it.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/largest_a_b_test_by_participants)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.