# A/B Experiment Assignment Schema

> One user, one experiment, one variant. No exceptions.

Canonical URL: <https://datadriven.io/problems/a_b_experiment_assignment_schema>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We run product experiments across our consumer app. When a user is assigned to an experiment, we need to track which variant they saw and when. Analysts need to compute metric lifts between variants. Design the data model to support experimentation analysis.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether you understand **separation of concerns in schema design**, **uniqueness constraints as correctness guarantees**, and **temporal join patterns**. These three concepts are the foundation of analytics-ready data modeling, and getting any one of them wrong produces silently invalid query results downstream.

> **Trick to Solving**
>
> The key phrase is "compute metric lifts between variants." This means analysts will join assignments to events. The moment you hear "join two fact tables by user and time," you need a dedicated assignment table with a timestamp column. Without it, the temporal join is impossible.
> 
> 1. Spot the temporal join requirement in the prompt
> 2. Create a separate assignments fact table (never embed on events)
> 3. Make `assignment_timestamp` a first-class column
> 4. Add `UNIQUE(user_id, experiment_id)` to prevent variant leakage

---

### Break down the requirements

#### Step 1: Identify the entities

Three distinct entities: **experiments** (the test), **variants** (control/treatment groups), and **assignments** (user placed into a variant at a specific time).

#### Step 2: Separate assignments from events

Assignments must live in their own table. Embedding variant info on the event log makes temporal filtering impossible and correction rewrites catastrophic.

#### Step 3: Add the uniqueness constraint

`UNIQUE(user_id, experiment_id)` prevents a user from leaking into multiple variants, which would invalidate every metric for that experiment.

#### Step 4: Include assignment_timestamp

This is the join key for post-assignment analysis. The query pattern is `WHERE event_timestamp >= assignment_timestamp`.

---

### The solution

> **Interviewers Watch For**
>
> Four things separate pass from fail:
> 1. Separate assignments table (not columns on events)
> 2. `UNIQUE(user_id, experiment_id)` constraint
> 3. `assignment_timestamp` as a first-class field
> 4. Understanding the temporal join pattern

> **Common Pitfall**
>
> Adding `variant_name` or `experiment_id` as columns on the event log. This creates two fatal problems: you cannot filter to post-assignment events, and late corrections require rewriting millions of event rows.

---

### The analysis pattern

**Metric lift query**

```sql
SELECT
    v.variant_name,
    COUNT(DISTINCT e.user_id) AS users,
    AVG(e.revenue)            AS avg_revenue
FROM fact_experiment_assignments a
JOIN dim_variants v ON v.variant_id = a.variant_id
JOIN fact_events e
  ON e.user_id = a.user_id
 AND e.event_timestamp >= a.assignment_timestamp
WHERE a.experiment_id = 42
  AND a.is_active = true
GROUP BY v.variant_name
```

---

---

### Trade-offs and alternatives

**Dedicated assignments table**

Clean separation of concerns. Temporal joins are clean via `assignment_timestamp`. Constraint enforcement is straightforward. Joining at query time costs an extra step on every metric query.

**Alternative: variant on event log**

Variant column denormalized onto each event row. Faster reads since no join. Backfills become catastrophic when assignments are corrected. Cannot reason about pre-assignment behavior. Late-arriving data is hard to attribute correctly.

## Common follow-up questions

- What if 0.1% of assignments need to be corrected after the fact? _(Tests the is_active flag pattern: mark old assignment inactive, insert new one, without deleting history.)_
- How would you handle users in 30+ concurrent experiments at 5M DAU? _(Tests partitioning strategy: partition fact_experiment_assignments by experiment_id for query performance.)_
- What if you needed to measure pre-assignment vs post-assignment behavior? _(Tests whether you understand why assignment_timestamp is load-bearing for this comparison.)_
- How would the schema change to support multi-arm bandits with dynamic traffic allocation? _(Tests extensibility: traffic_pct becomes time-varying, assignments need a version or epoch column.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/a_b_experiment_assignment_schema)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.