# Fitness App Data Model

> Reps, sets, streaks, and personal bests. Gym rats love their stats.

Canonical URL: <https://datadriven.io/problems/fitness_app_data_model>

Domain: Data Modeling · Difficulty: easy · Seniority: L4

## Problem

We're building a fitness app where users log workouts. Each workout has exercises with sets and reps. We need to track progress over time, like 'how much can this user bench press now vs. 3 months ago?' Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

This probes whether a candidate can **declare grain correctly** on a nested hierarchy (user, workout, exercise, set) and recognize when a raw string attribute should become a conformed dimension. Progress tracking forces the conversation onto aggregation, which only works if the lowest fact grain is a single set and the exercise vocabulary is normalized.

> **Trick to Solving**
>
> Any phrase like “how much can this user bench press now vs three months ago” is a signal that the fact grain has to sit at the set level, not the workout level. Before drawing any tables, a strong candidate asks: what is the lowest thing the user logs, and does the exercise name need to be canonical for aggregation?
> 
> 1. Ask what a user actually logs on each tap (one set at a time)
> 2. Declare grain as one row per set of one exercise in one workout
> 3. Pull exercise names into a catalog dimension
> 4. Keep weight and reps as additive measures on the fact

---

### Break down the requirements

#### Step 1: Declare the grain first

One row in `workout_sets` equals one set of one exercise inside one workout. Anything coarser makes personal-record queries impossible.

#### Step 2: Identify the four entities

`users`, `workouts`, `exercises` (a catalog dimension), and `workout_sets` (the fact). Exercises are a dimension because “Bench Press” must aggregate cleanly across users and time.

#### Step 3: Normalize the exercise catalog

If exercise name lives as a free-text column on each set, casing and typos shatter every progress query. A dimension with a canonical name column fixes that at ingest.

#### Step 4: Keep measures additive

`weight_lbs` and `reps` on `workout_sets` stay numeric and per-set so that MAX, AVG, and volume calculations are trivial downstream.

---

### The solution

Below is one conceptually sound model. The set-level grain anchors everything else: once the fact row represents a single set, the dimensions and keys fall out naturally.

> **Why this works**
>
> Declaring grain at the set level is the single load-bearing decision. It lets personal-record, volume, and trend queries be pure GROUP BY aggregations over a narrow fact table, and it pushes exercise-name consistency upstream into the catalog where it belongs.

> **Interviewers watch for**
>
> Strong candidates say “grain” out loud before drawing tables and justify why sets (not workouts) are the fact. They also pull exercises into a dimension without being prompted. Weaker candidates stop at a `workouts` table with a JSON column or comma-separated exercise list.

> **Common pitfall**
>
> Storing exercise name as inline TEXT on each set row. It looks harmless until two users type “bench press” and “Bench Press”, and suddenly neither one has a personal record. Normalizing the vocabulary into `exercises` is a one-line fix that pays for itself forever.

---

### The analysis pattern

**Per-user personal record progression**

```sql
SELECT
    w.user_id,
    e.canonical_name,
    DATE_TRUNC('month', w.started_at) AS month,
    MAX(s.weight_lbs) AS max_weight
FROM workout_sets s
JOIN workouts w ON w.workout_id = s.workout_id
JOIN exercises e ON e.exercise_id = s.exercise_id
WHERE e.canonical_name = 'Bench Press'
GROUP BY w.user_id, e.canonical_name, DATE_TRUNC('month', w.started_at)
ORDER BY w.user_id, month
```

---

### Trade-offs and alternatives

**Normalized catalog + set fact**

Set-level grain with `exercises` dimension.

* Aggregations across users and time are trivial
* Exercise taxonomy can evolve without rewriting fact rows
* Slightly more joins on every read

**Embedded exercise name**

Exercise name stored as TEXT on each set row.

* One less join on read
* Typos and casing fragment the vocabulary
* Fixing bad data requires mass updates on the fact table

---

## Common follow-up questions

- How would you model exercise variations like 'incline bench press' vs 'flat bench press'? _(Tests whether the candidate introduces a parent_exercise_id on the catalog or a variation attribute, instead of creating a new row per variation with duplicated metadata.)_
- A user edits a past workout to correct a weight. How does the schema handle it? _(Tests whether edits are destructive updates or versioned rows and whether audit requirements exist.)_
- What changes if we add cardio workouts where reps and weight don’t apply? _(Tests schema evolvability: nullable measures vs typed subtype tables vs event-specific fact tables.)_
- How would you compute volume (sets times reps times weight) per week without scanning the full fact table? _(Tests whether the candidate reaches for a weekly pre-aggregate keyed on user and exercise.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/fitness_app_data_model)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.