# Who Comes Back

Canonical URL: <https://datadriven.io/problems/who-comes-back>

Domain: Data Modeling · Difficulty: medium · Seniority: mid

## Problem

We run a short-video social platform and the growth team wants to track how many new users come back on each day after they sign up, broken down by signup cohort and acquisition channel. Design the warehouse model that lets analysts compute day-N return rates for any offset they ask for later, without rescanning the raw event stream. Activity is high-volume, so the model has to keep these queries cheap as daily actives grow into the hundreds of millions.

## Worked solution and explanation

### Why this problem exists in real interviews

This is a cohort set-membership question wearing a retention costume. The real skill being probed: can you pick a grain that makes 'did this user come back on day N' a date subtraction over a join, instead of a self-join that rediscovers each user's signup day from a raw event stream? Anyone can name a fact and a few dimensions. The trap is that retention is a moving target (day 1, day 7, day 30, whatever the dashboard asks next), so the moment you collapse retention to one fixed flag on the fact, the model can answer exactly one question and nothing else.

> **Trick to solving**
>
> Recognize the two distinct dates. The signup date is an attribute of the user (the cohort anchor) and belongs on dim_users. The activity date is an attribute of each day the user showed up and belongs on the fact. Day-N return is then activity_date minus signup_date. Get those two dates on the right tables and retention becomes a plain aggregation.

---

### Break down the requirements

#### Step 1: Declare the activity grain

One row per user per active day. This is a periodic snapshot, not an event log. A user who opens five sessions on Tuesday produces one Tuesday row with additive counts. If you keep per-session rows instead, every retention query needs a COUNT(DISTINCT user) with dedup, and one chatty user inflates the cohort's return rate.

#### Step 2: Anchor the cohort on the user

Put signup_date on dim_users alongside acquisition_channel and signup_country. The cohort a user belongs to never changes, so it is a stable dimension attribute, not a fact measure. This is what lets you slice retention by channel without touching the fact's grain.

#### Step 3: Keep day-N flexible

Anchor the offset on the two dates, not on a fixed flag. The offset is activity_date minus signup_date. Storing a hard-coded day-7 retention flag locks the model to whatever offsets you imagined on day one and forces a full backfill the first time someone asks for day-3. Materializing a per-row days_since_signup column is a defensible denormalization since it still supports any offset; what you must avoid is collapsing retention to a single precomputed answer.

#### Step 4: Decide star vs snowflake deliberately

Conform dim_date and dim_users and keep the model a star. Acquisition channel is stable and low cardinality, so it stays denormalized on dim_users. Device platform and OS version repeat across hundreds of millions of rows and evolve on their own cadence, so a separate dim_device (a snowflake branch) earns its extra join. Snowflaking everything by reflex just adds joins to every slice.

---

### The reference model

Below is one defensible design: a star centered on a daily activity snapshot, with dim_users carrying the cohort anchor and a snowflaked dim_device for the repeating platform attributes. The grain of fact_user_daily_activity is one row per user per active day, which keeps every measure additive and every retention question a join plus a date difference.

> **Why this works**
>
> The fact remembers only that a user was active on a given day. Signup lives on the user, activity lives on the fact, and the offset between them is computed when asked. That separation is what lets one model answer day-1, day-7, and day-90 retention, plus resurrection, without a schema change.

> **Interviewers watch for**
>
> A strong candidate states the grain out loud first, names signup_date as a dimension attribute, and refuses to collapse retention into a single fixed flag. They also justify the snowflake on dim_device with cardinality and churn rather than snowflaking everything because the prompt mentioned the word.

> **Common pitfall**
>
> Keeping the raw per-session event log and computing retention with a self-join that re-derives each user's first-active date, then matches rows exactly N days later. It double counts multi-session days unless you dedup, and the self-join explodes at hundreds of millions of daily rows. A second pattern to watch: duplicating signup_date onto every fact row. It is a defensible read-speed denormalization, but it drifts if a backfill touches only some rows, so the cohort anchor still belongs on dim_users as the source of truth.

> **How it scales**
>
> At 300 million daily actives the fact grows by roughly 300 million rows a day. Partition by date_key so a 90-day retention window prunes to 90 partitions instead of a full scan. Because each active day is already collapsed to one additive row, the cohort aggregation is a single pass with no distinct-count dedup on the hot path.

---

### The analysis pattern

**Day-N return rate by cohort and channel**

```sql
SELECT
    u.signup_date,
    u.acquisition_channel,
    (d.full_date - u.signup_date) AS days_since_signup,
    COUNT(DISTINCT a.user_key) AS returning_users
FROM fact_user_daily_activity a
JOIN dim_users u ON u.user_key = a.user_key
JOIN dim_date d ON d.date_key = a.date_key
WHERE u.signup_date >= DATE '2026-01-01'
  AND d.full_date BETWEEN u.signup_date AND u.signup_date + 90
GROUP BY u.signup_date, u.acquisition_channel, (d.full_date - u.signup_date)
ORDER BY u.signup_date, days_since_signup
```

*Retention is a join plus a date difference; no self-join over raw events.*

---

### Trade-offs and alternatives

**Daily snapshot fact**

One additive row per user per active day.

* Retention is a plain join plus date diff
* No dedup on the hot path
* Partitions prune to the window
* Slightly more storage than nothing

**Raw event log only**

Every session or view kept as a row.

* Maximum flexibility, nothing lost
* Retention needs a self-join to find day N
* Multi-session days must be deduped
* Scans blow up at 300M actives per day

---

## Common follow-up questions

- The team now wants resurrection: users who were dormant for 28 days and then returned. Does your model already support it? _(Tests whether the candidate sees that absence of rows plus a row at offset N answers resurrection without a schema change.)_
- Acquisition channel turns out to be re-attributed weeks after signup for some users. How does that change dim_users? _(Tests SCD reasoning and whether a re-attributed cohort key should version or stay fixed.)_
- How would you partition and cluster fact_user_daily_activity so both 7-day and 90-day cohort scans stay cheap? _(Tests partition-by-date_key plus clustering on user_key or cohort for pruning at scale.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/who-comes-back)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.