# Subscription Churn Analysis Model

> Subscribers are leaving. The data knows why.

Canonical URL: <https://datadriven.io/problems/subscription_churn_analysis_model>

Domain: Data Modeling · Difficulty: medium · Seniority: L5

## Problem

We run a music streaming service like Spotify. Our premium subscribers are churning at 8% monthly and we need to understand why. We want to analyze churn patterns by looking at how their listening habits, playlist activity, and subscription history changed before they cancelled. Design the data model.

## Worked solution and explanation

### Why this problem exists in real interviews

Churn modeling probes whether a candidate can coexist three different grains cleanly: a lifecycle event log, a pre-aggregated daily snapshot, and an atomic play-event fact. The signal is whether the candidate understands when to pay the storage cost of pre-aggregation versus reconstructing daily engagement on every query.

> **Trick to Solving**
>
> The tell is "churn analysis". Churn is inherently a lifecycle question joined to an engagement signal. Before drawing any tables, a strong candidate asks: at what grain do analysts need engagement, and how far back? If the answer is "daily, two years", the pre-aggregated daily fact earns its keep.
> 
> 1. Separate lifecycle events from engagement events
> 2. Pre-aggregate daily engagement for BI speed
> 3. Keep fact_play_events atomic for ad-hoc analysis
> 4. Use event_type on fact_subscription_events to cover signup, upgrade, cancel

---

### Break down the requirements

#### Step 1: Lifecycle as an event log

`fact_subscription_events` captures signup, upgrade, downgrade, pause, cancel, reactivate. Append-only, one row per state change, event_type as a discriminator.

#### Step 2: Daily engagement pre-aggregated

`fact_daily_engagement` is one row per user per day with rolled-up sessions, minutes, and distinct titles. This is the table most dashboards hit; it pays for itself the first time a cohort analysis runs.

#### Step 3: Atomic play events for deep dives

`fact_play_events` is the raw source. It feeds `fact_daily_engagement` and stays available for content-level analytics, A/B tests, and anomaly investigations.

#### Step 4: Conform dim_users and dim_content

Both dimensions are shared across all three facts. Conformity makes "did users who canceled last month watch less content in week 3" a single query.

---

### The solution

Below is one defensible approach. The coexistence of three grains is the anchor: lifecycle, daily, and atomic each serve different query patterns without collapsing into one table.

> **Why This Design Works**
>
> Pre-aggregation is a deliberate storage-for-speed trade. The daily fact is 1/100th the size of the play-event fact and serves the 90% of queries that only need daily granularity. The atomic fact stays available for the 10% that need content-level detail. This is the classic pattern for retention and churn analytics at scale.

> **Interviewers Watch For**
>
> Strong candidates explicitly name the three grains and justify the daily pre-aggregation with a query frequency argument. They also mention that `fact_daily_engagement` must be idempotently rebuildable from `fact_play_events` for late-arriving data. Weaker candidates build only the atomic fact and watch dashboards time out.

> **Common Pitfall**
>
> Mutating `fact_daily_engagement` in place when a late play event arrives. The daily table should be rebuilt for affected days from `fact_play_events`, not patched, otherwise drift between the two is inevitable.

---

### The analysis pattern

**Seven-day engagement before cancellation**

```sql
WITH cancels AS (
    SELECT user_sk, event_timestamp::date AS cancel_date
    FROM fact_subscription_events
    WHERE event_type = 'cancel'
      AND event_timestamp >= '2025-01-01'
)
SELECT
    u.plan_tier,
    u.country,
    AVG(e.minutes_watched) AS avg_daily_minutes,
    AVG(e.distinct_titles) AS avg_daily_titles,
    COUNT(DISTINCT c.user_sk) AS cancelled_users
FROM cancels c
JOIN dim_users u ON u.user_sk = c.user_sk
JOIN fact_daily_engagement e
    ON e.user_sk = c.user_sk
   AND e.engagement_date BETWEEN c.cancel_date - INTERVAL '7 days' AND c.cancel_date - INTERVAL '1 day'
GROUP BY u.plan_tier, u.country
ORDER BY avg_daily_minutes
```

---

### Trade-offs and alternatives

**Three-grain layered model**

Fast dashboards, atomic detail when needed, idempotent rebuilds of the daily fact. Cost: ETL must handle backfills correctly and storage footprint is roughly 1.01x the atomic fact alone.

**Atomic fact only**

Single source of truth, no derivation pipeline. Cost: every daily roll-up query scans the full play-event fact, and dashboards slow proportionally with watch volume.

---

## Common follow-up questions

- A two-week-late batch of play events arrives. How do you update fact_daily_engagement without drift? _(Tests idempotent rebuild of affected daily partitions from fact_play_events.)_
- Product asks for content-level retention curves. Is fact_daily_engagement enough? _(Tests whether the candidate recognizes that content-level questions need fact_play_events.)_
- The platform grows to 500M users with 2B play events per day. How does the model partition? _(Tests partitioning by engagement_date and started_at with clustering on user_sk.)_
- A paused subscription counts as neither active nor cancelled. How does churn math handle it? _(Tests event_type vocabulary and the definition of churn as a function of lifecycle states.)_
- Content is delisted mid-year. Does dim_content need Type 2 SCD? _(Tests whether historical play events should still resolve to the original genre classification.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/subscription_churn_analysis_model)
- [Data Modeling Interview Questions](https://datadriven.io/data-modeling-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.