# Monthly Service Retention

> Users came back. Or they did not.

Canonical URL: <https://datadriven.io/problems/monthly_service_retention>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

For December 2025 and January 2026, find the percentage of users active in a given month who also appear in any future month. Treat missing future activity as zero retention. Return the service name, month, and retention percentage.

## Worked solution and explanation

### Why this problem exists in real interviews

Extracting insights from svc_health.svc_name grouped by status via grouping and CTE composition is the central task. It is used in senior-level rounds to test whether you pick the right aggregation function and partition boundary on the first attempt.

> **Trick to Solving**
>
> NULL propagation silently corrupts aggregates and joins. The trick is explicit handling.
> 
> 1. Identify which columns can be NULL (check the schema)
> 2. Use `COALESCE(col, 0)` for numeric defaults or `LEFT JOIN` to preserve unmatched rows
> 3. Verify that your WHERE clause does not accidentally filter out NULLs

---

### Break down the requirements

#### Step 1: Structure the query as multi-step CTEs

This solution uses 3 CTEs to break the logic into readable stages. Each CTE produces an intermediate result that feeds the next.

#### Step 2: Self-join the table

Join `svc_health` to itself to compare or pair rows within the same table. Use an inequality condition to avoid duplicate pairs.

#### Step 3: Filter to the target rows

Apply the date filter using `STRFTIME` to extract and compare the relevant time component. This restricts rows before aggregation.

#### Step 4: Aggregate with COUNT DISTINCT

Group by the output grain and apply `COUNT DISTINCT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 5: Handle NULLs with COALESCE

`COALESCE(expr, 0)` replaces NULL with a default value. This prevents NULL propagation from silently corrupting sums, counts, or display values.

#### Step 6: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Period-pair comparison for retention metric**

```sql
WITH monthly_users AS (
    SELECT svc_name,
        STRFTIME('%Y-%m', checked) AS month,
        COUNT(DISTINCT check_id) AS user_count
    FROM svc_health
    WHERE STRFTIME('%Y-%m', checked) IN ('2024-12', '2025-01')
    GROUP BY svc_name, STRFTIME('%Y-%m', checked)
),
curr AS (
    SELECT * FROM monthly_users
),
future AS (
    SELECT svc_name, month
    FROM monthly_users m
    WHERE EXISTS (
        SELECT 1 FROM monthly_users m2
        WHERE m2.svc_name = m.svc_name
            AND m2.month > m.month
    )
)
SELECT c.svc_name, c.month,
    COALESCE(ROUND(f.user_count * 100.0 / c.user_count, 2), 0) AS retention_pct
FROM (SELECT svc_name, month, user_count FROM monthly_users) c
LEFT JOIN (SELECT svc_name, month, user_count FROM monthly_users) f
    ON c.svc_name = f.svc_name AND f.month > c.month
ORDER BY c.svc_name, c.month
```

> **Cost Analysis**
>
> The query scans 50M rows from `svc_health`.

> **Interviewers Watch For**
>
> Explicitly mentioning NULL handling before being asked signals production awareness.

> **Common Pitfall**
>
> A WHERE clause with `= NULL` matches nothing. Use `IS NULL` instead. This is one of the most common SQL mistakes in interviews.

---

## Common follow-up questions

- What happens to your result if svc_health.latency contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on latency.)_
- How would you handle users in svc_health who have activity before their recorded signup date? _(Tests edge case handling in cohort assignment when data quality is imperfect.)_
- With millions of distinct values in svc_health.check_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like check_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/monthly_service_retention)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.