# Active User Penetration Rate

> How much of the user base is actually alive?

Canonical URL: <https://datadriven.io/problems/active_user_penetration_rate>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

The product analytics team defines an active user on a given device type as someone whose most recent session on that device is within the last 30 days, who has logged at least 5 sessions, and whose total session time on that device is at least 10 hours (36,000 seconds). For each device type, calculate the penetration rate: the number of active users divided by the total number of users on that device. Include device types where no users meet the active threshold, showing a zero rate. Return the device type and penetration rate.

## Worked solution and explanation

### Why this problem exists in real interviews

Product analytics teams probe whether you anchor the denominator before joining. Penetration rate is a filtered cohort over a full population, and getting the LEFT JOIN direction wrong silently drops every `device_type` with zero active users from the output. The interviewer wants to hear you say 'denominator first, numerator joined in' out loud.

---

### Break down the requirements

#### Step 1: Per-user-per-device aggregates

Join `user_sessions` to `devices` on `device_id`, then `GROUP BY user_id, device_type`. Emit `MAX(session_start)`, `COUNT(*)`, `SUM(session_duration_sec)`. One row per (user, device_type).

#### Step 2: Apply the active filter

From those aggregates, keep rows where recency, session count, and total seconds all clear the thresholds. `COUNT(DISTINCT user_id) GROUP BY device_type` gives the numerator per device.

#### Step 3: Total users per device

Separate CTE: same join, `COUNT(DISTINCT user_id) GROUP BY device_type` on the unfiltered population. This is the denominator and the spine of the final result.

#### Step 4: LEFT JOIN total to active

`total_users` on the left, `active_users` on the right. Wrap the numerator in `COALESCE(active_count, 0)` and `CAST` to `REAL` before dividing so device_types with zero actives stay in the output.

---

### The solution

**PENETRATION RATE BY DEVICE TYPE**

```sql
WITH user_stats AS (
  SELECT
    us.user_id,
    d.device_type,
    MAX(us.session_start) AS last_session,
    COUNT(*) AS session_count,
    SUM(us.session_duration_sec) AS total_duration
  FROM user_sessions us
  INNER JOIN devices d ON us.device_id = d.device_id
  GROUP BY us.user_id, d.device_type
),
active_users AS (
  SELECT
    device_type,
    COUNT(DISTINCT user_id) AS active_count
  FROM user_stats
  WHERE julianday('now') - julianday(last_session) <= 30
    AND session_count >= 5
    AND total_duration >= 36000
  GROUP BY device_type
),
total_users AS (
  SELECT
    d.device_type,
    COUNT(DISTINCT us.user_id) AS total_count
  FROM user_sessions us
  INNER JOIN devices d ON us.device_id = d.device_id
  GROUP BY d.device_type
)
SELECT
  tu.device_type,
  CAST(COALESCE(au.active_count, 0) AS REAL) / tu.total_count AS penetration_rate
FROM total_users tu
LEFT JOIN active_users au ON tu.device_type = au.device_type
```

> **Cost Analysis**
>
> `user_sessions` at 120M rows joined to 15M `devices` is scanned twice (once per CTE that builds aggregates). `GROUP BY user_id, device_type` is very high cardinality and shuffles hard. In production, materialize a daily `user_device_stats` rollup keyed on (user_id, device_type) and read from that.

> **Interviewers Watch For**
>
> Before writing, ask: do I include device_types with zero active users? If yes, the denominator drives the row set, so `total_users` is the left side of the join. Say that out loud before writing; otherwise you stumble into the zero-handling later.

> **Common Pitfall**
>
> Reversing the join, `LEFT JOIN total_users` onto `active_users`, silently drops every device_type with no actives. The result still looks reasonable, which is what makes it dangerous in a review. Always anchor on the denominator for ratio queries.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Why is `julianday('now')` a problem for reproducibility? _(Probes whether you flag non-deterministic time functions that break backfills and replays.)_
- How would you rewrite this against a daily `user_device_stats` rollup? _(Tests whether you recognize the 120M-row scan as a rollup candidate and can name the grain.)_
- What changes if a user appears under multiple `device_id` values mapping to the same `device_type`? _(Checks your grasp of `COUNT(DISTINCT user_id)` semantics versus session-level counts.)_
- How would you break out penetration rate by both `device_type` and `os_name`? _(Forces you to think about cardinality and whether the denominator changes per slice.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/active_user_penetration_rate)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.