# Beyond the Signup

> Anyone can create an account. Fewer actually return.

Canonical URL: <https://datadriven.io/problems/active_users_by_session_count>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The product org tracks engagement trends month over month. For the last 6 months, show the count of unique active users and the average session duration per month. Only include months where the total number of sessions exceeded 3. Present the results chronologically.

## Worked solution and explanation

### What this is really asking

`date('now', '-6 months')` is a rolling cutoff, not a calendar boundary, so the earliest bucket is a partial month. Group by `strftime('%Y-%m', session_start)`, count distinct users, average duration, drop tiny months.

---

### Break down the requirements

#### Step 1: Filter the partitioned column directly

`session_start >= date('now', '-6 months')` lets the partition pruner skip 54 of every 60 monthly partitions on the 60M-row table. Wrapping it in strftime first would defeat that.

#### Step 2: Bucket by month, two aggregates side by side

`COUNT(DISTINCT user_id)` gives unique actives; `AVG(session_duration_sec)` is a row-level average, weighted by session count, not by user.

#### Step 3: HAVING runs after GROUP BY

`HAVING COUNT(*) > 3` drops months with fewer than four sessions total. On a 60M-row table this is a no-op, but the clause is what the prompt requires.

---

### The solution

**ACTIVE USERS BY SESSION COUNT**

```sql
SELECT strftime('%Y-%m', us.session_start) AS month,
       COUNT(DISTINCT us.user_id) AS active_users,
       AVG(us.session_duration_sec) AS avg_duration_sec
FROM user_sessions us
WHERE us.session_start >= date('now', '-6 months')
GROUP BY month
HAVING COUNT(*) > 3
ORDER BY month
```

> **Cost Analysis**
>
> Predicate is sargable on the partition key, so the scan touches roughly 6M of 60M rows. COUNT(DISTINCT user_id) needs a hash set per month; on 6M rows that fits comfortably in memory. No join, no window.

> **Interviewers Watch For**
>
> Whether you put month boundaries in `WHERE` (sargable, prunes partitions) or in `HAVING` (correct, but reads everything). And whether `AVG(duration)` is weighted by session or by user; the prompt says session-level here.

> **Common Pitfall**
>
> Using `BETWEEN date('now','-6 months') AND date('now')` excludes the current day's late sessions on most engines. Open-ended `>=` is safer and matches the rolling-window intent.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you change this to full calendar months only, excluding the partial current month? _(Add `AND session_start < date('now','start of month')` so the current month drops out of the window cleanly.)_
- What if average duration should be per user, not per session? _(First aggregate to (user, month, avg_duration), then take AVG of that in the outer query so heavy users do not dominate.)_
- How does the answer change if `users.account_status` matters? _(Join on `user_id` and filter to active accounts in the WHERE clause; keep the join inner so deleted users drop out before the distinct count.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/active_users_by_session_count)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.