# Multi-Month Active Users

> Active this month and last month. Who stuck around?

Canonical URL: <https://datadriven.io/problems/multi_month_active_users>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

Identify users who logged sessions in at least 3 separate calendar months, based on session_start dates from user_sessions. Return the user ID.

## Worked solution and explanation

### Why this problem exists in real interviews

This evaluates whether you can extract date components, deduplicate at the right level, and apply a threshold filter on a count of distinct time periods. It is a common retention-style query that tests date manipulation and multi-level aggregation.

> **Trick to Solving**
>
> "At least 3 separate calendar months" signals a distinct-count-within-group problem. Spot it when the prompt asks for entities active across multiple distinct time periods.
> 
> 1. Extract the year-month from `session_start`
> 2. `COUNT(DISTINCT year_month)` per `user_id`
> 3. Filter with `HAVING` for the threshold

---

### Break down the requirements

#### Step 1: Extract calendar month from session_start

Use `strftime('%Y-%m', session_start)` to normalize each session timestamp to its year-month. This ensures sessions in the same month are collapsed.

#### Step 2: Count distinct months per user

`COUNT(DISTINCT year_month)` per `user_id` gives the number of separate calendar months each user was active.

#### Step 3: Filter with HAVING

`HAVING COUNT(DISTINCT ...) >= 3` retains only users active in at least 3 months. Return `user_id`.

---

### The solution

**Distinct month count per user**

```sql
SELECT user_id
FROM user_sessions
GROUP BY user_id
HAVING COUNT(DISTINCT strftime('%Y-%m', session_start)) >= 3
```

> **Cost Analysis**
>
> At `user_sessions` (150,000,000 rows), a full table scan is expensive. Partition pruning (if the table is partitioned on the filter column) is the first optimization. A covering index on the `GROUP BY` + filter columns eliminates random I/O. Consider a materialized view for repeated dashboard queries.

> **Interviewers Watch For**
>
> Interviewers evaluate whether you translate the English requirements into the correct SQL clauses on the first attempt. They watch for clean syntax, correct column references, and whether you verify edge cases before declaring the query complete.

> **Common Pitfall**
>
> Forgetting `DISTINCT` in `COUNT(DISTINCT strftime(...))` counts total sessions instead of distinct months. A user with 50 sessions in January would appear to have 50 months of activity.

---

## Common follow-up questions

- How would this query perform on a distributed database like BigQuery or Redshift? _(Tests understanding of shuffle/redistribute costs in MPP systems.)_
- What if the data arrived as a stream rather than a batch table? _(Tests whether the candidate can adapt the logic to a streaming window.)_
- How would you test this query's correctness with a minimal reproducible dataset? _(Tests data engineering rigor: constructing edge-case test fixtures.)_
- What happens if a dependent table is eventually consistent and lags by a few minutes? _(Tests awareness of consistency guarantees and their impact on join correctness.)_
- Could you decompose this into smaller queries for a pipeline DAG? _(Tests whether the candidate can break a complex query into idempotent, testable stages.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/multi_month_active_users)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.