# Session Count Distribution

> How are sessions distributed among the newest users?

Canonical URL: <https://datadriven.io/problems/session_count_distribution>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

For users who signed up between 2024 and 2026, count how many sessions each had in February 2026, then show the session count and how many users had that exact count, from fewest sessions to most.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `users` and `user_sessions` tables, this challenge probes your ability to apply date extraction for time bucketing in a session analysis setting. Correctly referencing columns like `user_id`, `signup_date`, and `session_start` is essential to a working solution.

> **Trick to Solving**
>
> Complex queries with multiple transformation stages are best solved by writing one CTE per stage.
> 
> 1. Identify the intermediate results needed
> 2. Write each as a named CTE
> 3. Chain them: each CTE reads from the previous one
> 4. The final SELECT assembles the output

---

### Break down the requirements

#### Step 1: Structure the logic with 2 CTEs

Break the problem into named stages: `qualifying_users`, `feb_sessions`. Each CTE isolates one transformation, making the query readable and debuggable.

#### Step 2: Join `users` to `user_sessions`

The join connects the two tables on their shared key. This brings the columns needed for filtering and aggregation into a single row set.

#### Step 3: Apply the range filter

The WHERE clause restricts rows to the target range. Applying this filter early reduces the volume flowing into downstream operations.

---

### The solution

**Structure the logic with 2 ctes to find session count distribution**

```sql
WITH qualifying_users AS (
    SELECT user_id FROM users
    WHERE signup_date BETWEEN '2024-01-01' AND '2026-12-31'
),
feb_sessions AS (
    SELECT us.user_id, COUNT(*) AS session_count
    FROM user_sessions us
    INNER JOIN qualifying_users qu ON us.user_id = qu.user_id
    WHERE strftime('%Y-%m', us.session_start) = '2026-02'
    GROUP BY us.user_id
)
SELECT session_count, COUNT(*) AS user_count
FROM feb_sessions
GROUP BY session_count
ORDER BY session_count ASC
```

> **Cost Analysis**
>
> With ~115M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality; CTEs materialize intermediate results, which can be beneficial or costly depending on the engine. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for whether you decompose the problem into named, testable stages rather than nesting everything; how you handle date arithmetic and whether you account for edge cases like month boundaries; whether you use a subquery or self-join, and can explain the tradeoffs.

> **Common Pitfall**
>
> Returning extra columns that the prompt did not ask for, or using the wrong column alias, causes a grading mismatch even when the logic is correct.

---

## Common follow-up questions

- What would happen to your result if `user_sessions.session_duration_sec` contained duplicate values that you did not expect? _(Tests whether the candidate considers data quality issues in `session_duration_sec` and uses DISTINCT or deduplication where needed.)_
- If `user_sessions` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `session_id`? _(Tests ability to identify performance hotspots related to `user_sessions.session_id` at scale.)_
- Would materializing the CTE as a temp table improve performance when joining `users` and `user_sessions`? _(Tests understanding of CTE materialization behavior across different database engines.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/session_count_distribution)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.