# Activity Histogram

> How many users did X things? Build the distribution.

Canonical URL: <https://datadriven.io/problems/activity_histogram>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

Retention modeling suspects the platform has a bimodal usage pattern: a cluster of one-and-done visitors and a cluster of heavy regulars with little in between. Build a histogram of session counts so the team can see how many users fall at each level of activity, from lowest to highest.

## Worked solution and explanation

### Why this problem exists in real interviews

The core skill being tested is grouped aggregation combined with nested subqueries over `user_sessions`. Candidates must decide how `session_start`, `session_duration_sec`, `pages_viewed` interact before choosing a join strategy or aggregation level.

---

### Break down the requirements

#### Step 1: Count sessions per user

`GROUP BY user_id` with `COUNT(*)` gives each user's session count.

#### Step 2: Count users per session-count level

The outer query groups by session_count and counts how many users fall at each level.

#### Step 3: Order by session count ascending

Low to high ordering reveals the distribution shape: bimodal, normal, or skewed.

---

### The solution

**CTE-decompose for activity histogram**

```sql
WITH user_counts AS (
    SELECT user_id, COUNT(*) AS session_count
    FROM user_sessions
    GROUP BY user_id
)
SELECT session_count, COUNT(*) AS user_count
FROM user_counts
GROUP BY session_count
ORDER BY session_count
```

> **Cost Analysis**
>
> The main table has 40M rows (15 GB). Partitioned on `session_start`, so queries filtering on that column skip most partitions. The GROUP BY reduces the row count early, keeping downstream operations cheap.

> **Interviewers Watch For**
>
> Strong candidates state the correct `GROUP BY` grain before writing any SQL, showing they think about the output shape first. Interviewers look for correct CASE ordering; putting the most specific condition first prevents logic bugs.

> **Common Pitfall**
>
> Selecting a non-aggregated column without including it in `GROUP BY` is the most common error. Some engines reject it; others silently return arbitrary values.

---

## Common follow-up questions

- What happens to your results if `session_start` in `user_sessions` contains trailing whitespace or mixed casing? _(Tests awareness of text normalization issues that silently fragment GROUP BY results.)_
- Does the database engine materialize your CTE or inline it? How would you check, and when does it matter? _(Tests understanding of CTE materialization behavior, which varies by engine (PostgreSQL materializes by default before v12).)_
- `session_id` in `user_sessions` has ~40M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- If the business definition of `device_id` changed mid-quarter (e.g., a status value was renamed), how would you handle historical consistency? _(Tests awareness of slowly changing dimensions and backward-compatible query design.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/activity_histogram)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.