# Most Active Chat Users

> The loudest voices on the platform.

Canonical URL: <https://datadriven.io/problems/most_active_chat_users>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Rank all chat users by total message volume so that tied counts share the same rank with no gaps. Show each user's rank, ID, and total messages, from most to least active.

## Worked solution and explanation

### Why this problem exists in real interviews

The chat_msgs table contains channel and content values that must be processed with dense ranking and grouping. This appears in mid-level screens to probe whether you reason about the correct aggregation grain before writing any window or GROUP BY clause.

> **Trick to Solving**
>
> Look for language about ties or "include all at position N." This signals `DENSE_RANK` over `ROW_NUMBER` or `LIMIT`.
> 
> 1. Identify tie-inclusion language in the prompt
> 2. Use `DENSE_RANK()` instead of `ROW_NUMBER()` or `LIMIT`
> 3. Aggregate to the correct grain before ranking

---

### Break down the requirements

#### Step 1: Rank with DENSE_RANK for tie inclusion

`DENSE_RANK()` assigns the same rank to tied values and never skips numbers. This ensures all tied rows appear in the result.

#### Step 2: Aggregate with COUNT

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**DENSE_RANK on aggregate for tie-inclusive ranking**

```sql
SELECT DENSE_RANK() OVER (ORDER BY COUNT(*) DESC) AS rnk,
    sender_id,
    COUNT(*) AS total_messages
FROM chat_msgs
GROUP BY sender_id
ORDER BY rnk
```

> **Cost Analysis**
>
> The query scans 40M rows from `chat_msgs`. The window function requires a sort, which is O(n log n). Pre-aggregating reduces the sort input. The aggregation reduces the row count before any downstream processing, which is the key performance lever. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Strong candidates explain their choice of window function (`ROW_NUMBER` vs `RANK` vs `DENSE_RANK`) and why it matches the tie semantics. Walking through comparison logic step by step, rather than writing it in one pass, demonstrates structured thinking.

> **Common Pitfall**
>
> Applying a window function before aggregating inverts the grain. Always aggregate first, then rank or compare across groups.

---

## Common follow-up questions

- What happens to your result if chat_msgs.reply_to contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on reply_to.)_
- If two rows in chat_msgs have identical values in the ORDER BY columns, how does your ranking handle the tie? _(Tests understanding of RANK vs DENSE_RANK vs ROW_NUMBER tie-breaking behavior.)_
- With millions of distinct values in chat_msgs.msg_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like msg_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/most_active_chat_users)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.