# Top Chat Contributors

> The ten most active chat users.

Canonical URL: <https://datadriven.io/problems/top_chat_contributors>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The chat_messages table records senders and reply_to references. Find the top 10 most active users by total messages. A user can be either a sender or a receiver (via reply-to). Count both sent and received messages per user, sorted from most active to least, limited to 10.

## Worked solution and explanation

### What this is really asking

`chat_msgs.reply_to` is treated as a user identifier here, even though the column name suggests a message id. The whole problem is: union two per-user counts (sender side, replied-to side), then sum and top-10.

---

### Break down the requirements

#### Step 1: Two contribution streams

Each user accumulates messages via two columns: `sender_id` (always populated) and `reply_to` (nullable). Build each stream independently, then merge.

#### Step 2: Filter nulls before the second GROUP BY

`WHERE reply_to IS NOT NULL` keeps the receiver-side aggregation from producing a NULL bucket that would later collide with real user_ids of zero or get carried into the LIMIT 10.

#### Step 3: UNION ALL, then re-aggregate

UNION ALL preserves duplicate user rows across streams. The outer GROUP BY + SUM collapses them. UNION (distinct) would drop legitimate equal counts.

---

### The solution

**TOP CHAT CONTRIBUTORS**

```sql
WITH user_msgs AS (
  SELECT sender_id AS user_id, COUNT(*) AS msg_count
  FROM chat_msgs
  GROUP BY sender_id
  UNION ALL
  SELECT reply_to AS user_id, COUNT(*) AS msg_count
  FROM chat_msgs
  WHERE reply_to IS NOT NULL
  GROUP BY reply_to
)
SELECT user_id, SUM(msg_count) AS total_messages
FROM user_msgs
GROUP BY user_id
ORDER BY total_messages DESC
LIMIT 10
```

> **Cost Analysis**
>
> Two passes over 40M rows, each pre-aggregated to one row per user before the union. The outer SUM operates on a tiny intermediate set. Partitioning by `sent_at` does not help here; both scans are full-table.

> **Interviewers Watch For**
>
> UNION vs UNION ALL. If you use UNION, two users with identical (user_id, count) collapse into one row and the SUM under-counts. Also the IS NOT NULL guard on `reply_to` is non-optional.

> **Common Pitfall**
>
> Skipping the inner GROUP BY and unioning raw rows scans the table twice with no early reduction. On 40M rows the planner may spill. Pre-aggregate per stream first.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- If `reply_to` stored a message id instead of a user id, how would the query change? _(You would self-join `chat_msgs` to itself on `reply_to = msg_id` to recover the parent sender, then count that sender.)_
- How would you scope this to a single channel or time window? _(Push the filter into both inner selects so the partition pruning on `sent_at` kicks in before aggregation.)_
- What if you needed sender and receiver totals as separate columns, not summed? _(Use conditional aggregation: SUM(CASE WHEN role='sender' THEN msg_count END) per side after the UNION ALL.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_chat_contributors)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.