# Frequent Message Senders

> Someone is sending too many messages.

Canonical URL: <https://datadriven.io/problems/frequent_message_senders>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

Surface all sender IDs who have posted more than one message, along with each sender's message count. A downstream anomaly-detection system needs these high-frequency senders.

## Worked solution and explanation

### What this is really asking

`sender_id` is the grain, not `msg_id`. You collapse 50M rows to one row per sender, count their messages, and keep only those above one. The anomaly job wants the count attached.

---

### Break down the requirements

#### Step 1: Aggregate by sender_id

GROUP BY sender_id with COUNT(*) yields one row per sender carrying their total message count.

#### Step 2: Filter on the aggregate

The predicate is on COUNT(*), so it belongs in HAVING. WHERE runs before the group forms and cannot see the count.

#### Step 3: Sort by frequency

ORDER BY msg_count DESC puts the loudest senders first, which is what the downstream job wants to scan.

---

**FREQUENT SENDERS**

```sql
SELECT
    sender_id,
    COUNT(*) AS msg_count
FROM chat_msgs
GROUP BY sender_id
HAVING COUNT(*) > 1
ORDER BY msg_count DESC;
```

> **Cost Analysis**
>
> Full scan of 50M rows is unavoidable; every sender must be counted. Partitioning is on sent_at, so no pruning helps. A hash aggregate keeps state proportional to distinct senders, not row count.

> **Interviewers Watch For**
>
> Whether you reach for HAVING reflexively or wrap the GROUP BY in a subquery and filter outside. Both work, but HAVING is one less plan node and shows you know the logical clause order.

> **Common Pitfall**
>
> Writing `COUNT(DISTINCT msg_id)`. Each msg_id is already unique, so the DISTINCT adds a hash pass for no behavior change. Plain COUNT(*) is correct and faster.

> **The False Start**
>
> First instinct is `WHERE COUNT(*) > 1`. The engine rejects it: WHERE evaluates before grouping, so COUNT does not exist yet. Pivot to HAVING, which runs after GROUP BY and can see the aggregate.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you restrict this to the last 24 hours of messages? _(Add `WHERE sent_at >= NOW() - INTERVAL '1 day'` so the partition key prunes the scan.)_
- What if the system only cares about senders active in multiple channels? _(Move channel into the grouping: `GROUP BY sender_id, channel HAVING COUNT(*) > 1`. Grain changes to sender-channel.)_
- How would you express this without HAVING? _(Wrap the aggregation in a CTE and filter on `msg_count > 1` in an outer WHERE. Same plan in most engines but more verbose.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/frequent_message_senders)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.