# Chat Activity

> Which channels are ghost towns?

Canonical URL: <https://datadriven.io/problems/chat_activity>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The community team is evaluating channel health and wants to know which chat channels are active and which are dead weight. For each channel, show the total number of messages, the number of unique senders, and the percentage of messages that have been edited. Round the edited percentage to one decimal place. Only include channels with at least five messages, sorted from most to fewest.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests multi-metric aggregation with a percentage calculation and HAVING filter. Interviewers probe whether you can mix COUNT, COUNT(DISTINCT), and conditional aggregation in a single GROUP BY query.

---

### Break down the requirements

#### Step 1: Group by channel

`GROUP BY channel` produces one row per channel.

#### Step 2: Compute three metrics

`COUNT(*)` for total messages, `COUNT(DISTINCT sender_id)` for unique senders, and `ROUND(100.0 * SUM(CASE WHEN edited = 1 THEN 1 ELSE 0 END) / COUNT(*), 1)` for edited percentage.

#### Step 3: Filter and sort

`HAVING COUNT(*) >= 5` removes low-traffic channels. `ORDER BY COUNT(*) DESC` ranks most active first.

---

### The solution

**Multi-metric channel health report**

```sql
SELECT
    channel,
    COUNT(*) AS total_messages,
    COUNT(DISTINCT sender_id) AS unique_senders,
    ROUND(100.0 * SUM(CASE WHEN edited = 1 THEN 1 ELSE 0 END) / COUNT(*), 1) AS edited_pct
FROM chat_msgs
GROUP BY channel
HAVING COUNT(*) >= 5
ORDER BY total_messages DESC
```

> **Cost Analysis**
>
> Single scan of 10M rows. The COUNT(DISTINCT sender_id) requires maintaining a hash set per channel, but channel cardinality is typically low (hundreds), keeping memory usage reasonable.

> **Interviewers Watch For**
>
> Whether the candidate uses `edited = 1` or `edited = true` depends on the column type. Strong candidates ask about the data type or use a safe comparison like `edited::BOOLEAN` or `edited != 0`.

> **Common Pitfall**
>
> Using `AVG(edited)` works if the column is 0/1, but fails if it contains other values. Explicit conditional aggregation with CASE is safer and more readable.

---

## Common follow-up questions

- How would you identify channels where the edited percentage is increasing over time? _(Tests time-series analysis with LAG or monthly aggregation.)_
- What if edited is a boolean column, not integer? _(Tests casting: SUM(edited::INT) or COUNT(*) FILTER (WHERE edited).)_
- How would you also show the most recent message timestamp per channel? _(Adding MAX(sent_at) to the SELECT list.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/chat_activity)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.