# Low-Volume Stream Topics

> Quiet topics in the stream.

Canonical URL: <https://datadriven.io/problems/low_volume_stream_topics>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The streaming platform has some low-traffic topics that might be candidates for compaction. Find topics with fewer than 11 messages, showing the topic and its message count.

## Worked solution and explanation

### Why this problem exists in real interviews

Against the stream_msgs table, HAVING filter and grouping on topic values is the key operation. Interviewers favor this in mid-level screens because it exposes whether candidates handle ties, NULLs, and ordering edge cases correctly.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Aggregate with COUNT

Group by the output grain and apply `COUNT()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 2: Filter groups with HAVING

The `HAVING` clause filters after aggregation, unlike `WHERE` which filters before. This is necessary when the condition depends on an aggregate result.

#### Step 3: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**HAVING for post-aggregation threshold**

```sql
SELECT topic, COUNT(*) AS message_count
FROM stream_msgs
GROUP BY topic
HAVING COUNT(*) < 11
ORDER BY message_count ASC
```

> **Cost Analysis**
>
> The query scans 600M rows from `stream_msgs`. The aggregation reduces the row count before any downstream processing, which is the key performance lever. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Comparing dates stored as TEXT without casting can produce lexicographic instead of chronological ordering. Always confirm the column type.

---

## Common follow-up questions

- What happens to your result if stream_msgs.part_key contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on part_key.)_
- What is the difference between filtering in WHERE versus HAVING for this query against stream_msgs? _(Tests whether the candidate understands pre-aggregation vs post-aggregation filtering.)_
- With millions of distinct values in stream_msgs.msg_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like msg_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/low_volume_stream_topics)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.