# Threads Excluding User

> Every thread they're not part of.

Canonical URL: <https://datadriven.io/problems/threads_excluding_user>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The trust and safety team needs to audit messaging activity that doesn't involve user 1. A thread is defined as a unique sender_id and reply_to pair. Count how many threads exist where neither the sender_id nor the reply_to is user 1, considering only messages that have a non-null reply_to.

## Worked solution and explanation

### The mental model

Trust and safety teams ask this to see if you remember that `COUNT(DISTINCT a, b)` is not portable SQL and that NULL comparisons are three-valued. Pick a delimiter that can't collide with the column values before you write the concatenation, not after a bug shows up in production.

---

### The three traps

#### Step 1: NULL exclusion is non-trivial

The prompt says to consider only messages with a non-null `reply_to`. Most candidates assume `reply_to != 1` already handles that, since the WHERE clause drops rows where the predicate is not TRUE. It does, because `NULL != 1` evaluates to NULL, not TRUE. But the intent is clearer with `AND reply_to IS NOT NULL` written out. A reviewer scanning the query in six months should not have to remember three-valued logic to understand why null replies got dropped.

#### Step 2: Delimiter collision

If you write `sender_id || reply_to` with no delimiter, the pair (sender=12, reply=3) becomes the string `123`, identical to (sender=1, reply=23). With 30M rows and integer IDs of varying width, this collision is not theoretical. It will silently undercount. A single non-numeric character between the two values fixes it. Engines that support `CONCAT_WS('-', sender_id, reply_to)` make the delimiter explicit and harder to forget.

#### Step 3: Directional vs symmetric threads

The prompt defines a thread as a unique (sender_id, reply_to) pair, so (5, 7) and (7, 5) count as two threads. In a real chat product, those are usually the same conversation seen from both sides. A senior candidate names this assumption out loud before writing SQL, because the symmetric version requires ordering the two IDs first: `LEAST(a,b) || '-' || GREATEST(a,b)`.

---

### The solution

**Count distinct composite keys with a safe delimiter**

```sql
SELECT COUNT(DISTINCT sender_id || '-' || reply_to) AS thread_count
FROM chat_msgs
WHERE sender_id != 1 AND reply_to != 1
  AND reply_to IS NOT NULL
```

> **30M rows, one scan, hash-distinct dominates**
>
> The plan is a single scan of `chat_msgs` with a hash-distinct on the concatenated key. The WHERE filter is very weak selectivity-wise; if there are 100k users, excluding user 1 keeps ~99.998% of rows. The bottleneck is the hash table for the distinct count. The table is partitioned by `sent_at`, but since the prompt has no date predicate, partition pruning gives you nothing. On a warehouse like Snowflake or BigQuery this still finishes in seconds because the hash key is small and the aggregate parallelizes cleanly.

> **What a senior candidate flags**
>
> Three scoping questions earn points before you touch a keyboard. (a) Is the thread (5, 7) the same as (7, 5)? Directional or symmetric? (b) Does `msg_type` matter? Should system-generated notifications be excluded from a trust-and-safety audit of human messaging? (c) The phrase "doesn't involve user 1" is ambiguous. Does it mean user 1 is not on either end of this specific message, or that user 1 never participates in the broader conversation at all? The expected query answers the narrow reading.

> **The seductive non-portable shortcut**
>
> `COUNT(DISTINCT sender_id, reply_to)` works in PostgreSQL and a handful of others. In MySQL, SQLite, Snowflake, and BigQuery it is a syntax error or means something different. Concatenation with an explicit delimiter is the portable form that works everywhere and survives a migration between warehouses.

> **When IDs are strings, the delimiter trap gets worse**
>
> If `sender_id` were a UUID like `7f-3a-...`, then `-` as a delimiter would silently collide with hyphens inside the values. Pick a character or sequence that cannot appear in the data: a non-printable like CHR(31), or a multi-character token like `|||` via `CONCAT_WS`. The choice is data-driven, not stylistic.

---

## Common follow-up questions

- If a thread (5, 7) and (7, 5) are the same logical conversation, how would you rewrite this? _(Hint: order the two IDs before concatenating, e.g. `LEAST(sender_id, reply_to) || '-' || GREATEST(sender_id, reply_to)`. Tests whether the candidate sees the symmetry collapse.)_
- Why does `sender_id != 1` accidentally also filter out rows where `sender_id IS NULL`? _(Tests three-valued logic: `NULL != 1` evaluates to NULL, and a WHERE clause keeps only rows whose predicate is TRUE, so NULL rows are silently dropped.)_
- What is the smallest delimiter choice that is safe when IDs are strings containing hyphens? _(Tests data-aware delimiter selection: a character outside the alphabet of the data, or a multi-character token via CONCAT_WS, or a control character like CHR(31).)_
- How would you count distinct threads broken down by `channel`? _(Tests extending the pattern with GROUP BY: `SELECT channel, COUNT(DISTINCT sender_id || '-' || reply_to) FROM chat_msgs WHERE ... GROUP BY channel`.)_
- Does `COUNT(DISTINCT sender_id || reply_to)` (no delimiter) give the same answer? If not, construct an example where it differs. _(Tests recognition of the collision bug: (12, 3) and (1, 23) both become `123` and merge into one distinct value, undercounting the true number of pairs.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/threads_excluding_user)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.