# Distinct Chat Conversations

> How many unique conversations?

Canonical URL: <https://datadriven.io/problems/distinct_chat_conversations>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

How many unique conversations exist in chat_msgs? Two users share one conversation regardless of who sent first, so (A,B) and (B,A) count as the same conversation. Return a single count.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests deduplication of symmetric pairs. Counting unique conversations where (A,B) = (B,A) probes whether you can normalize pair ordering before counting.

> **Trick to Solving**
>
> "(A,B) and (B,A) count as the same conversation" means you need to normalize the pair. The trick is `LEAST(sender, receiver)` and `GREATEST(sender, receiver)` to create a canonical pair, then COUNT DISTINCT.
> 
> 1. Identify sender and receiver from the message structure
> 2. Normalize with LEAST/GREATEST
> 3. COUNT DISTINCT pairs

---

### Break down the requirements

#### Step 1: Identify conversation participants

From `chat_msgs`, each message has a `sender_id` and `reply_to` (recipient). A conversation is a unique (sender, recipient) pair.

#### Step 2: Normalize pair ordering

`LEAST(sender_id, reply_to)` and `GREATEST(sender_id, reply_to)` create a canonical pair regardless of direction.

#### Step 3: Count distinct pairs

`COUNT(DISTINCT (least_id, greatest_id))` or a GROUP BY on the normalized pair.

---

### The solution

**Symmetric pair normalization with count**

```sql
SELECT COUNT(DISTINCT (LEAST(sender_id, reply_to), GREATEST(sender_id, reply_to))) AS conversation_count
FROM chat_msgs
WHERE reply_to IS NOT NULL
```

> **Cost Analysis**
>
> Scan of 30M rows with LEAST/GREATEST computation per row. The DISTINCT aggregation on the pair requires hashing. Memory depends on the number of unique conversations.

> **Interviewers Watch For**
>
> Whether the candidate uses the LEAST/GREATEST trick to normalize pairs. Candidates who try self-joins or UNION approaches are typically less efficient.

> **Common Pitfall**
>
> Without LEAST/GREATEST normalization, (A,B) and (B,A) are counted as separate conversations, doubling the count. This is the core challenge of the problem.

---

## Common follow-up questions

- What if the chat model uses channels instead of direct replies? _(The conversation definition changes; tests adaptability.)_
- How would you extend this to count messages per conversation? _(GROUP BY the normalized pair instead of COUNT DISTINCT.)_
- What if reply_to is NULL for non-reply messages? _(The WHERE filter excludes them. Tests whether to include channel-based conversations.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/distinct_chat_conversations)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.