# Shared Channel Contacts

> User networks mapped through messages.

Canonical URL: <https://datadriven.io/problems/shared_channel_contacts>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

Using our chat message data, build a contact overlap picture. For each sender, count how many other unique senders share at least one channel with them. Show each sender and their shared contact count, sorted from most contacts to least.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around the `chat_msgs` table, this challenge probes your ability to apply self-join in a analytics setting. Correctly referencing the `sender_id` column is essential to a working solution.

> **Trick to Solving**
>
> When the prompt asks for pairs or combinations within the same table, a self-join is the pattern.
> 
> 1. Alias the table twice (e.g., `a` and `b`)
> 2. Join on the shared attribute (region, user, etc.)
> 3. Add `a.id < b.id` to avoid duplicate and self-pairs

---

### Break down the requirements

#### Step 1: Self-join the table to pair rows

Join `chat_msgs` to itself to compare rows within the same table. The join condition controls which pairs are valid and prevents duplicate mirrors.

#### Step 2: Aggregate by `a.sender_id`

`GROUP BY a.sender_id` collapses rows to one per group. The aggregate functions (`SUM`, `COUNT`, `AVG`, etc.) compute the metric for each group.

#### Step 3: Deduplicate the result with DISTINCT

`SELECT DISTINCT` removes duplicate rows from the output. This is necessary when joins or subqueries can produce repeated combinations.

#### Step 4: Sort the final output

The `ORDER BY` clause ensures the result appears in the expected sequence. Interviewers check that the sort direction matches the prompt.

---

### The solution

**Self-join the table to pair rows to find shared channel contacts**

```sql
SELECT a.sender_id, COUNT(DISTINCT b.sender_id) AS mutual_connections
FROM chat_msgs a
INNER
JOIN chat_msgs b ON a.channel = b.channel AND a.sender_id != b.sender_id
INNER
JOIN chat_msgs c ON b.channel = c.channel AND b.sender_id != c.sender_id AND a.sender_id = c.sender_id
GROUP BY a.sender_id
ORDER BY mutual_connections DESC
```

> **Cost Analysis**
>
> With ~50M rows, the GROUP BY reduces the working set before any downstream operations; the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you prevent duplicate pairs and ensure the join condition is correct; whether you know when DISTINCT is needed and when it masks a logic error.

> **Common Pitfall**
>
> A self-join without an inequality condition (`a.id < b.id`) produces duplicate mirrored pairs and a self-pair row for each entry.

---

## Common follow-up questions

- The `reply_to` column in `chat_msgs` has a 70% null rate. How does your query handle rows where `reply_to` is NULL, and could that silently change the result count? _(Tests whether the candidate accounts for NULLs in `chat_msgs.reply_to` and understands how aggregates skip NULL values.)_
- `chat_msgs.content` has roughly 35,000,000 distinct values. What index strategy would you use to avoid a full scan on `chat_msgs`? _(Tests indexing knowledge specific to the high-cardinality `content` column in `chat_msgs`.)_
- How would you modify this query if the business logic required grouping by both `msg_id` and `channel` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/shared_channel_contacts)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.