# User Connection Score

> Every user has a social score.

Canonical URL: <https://datadriven.io/problems/user_connection_score>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

In the chat messages table, sender and reply-to pairs represent bidirectional connections. Each user's connection score is their total unique connections divided by total unique users on the platform, expressed as a percentage.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests bidirectional relationship extraction from messaging data. Interviewers check whether you can model connections from reply pairs and compute a per-user network density metric.

> **Trick to Solving**
>
> The tricky part is that connections are bidirectional: if A replies to B, both A and B are connected. You need to extract both sides of each reply pair and count distinct connections per user.
> 
> 1. Self-join to find (sender, replied-to-sender) pairs
> 2. Union both directions for bidirectional connections
> 3. Count distinct connections per user
> 4. Divide by total distinct users on the platform

---

### Break down the requirements

#### Step 1: Find reply pairs

Self-join `chat_msgs` on `reply_to = msg_id` to get (sender_id, original_sender_id) pairs where reply_to is not null.

#### Step 2: Count distinct connections per user

For each sender, count the distinct users they have exchanged replies with (in either direction).

#### Step 3: Compute percentage

Divide unique connections by total unique platform users, multiply by 100 for percentage.

---

### The solution

**Self-join for bidirectional connections**

```sql
WITH connections AS (
    SELECT DISTINCT a.sender_id AS user_id, b.sender_id AS connected_to
    FROM chat_msgs a
    JOIN chat_msgs b ON a.reply_to = b.msg_id
    WHERE a.sender_id != b.sender_id
    UNION
    SELECT DISTINCT b.sender_id AS user_id, a.sender_id AS connected_to
    FROM chat_msgs a
    JOIN chat_msgs b ON a.reply_to = b.msg_id
    WHERE a.sender_id != b.sender_id
),
total_users AS (
    SELECT COUNT(DISTINCT sender_id) AS total FROM chat_msgs
)
SELECT
    c.user_id,
    COUNT(DISTINCT c.connected_to) AS unique_connections,
    COUNT(DISTINCT c.connected_to) * 100.0 / t.total AS connection_score
FROM connections c
CROSS JOIN total_users t
GROUP BY c.user_id, t.total
```

> **Cost Analysis**
>
> The self-join on 50M rows (with 70% NULL reply_to filtered out) processes ~15M reply rows. The join on msg_id needs an index. The UNION deduplicates bidirectional pairs. This is compute-intensive but correct.

> **Interviewers Watch For**
>
> Whether you handle bidirectionality. A replies to B means both A and B are connected. Missing one direction halves the connection count for many users.

> **Common Pitfall**
>
> Counting self-connections (where sender replies to their own message). The `WHERE a.sender_id != b.sender_id` filter prevents this.

---

## Common follow-up questions

- How would you find the top 10 most connected users? _(Wrap in an outer query with ORDER BY connection_score DESC LIMIT 10.)_
- What if you only wanted connections from the last 30 days? _(Add date filters to both sides of the self-join.)_
- How would you compute a mutual connection score (both must reply to each other)? _(Intersect the two directions instead of unioning them.)_
- What if the table had billions of messages? _(Pre-compute a connections graph table; real-time computation would be too expensive.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/user_connection_score)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.