# Mentorship User Pairs

> Pair them up. Mentor and mentee.

Canonical URL: <https://datadriven.io/problems/mentorship_user_pairs>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

We're setting up an internal mentorship program. Find all pairs of users who have different age brackets, the same account status, and signed up in different years. Each pair should appear only once, with the smaller user ID first. Show both user IDs.

## Worked solution and explanation

### Why this problem exists in real interviews

Querying users for username data using date extraction tests whether you can translate a business requirement into the right column references and filter sequence. It shows up in mid-level screens to verify practical fluency.

> **Trick to Solving**
>
> Pair-generation from a single table requires a self-join. The key constraint `a.id < b.id` avoids duplicates and self-pairs.
> 
> 1. Join the table to itself on the pairing condition
> 2. Use an inequality on the ID column to ensure each pair appears once
> 3. Apply any additional filters on the pair

---

### Break down the requirements

#### Step 1: Self-join the table

Join `users` to itself to compare or pair rows within the same table. Use an inequality condition to avoid duplicate pairs.

#### Step 2: Filter to the target rows

Apply the date filter using `STRFTIME` to extract and compare the relevant time component. This restricts rows before aggregation.

---

### The solution

**Self-join with inequality constraints for unique pairs**

```sql
SELECT u1.user_id AS user_id_1, u2.user_id AS user_id_2
FROM users u1
JOIN users u2 ON u1.user_id < u2.user_id
WHERE u1.age_bucket <> u2.age_bucket
    AND u1.account_status = u2.account_status
    AND STRFTIME('%Y', u1.signup_date) <> STRFTIME('%Y', u2.signup_date)
```

> **Cost Analysis**
>
> The query scans 15M rows from `users`. The aggregation reduces the row count before any downstream processing, which is the key performance lever. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Comparing dates stored as TEXT without casting can produce lexicographic instead of chronological ordering. Always confirm the column type.

---

## Common follow-up questions

- If users.user_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in user_id.)_
- How would you verify that your aggregation on users.user_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in users.user_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like user_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/mentorship_user_pairs)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.