# Content Recommendation Engine

> Pages they haven't discovered yet.

Canonical URL: <https://datadriven.io/problems/content_recommendation_engine>

Domain: SQL · Difficulty: medium · Seniority: L5

## Problem

For each user, find content items they haven't viewed but at least one of their chat contacts (from chat_msgs) has viewed. Return the user ID and the recommended content ID.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests a collaborative filtering pattern in SQL: finding items viewed by contacts but not by the target user. It combines self-joins on the chat table with anti-joins on the views table, probing multi-table reasoning.

---

### Break down the requirements

#### Step 1: Identify each user's chat contacts

From `chat_msgs`, a user's contacts are all distinct `sender_id` values in channels where the user has also sent a message (or vice versa via `reply_to`).

#### Step 2: Find content viewed by contacts

Join the contact list to `page_views` to get content (page URLs) those contacts have viewed.

#### Step 3: Exclude content the user has already viewed

Anti-join: remove content items the user has already seen using NOT EXISTS or LEFT JOIN with NULL check.

#### Step 4: Return user-content pairs

Output `user_id` and recommended `content_id` (page_url).

---

### The solution

**Collaborative filtering via contact views**

```sql
WITH user_contacts AS (
    SELECT DISTINCT a.sender_id AS user_id, b.sender_id AS contact_id
    FROM chat_msgs a
    JOIN chat_msgs b ON a.channel = b.channel
    WHERE a.sender_id != b.sender_id
),
contact_views AS (
    SELECT DISTINCT uc.user_id, pv.page_url AS content_id
    FROM user_contacts uc
    JOIN page_views pv ON uc.contact_id = pv.user_id
)
SELECT cv.user_id, cv.content_id
FROM contact_views cv
WHERE NOT EXISTS (
    SELECT 1 FROM page_views pv
    WHERE pv.user_id = cv.user_id
      AND pv.page_url = cv.content_id
)
```

> **Cost Analysis**
>
> The self-join on `chat_msgs` (60M rows) by channel can be expensive if channels are large. The subsequent join to `page_views` (500M rows) is also heavy. In production, pre-materializing the contact graph and using bloom filters for the anti-join would be essential.

> **Interviewers Watch For**
>
> Whether the candidate defines "chat contacts" correctly. The simplest definition is co-participants in the same channel. Strong candidates ask for clarification on the contact definition before writing SQL.

> **Common Pitfall**
>
> The self-join on chat_msgs can produce an enormous intermediate result if channels have many participants. Adding DISTINCT early and limiting the contact definition reduces the explosion.

---

## Common follow-up questions

- How would you limit recommendations to the top 5 per user? _(Tests ROW_NUMBER partitioned by user with a LIMIT pattern.)_
- What if you wanted to rank recommendations by how many contacts viewed them? _(Add COUNT(DISTINCT contact_id) and order by it descending.)_
- How would this scale with 100M users? _(Discusses materialized views, incremental updates, and moving to a graph database.)_
- What if contacts were defined as users who directly replied to each other? _(Tests using the reply_to column for a more specific contact definition.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/content_recommendation_engine)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.