# The Vanishing Rows

> Some records disappear when the tables meet. Figure out why.

Canonical URL: <https://datadriven.io/problems/null_keys_in_joins>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

A downstream dashboard is reporting fewer content views than expected. The content_views table has 100 records, but some views were logged without a user association. Write a query that pairs content_views to users on user_id and returns each view's ID, user ID, and username. Then reason about why the result has fewer than 100 rows.

## Worked solution and explanation

### Why this problem exists in real interviews

This challenge forces you to reconcile content_views.viewed_at with users.username through query construction. It surfaces as a fundamentals check because candidates who skip the grain check produce plausible but incorrect numbers.

> **Trick to Solving**
>
> When the prompt asks you to explain why the result has fewer rows than expected, the answer is almost always NULL join keys. `NULL = NULL` evaluates to UNKNOWN, not TRUE, so inner joins drop those rows.
> 
> 1. Write the inner join
> 2. Note that rows where `user_id IS NULL` vanish
> 3. Explain that NULL never matches in an equi-join

---

### Break down the requirements

#### Step 1: Write the inner join

`SELECT cv.view_id, cv.user_id, u.username FROM content_views cv JOIN users u ON cv.user_id = u.user_id` pairs views with users.

#### Step 2: Observe the row count drop

Rows in `content_views` where `user_id IS NULL` are dropped because `NULL = NULL` is UNKNOWN, not TRUE.

#### Step 3: Explain the behavior

The result has fewer rows than the view table because inner join excludes any row where the join key is NULL on either side.

---

### The solution

**Inner join showing NULL exclusion**

```sql
SELECT cv.view_id, cv.user_id, u.username
FROM content_views cv
JOIN users u ON cv.user_id = u.user_id
```

> **Cost Analysis**
>
> At `content_views` (100,000,000 rows), `users` (4,000,000 rows), a full table scan is expensive. Partition pruning (if the table is partitioned on the filter column) is the first optimization. A covering index on the `GROUP BY` + filter columns eliminates random I/O. Consider a materialized view for repeated dashboard queries.

> **Interviewers Watch For**
>
> The query itself is trivial. What interviewers really want is your explanation of why rows are missing. Candidates who say "NULL is not equal to NULL" and suggest a `LEFT JOIN` or `COALESCE` workaround score well.

> **Common Pitfall**
>
> Using `LEFT JOIN` instead of `INNER JOIN` would include NULL-keyed rows and mask the behavior the prompt asks you to explain.

---

## Common follow-up questions

- What happens to your result if content_views.user_id contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on user_id.)_
- If the join between content_views and users produces a fan-out, how does that affect your aggregate? _(Tests awareness of join cardinality and its impact on SUM, COUNT, and AVG results.)_
- With millions of distinct values in content_views.view_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like view_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/null_keys_in_joins)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.