# Searches by Users With Email

> One user's search behavior.

Canonical URL: <https://datadriven.io/problems/searches_by_users_with_email>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The search quality team is analyzing result effectiveness but only for identifiable users. Pull all search query details for users who have an email address on file, limited to queries that returned at least one result.

## Worked solution and explanation

### Why this problem exists in real interviews

This problem targets self-join across the `search_queries` and `users` tables. You need to work with the `email` column to satisfy the requirements.

---

### Break down the requirements

#### Step 1: Join `search_queries` to `users`

The join connects the two tables on their shared key. This brings the columns needed for filtering and aggregation into a single row set.

#### Step 2: Filter out null values

Exclude rows where `email` is NULL. This prevents nulls from polluting aggregations or creating phantom groups.

---

### The solution

**Join `search_queries` to `users` to find searches by users with email**

```sql
SELECT sq.query_id, sq.user_id, sq.search_term, sq.results_count, sq.clicked_result, sq.query_time
FROM search_queries sq
JOIN users u ON sq.user_id = u.user_id
WHERE u.email IS NOT NULL AND sq.results_count > 0
```

> **Cost Analysis**
>
> With ~60M rows, the join cost depends on the smaller table's cardinality. An index on the filter/join columns would reduce the scan to a seek.

> **Interviewers Watch For**
>
> Interviewers watch for how you handle NULL values and whether you account for them in filters and aggregations.

> **Common Pitfall**
>
> Forgetting to filter NULLs creates phantom groups or inflated counts. Always check `null_fraction` in the schema before assuming columns are clean.

---

## Common follow-up questions

- What result would you get if every value in `users.email` were NULL? Would your query return an empty set or something unexpected? _(Tests extreme NULL scenarios and whether the candidate guards against edge cases in `email`.)_
- If `search_queries` grew to contain billions of rows, which part of your query would become the bottleneck given the cardinality of `user_id`? _(Tests ability to identify performance hotspots related to `search_queries.user_id` at scale.)_
- How would you modify this query if the business logic required grouping by both `query_id` and `user_id` instead of just one? _(Tests ability to adapt the query structure to changing requirements.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/searches_by_users_with_email)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.