# Impressions by Search Keyword

> Campaign performance, keyword by keyword.

Canonical URL: <https://datadriven.io/problems/impressions_by_search_keyword>

Domain: SQL · Difficulty: hard · Seniority: L3

## Problem

For each search term that contains 'laptop', count how many ad impressions came from users who searched that term. Rank sorted from most impressions to least.

## Worked solution and explanation

### Why this problem exists in real interviews

Ad analytics teams measure which search keywords correlate with the most ad exposure, which feeds keyword bidding and inventory pricing. Interviewers love this prompt because it bundles three traps into one query: a case-insensitive substring filter, a many-to-many join on `user_id` that explodes row counts, and the need for `COUNT(DISTINCT impression_id)` to avoid double-counting impressions that get joined to multiple search rows.

---

### Break down the requirements

#### Step 1: Filter search rows by keyword

`WHERE LOWER(sq.search_term) LIKE '%laptop%'` keeps only search rows whose term contains 'laptop' regardless of case. The LOWER call normalizes both sides; without it, 'Laptop' and 'LAPTOP' would slip through with wrong casing.

#### Step 2: Join impressions to those users

`INNER JOIN search_queries sq ON ai.user_id = sq.user_id` attaches each impression to every laptop-related search the same user issued. Because users issue many searches and see many impressions, this join is many-to-many and will multiply rows.

#### Step 3: Count distinct impressions per term

`COUNT(DISTINCT ai.impression_id)` is the only way to avoid inflating the count when one impression joins to multiple search rows for the same user. Group by `sq.search_term` so each distinct laptop variant gets its own row, then `ORDER BY impression_count DESC, sq.search_term ASC` for the final ranking.

---

### The solution

**Case-insensitive filter, many-to-many join, distinct count**

```sql
SELECT sq.search_term, COUNT(DISTINCT ai.impression_id) AS impression_count
FROM ad_impressions ai
INNER JOIN search_queries sq ON ai.user_id = sq.user_id
WHERE LOWER(sq.search_term) LIKE '%laptop%'
GROUP BY sq.search_term
ORDER BY impression_count DESC, sq.search_term ASC
```

> **Cost Analysis**
>
> This is the most expensive query in the set: `ad_impressions` holds 600M rows (~115 GB) and `search_queries` holds 100M rows (~26 GB). The join key `user_id` has fan-out 24 on impressions and 12.5 on searches, so a naive hash join can multiply intermediate rows into the tens of billions. A real optimizer pushes the `LIKE '%laptop%'` filter down on `search_queries` first (cutting it to a small fraction), then probes a hash table built from those laptop searchers against the much larger `ad_impressions`. Pre-filtering by impression date partition would shrink the right side further.

> **Interviewers Watch For**
>
> Interviewers check three things in order: did you use `COUNT(DISTINCT impression_id)` instead of `COUNT(*)`, did you LOWER both sides of the LIKE, and did you understand that the join is per-user, not per-search-event? Strong candidates also call out the fan-out explicitly ('this join multiplies rows by the average impressions per user') before writing the SELECT.

> **Common Pitfall**
>
> Writing `COUNT(*)` or `COUNT(impression_id)` instead of `COUNT(DISTINCT impression_id)` inflates the count by the number of times a user issued the same search term. A user who searched 'laptop' three times and saw 100 impressions contributes 300 to the count instead of 100. Another trap: `LIKE 'laptop%'` (no leading `%`) only catches terms starting with 'laptop', missing 'gaming laptop' or 'best laptop 2026'.

---

## Common follow-up questions

- The prompt says a user who issued multiple laptop terms contributes their impressions to each term separately. Trace through what your query would return if user 42 searched both 'laptop' and 'gaming laptop' and saw 10 impressions. _(Tests join-semantics understanding. The query attributes all 10 impressions to 'laptop' and all 10 to 'gaming laptop' (20 row-counts total), which is the documented behavior. A candidate who answers '10' is incorrectly assuming impressions are split, not duplicated.)_
- If you added `AND ai.impression_time BETWEEN sq.query_time AND datetime(sq.query_time, '+1 hour')` to the join, what real-world question are you now answering, and why is it usually more useful? _(Tests product intuition: the unrestricted join attributes any impression the user ever saw to any laptop search they ever made. Restricting to impressions within an hour of the search captures attribution-window logic, which matches how ad platforms actually credit keywords.)_
- `search_term` has 8M distinct values across 100M rows. If 'laptop' matches 10K distinct terms (gaming laptop, laptop bag, etc.), how does the GROUP BY size affect query performance and result usability? _(Tests scale intuition. A 10K-row result is fine for a query, but for a dashboard you would want to bucket variants ('laptop', 'laptop accessory', 'laptop part') with a CASE or a controlled vocabulary. A candidate should mention that long-tail variants drown out signal in a flat ranking.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/impressions_by_search_keyword)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.