# Every Touch Counts

> A bid, a click, a search: each one is a vote. Total them honestly.

Canonical URL: <https://datadriven.io/problems/the-most-engaged-shoppers>

Domain: SQL · Difficulty: medium · Seniority: junior

## Problem

We run an online vehicle auction marketplace and want to surface our most engaged shoppers from 2026. Treating every ad impression and every site search a person made that year as one touchpoint, total the touchpoints per user across both sources, most active first.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the UNION vs UNION ALL question wearing a marketing hat. The real skill being probed: do you know that UNION quietly deduplicates rows while UNION ALL keeps every one? You are combining two activity streams and counting events per user. If you reach for UNION, two ad impressions from the same shopper collapse into a single row before the count ever runs, so a power user with 40 touchpoints reports the same number as someone who showed up once. Your 'most engaged' ranking becomes noise, and nothing in the output looks obviously wrong.

---

### Break down the requirements

#### Step 1: Reduce each source to one row per event

From ad_impressions and from search_queries, select just user_id, each filtered to 2026 via strftime('%Y', ...). One row out per impression and per search is exactly what 'one touchpoint' means.

#### Step 2: Stack the streams, keep every row

UNION ALL the two selects. This is the whole problem: UNION ALL preserves duplicates so the row count equals the event count. UNION would fold identical user_id rows together and destroy the metric before you can measure it.

#### Step 3: Count and order

Wrap the stacked stream in a subquery, GROUP BY user_id, COUNT(*) as touchpoint_count, then ORDER BY touchpoint_count DESC with user_id as a deterministic tiebreaker.

---

### The solution

**TOUCHPOINTS PER USER**

```sql
SELECT user_id,
       COUNT(*) AS touchpoint_count
FROM (
    SELECT user_id FROM ad_impressions
    WHERE strftime('%Y', impression_time) = '2026'
    UNION ALL
    SELECT user_id FROM search_queries
    WHERE strftime('%Y', query_time) = '2026'
) activity
GROUP BY user_id
ORDER BY touchpoint_count DESC, user_id
```

**UNION ALL (correct here)**

Appends both streams row for row. A user with 12 impressions and 8 searches contributes 20 rows, so COUNT(*) returns 20. No hidden work, no sort to dedup.

**UNION (wrong here)**

Runs an implicit DISTINCT across the whole stacked set. Selecting only user_id means each user collapses to one row, so every COUNT(*) returns 1 and the ranking is meaningless. It also pays for a sort or hash you never wanted.

> **Cost Analysis**
>
> At scale picture ad_impressions near 800M rows and search_queries near 300M, both partitioned by their time column. The 2026 predicates prune to a single year on each side before anything stacks. UNION ALL is a pure concatenation: no dedup pass, so the engine streams straight into the GROUP BY. Swapping in UNION forces a global DISTINCT over a billion rows, a sort or hash you gain nothing from.

> **Interviewers Watch For**
>
> The tell is that you say out loud why UNION ALL, not just that you typed it: 'I want every event counted, and UNION would deduplicate.' Strong candidates also note that selecting only user_id is what makes the UNION trap so destructive, since there is nothing left to keep the rows distinct.

> **Common Pitfall**
>
> Using UNION out of habit because it 'feels cleaner.' On a passing fixture where each user appears once it gives the right answer, then silently undercounts the moment real users have repeat activity. When you mean 'how many events,' always UNION ALL.

---

## Common follow-up questions

- Now break the total into two columns: impressions and searches side by side per user. _(Pushes toward conditional aggregation, tagging each branch with a source literal before the UNION ALL, then SUM(CASE ...).)_
- What changes if you only want shoppers active in BOTH channels, not either? _(Shifts from UNION ALL to an INTERSECT or a join/HAVING on distinct sources, testing set-operator fluency.)_
- How would you keep this incremental as new impressions and searches stream in daily? _(Probes partition pruning on the time columns and whether they would pre-aggregate per day rather than rescan the full history.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the-most-engaged-shoppers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.