# Daily Spam Impression Rate

> How much of the ad feed is spam?

Canonical URL: <https://datadriven.io/problems/daily_spam_impression_rate>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

An ad impression is considered spam if its campaign name contains 'spam'. Limit to users who also appear in page_views. For each day, compute the percentage of spam impressions among all qualifying impressions. Show the date and spam percentage.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests a semi-join combined with conditional aggregation. Filtering to users who appear in both tables, then computing a percentage of rows matching a substring condition, probes multi-table reasoning.

---

### Break down the requirements

#### Step 1: Restrict to users in page_views

Use `EXISTS` or `IN` to keep only `ad_impressions` rows for users present in `page_views`.

#### Step 2: Identify spam impressions

`ad_campaign LIKE '%spam%'` matches spam campaigns.

#### Step 3: Compute daily spam percentage

`GROUP BY impression_time::DATE` with conditional count for spam vs total.

---

### The solution

**Semi-join with conditional daily percentage**

```sql
SELECT
    ai.impression_time::DATE AS day,
    ROUND(100.0 * SUM(CASE WHEN ai.ad_campaign LIKE '%spam%' THEN 1 ELSE 0 END) / COUNT(*), 2) AS spam_pct
FROM ad_impressions ai
WHERE EXISTS (
    SELECT 1 FROM page_views pv WHERE pv.user_id = ai.user_id
)
GROUP BY ai.impression_time::DATE
ORDER BY day
```

> **Cost Analysis**
>
> The EXISTS subquery on 600M page_views is correlated on user_id. With an index on `page_views(user_id)`, each probe is fast. The outer scan of 400M impressions is the bottleneck. A hash semi-join would be more efficient.

> **Interviewers Watch For**
>
> Whether the candidate uses EXISTS (correct, no fan-out) vs JOIN (risky, can multiply rows if a user has multiple page views). The semi-join pattern is key here.

> **Common Pitfall**
>
> Using an INNER JOIN to page_views would duplicate ad_impressions rows for users with multiple page views, inflating both the numerator and denominator. EXISTS avoids this.

---

## Common follow-up questions

- How would you make the spam detection case-insensitive? _(Use ILIKE or LOWER(ad_campaign) LIKE '%spam%'.)_
- What if you needed to exclude known spam campaigns from the total? _(Tests whether spam impressions should be in the denominator.)_
- How would you identify the top spam campaign by impression volume? _(Add a GROUP BY on ad_campaign for a campaign-level breakdown.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/daily_spam_impression_rate)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.