# Most Efficient High-Volume Campaign

> High volume. Low cost. The dream campaign.

Canonical URL: <https://datadriven.io/problems/most_efficient_high_volume_campaign>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

We want the campaign with the lowest peak monthly spend that still delivers at least 1 click every month it runs. Monthly clicks are the sum of clicked impressions, and monthly spend is the sum of revenue. Return just the campaign name.

## Worked solution and explanation

### Why this problem exists in real interviews

The interviewer wants to see you apply conditional aggregation and HAVING filter to ad_impressions.ad_campaign while accounting for the distribution of impression_time. This surfaces in senior-level rounds because small logic errors produce results that look correct at a glance.

> **Trick to Solving**
>
> Read the prompt carefully for implicit constraints. The phrase structure hints at the grain of the output: what each row represents.
> 
> 1. Identify the output grain from the prompt (one row per what?)
> 2. Work backward from the desired output columns
> 3. Build the query inside-out: innermost subquery first, then layer on filters and aggregates

---

### Break down the requirements

#### Step 1: Structure the query as multi-step CTEs

This solution uses 2 CTEs to break the logic into readable stages. Each CTE produces an intermediate result that feeds the next.

#### Step 2: Self-join the table

Join `ad_impressions` to itself to compare or pair rows within the same table. Use an inequality condition to avoid duplicate pairs.

#### Step 3: Aggregate with SUM/MIN/MAX

Group by the output grain and apply `SUM()` to compute the metric. The `GROUP BY` must match exactly what the output needs: one row per group key.

#### Step 4: Filter groups with HAVING

The `HAVING` clause filters after aggregation, unlike `WHERE` which filters before. This is necessary when the condition depends on an aggregate result.

#### Step 5: Order and limit the output

Sort by the target metric and apply `LIMIT` to return the requested number of rows. Ensure the sort is deterministic to produce reproducible results.

---

### The solution

**Multi-CTE qualification with peak metric filter**

```sql
WITH monthly AS (
    SELECT ad_campaign,
        STRFTIME('%Y-%m', impression_time) AS month,
        SUM(CASE WHEN clicked = 1 THEN 1 ELSE 0 END) AS monthly_clicks,
        SUM(revenue) AS monthly_spend
    FROM ad_impressions
    GROUP BY ad_campaign, STRFTIME('%Y-%m', impression_time)
),
qualified AS (
    SELECT ad_campaign
    FROM monthly
    GROUP BY ad_campaign
    HAVING MIN(monthly_clicks) >= 1
)
SELECT m.ad_campaign, MAX(m.monthly_spend) AS peak_monthly_spend
FROM monthly m
JOIN qualified q ON m.ad_campaign = q.ad_campaign
GROUP BY m.ad_campaign
ORDER BY peak_monthly_spend ASC
LIMIT 1
```

> **Cost Analysis**
>
> The query scans 500M rows from `ad_impressions`. The aggregation reduces the row count before any downstream processing, which is the key performance lever. CTEs in most engines are optimization fences. For production workloads, consider inlining or materializing the intermediate results.

> **Interviewers Watch For**
>
> Naming the output grain ("one row per X") before writing the GROUP BY shows you think about data shape, not just syntax. Breaking complex logic into named CTEs shows the interviewer you prioritize readability and debuggability.

> **Common Pitfall**
>
> Comparing dates stored as TEXT without casting can produce lexicographic instead of chronological ordering. Always confirm the column type.

---

## Common follow-up questions

- If ad_impressions.impression_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in impression_id.)_
- What happens to your CASE expressions if a new category value appears in ad_campaign, impression_time? _(Tests whether the candidate recognizes that hard-coded CASE values miss future categories.)_
- With millions of distinct values in ad_impressions.impression_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like impression_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/most_efficient_high_volume_campaign)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.