# API Traffic by CDN Edge

> CDN paths carrying API traffic. Which edges?

Canonical URL: <https://datadriven.io/problems/api_traffic_by_cdn_edge>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

The CDN team is analyzing API traffic efficiency across edge locations. For request paths that include 'api', compute the average bytes served at each edge location and path combination. Only include edge locations that have delivered at least one cache hit across any path. Return the edge location, request path, and average bytes, from lowest average to highest.

## Worked solution and explanation

### Why this problem exists in real interviews

`cdn_logs` is 2B rows partitioned by `served_at`. The interviewer is watching which shape you reach for to express 'edges that ever had a cache hit': a `JOIN (SELECT DISTINCT ...)`, an `IN` subquery, or `EXISTS`. All three are correct; one of them scans the partition twice cleanly, one materializes a huge intermediate, one confuses the optimizer. They want to see you think about that before typing.

---

### Break down the requirements

#### Step 1: Build the qualifying edge set

`SELECT DISTINCT edge_loc FROM cdn_logs WHERE cache_hit = 1`. This is a separate logical pass over `cdn_logs`. Distinct edge count is small (hundreds), so the result is tiny even though the scan is large.

#### Step 2: Filter main scan to API paths

`WHERE cl.req_path LIKE '%api%'`. Leading wildcard kills index use; you are scanning anyway because of the aggregation, so this is fine. State that out loud so the interviewer knows you noticed.

#### Step 3: Inner join, not IN, not EXISTS

`JOIN (subquery) ON edge_loc = edge_loc`. The qualifying set is small enough to broadcast or hash-build cheaply. `IN (SELECT ...)` and `EXISTS` work but force the planner to re-evaluate per row on some engines.

#### Step 4: Aggregate and sort low to high

`GROUP BY cl.edge_loc, cl.req_path`, `AVG(cl.bytes)`, `ORDER BY avg_bytes` (ascending, the prompt says lowest first). No `LIMIT` requested.

---

### The solution

**API TRAFFIC BY CDN EDGE**

```sql
SELECT
  cl.edge_loc,
  cl.req_path,
  AVG(cl.bytes) AS avg_bytes
FROM cdn_logs cl
JOIN (
  SELECT DISTINCT edge_loc
  FROM cdn_logs
  WHERE cache_hit = 1
) cached
  ON cl.edge_loc = cached.edge_loc
WHERE cl.req_path LIKE '%api%'
GROUP BY cl.edge_loc, cl.req_path
ORDER BY avg_bytes;
```

> **Cost Analysis**
>
> Two passes over 2B rows. Pass one filters on `cache_hit = 1`, projects only `edge_loc`, distincts to ~hundreds of rows. Pass two scans the same data filtered to API paths and hash-joins against the tiny set. If `served_at` is the only partition key, no pruning helps here; ask whether a date window is implied.

> **Interviewers Watch For**
>
> Ask aloud: 'Is the qualifying-edge check across all time, or scoped to the same window as the metric?' The prompt is silent. Picking 'all time' silently is fine if you name the choice. Also ask whether `req_path LIKE '%api%'` should be anchored (`/api/`) to avoid matching `/rapidapi-test/`.

> **Common Pitfall**
>
> Putting `cache_hit = 1` in the outer `WHERE` instead of the subquery. That filters the averaging set to cache hits only, so `avg_bytes` becomes 'avg bytes on cache hits', not 'avg bytes on all traffic at edges that ever cached'. Different number, wrong answer, easy to miss in review.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Rewrite this with EXISTS instead of a join. When does each shape win? _(Probes correlated subquery semantics and how the optimizer treats semi-joins on partitioned tables.)_
- Add a 7-day window. Where do you push the served_at predicate and why? _(Tests partition-pruning awareness; the predicate has to land on the base scan, not after the join.)_
- What changes if `cache_hit` is BOOLEAN instead of INT? _(Surfaces three-valued logic and whether you write `cache_hit IS TRUE` vs `= 1`.)_
- How would you detect edges where `avg_bytes` for API paths jumped 50% week-over-week? _(Stretches the candidate into windowed comparisons, `LAG` over a weekly aggregate, and ratio thresholds.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/api_traffic_by_cdn_edge)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.