# Low-Byte CDN Responses

> Tiny responses from the edge.

Canonical URL: <https://datadriven.io/problems/low_byte_cdn_responses>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The CDN team suspects some responses are suspiciously small, possibly indicating truncated or error payloads. Pull all log entries where bytes served is under 5000, showing every available field, ordered from smallest response up.

## Worked solution and explanation

### Why this problem exists in real interviews

Against the cdn_logs table, query construction on edge_loc values is the key operation. Interviewers favor this as a fundamentals check because it exposes whether candidates handle ties, NULLs, and ordering edge cases correctly.

---

### Break down the requirements

#### Step 1: Filter to the target rows

Apply the `WHERE` filter to restrict the working set before aggregation. Filtering early reduces the number of rows that downstream operations process.

#### Step 2: Order the final output

Apply `ORDER BY` as specified to produce the expected row sequence. When tied values exist, add a secondary sort column for determinism.

---

### The solution

**Threshold filter with ascending sort**

```sql
SELECT log_id, edge_loc, req_path, status, bytes, cache_hit, served_at
FROM cdn_logs
WHERE bytes < 5000
ORDER BY bytes ASC
```

> **Cost Analysis**
>
> The query scans 400M rows from `cdn_logs`.

> **Interviewers Watch For**
>
> Candidates who verbalize their approach before typing, naming the output columns and expected row count, consistently perform better.

> **Common Pitfall**
>
> Placing a filter in `WHERE` instead of `HAVING` (or vice versa) is a common mistake. `WHERE` filters rows before aggregation; `HAVING` filters groups after.

---

## Common follow-up questions

- If cdn_logs.log_id could contain unexpected NULL values, how would your query behave? _(Tests NULL awareness even when the schema does not currently allow NULLs in log_id.)_
- How would you verify that your aggregation on cdn_logs.log_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in cdn_logs.log_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like log_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/low_byte_cdn_responses)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.