# Average Latency by Health Status

> Healthy versus degraded. The latency gap is real.

Canonical URL: <https://datadriven.io/problems/average_latency_by_health_status>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

After an on-call page last night, the SRE team wants to quantify how much latency worsens as service health deteriorates. Show the average latency for health checks in the 'healthy', 'degraded', and 'critical' buckets so they can set alert thresholds.

## Worked solution and explanation

### What this is really asking

`svc_health.status` has more buckets than the three the SRE team cares about. With 8M rows, an unfiltered GROUP BY happily returns 'unknown', 'maintenance', or whatever else exists, and the alert thresholds get set against noise.

---

### Break down the requirements

#### Step 1: Pin the three buckets

Filter `status IN ('healthy', 'degraded', 'critical')` in WHERE so unrelated states never reach the aggregate. Doing it post-aggregation wastes 8M rows of arithmetic.

#### Step 2: Average per bucket

GROUP BY status and project AVG(latency) as avg_latency. One row per bucket, three rows out.

---

### The solution

**AVERAGE LATENCY BY STATUS**

```sql
SELECT
    status,
    AVG(latency) AS avg_latency
FROM svc_health
WHERE status IN ('healthy', 'degraded', 'critical')
GROUP BY status
```

> **Cost Analysis**
>
> 8M rows, single scan. The IN filter prunes before the hash aggregate, so memory stays proportional to distinct status values (three), not row count. The `checked` partition key is not in play here, so expect a full scan unless `status` is also indexed.

> **Interviewers Watch For**
>
> Did you filter, or did you trust that only those three values exist? Production `status` columns drift: a new 'warning' state ships, the aggregate quietly distorts. Naming the filter explicitly shows you treat enums as untrusted input.

> **Common Pitfall**
>
> Putting `status IN (...)` in HAVING instead of WHERE. It returns the same rows but aggregates every status first, then discards. On 8M rows that is wasted CPU and a slower plan.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you add p95 latency alongside the mean? _(Means hide tail blowups. Use PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency) per status bucket.)_
- What if latency contains nulls for failed checks? _(AVG ignores nulls, which inflates the mean for the failing bucket. Decide whether nulls mean zero, infinity, or exclusion before reporting.)_
- How would you split this by region without a second query? _(Add region to both SELECT and GROUP BY. Three buckets times N regions, and you can rank worst regions per status.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/average_latency_by_health_status)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.