# Healthiest Service Check History

> The healthiest service. Full history.

Canonical URL: <https://datadriven.io/problems/healthiest_service_check_history>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

Find the service(s) with the highest-ever uptime score, then for those services show each check date, the previous check date, and the gap in days between them. Results should be ordered by check date, most recent first.

## Worked solution and explanation

### Why this problem exists in real interviews

Reliability teams routinely audit health-check cadence on their best-performing services to confirm the monitoring system is firing on schedule. The prompt has two parts: pick the service(s) with the best uptime, then walk that service's check history with `LAG` to compute gaps between consecutive checks. Interviewers stack these on purpose. They want to see whether you can identify the top group with proper tie handling, then carry that selection into a windowed self-comparison without spurious joins. Almost every candidate who fails this question fails on one of three things: missing `PARTITION BY` in the window, using `LIMIT 1` when the prompt says "service(s)", or sorting the final result the wrong direction.

> **Trick to Solving**
>
> Two-stage pattern: filter to the entity, then run a window function over its history.
> 
> 1. Find max `uptime` across all rows, capture every service tied at that max
> 2. For rows belonging to those services, use `LAG(checked) OVER (PARTITION BY svc_name ORDER BY checked)` to get the previous check date
> 3. `JULIANDAY(checked) - JULIANDAY(prev_checked)` gives day gap
> 4. `ORDER BY checked DESC` because the prompt says most recent first
> 
> The `PARTITION BY svc_name` is non-negotiable: without it, the LAG would compare the last check of one service to the first check of another.

---

### Break down the requirements

#### Step 1: Find every service tied at max uptime

`(SELECT MAX(uptime) FROM svc_health)` is a scalar. The prompt says "service(s) with the highest-ever uptime score" - plural - so identify all services that have at least one row at that max uptime, not the single top row. `LIMIT 1` after `ORDER BY uptime DESC` looks right but silently drops ties. With `uptime` cardinality 10000 and single-hot skew on the top values, ties are realistic.

#### Step 2: Carry that selection into a CTE

Wrap the qualifying services in a CTE so the next stage can filter to them with `WHERE svc_name IN (SELECT svc_name FROM healthiest)`. Keep the CTE small - just the names - so the planner can hash-probe efficiently when the outer scan hits 60M rows.

#### Step 3: LAG within each service's history

`LAG(checked) OVER (PARTITION BY svc_name ORDER BY checked)` is the previous check timestamp within the same service. The `PARTITION BY` is what isolates each service's history; without it, when the data has multiple tied services, the LAG would cross service boundaries and produce nonsense. The first row in each partition has a NULL `prev_checked` - keep it or filter it depending on what the consumer needs (the prompt does not say to drop it).

#### Step 4: Compute day gap with JULIANDAY

`JULIANDAY(a) - JULIANDAY(b)` returns the difference in days as a real number. Cast to INTEGER if the prompt expects whole days. NULL minus anything is NULL, so the first row per service correctly shows a NULL gap rather than a misleading zero.

#### Step 5: Sort the result most recent first

The prompt explicitly says "ordered by check date, most recent first" - that is `ORDER BY checked DESC`. Easy to misread as ASC because you computed prev_checked which feels chronological, but read the words. Final-sort direction is one of the most common silent errors on this kind of prompt.

---

### The solution

**Top-uptime services with gap-from-previous-check**

```sql
WITH healthiest AS (
    SELECT DISTINCT svc_name
    FROM svc_health
    WHERE uptime = (SELECT MAX(uptime) FROM svc_health)
)
SELECT
    s.svc_name,
    s.checked,
    LAG(s.checked) OVER (PARTITION BY s.svc_name ORDER BY s.checked) AS prev_checked,
    JULIANDAY(s.checked)
    - JULIANDAY(LAG(s.checked) OVER (PARTITION BY s.svc_name ORDER BY s.checked)) AS days_diff
FROM svc_health s
WHERE s.svc_name IN (SELECT svc_name FROM healthiest)
ORDER BY s.checked DESC
```

**Equivalent form using DENSE_RANK**

```sql
/* Postgres / BigQuery: filter window-derived rows with QUALIFY-style pattern */
WITH ranked AS (
  SELECT
    svc_name,
    checked,
    LAG(checked) OVER (PARTITION BY svc_name ORDER BY checked) AS prev_checked,
    DENSE_RANK() OVER (ORDER BY uptime DESC) AS uptime_rk
  FROM svc_health
)
SELECT svc_name, checked, prev_checked,
       (checked - prev_checked) AS days_diff
FROM ranked
WHERE uptime_rk = 1
ORDER BY checked DESC;
```

> **Cost Analysis**
>
> `svc_health` is 60M rows / ~11.5 GB, partitioned by `checked` across 1095 daily partitions. The MAX(uptime) subquery scans the full table once - on a partitioned table this is unavoidable without a global pre-aggregate. The healthiest CTE returns at most a handful of svc_names (cardinality 200 total, single-hot skew on uptime). The outer query's `WHERE svc_name IN (...)` should be pushed down to the partition scan, but svc_name is not the partition key; an index on `svc_name` is what makes this fast. The window pass partitions by svc_name and sorts by `checked`, which the planner can satisfy by reading partitions in date order if `checked` has a btree. Expected plan: Aggregate (MAX scan) -> Hash Join probe on svc_name -> Sort partitions by (svc_name, checked) -> Window. Mitigation at this scale: maintain a `svc_uptime_max` rollup refreshed hourly so the MAX subquery doesn't re-scan 60M rows.

> **Interviewers Watch For**
>
> Three things the interviewer is checking. (1) Did you handle ties on uptime, or did `LIMIT 1` drop services? Strong candidates use `= MAX(...)`. (2) Does your `LAG` have `PARTITION BY svc_name`? Without it, when ties exist, the LAG crosses services. Weak candidates write a single OVER (ORDER BY checked) clause and silently get wrong gaps. (3) Did you sort DESC as the prompt says? Strong candidates re-read the output requirements out loud before writing the final ORDER BY.

> **Common Pitfall**
>
> The silent killer is omitting `PARTITION BY svc_name` in the LAG. If exactly one service ties at max uptime, the query happens to produce the right output and you ship a bug. The bug surfaces the first time two services tie, and now the gap-days column is meaningless because it's comparing across services. Always partition window functions by the entity you're walking, even if you think there's only one. The other pitfall is `LIMIT 1` instead of `= MAX(uptime)` for top-N-with-ties.

---

## Common follow-up questions

- Health checks should fire every 15 minutes. How would you extend this query to flag rows where the gap from the previous check exceeds the expected cadence? _(Tests gap-detection on top of LAG. Real reliability work uses this query to find missed checks - the column is interesting only when it exceeds the expected cadence. Add `WHERE days_diff > <expected_cadence>` to the outer query, or compute a flag column.)_
- Suppose the prompt also asked for the change in latency between consecutive checks. `latency` has 2% NULLs - how does that affect your computation, and where would you handle it? _(Tests NULL handling in window functions. `latency` has 2% NULLs, `uptime` has 0%, `checked` has 0%. None of those affect this query directly, but a follow-up often asks you to compute average latency between consecutive checks - and there NULLs propagate through arithmetic.)_
- If the team only cares about the last 30 days of history, how does adding a `WHERE checked >= DATE('now', '-30 days')` change the query plan, and where do you put it? _(Tests partition-pruning intuition. The table is partitioned by `checked` over 1095 daily partitions. A bounded date range allows partition pruning; without it the MAX subquery scans all partitions.)_
- If `svc_health` grew to 60 billion rows, the MAX(uptime) subquery becomes a bottleneck. What would you change about the architecture to avoid scanning the full table every time? _(Tests scale awareness. At 60M rows the MAX(uptime) subquery is a full scan; at 60B it's a problem. Production answer is a maintained rollup table or a materialized view refreshed on a schedule.)_
- Rewrite the gap calculation as a self-join instead of using LAG. Which form is easier to get right when two checks fire at the same `checked` timestamp? _(Tests window-function fluency. PARTITION BY svc_name plus ORDER BY checked makes the LAG correct per-service. Some candidates instinctively reach for a self-join (LEFT JOIN to svc_health on the prior date) which is slower and harder to get right with ties on `checked`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/healthiest_service_check_history)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.