# The Usual Suspects

> Same services, same checks, same problems.

Canonical URL: <https://datadriven.io/problems/the_usual_suspects>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

Which services had the most health checks in 2025? Show each qualifying service and its 2025 check count, including ties.

## Worked solution and explanation

### Why this problem exists in real interviews

Find-the-tied-maximum is a classic hard SQL pattern because it forces two passes: one to compute the per-group counts and another to compare them against the global maximum. Interviewers use it to see whether you reach for a CTE plus a scalar subquery (the clean answer) or fumble with `LIMIT 1` (which silently drops the tied rows the prompt explicitly asks for).

---

### Break down the requirements

#### Step 1: Scope to the previous year

`svc_health.checked` is TEXT. Use `strftime('%Y', checked) = '2025'` to keep only rows from the prior year. Apply this in `WHERE` so the predicate runs before the GROUP BY and the planner can prune partitions (the table is partitioned by `checked` over 730 days).

#### Step 2: Aggregate per service in a CTE

`GROUP BY svc_name` with `COUNT(*) AS check_count` produces one row per service (cardinality 200). Materialize this in a CTE called `yearly_counts` so you can reuse it twice: once to find the maximum, once to filter against it.

#### Step 3: Match every row at the global max

`WHERE check_count = (SELECT MAX(check_count) FROM yearly_counts)` keeps every service tied at the top, not just one. This is the whole point: `LIMIT 1` is wrong because the prompt says 'every service tied at the maximum'.

#### Step 4: Sort with explicit tie-break

`ORDER BY check_count DESC, svc_name ASC` is deterministic even though every output row shares the same count value. The svc_name tiebreak guarantees stable ordering.

---

### The solution

**CTE plus scalar-max subquery to keep ties**

```sql
WITH yearly_counts AS (
  SELECT svc_name, COUNT(*) AS check_count
  FROM svc_health
  WHERE strftime('%Y', checked) = '2025'
  GROUP BY svc_name
)
SELECT svc_name, check_count
FROM yearly_counts
WHERE check_count = (SELECT MAX(check_count) FROM yearly_counts)
ORDER BY check_count DESC, svc_name ASC
```

> **Cost Analysis**
>
> `svc_health` is 50,000,000 rows partitioned by `checked` over 730 days, so partition pruning to 2025 cuts the scan roughly in half (about 25M rows). The hash aggregate on 200 distinct `svc_name` values is trivial. The scalar subquery `SELECT MAX(check_count) FROM yearly_counts` runs once against the materialized CTE, not the base table, so the second pass costs nothing.

> **Interviewers Watch For**
>
> Interviewers want to see the CTE-plus-scalar-subquery pattern, not `ORDER BY count DESC LIMIT 1` (which loses tied rows) and not a window-function rewrite that overcomplicates a 4-line problem. They also watch whether you write `strftime('%Y', checked) = '2025'` (string-to-string) or accidentally `= 2025` (which still works in SQLite but is a portability red flag).

> **Common Pitfall**
>
> `LIMIT 1` looks correct on a sample where there is no tie, but the prompt explicitly says 'every service tied at the maximum count.' If two services both hit 1.2M checks in 2025, `LIMIT 1` returns one of them at random and the grader rejects the row count.

---

## Common follow-up questions

- How would you rewrite this with `RANK()` instead of a scalar `MAX` subquery, and when would you prefer one over the other? _(Tests window-function alternative. `RANK() OVER (ORDER BY check_count DESC)` in an outer SELECT with `WHERE rnk = 1` is equivalent and avoids the scalar subquery, but candidates should call out that it requires reading the CTE one extra time.)_
- How would you change the year filter so the partition pruner on `checked` could be used more aggressively? _(Tests sargable date predicates. `checked >= '2025-01-01' AND checked < '2026-01-01'` lets the partition pruner work directly on the base column instead of the opaque `strftime` expression.)_
- How would you extend this to find the tied-max service per `region` instead of globally? _(Tests adding a second filter dimension. The candidate should add `region` to the GROUP BY (`GROUP BY svc_name, region`) and partition the MAX subquery by region (`SELECT MAX(check_count) FROM yearly_counts WHERE region = outer.region` correlated, or rewrite with a window function).)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_usual_suspects)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.