# Top Error-Service Pair

> Which error-service pair triggered the most resolved incidents

Canonical URL: <https://datadriven.io/problems/top_error_service_pair>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

During an incident postmortem, you need to find the error-service combination responsible for the most resolved incidents. Build an identifier by combining the error type and service name from the error tracker, then count how many unique errors that combination produced where the corresponding alert was eventually resolved. Return only the top-ranked combinations, including ties, sequenced alphabetically.

## Worked solution and explanation

### Why this problem exists in real interviews

The interviewer is probing whether you negotiate the grain before you write. `err_tracks` joined to `alert_events` on `svc_name` alone fans out (8M alerts against 3M errors), and `top pair` is ambiguous: do they want one row, or all ties? Ask both questions out loud. The signal is whether you treat `(err_type, svc_name)` as a composite identifier without losing the tie semantics.

---

### Break down the requirements

#### Step 1: Confirm the join key and resolved predicate

Join `err_tracks` to `alert_events` on `svc_name`. The brief says `resolved` is the resolution timestamp, so `a.resolved IS NOT NULL` filters to closed alerts. Confirm there is no stronger join key like `err_id`.

#### Step 2: Build the composite identifier

Most engines do not allow `COUNT(DISTINCT col1, col2)`. Concatenate with a safe delimiter: `e.err_type || '-' || e.svc_name AS reporter_identity`. Pick a delimiter that cannot appear in either column.

#### Step 3: Count distinct errors per pair

`COUNT(DISTINCT e.err_id)` deduplicates the fan-out from the join to `alert_events`. Without DISTINCT, one error with five resolved alerts counts as five, inflating the leader.

#### Step 4: Rank with ties and order alphabetically

`RANK() OVER (ORDER BY COUNT(DISTINCT e.err_id) DESC)` then filter `rnk = 1`. Use `RANK` (not `ROW_NUMBER`) so ties surface. Final `ORDER BY reporter_identity` for the alphabetical sequence.

---

### The solution

**TOP ERROR SERVICE PAIR**

```sql
SELECT reporter_identity, distinct_errors
FROM (
  SELECT e.err_type || '-' || e.svc_name AS reporter_identity,
         COUNT(DISTINCT e.err_id) AS distinct_errors,
         RANK() OVER (ORDER BY COUNT(DISTINCT e.err_id) DESC) AS rnk
  FROM err_tracks e
  JOIN alert_events a ON e.svc_name = a.svc_name
  WHERE a.resolved IS NOT NULL
  GROUP BY e.err_type || '-' || e.svc_name
) ranked
WHERE rnk = 1
ORDER BY reporter_identity
```

> **Cost Analysis**
>
> Joining 3M `err_tracks` to 8M `alert_events` on `svc_name` is the hot spot; expect a hash join with substantial fan-out before `COUNT(DISTINCT err_id)` collapses it. An index on `alert_events(svc_name, resolved)` helps the filter push down. The window `RANK` runs over the grouped output, which is tiny by comparison.

> **Interviewers Watch For**
>
> Ask out loud: top means just rank 1 or top N including ties, and should I dedupe `err_id` before or after joining alerts? If you skip the DISTINCT, an error tied to many resolved alerts wins by fan-out, not by real frequency. State the tie policy before writing.

> **Common Pitfall**
>
> Reaching for `ROW_NUMBER()` drops ties silently, returning one pair when two are tied. The other trap: using a delimiter that appears in `err_type` or `svc_name` (a hyphen in `auth-service` plus an `err_type` of `db`) collides with a different pair. Pick something like `'||'` or a control character.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Why `RANK()` instead of `DENSE_RANK()` or `ROW_NUMBER()` here? _(Probes the tie-handling tradeoff: `RANK` and `DENSE_RANK` both include ties at the top; `ROW_NUMBER` arbitrarily breaks them. For `rnk = 1` either works, but `DENSE_RANK` matters when filtering top N.)_
- How would you do this without string concatenation? _(Tests `GROUP BY (err_type, svc_name)` with a tuple key, or grouping on both columns and projecting them separately, avoiding delimiter collisions entirely.)_
- What changes if you must join on both `svc_name` and a time window between `first_at` and `fired_at`? _(Probes inequality joins and whether you can keep the `COUNT(DISTINCT err_id)` semantics correct when the alert window itself can duplicate.)_
- How would you make this incremental for a streaming postmortem dashboard? _(Tests whether you can decompose `COUNT(DISTINCT)` into HyperLogLog sketches per `(err_type, svc_name)` pair, mergeable across time windows.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_error_service_pair)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.