# Most Frequent Error Types

> The errors that keep coming back.

Canonical URL: <https://datadriven.io/problems/most_frequent_error_types>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

During an incident postmortem, the on-call engineer needs to rank error types by how frequently they fire across the platform, sorted from most frequent to least.

## Worked solution and explanation

### The mental model

The prompt looks like a one-liner, but `err_tracks` ships with a pre-aggregated `count` column. That changes the question. `COUNT(*)` counts incident records (rows ingested into `err_tracks`). `SUM(count)` counts how many times those errors actually fired in production, since `count` is a rollup. The expected answer here is `COUNT(*)`, but a senior on-call asking 'rank by frequency' usually wants `SUM(count)`. Flag the ambiguity before you write SQL.

---

### The two traps

#### Step 1: Trap 1: which 'frequency'?

`COUNT(*)` vs `SUM(count)`. They produce different rankings whenever any `err_type` has rollup rows with `count > 1`. A `NullPointerException` recorded 50 times in one row beats a `TimeoutException` recorded as 10 separate rows under `SUM(count)`, and loses under `COUNT(*)`. The expected query uses `COUNT(*)`, treating each `err_tracks` row as one incident. In a real postmortem, ask: 'do you want distinct incident records, or total fire events?'

#### Step 2: Trap 2: no time bound means full scan

`err_tracks` is partitioned by `first_at` and holds 35M rows. The prompt is silent on the incident window, so the expected query scans everything. In a real on-call workflow you would bound it to the outage window with `WHERE first_at >= '2026-05-17 02:00' AND first_at < '2026-05-17 04:00'`, which lets the planner prune partitions instead of reading the full table.

#### Step 3: Trap 3: unstable tie order

`ORDER BY err_count DESC` alone is non-deterministic when two `err_type` values tie. Postmortem reports get diffed across runs; ties that flip order look like real changes. Add `, err_type ASC` as a tiebreak so the ranking is reproducible.

---

### The solution

**Rank error types by incident record count**

```sql
SELECT
    err_type,
    COUNT(*) AS err_count
FROM err_tracks
GROUP BY err_type
ORDER BY err_count DESC;
```

> **Where the time goes**
>
> 35M rows is large, but `err_type` is low cardinality (hundreds of distinct values), so the hash aggregate stays tiny and fits in memory. `EXPLAIN` will show a `HashAggregate` over a `Seq Scan` of `err_tracks`. The dominant cost is the full scan: adding a `first_at` predicate would let the planner prune partitions and cut IO by an order of magnitude. There is no benefit to indexing `err_type` for this query, since you touch every row anyway.

> **What a senior flags up front**
>
> Two clarifying questions earn points before you type. First, 'across the platform' is vague: is the on-call ranking globally, or per `svc_name`? If a `Timeout` happens in `auth-svc` and `payments-svc`, do they merge or stay split? Second, 'frequently' is the `COUNT(*)` vs `SUM(count)` fork above. Stating the assumption out loud is worth more than the query.

> **Three ways to fumble this**
>
> Three easy ways to lose this question. Writing `COUNT(err_id)` instead of `COUNT(*)`: identical only if `err_id` is `NOT NULL`, and the semantics drift the moment that changes. Writing `ORDER BY 2 DESC`: positional ordering breaks the second someone adds a column to the `SELECT` list. Aliasing the result as `count`: now you have shadowed the pre-aggregated column name, and any follow-up query referencing `count` is ambiguous.

> **Why COUNT(*) over COUNT(col)**
>
> `COUNT(*)` is the row-counting primitive and is universally optimized by every planner. `COUNT(col)` adds a `col IS NOT NULL` check per row and means something subtly different. When the question is 'how many records', reach for `COUNT(*)` every time.

---

## Common follow-up questions

- How would the top of the ranking change if you used `SUM(count)` instead of `COUNT(*)`? _(Forces you to articulate the row-vs-event distinction and how it changes the ranking.)_
- If `err_type` has NULLs, does a NULL group appear in your output? How do you suppress or surface it? _(Tests NULL handling in GROUP BY: most engines emit a single NULL group rather than dropping the rows.)_
- How would you return the top 5 error types with all ties at rank 5 included? _(Probes window function fluency. LIMIT 5 drops ties; DENSE_RANK keeps them.)_
- Production wants this metric broken down by hour of `first_at`. What changes in the query? _(Operationalizes the query for an incident dashboard rather than a one-shot postmortem.)_
- How would you rewrite this to be partition-pruning friendly on `first_at`? _(Connects query shape to physical layout. Partition pruning requires predicates on the partition key in a form the planner can evaluate at plan time.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/most_frequent_error_types)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.