# Exact Keyword Counts in Logs

> Errors and warnings. Count every single one.

Canonical URL: <https://datadriven.io/problems/exact_keyword_counts_in_logs>

Domain: SQL · Difficulty: hard · Seniority: L3

## Problem

The SRE team is building a severity distribution chart and needs exact counts of ERROR and WARN log entries from the server logs. Show each log level and its total count.

## Worked solution and explanation

### Why this problem exists in real interviews

SRE dashboards probe whether you reach for the simplest shape that fits the output. The expected output is one row per `log_level`, which means a plain `GROUP BY`, not a `SUM(CASE WHEN ...)` pivot. The interviewer is also checking whether you ask about a time window before scanning all 200M rows in `server_logs`.

---

### Break down the requirements

#### Step 1: Restrict to the two levels

Use `WHERE log_level IN ('ERROR', 'WARN')`. `IN` beats two `OR`s for readability and the planner treats them identically.

#### Step 2: Group on log_level

One row per level. `GROUP BY log_level` is the only correct grain. No `DISTINCT`, no window.

#### Step 3: Count rows per group

`COUNT(*)` aliased to `total_count`. `COUNT(log_level)` is equivalent here since the column is non-null inside the filter, but `COUNT(*)` is the convention.

#### Step 4: Order deterministically

`ORDER BY log_level` so ERROR comes before WARN alphabetically. Without it, partition-parallel scans return groups in arbitrary order.

---

### The solution

**SEVERITY COUNTS**

```sql
SELECT log_level, COUNT(*) AS total_count
FROM server_logs
WHERE log_level IN ('ERROR', 'WARN')
GROUP BY log_level
ORDER BY log_level
```

> **Cost Analysis**
>
> 200M rows, partitioned by `log_timestamp`. With no time predicate this scans every partition. A partial index on `log_level` for ('ERROR','WARN') helps, but the planner usually picks a parallel seq scan with a hash aggregate over two groups. Cost is bounded by I/O, not the aggregate.

> **Interviewers Watch For**
>
> Ask aloud: do they want a time window? On a 200M row table you almost always want `WHERE log_timestamp >= now() - interval '24 hours'` to hit one partition. If they say full history, fine, write it. Asking it shows you read the partition key before scanning.

> **Common Pitfall**
>
> Reaching for `SUM(CASE WHEN log_level = 'ERROR' THEN 1 END)` to pivot into one row. The prompt asks for one row per level, so a pivot inverts the requirement. Read the output shape before you write the SELECT.

> **The Move**
>
> If `log_level` ever has values like `error` or `Error`, `IN` misses them. Confirm the enum is uppercase before writing the filter. If not, wrap in `UPPER(log_level)` and accept the index loss.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you return zero counts for ERROR or WARN if no rows exist in the window? _(Probes whether you reach for a `VALUES` table left-joined to the aggregate, since `GROUP BY` cannot invent missing groups.)_
- Add a column for the share of total log volume each level represents. _(Tests window aggregates: `COUNT(*) * 1.0 / SUM(COUNT(*)) OVER ()` after the GROUP BY.)_
- Bucket by hour as well as level for the last 24 hours. _(Forces a partition-pruning predicate on `log_timestamp` plus `date_trunc('hour', log_timestamp)` in the GROUP BY.)_
- How would the plan change if `log_level` were stored as TEXT versus an ENUM type? _(Checks understanding of dictionary encoding, comparison cost, and index selectivity on low-cardinality columns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/exact_keyword_counts_in_logs)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.