# Top Pattern Matches

> A needle in a haystack, but how many haystacks?

Canonical URL: <https://datadriven.io/problems/top_pattern_matches>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Our server logs contain messages with embedded reference numbers. Find the 10 servers whose log messages most frequently match a pattern of a 3-digit prefix, a dash, then 4 or more digits. Return server_name and match count, from highest first.

## Worked solution and explanation

### What this is really asking

`message LIKE '%___-____%'` is the entire trick. Underscore is single-character wildcard, so three underscores, dash, four underscores matches '3 digits, dash, 4+ digits' on 60M rows without regex.

---

### Break down the requirements

#### Step 1: Match shape with LIKE

`%___-____%` reads as: anything, 3 chars, dash, 4 chars, anything. The trailing `%` lets the digit run exceed 4.

#### Step 2: Aggregate per server

`GROUP BY server_name`, `COUNT(*)`. WHERE filters before grouping, so the count is messages-with-a-match, not total matches.

#### Step 3: Top 10

`ORDER BY match_count DESC LIMIT 10`. Production code would add `server_name` as a deterministic tie-break.

---

### The solution

**TOP 10 SERVERS BY LIKE-PATTERN HITS**

```sql
SELECT server_name, COUNT(*) AS match_count
FROM server_logs
WHERE message LIKE '%___-____%'
GROUP BY server_name
ORDER BY match_count DESC
LIMIT 10
```

> **Cost Analysis**
>
> Leading `%` kills any btree index on `message`; full scan over 60M rows. The `log_timestamp` partition doesn't help without a time filter. On Postgres, a trigram index fixes it.

> **Interviewers Watch For**
>
> Whether you know `_` is single-char (not 'zero or one'). Also: count messages or matches? `COUNT(*)` counts messages; one with two refs still counts once.

> **Common Pitfall**
>
> `LIKE '___-____'` without surrounding `%` only matches messages that ARE the pattern, not ones that CONTAIN it. Drops match count to near zero on real log text.

> **The False Start**
>
> First instinct is `REGEXP '[0-9]{3}-[0-9]{4,}'`. Right shape, and it actually enforces digits. The expected query trades strictness for LIKE's portability and speed. Flag the tradeoff aloud and ship the LIKE.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How do you enforce that matched chars are actually digits? _(Switch to `REGEXP_LIKE(message, '[0-9]{3}-[0-9]{4,}')` or Postgres `~`. Slower, but no false positives on `abc-defg`.)_
- What if you wanted total occurrences, not messages? _(Use `REGEXP_COUNT(message, '[0-9]{3}-[0-9]{4,}')` and `SUM()` per server. `COUNT(*)` undercounts multi-match messages.)_
- How do you exploit the `log_timestamp` partition? _(Add `WHERE log_timestamp >= ...` so the planner prunes partitions. Without it, all partitions scan regardless of LIKE.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/top_pattern_matches)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.