# Incident Keyword Messages

> Certain words trigger an investigation.

Canonical URL: <https://datadriven.io/problems/incident_keyword_messages>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

During an incident review, the on-call engineer needs to find chat messages containing the words 'latency', 'down', 'back', or 'pipeline' and match them with error tracking entries from the same month. For each matching message, pull the associated severity. Show the channel, message content, and severity.

## Worked solution and explanation

### Why this problem exists in real interviews

Postmortem queries like this probe whether you spot coarse-grain joins hiding inside string functions. `SUBSTR(sent_at, 1, 7)` looks innocuous but produces a same-month cartesian product that the interviewer wants you to name out loud before submitting. They also want to hear you ask whether the consumer wants every (chat, error) pair or a deduplicated view.

---

### Break down the requirements

#### Step 1: Filter messages by keyword

OR-chain of `cm.content LIKE` on `'%latency%'`, `'%down%'`, `'%back%'`, `'%pipeline%'`. Leading wildcards mean no btree index helps. Wrap in `LOWER()` if the engine is case-sensitive.

#### Step 2: Define same-month

`SUBSTR(cm.sent_at, 1, 7) = SUBSTR(et.first_at, 1, 7)` gives `'YYYY-MM'`. Works on ISO strings or timestamps cast to text. Both sides are derived, so neither partition key gets pruned.

#### Step 3: Project the three columns

`cm.channel`, `cm.content`, `et.severity`. No aggregation in the spec, so duplicates from the many-to-many join land in the result set.

#### Step 4: Decide on shape

One chat row pairs with every same-month error. Confirm with the interviewer: per-message-per-error, or DISTINCT on `(channel, content, severity)`, or one row per message with max severity.

---

### The solution

**INCIDENT KEYWORD MATCH**

```sql
SELECT cm.channel, cm.content, et.severity
FROM chat_msgs cm
INNER JOIN err_tracks et
  ON SUBSTR(cm.sent_at, 1, 7) = SUBSTR(et.first_at, 1, 7)
WHERE cm.content LIKE '%latency%'
   OR cm.content LIKE '%down%'
   OR cm.content LIKE '%back%'
   OR cm.content LIKE '%pipeline%'
```

> **Cost Analysis**
>
> 60M x 30M with a derived-string join key. Partition pruning is dead because the planner cannot push `SUBSTR(...) = SUBSTR(...)` to the partition column. Expect a full scan on both sides, hash join on the month string, and a result that can balloon past 100M rows per month bucket.

> **Interviewers Watch For**
>
> Ask aloud: when one chat message matches multiple errors in the same month, do they want every pair, distinct severities, or the max severity per message? The prompt is silent. Naming the ambiguity before writing scores more than the query.

> **Common Pitfall**
>
> Writing `WHERE cm.content LIKE '%latency%' OR ... AND SUBSTR(...) = SUBSTR(...)` puts the join condition in WHERE with no parens. Precedence makes `AND` bind tighter than `OR`, so three of the four keyword branches drop the join filter entirely. Keep the join in `ON`.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- How would you collapse to one row per message with the highest severity? _(Tests window functions or GROUP BY with a severity rank.)_
- Rewrite the month join so partition pruning actually fires. _(Probes whether you reach for `DATE_TRUNC('month', ...)` and range predicates instead of `SUBSTR`.)_
- How do you make the keyword filter case-insensitive across Postgres and Snowflake? _(Checks awareness of `LOWER()` vs `ILIKE` portability.)_
- Would a tokenized full-text index change your approach at 60M rows? _(Opens the door to GIN, search services, or precomputed keyword flags.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/incident_keyword_messages)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.