# Messages Containing Keyword

> Flagged terms in the messages.

Canonical URL: <https://datadriven.io/problems/messages_containing_keyword>

Domain: SQL · Difficulty: easy · Seniority: L3

## Problem

The on-call team is searching chat history for latency-related discussions during a recent incident. Pull all fields for messages that contain the word 'latency'.

## Worked solution and explanation

### Why this problem exists in real interviews

The chat_msgs table contains channel and content values that must be processed with pattern matching. This appears as a fundamentals check to probe whether you reason about the correct aggregation grain before writing any window or GROUP BY clause.

---

### Break down the requirements

#### Step 1: Read from `chat_msgs`

The query targets `chat_msgs` with 8 columns. Identify which columns are needed for the output.

#### Step 2: Filter to the target rows

Apply the `LIKE` pattern match in the `WHERE` clause. This narrows the dataset before any grouping or aggregation.

#### Step 3: Return the result set

Select the required columns with any necessary aliasing or formatting.

---

### The solution

**LIKE pattern match for keyword search**

```sql
SELECT msg_id, channel, sender_id, content, msg_type, sent_at, edited, reply_to
FROM chat_msgs
WHERE content LIKE '%latency%'
```

> **Cost Analysis**
>
> The query scans 30M rows from `chat_msgs`.

> **Interviewers Watch For**
>
> Candidates who verbalize their approach before typing, naming the output columns and expected row count, consistently perform better.

> **Common Pitfall**
>
> Placing a filter in `WHERE` instead of `HAVING` (or vice versa) is a common mistake. `WHERE` filters rows before aggregation; `HAVING` filters groups after.

---

## Common follow-up questions

- What happens to your result if chat_msgs.reply_to contains NULLs for some rows? _(Tests whether the candidate accounts for NULL behavior in aggregates and comparisons on reply_to.)_
- How would you verify that your aggregation on chat_msgs.msg_id is not double-counting due to duplicate rows? _(Tests data quality awareness and deduplication strategies.)_
- With millions of distinct values in chat_msgs.msg_id, what index strategy would you use to keep this query performant? _(Tests indexing knowledge specific to high-cardinality columns like msg_id.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/messages_containing_keyword)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.