# Keyword-Based User Search

> The search terms reveal intent.

Canonical URL: <https://datadriven.io/problems/keyword_based_user_search>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

We ran a search quality audit and need to find users whose search terms contain 'desk', 'monitor', 'cable', or 'mouse' in singular form only. Exclude any entries that use the plural forms of those words. Return unique user IDs only.

## Worked solution and explanation

### Why this problem exists in real interviews

This is the OR-chain `LIKE` pattern dressed up with a singular-only twist. The skill being tested: do you reach for `IN` (wrong, `IN` is exact equality), do you handle case folding portably, and do you encode the singular-only rule as a `NOT LIKE` exclusion on the plural forms? `search_queries` is 80M rows partitioned by `query_time`, so the second question is whether you understand that `%kw%` torches any index on `search_term`.

---

### Break down the requirements

#### Step 1: Build the inclusion OR-chain

Four substring matches joined by `OR`, each as `LOWER(search_term) LIKE '%kw%'`. `LOWER` on the column is the portable case-insensitivity move. `ILIKE` is Postgres-only and won't survive a cross-engine review.

#### Step 2: Parenthesize the OR group

The plural exclusion is an `AND` against the whole inclusion group. Without parens, `OR` and `AND` precedence binds the last `OR` term to the first `NOT LIKE` and the filter silently includes wrong rows.

#### Step 3: Exclude the plural forms

Four `AND LOWER(search_term) NOT LIKE '%kws%'` clauses. A row with both 'desk' and 'desks' is excluded (the prompt asks singular-only entries, not entries-that-mention-the-singular).

#### Step 4: Collapse to distinct users

`SELECT DISTINCT user_id`. A user who searched both 'desk' and 'monitor' shows up on multiple rows; the audit wants the user list, not the query list.

---

### The solution

**OR-CHAIN INCLUSION WITH PLURAL EXCLUSION**

```sql
SELECT DISTINCT user_id
FROM search_queries
WHERE (
        LOWER(search_term) LIKE '%desk%'
     OR LOWER(search_term) LIKE '%monitor%'
     OR LOWER(search_term) LIKE '%cable%'
     OR LOWER(search_term) LIKE '%mouse%'
      )
  AND LOWER(search_term) NOT LIKE '%desks%'
  AND LOWER(search_term) NOT LIKE '%monitors%'
  AND LOWER(search_term) NOT LIKE '%cables%'
  AND LOWER(search_term) NOT LIKE '%mouses%'
```

> **Cost Analysis**
>
> 80M rows, partitioned by `query_time`. No time filter here, so partition pruning gives you nothing and the planner runs a full scan across every partition. Worse: `LOWER(col) LIKE '%kw%'` has a leading wildcard AND a function wrap, so any B-tree on `search_term` is unusable. Eight regex-grade row-by-row checks per row. Add a `query_time` window in any real audit.

> **Interviewers Watch For**
>
> Whether you flag the singular-only ambiguity out loud. Does 'desk mouse pad' count? Yes by this query (contains 'desk' and 'mouse', no plurals). Does 'I bought desks and a mouse' count? No, the `NOT LIKE '%desks%'` kills the whole row even though 'mouse' is singular. Surface that interpretation choice before writing the query.

> **Common Pitfall**
>
> Writing `WHERE search_term IN ('desk', 'monitor', 'cable', 'mouse')`. `IN` is exact equality, not substring; 'wireless mouse' is excluded. The other classic: `LIKE 'desk' OR LIKE 'monitor' ...` (no wildcards), same bug. And dropping the parens around the OR group lets operator precedence quietly include plural rows.

> **Postgres Shortcut**
>
> Postgres lets you collapse the OR-chain with `LOWER(search_term) LIKE ANY (ARRAY['%desk%','%monitor%','%cable%','%mouse%'])`. Cleaner, same plan. Not portable to MySQL / SQLite / BigQuery. Mention it as engine-specific; don't lead with it.

> **When LIKE Stops Scaling**
>
> If this audit query starts taking 30+ seconds on the 80M-row table you're past what `LIKE` can do. Real fix is a full-text index: Postgres `GIN` on `to_tsvector('english', search_term)`, SQLite `FTS5`, or punt the search corpus to Elasticsearch / OpenSearch. FTS also gives you free stemming, which solves the singular/plural problem at index time instead of in the `WHERE` clause.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Why not `WHERE search_term IN ('desk','monitor','cable','mouse')`? _(Tests whether the writer knows `IN` is exact equality, not substring containment. The whole point of `LIKE '%kw%'` is the wildcard.)_
- Rewrite this for a Postgres warehouse with a 1B-row table and a SLA of under 2 seconds. _(Probes whether they reach for a `GIN` index on `to_tsvector(search_term)` and `@@` matching, or push the corpus to a dedicated search engine.)_
- What if the audit also needs the count of matching searches per user, not just the user list? _(Forces a switch from `SELECT DISTINCT` to `GROUP BY user_id` with `COUNT(*)`, and exposes whether they understand `DISTINCT` and `GROUP BY` solve different shapes.)_
- How would the query change if 'mouse' should match the pest but not the device? _(Opens the door to negative keyword lists, stemmers, or moving to a real search index. Demonstrates that `LIKE` has no semantic understanding.)_
- Could you push the singular-only rule into the data pipeline instead of the query? _(Tests whether they recognize this as a tokenization / normalization problem that belongs upstream, not a thing to relitigate on every audit run.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/keyword_based_user_search)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.