# Active Searchers

> They typed a query. That means something.

Canonical URL: <https://datadriven.io/problems/active_searchers>

Domain: SQL · Difficulty: medium · Seniority: L3

## Problem

The API gateway team is reviewing which clients have been rate-limited. Pull a deduplicated list of every client that has had at least one request blocked.

## Worked solution and explanation

### Why this problem exists in real interviews

Rate-limit and audit tables frequently land in the warehouse as all-TEXT columns, either because the ingestion layer is schema-on-read or because the producer writes JSON fields as strings. The interviewer wants to see whether you notice the type, cast correctly, and deduplicate on a high-cardinality column without writing a five-CTE answer for what is really two filters.

> **Trick to Solving**
>
> Read the column types before writing a predicate. `blocked` is declared TEXT, not INTEGER. Comparing a TEXT column to a number works on some engines and silently coerces, fails outright on others, and never uses an index on the raw column. Cast explicitly, filter first, then deduplicate.

---

### Break down the requirements

#### Step 1: Recognize that `blocked` is a TEXT-encoded integer

Every column in `rate_limits` is TEXT. `blocked` stores an integer as a string. `CAST(blocked AS INTEGER) > 0` is the predicate you actually want. `blocked > 0` compiles on some engines by implicit coercion but is non-portable and defeats any future index on a typed expression.

#### Step 2: Filter before deduplicating

Push the WHERE clause below the DISTINCT. Filtering first means DISTINCT only has to sort or hash the clients that actually hit a block, not all 3M rows. `client` has ~60k distinct values with zipf skew, so the unique set after filtering is much smaller than the raw row count.

#### Step 3: Project exactly one column

Return a single column of distinct clients. No aggregation, no ORDER BY required, no case branching. The scope is deliberately narrow: the interviewer is checking whether you over-engineer a simple question.

---

### The solution

**Distinct clients with at least one block**

```sql
SELECT DISTINCT client
FROM rate_limits
WHERE CAST(blocked AS INTEGER) > 0
```

**Equivalent forms**

```sql
/* GROUP BY rewrite */
SELECT client
FROM rate_limits
WHERE CAST(blocked AS INTEGER) > 0
GROUP BY client;

/* Semi-join rewrite (scales when the filter grows) */
SELECT DISTINCT r.client
FROM rate_limits r
WHERE EXISTS (
  SELECT 1 FROM rate_limits r2
  WHERE r2.client = r.client
    AND CAST(r2.blocked AS INTEGER) > 0
);
```

> **Cost Analysis**
>
> Scan cost is dominated by the 3M-row table read. The CAST is cheap per row but blocks index use on `blocked` unless a functional index `(CAST(blocked AS INTEGER))` exists. After the WHERE, the row count drops sharply (zipf skew on `blocked` means most rows are `'0'`), so the DISTINCT stage sorts or hashes a small fraction of the input. Expected plan: Seq Scan → Filter → HashAggregate on `client`.

> **Interviewers Watch For**
>
> Strong candidates call out the TEXT type immediately and ask whether `blocked` can be NULL or contain non-numeric strings before writing a cast. They also note that `DISTINCT client` and `GROUP BY client` produce the same result here and pick one consciously. Weak candidates write `WHERE blocked > 0` and move on, missing that the query is implicitly coercing every row.

> **Common Pitfall**
>
> Assuming implicit coercion is safe. On Postgres, `'3' > 0` errors. On SQLite, it returns results but via string comparison rules that disagree with integer ordering. On MySQL, it coerces but loses predicate pushdown. Always cast the TEXT side explicitly, and decide what to do with rows where the cast itself would fail (use `TRY_CAST` or a regex filter if the source is dirty).

---

## Common follow-up questions

- What happens if `blocked` contains values like 'null', empty string, or '3 '? How do you make the cast robust? _(Tests whether the candidate accounts for dirty data in a TEXT column that should hold integers. The production answer is usually TRY_CAST or a regex guard, not letting the query crash mid-scan.)_
- If this query ran every minute and the table was 300M rows, what index would you add? Why does a btree on `blocked` not help? _(Tests indexing intuition on a TEXT column that is always used as an integer. The answer is a functional index or a generated column, not a plain btree on `blocked`.)_
- Is `SELECT DISTINCT client` cheaper or more expensive than `GROUP BY client` on a table this size? When would the two diverge? _(Tests whether the candidate sees that DISTINCT and GROUP BY collapse to the same plan on modern engines, and whether they know when the forms diverge.)_
- Given `client` is heavily skewed, how would you estimate the result-set size before running the query, and what does that tell you about whether to materialize or stream the output? _(Tests understanding of cardinality and skew. `client` is zipf-distributed over 60k values, so a tiny head set owns most of the blocked rows.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/active_searchers)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.