# Feature Name Intersection

> Training names versus serving names. The overlap.

Canonical URL: <https://datadriven.io/problems/feature_name_intersection>

Domain: SQL · Difficulty: hard · Seniority: L4

## Problem

The ML feature store pulls feature names from two sources: 'ad_impressions' and 'search_queries'. Find which feature names appear in both sources.

## Worked solution and explanation

### Why this problem exists in real interviews

Built around `ml_features`, this problem requires pattern matching on text columns. Interviewers watch whether candidates handle the relationship between `feat_name`, `dtype`, `avg_val` correctly under grouping pressure.

---

### Break down the requirements

#### Step 1: Apply string filter on `feat_name`

Use `LIKE '%keyword%'` or `INSTR(feat_name, 'keyword') > 0` to match the target pattern.

#### Step 2: Return matching rows

Select the relevant columns for the matching records.

---

### The solution

**Pattern-match for feature name intersection**

```sql
SELECT *
FROM ml_features
WHERE feat_name LIKE '%pattern%'
ORDER BY feat_id
```

> **Cost Analysis**
>
> `LIKE '%pattern%'` cannot use a B-tree index. For production, consider a full-text index or pre-computed column.

> **Interviewers Watch For**
>
> Strong candidates mention case sensitivity: `LIKE` is case-sensitive in most dialects. Use `LOWER()` for insensitive matching.

> **Common Pitfall**
>
> Leading wildcards (`%pattern%`) prevent index usage. Trailing-only wildcards (`pattern%`) can use an index.

---

## Common follow-up questions

- The `avg_val` column in `ml_features` has roughly 8% NULLs. How does your query handle those rows, and would the result change if NULLs were replaced with zeros? _(Tests whether the candidate understands how NULLs propagate through aggregation functions and whether their WHERE/JOIN conditions implicitly filter them out.)_
- Your LIKE pattern on `feat_name` is case-sensitive by default. Would switching to ILIKE change your result set, and what is the index cost? _(Tests awareness of case sensitivity in pattern matching and its impact on index usage.)_
- `feat_id` in `ml_features` has ~20M distinct values. What index strategy keeps your query from doing a full table scan? _(Tests whether the candidate can design indexes for high-cardinality columns and understands selectivity.)_
- If `feat_id` in `ml_features` contained negative values, would your query still produce correct results? _(Tests whether the candidate validated assumptions about the domain of numeric columns.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/feature_name_intersection)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.