# Feature Quality by Source

> Quality varies by source.

Canonical URL: <https://datadriven.io/problems/feature_quality_by_source>

Domain: SQL · Difficulty: medium · Seniority: L4

## Problem

Find sources in the ML feature store that have both high-null features (null percentage above 20) and low-null features (null percentage below 2). For each qualifying source, count how many different features fall into each bucket. Include sources even if one bucket is empty.

## Worked solution and explanation

### Why this problem exists in real interviews

Read the prompt twice. It says include sources even if one bucket is empty, but the expected output keeps only sources where both buckets exist. That contradiction is the whole problem. A real engineer pushes back before writing the CTEs.

---

### Break down the requirements

#### Step 1: Define the two buckets

High-null is `null_pct > 20`, low-null is `null_pct < 2`. The 2-to-20 band is intentionally neither and contributes to no count.

#### Step 2: Filter to sources with both populations

Two CTEs build distinct source lists per bucket; the outer query keeps only sources present in both. A HAVING clause does the same in one pass.

#### Step 3: Count distinct features per bucket

`COUNT(DISTINCT CASE WHEN ...)` returns NULL for off-bucket rows, and COUNT ignores NULL. That is why no HAVING wrapper is needed on the aggregate.

---

### The solution

**FEATURE QUALITY BY SOURCE**

```sql
WITH high_null AS (
  SELECT source
  FROM ml_features
  WHERE null_pct > 20
  GROUP BY source
),
low_null AS (
  SELECT source
  FROM ml_features
  WHERE null_pct < 2
  GROUP BY source
)
SELECT
  f.source,
  COUNT(DISTINCT CASE WHEN f.null_pct > 20 THEN f.feat_name END) AS high_null_count,
  COUNT(DISTINCT CASE WHEN f.null_pct < 2 THEN f.feat_name END) AS low_null_count
FROM ml_features f
LEFT JOIN high_null h ON f.source = h.source
LEFT JOIN low_null l ON f.source = l.source
WHERE h.source IS NOT NULL AND l.source IS NOT NULL
GROUP BY f.source
```

> **Cost Analysis**
>
> Three passes over 15M rows: two CTE scans plus the outer aggregate. An index on `(source, null_pct)` prunes both filtered scans. A one-pass rewrite with `HAVING SUM(null_pct > 20) > 0 AND SUM(null_pct < 2) > 0` reads the table once.

> **Interviewers Watch For**
>
> Whether you flag that the prompt and expected output disagree on empty-bucket sources. Asking which behavior they actually want beats silently picking one. The expected query requires both buckets; the prompt reads more permissive.

> **Common Pitfall**
>
> `LEFT JOIN` followed by `WHERE h.source IS NOT NULL` silently degrades to an inner join. If you wanted to keep one-bucket sources, move the predicate into the ON clause or aggregate before joining.

---

### COMMON FOLLOW-UP QUESTIONS

## Common follow-up questions

- Rewrite this in a single pass over `ml_features` without CTEs. _(Tests whether the engineer can collapse semi-joins into a `HAVING` clause with conditional aggregates.)_
- What if `null_pct` itself can be NULL for unmeasured features? _(Probes three-valued logic awareness on inequality predicates, since `NULL > 20` is unknown and silently drops rows.)_
- Add a third bucket for medium-null (2 to 20) and report all three counts per source. _(Checks generalization from two conditional aggregates to N, and whether the engineer keeps the bucket boundaries non-overlapping.)_

> **One-Pass Trick**
>
> `COUNT(DISTINCT CASE WHEN cond THEN col END)` does the work alone: NULLs from the failing branch get dropped, DISTINCT handles duplicates, no CTE needed if you push the existence check into HAVING.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/feature_quality_by_source)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.