# Transaction-Only Features

> Exclusive to one source. Missing from the other.

Canonical URL: <https://datadriven.io/problems/transaction_only_features>

Domain: SQL · Difficulty: hard · Seniority: L5

## Problem

The ML team is isolating features exclusive to the 'transactions' source that do not appear in 'page_views' or 'ad_impressions'. Two features are considered equal if they share the same name and data type. Show the feature name, data type, average value, and null percentage, ordered by feature name.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests set difference logic in SQL. Interviewers check whether you can identify rows in one subset that do not exist in another, using the correct equality criteria across multiple columns.

> **Trick to Solving**
>
> "Features exclusive to X that do not appear in Y or Z" is a set difference pattern. The equality condition spans two columns (`feat_name` and `dtype`), so you need a compound NOT EXISTS or EXCEPT.
> 
> 1. Filter to `source = 'transactions'`
> 2. Exclude any (feat_name, dtype) pair that also appears in 'page_views' or 'ad_impressions'
> 3. Aggregate the remaining features

---

### Break down the requirements

#### Step 1: Identify transaction-source features

Filter `ml_features` to `source = 'transactions'` to get the candidate set.

#### Step 2: Exclude features shared with other sources

Use `NOT EXISTS` with a correlated subquery matching on `(feat_name, dtype)` where `source IN ('page_views', 'ad_impressions')`.

#### Step 3: Aggregate and order

Select `feat_name`, `dtype`, `AVG(avg_val)`, `AVG(null_pct)` grouped by name and type, ordered by `feat_name`.

---

### The solution

**Set difference with compound key exclusion**

```sql
SELECT feat_name, dtype, AVG(avg_val) AS avg_val, AVG(null_pct) AS null_pct
FROM ml_features t
WHERE source = 'transactions'
  AND NOT EXISTS (
      SELECT 1 FROM ml_features other
      WHERE other.feat_name = t.feat_name
        AND other.dtype = t.dtype
        AND other.source IN ('page_views', 'ad_impressions')
  )
GROUP BY feat_name, dtype
ORDER BY feat_name
```

> **Cost Analysis**
>
> The correlated subquery executes for each row in the transactions subset. With 20M total rows and ~2M per source, an index on `(feat_name, dtype, source)` is critical to avoid a nested-loop full scan.

> **Interviewers Watch For**
>
> Whether you match on both `feat_name` AND `dtype` as the prompt specifies. Matching on name alone would incorrectly exclude features that share a name but differ in type.

> **Common Pitfall**
>
> Using `NOT IN` with a subquery that could contain NULLs. If `feat_name` is nullable, `NOT IN` returns no rows when the sublist contains NULL. `NOT EXISTS` is safer.

---

## Common follow-up questions

- How would you rewrite this using EXCEPT? _(Tests EXCEPT syntax for set differences across compound keys.)_
- What if you needed features unique to each source, not just transactions? _(Generalizes to a self-join or GROUP BY with HAVING COUNT(DISTINCT source) = 1.)_
- What if avg_val has NULLs and you need to exclude those from the average? _(AVG already ignores NULLs, but the prompt says 'recorded average value' which may imply filtering.)_
- How would query performance change with 100 sources instead of 10? _(The IN list grows, but the index-based lookup remains efficient per probe.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/transaction_only_features)
- [SQL Interview Questions](https://datadriven.io/sql-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.