# Schema Checker

> The schema says one thing. The data says another.

Canonical URL: <https://datadriven.io/problems/schema_checker>

Domain: Python · Difficulty: easy · Seniority: L3

## Problem

Given a record dict and a list of required keys, return a sorted list of keys that appear in required but NOT in record.

## Worked solution and explanation

### Why this problem exists in real interviews

Schema validation shows up at every API boundary: request bodies, ETL records, config loaders. Interviewers use this to test whether you reach for set algebra, whether you respect the input contract (sorted output), and whether you handle the empty-required and empty-record edge cases without a tangle of branches.

---

### Break down the requirements

#### Step 1: Frame as set difference

'Keys in required but not in record' is set(required) - set(record.keys()). Set difference is O(len(required)) average and reads exactly like the spec sentence, which is the strongest signal of correctness you can give an interviewer.

#### Step 2: Sort the result

Sets are unordered, but the spec wants a sorted list. Wrap the difference in sorted(...) to lock in deterministic output, which also makes the function trivially testable with assertEqual.

#### Step 3: Handle empty inputs naturally

Empty required returns []. Empty record returns sorted(required). The set-difference idiom handles both without a special case, so do not introduce one; extra branches are negative signal here.

---

### The solution

**Set difference, sorted**

```python
def missing_keys(record: dict, required: list) -> list:
    return sorted(set(required) - set(record.keys()))
```

> **Cost Analysis**
>
> Time is O(r + k + d log d) where r is len(required), k is len(record), and d is the size of the difference. Space is O(r + k) for the two sets. For typical schemas with under a hundred keys this is effectively constant.

> **Interviewers Watch For**
>
> Whether you reach for set difference instead of a list comprehension with 'in record' (which is also correct but reads less cleanly), whether you remember to sort, and whether you avoid mutating the inputs. Strong candidates also note that record.keys() returns a view, not a list, and is O(1) to construct.

> **Common Pitfall**
>
> Iterating with 'for k in required: if k not in record.keys(): ...' which is correct but quadratic if you forget that .keys() over a dict supports O(1) membership only because Python special-cases it (in older versions or when iterating .keys() into a list, this becomes O(k) per check). The set form is unambiguous. Another miss is returning the set directly; the spec wants a list.

---

## Common follow-up questions

- What changes if required keys can be nested (dotted paths like 'user.email')? _(discuss splitting on '.', recursing into sub-dicts, or using jsonpath / pydantic for production cases.)_
- How would you also report keys in record that are NOT in required (extras)? _(second set difference in the other direction; return a tuple or dict with both lists.)_
- How would you scale this to validate millions of records against the same schema? _(compute set(required) once outside the hot loop, use frozenset, or compile a pydantic model for cached validation.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/schema_checker)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.