# The Schema Differ

> Schema from yesterday vs today. Something changed.

Canonical URL: <https://datadriven.io/problems/the_schema_differ>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given old_schema and new_schema (each a list of {'name', 'type'} dicts), return a dict with: 'added' (list of column names in new only), 'removed' (list of column names in old only), 'type_changed' (list of {'name', 'old_type', 'new_type'} dicts where the same column name has different types). Sort each list by name ascending.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **set-based comparison on structured records**, specifically column schema drift detection. It mirrors a real data governance task where staging and production schemas must be compared for compatibility.

---

### Break down the requirements

#### Step 1: Index both schemas by column name

Build a dict mapping column name to column type for each schema.

#### Step 2: Find added and removed columns

Columns in schema2 but not schema1 are added. The reverse are removed.

#### Step 3: Find columns with changed types

For columns in both schemas, compare their types and record mismatches.

---

### The solution

**Name-indexed schema comparison**

```python
def diff_schemas(schema1, schema2):
    map1 = {}
    for col in schema1:
        map1[col['name']] = col['type']
    map2 = {}
    for col in schema2:
        map2[col['name']] = col['type']
    names1 = set(map1.keys())
    names2 = set(map2.keys())
    added = sorted(names2 - names1)
    removed = sorted(names1 - names2)
    type_changed = {}
    for name in names1 & names2:
        if map1[name] != map2[name]:
            type_changed[name] = {'old': map1[name], 'new': map2[name]}
    result = {
        'added': added,
        'removed': removed,
        'type_changed': type_changed
    }
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n + m) where n and m are the column counts in each schema.
> 
> **Space:** O(n + m) for the indexed maps and result.

> **Interviewers Watch For**
>
> Do you index by column name before comparing? This is O(n + m) instead of O(n * m) for nested-loop comparison.

> **Common Pitfall**
>
> Treating column order changes as schema drift. The problem cares about names and types, not positions. Indexing by name naturally ignores ordering.

---

## Common follow-up questions

- How would you detect column renames? _(Tests heuristic matching: same type, similar position, only one candidate.)_
- What if schemas had nullable flags and default values? _(Tests extending the comparison to additional column attributes.)_
- How would you auto-generate ALTER TABLE statements from the diff? _(Tests translating diff categories to DDL: ADD COLUMN, DROP COLUMN, ALTER COLUMN TYPE.)_
- What if there were hundreds of schema versions to compare? _(Tests pairwise diff chaining or baseline comparison strategies.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_schema_differ)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.