# The Schema Migrator

> Old schema in, new schema out.

Canonical URL: <https://datadriven.io/problems/the_schema_migrator>

Domain: Python · Difficulty: hard · Seniority: L5

## Problem

Given a list of record dicts and a list of operations, apply each operation in order to every record and return the transformed records. Each op is a dict with key 'op' (the operation) and key 'path' (dot-separated for nested fields, e.g. 'addr.country'). Supported ops: 'rename' (also carries 'new_name'), 'add' (also carries 'default' as the value to insert), 'remove' (no extra fields), 'cast' (also carries 'target_type' in 'int' / 'float' / 'str'). For 'add' on a nested path, create missing intermediate dicts. 'rename' / 'remove' / 'cast' on a missing path are no-ops.

## Worked solution and explanation

### Why this problem exists in real interviews

Schema migrations are dispatch-style code: the operation kind decides the action, the dot-path decides where in the record to act. The interviewer is looking for a clean separation of (1) walking the dot-path to the parent dict and final key, and (2) dispatching on `op['op']` to apply the change. They also watch for the missing-parent case in `add`.

> **Trick to Solving**
>
> Walk the path **to the parent of the leaf**, not all the way to the leaf.
> 
> 1. `parts = op['path'].split('.')`; `*parents, leaf = parts`
> 2. For `add`, create missing intermediate dicts via `setdefault`
> 3. For `rename` / `remove` / `cast`, no-op if the leaf isn't there

---

### Break down the requirements

#### Step 1: Per-record, per-op iteration

Process records independently and operations in order. Make a deep copy of each record so you can safely mutate it without affecting the input or sibling records.

#### Step 2: Resolve the dot-path

Split `op['path']` on `.` and walk the record dict to the parent container. For destructive ops (rename, remove, cast), if any segment is missing, skip the op silently. For `add`, use `setdefault({})` at each step so missing intermediates are created.

#### Step 3: Apply the four op kinds

Dispatch on `op['op']`. **rename**: `parent[op['new_name']] = parent.pop(leaf)`. **add**: `parent[leaf] = op['default']`. **remove**: `parent.pop(leaf, None)`. **cast**: `parent[leaf] = {'int': int, 'float': float, 'str': str}[op['target_type']](parent[leaf])`.

---

### The solution

**Dot-path resolver with op dispatch**

```python
import copy

CAST_MAP = {'int': int, 'float': float, 'str': str}

def migrate_records(records: list[dict], operations: list[dict]) -> list[dict]:
    out = []
    for record in records:
        rec = copy.deepcopy(record)
        for op in operations:
            parts = op['path'].split('.')
            *parents, leaf = parts
            kind = op['op']
            if kind == 'add':
                container = rec
                for p in parents:
                    container = container.setdefault(p, {})
                container[leaf] = op['default']
                continue
            container = rec
            missing = False
            for p in parents:
                if not isinstance(container, dict) or p not in container:
                    missing = True
                    break
                container = container[p]
            if missing or not isinstance(container, dict) or leaf not in container:
                continue
            if kind == 'rename':
                container[op['new_name']] = container.pop(leaf)
            elif kind == 'remove':
                del container[leaf]
            elif kind == 'cast':
                container[leaf] = CAST_MAP[op['target_type']](container[leaf])
        out.append(rec)
    return out
```

> **Time and Space Complexity**
>
> **Time:** O(r * o * d) where r is records, o is operations, d is dot-path depth. Each op walks at most d steps.
> 
> **Space:** O(r * s) for the deep-copied output where s is the average record size.

> **Interviewers Watch For**
>
> Strong candidates separate path resolution from op dispatch and treat `add` as the only path-creating op. They reach for `setdefault` to build intermediates lazily and recognize that `pop(leaf, None)` makes `remove` idempotent.

> **Common Pitfall**
>
> Mutating the original record. `dict(record)` is a shallow copy, so `record['addr']` is shared between the input and your output. After `add 'addr.country'` you've poisoned the input. Use `copy.deepcopy(record)` once per record.

---

## Common follow-up questions

- What should happen when `rename` targets a path whose parent doesn't exist? Silent no-op vs raise vs log? _(Tests defensive navigation with try/except or existence checks.)_
- How would you support array indexing in dot paths (e.g., 'items.0.name')? _(Tests parsing numeric path segments as list indices.)_
- If a rename runs before a cast on the same field, do you target the old name or the new one? How does the order of operations encode that? _(Tests operation ordering and dependency resolution.)_
- How would you validate the operations list against a schema before touching any records? _(Tests dry-run validation that checks path existence and type compatibility.)_
- At what data volume would you stop running this in Python and push the migration into Spark or Beam? _(Tests awareness of Spark or Beam for large-scale schema migration.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_schema_migrator)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.