# The Record Reconciler

> Two versions of the same truth.

Canonical URL: <https://datadriven.io/problems/the_record_reconciler>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given source_a and source_b (each a list of dicts) and an id_field name, reconcile the two by id and return a dict with four keys: 'only_a' (sorted list of ids in source_a but not source_b), 'only_b' (sorted list of ids in source_b but not source_a), 'matches' (sorted list of ids whose records are identical in both sources), and 'mismatches' (sorted list of {'id': id, 'differences': {field: {'a': a_val, 'b': b_val}}} dicts for ids whose records differ on at least one non-id field).

## Worked solution and explanation

### Why this problem exists in real interviews

This is the canonical "diff two record sets by id" task that data engineers do every week. The interviewer is checking that you index both sources by id, use set difference and intersection for the membership classification, and produce a structured per-id diff for the records that don't match.

---

### Break down the requirements

#### Step 1: Index both sources by id

Build `map_a` and `map_b`: dicts keyed by `record[id_field]` so id lookups are O(1). The output has four keys: `only_a`, `only_b`, `matches`, `mismatches`.

#### Step 2: Membership classification with set operations

Use Python set operations on the two id sets: `only_a = ids_a - ids_b`, `only_b = ids_b - ids_a`, `both = ids_a & ids_b`. Sort the first two before returning.

#### Step 3: Field-level diff for matched ids

For each id in `both`, compare the two records field by field over the union of their non-id keys. If every non-id field is equal, the id goes in `matches`. Otherwise, build `{field: {'a': a_val, 'b': b_val}}` for each differing field and append `{'id': id, 'differences': ...}` to `mismatches`.

#### Step 4: Sort and assemble

Return the four keys with `only_a`, `only_b`, `matches` sorted by id, and `mismatches` sorted by `id`. Use `key=lambda d: d['id']` for the mismatch sort.

---

### The solution

**Hash-by-id, set diff for membership, field-by-field diff for content**

```python
def reconcile(source_a: list[dict], source_b: list[dict], id_field: str) -> dict:
    map_a = {r[id_field]: r for r in source_a}
    map_b = {r[id_field]: r for r in source_b}
    ids_a = set(map_a)
    ids_b = set(map_b)
    only_a = sorted(ids_a - ids_b)
    only_b = sorted(ids_b - ids_a)
    matches = []
    mismatches = []
    for rid in sorted(ids_a & ids_b):
        a, b = map_a[rid], map_b[rid]
        fields = (set(a) | set(b)) - {id_field}
        diffs = {}
        for f in fields:
            if a.get(f) != b.get(f):
                diffs[f] = {'a': a.get(f), 'b': b.get(f)}
        if diffs:
            mismatches.append({'id': rid, 'differences': diffs})
        else:
            matches.append(rid)
    return {
        'only_a': only_a,
        'only_b': only_b,
        'matches': matches,
        'mismatches': mismatches,
    }
```

> **Time and Space Complexity**
>
> **Time:** O(n + m + k * f) where n = |source_a|, m = |source_b|, k = matched ids, f = avg fields per record.
> 
> **Space:** O(n + m) for the two id-indexed maps.

> **Interviewers Watch For**
>
> Strong candidates use the **union** of field names per matched pair so a field that exists in only one source still shows up in the diff. They also keep `id` out of the diff loop so it isn't double-reported.

> **Common Pitfall**
>
> Iterating only over the keys of `source_a` records when diffing. If `source_b` has a field that `source_a` doesn't, that difference is silently missed. Using `set(a) | set(b)` over each matched pair catches both directions.

---

## Common follow-up questions

- How would you present the diff in a human-readable report? _(Tests formatting diffs with old/new values in a table layout.)_
- What if records had nested structures? _(Tests recursive field comparison.)_
- How would this scale to millions of records? _(Tests streaming reconciliation with sorted merge or database-side comparison.)_
- What if the ID field were not unique in one of the lists? _(Tests handling multi-valued mappings or flagging the anomaly.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_record_reconciler)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.