# The Change Data Capture

> Inserts, updates, deletes :  all present.

Canonical URL: <https://datadriven.io/problems/the_change_data_capture>

Domain: Python · Difficulty: hard · Seniority: L5

## Problem

Given old_snapshot, new_snapshot (lists of dicts keyed by id_field), return a dict with four sorted lists: 'added' (records in new only), 'removed' (in old only), 'unchanged' (identical across both), 'modified' (same id, different other fields). Sort each list by id_field ascending.

## Worked solution and explanation

### Why this problem exists in real interviews

CDC (Change Data Capture) is a core data engineering pattern. Comparing two snapshots to produce inserts, updates, and deletes tests **set-based diffing** and whether you can identify record-level changes efficiently.

---

### Break down the requirements

#### Step 1: Index both snapshots by the key field

Build dicts keyed by the record's unique identifier for O(1) lookup.

#### Step 2: Compute inserts, deletes, and updates

Inserts: keys in new but not old. Deletes: keys in old but not new. Updates: keys in both where the record differs.

#### Step 3: Compare records for changes

For updates, compare all non-key fields to detect modifications.

---

### The solution

**Snapshot diffing with set operations for CDC**

```python
def compute_cdc(old_records, new_records, key_field):
    old_index = {}
    for record in old_records:
        old_index[record[key_field]] = record
    new_index = {}
    for record in new_records:
        new_index[record[key_field]] = record
    old_keys = set(old_index.keys())
    new_keys = set(new_index.keys())
    inserts = []
    for k in sorted(new_keys - old_keys):
        inserts.append(new_index[k])
    deletes = []
    for k in sorted(old_keys - new_keys):
        deletes.append(old_index[k])
    updates = []
    for k in sorted(old_keys & new_keys):
        if old_index[k] != new_index[k]:
            updates.append(new_index[k])
    result = {
        'inserts': inserts,
        'deletes': deletes,
        'updates': updates
    }
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n + m) for indexing plus O(k log k) for sorting the change sets, where n and m are snapshot sizes and k is the change count.
> 
> **Space:** O(n + m) for the two index dicts.

> **Interviewers Watch For**
>
> Using set operations (`-` for difference, `&` for intersection) to categorize keys. This is cleaner and faster than nested loops.

> **Common Pitfall**
>
> Comparing records with `==` when dicts contain float values or nested structures. Floating-point comparison issues may cause false positives on updates.

---

## Common follow-up questions

- What if records have a last_modified timestamp you can use? _(Tests optimizing update detection by comparing timestamps instead of full record equality.)_
- How would you handle soft deletes? _(Tests adding a 'deleted' flag instead of removing records from the new snapshot.)_
- How does this scale to millions of records? _(Tests hash-based partitioning: split by key range and process each partition independently.)_
- What is the difference between snapshot CDC and log-based CDC? _(Tests understanding that log-based CDC reads database change logs (WAL) rather than comparing snapshots.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_change_data_capture)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.