# Batch Partitioner

> One pile becomes many. Split wisely.

Canonical URL: <https://datadriven.io/problems/batch_partitioner>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given a list of dicts and a key name, return a dict mapping each distinct value of records[i][key] to the list of records with that value (in original order).

## Worked solution and explanation

### Why this problem exists in real interviews

Group a list of dicts by a key is the Python-side analog of SQL GROUP BY, and it appears in every ETL pipeline: partition events by user, bucket records by status, split API responses by tenant. Interviewers grade whether you reach for collections.defaultdict (the right tool), whether you preserve input order within each partition, and whether you avoid the copy-paste-from-StackOverflow itertools.groupby trap, which requires sorted input.

---

### Break down the requirements

#### Step 1: Use defaultdict(list) for the output

defaultdict(list) creates an empty list on first access, so you can .append() without checking if the key exists. This is exactly the 'append to bucket, create bucket if missing' idiom the problem needs. A plain dict with .setdefault(k, []).append(v) is equivalent but reads noisier.

#### Step 2: Iterate once, append in place

One pass over records: extract record[key], append the record to the bucket. Order is preserved because list.append goes to the end and Python dicts preserve insertion order (since 3.7). The first record with a new key creates the bucket; subsequent ones extend it.

#### Step 3: Return a plain dict, not the defaultdict

defaultdict compares equal to a plain dict with the same contents, but returning dict(partitions) converts it. This avoids surprising a caller that does 'partitions[missing_key]' and accidentally creates an empty list as a side effect.

---

### The solution

**defaultdict(list), one pass, return dict**

```python
from collections import defaultdict

def partition_by(records, key):
    buckets = defaultdict(list)
    for r in records:
        buckets[r[key]].append(r)
    return dict(buckets)
```

> **Cost Analysis**
>
> Time is O(n) with one pass over records, O(1) average per dict access and list append. Space is O(n) for the output (every input record ends up in some bucket). defaultdict is implemented in C and avoids the Python-level branch of dict.setdefault; both are essentially the same speed in practice.

> **Interviewers Watch For**
>
> Whether you pick defaultdict over itertools.groupby (groupby needs sorted input and is the wrong tool here), whether you preserve input order within each bucket (a natural consequence of iterating in order), and whether you handle missing keys gracefully. Strong candidates mention that records[i][key] raises KeyError if the key is missing and ask whether to skip, default, or raise.

> **Common Pitfall**
>
> Using itertools.groupby without sorting first. groupby only groups consecutive equal keys, so on unsorted input it creates one 'group' per run, not per distinct key. The fix is sorted(records, key=lambda r: r[key]) before groupby, which is O(n log n) and loses original order within buckets, so sort-then-groupby is both slower and wrong for this spec.

---

## Common follow-up questions

- What changes if records may be missing the key entirely? _(record.get(key, DEFAULT) to bucket missing-key records together, or a try/except KeyError to skip them. Discuss the product decision.)_
- How would you return only the counts per bucket instead of the records themselves? _(Counter(r[key] for r in records); one line, also O(n). Compare to the defaultdict(int) idiom.)_
- How would you parallelize this for a billion records? _(hash-partition by r[key] across workers, each builds a local defaultdict, then merge. Mention Spark/Beam groupByKey as the production version.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/batch_partitioner)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.