# Ways of Reading

> One key, no separators, and more than one story it could tell.

Canonical URL: <https://datadriven.io/problems/ways-of-reading-key-segmentation>

Domain: Python · Difficulty: medium · Seniority: mid

## Problem

A metrics pipeline names each time series by gluing known segments together into one separator-free `key`, so a single string like `diskmemio` can be read more than one way. Given the `key` and the `vocab` of known segments, produce every sequence of segments drawn from the vocabulary whose concatenation is exactly the key.

## Worked solution and explanation

### Why this problem exists in real interviews

This is word-break-all-segmentations wearing a metrics-parsing costume. The skill being probed is enumerating every way a string factors into dictionary segments, not just finding one. Everyone reaches for the greedy move first, grab the longest prefix that is in the vocab and march forward, and it fails twice over. It only ever yields a single split, so it misses the others outright; and on a key where that longest prefix strands an unsplittable tail, it dead-ends and reports no split exists when several do. Get this wrong on `memprod` with vocab `mem`, `memp`, `prod` and you return nothing, because greedy takes `memp` and chokes on `rod`. Two families of correct answers exist: backtracking that rewinds to the last choice, and an iterative table that accumulates every way to reach each position. This dive shows the backtracking form and points at the table form in the follow-ups.

---

**Greedy longest-prefix**

At each position take the longest vocab match and continue. One string of work, no memory of alternatives. On `memprod` it grabs `memp`, hits the dead tail `rod`, and cannot rewind, so it returns nothing. It also never sees `cpu + host + prod` once it has committed to `cpuhost`.

**Backtracking**

Try every vocab prefix at each position. When a branch dead-ends the call stack rewinds to the last choice point and tries the next prefix, and every branch that reaches the end of the string is one recorded segmentation. It finds all of them, and it never falsely reports failure.

### Break down the requirements

#### Step 1: Convert vocab to a set once

Membership is checked inside the inner loop, once per candidate prefix. A list makes each `in` an O(v) scan and turns the whole thing quadratic in the vocab size for no reason; `set(vocab)` makes it O(1) average.

#### Step 2: Recurse on the remaining suffix

Carry a start index. From `start`, try each end position; if `key[start:end]` is a known segment, append it to the current path and recurse from `end`. The recursion frame IS your backtracking memory: it remembers exactly where to resume when a deeper branch fails. An iterative dp[i]-holds-all-ways table reaches the same answer by building solutions for every prefix bottom-up instead of top-down.

#### Step 3: Record a copy at the base case, then undo

When `start` reaches the end of the key, the path spells a full segmentation, so append a COPY of it. Then pop the last segment before trying the next prefix. Appending the path object itself (not a copy) is the classic bug: every stored result aliases the same list, which the pops later empty out.

---

### The solution

**Recursive enumeration with append-recurse-pop**

```python
def segment_key(key, vocab):
    words = set(vocab)
    results = []

    def build(start, path):
        if start == len(key):
            results.append(path[:])
            return
        for end in range(start + 1, len(key) + 1):
            piece = key[start:end]
            if piece in words:
                path.append(piece)
                build(end, path)
                path.pop()

    build(0, [])
    return results
```

> **Complexity**
>
> Worst case is exponential in the length of the key, because the number of segmentations can itself be exponential (think a key of all `a` with vocab `a`), and each complete path costs O(n) to copy out. Space is O(n) for the recursion depth plus the output. In practice metric keys are short (tens of characters) and a real vocab constrains the branching sharply, so it runs fine; the exponential is a property of the answer size, not wasted work. An iterative table hits the same exponential floor, since it must still materialize every path.

> **Interviewers Watch For**
>
> Three tells of seniority: converting `vocab` to a set once instead of scanning a list inside the loop; copying the path at the base case with `path[:]` rather than storing the live reference; and articulating out loud WHY greedy is wrong (it cannot enumerate, and it can dead-end). Naming that a memoized version counts or finds one split in polynomial time, while all-solutions is inherently exponential in output whether you recurse or fill a table, is a strong close.

> **Common Pitfall**
>
> `results.append(path)` instead of `results.append(path[:])`. Every result then points at the same list, and the trailing `path.pop()` calls unwind it, so you finish with a list of identical (usually empty) lists. The second trap is testing `piece in vocab` against a list, which is correct but silently O(len(vocab)) per check and blows up on a large dictionary.

---

## Common follow-up questions

- Rewrite this as a generator that yields one segmentation at a time instead of building the whole list. What changes in the recursion? _(Tests generators and `yield from` in a recursive setting: lazy production so a caller can stop early, and bounded memory when the full result set is huge.)_
- Now I only want the COUNT of segmentations, or just any one valid split. How do you make that fast? _(Tests memoized dynamic programming (word break I / count): O(n^2) with a suffix cache, since counting and existence do not need to materialize every path.)_
- vocab has ten million entries and keys can be two hundred characters. What breaks and how do you bound it? _(Tests capping the inner loop by the longest segment length, or a trie/Aho-Corasick over the vocab, to avoid probing every suffix length.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/ways-of-reading-key-segmentation)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.