# The Vocabulary Test

> Can you spell out the whole sentence using only the words you know?

Canonical URL: <https://datadriven.io/problems/the_vocabulary_test>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given a string s and a list of words (word_dict), return True if s can be split into a sequence of one or more words from the dictionary. Words may be reused.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **dynamic programming** for string segmentation (the classic "word break" problem). It probes whether a candidate can build a DP table that tracks reachable positions in the string using a known vocabulary.

> **Trick to Solving**
>
> Build a boolean DP array where `dp[i]` is True if the first i characters of the string can be segmented into dictionary words. For each position, check all possible word endings.

---

### Break down the requirements

#### Step 1: Convert the word list to a set for O(1) lookups

Checking membership in a list is O(n) per check; a set makes it O(1).

#### Step 2: Build the DP table

`dp[0] = True` (empty prefix is always segmentable). For each position i, check if any `dp[j]` is True where `s[j:i]` is a valid word.

#### Step 3: Return dp[len(s)\]

If the entire string can be segmented, the last entry is True.

---

### The solution

**Bottom-up DP with set lookup for word matching**

```python
def word_break(s: str, words: list) -> bool:
    word_set = set(words)
    n = len(s)
    dp = [False] * (n + 1)
    dp[0] = True
    for i in range(1, n + 1):
        for j in range(i):
            if dp[j] and s[j:i] in word_set:
                dp[i] = True
                break
    result = dp[n]
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n^2 * m) where n is the string length and m is the average word length (for the substring comparison). The `break` provides early exit.
> 
> **Space:** O(n) for the DP array plus O(w) for the word set.

> **Interviewers Watch For**
>
> The `break` after setting `dp[i] = True` is an important optimization. Once you know position i is reachable, checking more splits is unnecessary.

> **Common Pitfall**
>
> Greedy matching (always taking the longest or shortest word) fails. For `s = 'catsanddog'` with words `['cats', 'cat', 'sand', 'and', 'dog']`, greedy from the left with 'cats' misses the valid segmentation 'cat sand dog'.

---

## Common follow-up questions

- How would you return all valid segmentations, not just True/False? _(Tests backtracking with memoization to enumerate all word break paths.)_
- What if the dictionary was very large (millions of words)? _(Tests using a Trie for prefix-based pruning instead of checking all substrings.)_
- What if words could overlap (same character used in two words)? _(Tests that this is not how the problem works; the segmentation is a partition.)_
- How does this relate to tokenization in NLP? _(Tests awareness that subword tokenizers like BPE use similar segmentation logic.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_vocabulary_test)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.