# The Word Map

> Input text. Output: word frequency.

Canonical URL: <https://datadriven.io/problems/the_word_map>

Domain: Python · Difficulty: easy · Seniority: L4

## Problem

Given a string of text, return a dict mapping each distinct word to its count. Lowercase the text, split on whitespace, and strip surrounding punctuation from each token before counting.

## Worked solution and explanation

### Why this problem exists in real interviews

Word frequency is the warmup that lets the interviewer see whether you reach for `Counter`, whether you normalize case before counting, and whether you handle punctuation correctly. It is the canonical setup for follow ups about top-k words, streaming counts, and tokenization choices.

---

### Break down the requirements

#### Step 1: Normalize case before counting

Lowercase the entire input first so `'The'` and `'the'` collapse to one bucket. Doing this once on the full string is cheaper than lowercasing each token in a loop and removes a class of off by one bugs.

#### Step 2: Tokenize on letters, not whitespace

Use `re.findall(r'[a-zA-Z]+', text.lower())` so punctuation never reaches the counter. A whitespace split followed by `strip(string.punctuation)` works but leaves contractions like `don't` as a single token, which interviewers will probe.

#### Step 3: Aggregate with `Counter` and return a plain dict

`Counter` gives you O(n) counting in one pass and exposes `most_common` for the inevitable follow up. Cast to `dict` at the boundary because the spec asks for a dict, not a `Counter` subclass.

---

### The solution

**Regex tokenize, then Counter**

```python
import re
from collections import Counter

def word_frequency(text):
    words = re.findall(r'[a-zA-Z]+', text.lower())
    return dict(Counter(words))
```

> **Cost Analysis**
>
> Time is O(n) for the single regex scan plus O(n) for the counter increments where n is the character length of the input. Space is O(k) for the output dict where k is the number of distinct words.

> **Interviewers Watch For**
>
> Whether you state the tokenization rule out loud (letters only? include digits?), whether you normalize before counting, and whether you know `Counter` exists. Reaching for a manual `defaultdict(int)` is fine but you should explain why.

> **Common Pitfall**
>
> Splitting on whitespace alone leaves trailing punctuation attached, so `'cat,'` and `'cat'` count as different words. Stripping punctuation per token also fails on apostrophes inside words unless you decide what to do with `don't`.

---

## Common follow-up questions

- How would you return only the top 5 most frequent words? _(Show `Counter(words).most_common(5)`. Interviewers want to know you do not need to sort the full dict.)_
- How would you stream counts across a 50 GB file? _(Iterate line by line, feed each tokenized batch into a single persistent `Counter`. Discuss memory bounds: the counter scales with vocabulary, not corpus size.)_
- How would you treat `don't` and `dont` as the same word? _(Strip apostrophes before tokenizing, or extend the regex to `[a-zA-Z']+` then post process. Surface the tradeoff with possessives like `Alice's`.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_word_map)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.