# The Word Census

> Who said what - and how many times?

Canonical URL: <https://datadriven.io/problems/the_word_census>

Domain: Python · Difficulty: easy · Seniority: L3

## Problem

Given a string of whitespace-separated words, return a dict mapping each distinct lowercased word to its count. (Test harness accepts any key order; note the expected output shows sorted-by-count which may not be guaranteed by dict iteration - be explicit: return a regular dict.)

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **word frequency counting** with **case normalization**. Building a frequency dict from text is fundamental to text processing pipelines, search indexing, and analytics.

---

### Break down the requirements

#### Step 1: Split the string on whitespace and lowercase each word

Treat `'The'` and `'the'` as the same word.

#### Step 2: Count occurrences in a dict

Accumulate each word's count using dict operations.

---

### The solution

**Case-normalized frequency accumulation**

```python
def word_frequencies(text: str) -> dict:
    words = text.lower().split()
    freq = {}
    for word in words:
        freq[word] = freq.get(word, 0) + 1
    return freq
```

> **Time and Space Complexity**
>
> **Time:** O(n) where n is the total number of characters. Splitting and lowercasing are both O(n).
> 
> **Space:** O(w) where w is the number of unique words.

> **Interviewers Watch For**
>
> Lowercasing before splitting (or equivalently, each word after splitting) to ensure consistent counting. Missing this gives separate counts for 'The' and 'the'.

> **Common Pitfall**
>
> The prompt says the result dict should be sorted by count descending. Regular dicts in Python 3.7+ maintain insertion order, but you need to sort before inserting.

---

## Common follow-up questions

- How would you handle punctuation attached to words? _(Tests stripping punctuation with `str.strip` or regex before counting.)_
- How would you return only words above a frequency threshold? _(Tests adding a filter after counting.)_
- How would you process a file too large for memory? _(Tests streaming line-by-line and accumulating counts incrementally.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_word_census)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.