# Word Counter

> Words in, counts out.

Canonical URL: <https://datadriven.io/problems/word_counter>

Domain: Python · Difficulty: easy · Seniority: L3

## Problem

Given a string of words separated by whitespace, return a dict mapping each distinct word to the number of times it appears.

## Worked solution and explanation

### Why this problem exists in real interviews

Tokenize a string and count tokens is the canonical Counter problem, and it shows up in every data pipeline: event tag frequency, log severity counts, word clouds. Interviewers grade whether you reach for collections.Counter (the right tool), whether you use str.split() with no argument (handles multiple whitespace and leading/trailing whitespace automatically), and whether you return a plain dict when the spec asks for one.

---

### Break down the requirements

#### Step 1: Tokenize with str.split() and no argument

text.split() (with no arg) splits on any whitespace run and drops empty strings, so '  a   b  '.split() returns ['a', 'b']. Passing ' ' (a single space) keeps empty tokens from consecutive spaces and is a subtle bug source. Default split is what the spec means by 'whitespace-separated.'

#### Step 2: Count with collections.Counter

Counter(text.split()) builds the tally in one expression, O(n) in token count. It is implemented in C and beats a hand-rolled dict with .get(k, 0) + 1. Counter is a dict subclass, so it compares equal to a plain dict with the same contents.

#### Step 3: Return a plain dict

The spec says 'return a dict,' and most tests use == against a dict literal. Counter compares equal, but dict(counter) is the safest return type and avoids surprising a caller that does isinstance(result, dict) strictly via type().

---

### The solution

**Counter over split tokens**

```python
from collections import Counter

def word_counts(text: str) -> dict:
    return dict(Counter(text.split()))
```

> **Cost Analysis**
>
> Time is O(n) where n is the length of text: split scans the string once, Counter scans tokens once. Space is O(u) for u unique tokens. Counter's C implementation makes this effectively as fast as any Python-level solution; a hand-rolled dict with .get() is roughly 30-50% slower in benchmarks.

> **Interviewers Watch For**
>
> Whether you reach for Counter instead of a hand-rolled dict, whether you use split() without an argument, and whether you handle empty strings cleanly (''.split() returns [], Counter([]) returns an empty Counter, dict() returns {}). Strong candidates mention that Counter has .most_common() if the follow-up asks for the top k.

> **Common Pitfall**
>
> text.split(' ') with an explicit single space. On input '  click view  ' this yields ['', '', 'click', 'view', '', ''] and the Counter picks up an empty-string key with a nonzero count. Default split() collapses runs and trims. Another miss: returning the Counter directly and having a strict equality test fail because the caller compared type() instead of using ==

---

## Common follow-up questions

- What changes if words should be counted case-insensitively? _(Counter(text.lower().split()); note that .lower() copies the string, costing O(n) extra memory.)_
- How would you return the top 3 most common words? _(Counter(...).most_common(3) returns a list of (word, count) tuples sorted by descending count. Discuss tie-break behavior (insertion order).)_
- How would you stream this over a 10 GB file? _(iterate line by line, split per line, Counter.update(tokens) in place; the Counter stays small relative to the file size.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/word_counter)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.