# The Top Words

> In every document, some words dominate the conversation.

Canonical URL: <https://datadriven.io/problems/the_top_words>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Given a text string and integer k, split on whitespace, count occurrences, and return the k words with the highest counts. Sort by count descending, tie-break alphabetically ascending. If fewer than k distinct words exist, return all of them.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **frequency counting** combined with **custom sorting** (by count descending, then alphabetically for ties). It mirrors the common data pipeline pattern of ranking items by a metric with tiebreaker rules.

---

### Break down the requirements

#### Step 1: Split text and count word frequencies

Split on whitespace and accumulate counts in a dict.

#### Step 2: Sort by frequency descending, then alphabetically for ties

Use a composite sort key: negate the count for descending order, use the word itself for alphabetical tiebreaking.

#### Step 3: Return the top k words

Slice the sorted list to the first k entries.

---

### The solution

**Frequency sort with alphabetical tiebreaker**

```python
def top_k_words(text: str, k: int) -> list:
    words = text.split()
    freq = {}
    for word in words:
        freq[word] = freq.get(word, 0) + 1
    sorted_words = sorted(freq.keys(), key=lambda w: (-freq[w], w))
    result = []
    for i in range(min(k, len(sorted_words))):
        result.append(sorted_words[i])
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n + u log u) where n is the number of words and u is the number of unique words. Counting is O(n), sorting is O(u log u).
> 
> **Space:** O(u) for the frequency dict.

> **Interviewers Watch For**
>
> The negation trick `(-freq[w], w)` gives descending-by-count, ascending-by-word in a single sort call. This is idiomatic Python for multi-key sorting.

> **Common Pitfall**
>
> Forgetting the alphabetical tiebreaker. Without it, tied words appear in arbitrary order, making the output non-deterministic.

---

## Common follow-up questions

- How would you handle this for a very large text (gigabytes)? _(Tests streaming frequency counting with bounded memory.)_
- What if k is much smaller than the vocabulary size? _(Tests using `heapq.nlargest` for O(u log k) instead of full sort.)_
- What if the text contained punctuation? _(Tests stripping punctuation before counting.)_
- How would you make this case-insensitive? _(Tests lowercasing words before insertion into the frequency dict.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_top_words)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.