# The Word Families

> Different spellings, same letters - they belong together.

Canonical URL: <https://datadriven.io/problems/the_word_families>

Domain: Python · Difficulty: medium · Seniority: L4

## Problem

Group a list of strings such that any two strings that are anagrams of each other are in the same group. Return a list of groups where each inner group is sorted alphabetically and the outer list is sorted by each group's first word.

## Worked solution and explanation

### Why this problem exists in real interviews

This tests **grouping by a canonical key**, specifically sorting characters to identify anagrams. It probes dict-based grouping where the key is a derived property of the input.

> **Trick to Solving**
>
> Two words are anagrams if and only if their sorted character sequences are identical. Use the sorted string as a dict key to group words into families.

---

### Break down the requirements

#### Step 1: Compute a canonical key for each word

Sort the characters of each word to produce a key. All anagrams share the same sorted key.

#### Step 2: Group words by their canonical key

Use a dict mapping sorted keys to lists of original words.

#### Step 3: Sort each group and the overall result

Sort words within each group alphabetically, then sort groups for deterministic output.

---

### The solution

**Sorted-character key grouping**

```python
def group_anagrams(words: list) -> list:
    groups = {}
    for word in words:
        key = "".join(sorted(word))
        if key not in groups:
            groups[key] = []
        groups[key].append(word)
    result = []
    for key in groups:
        groups[key].sort()
        result.append(groups[key])
    result.sort()
    return result
```

> **Time and Space Complexity**
>
> **Time:** O(n * m log m) where n is the number of words and m is the maximum word length. Sorting each word is O(m log m).
> 
> **Space:** O(n * m) for storing all words in the groups dict.

> **Interviewers Watch For**
>
> Using a character frequency tuple as the key instead of sorting is O(m) per word instead of O(m log m). For lowercase English letters, count the 26 frequencies and use the tuple as a hashable key.

> **Common Pitfall**
>
> Comparing every pair of words for the anagram relationship is O(n^2 * m), vastly slower than the grouping approach.

---

## Common follow-up questions

- What if the comparison needed to be case-insensitive? _(Tests lowercasing words before computing the key.)_
- How would you optimize for very long words? _(Tests using a frequency-count tuple key instead of sorting.)_
- How would you handle this in a distributed system with millions of words? _(Tests MapReduce: map emits (sorted_key, word), reduce collects groups.)_
- What if words could contain unicode characters? _(Tests that sorted() and dict keys work with unicode, but normalization may be needed.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/the_word_families)
- [Python Interview Questions](https://datadriven.io/python-interview-questions)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.