# The Word Count Shuffle Trap

> groupByKey works. Your cluster disagrees.

Canonical URL: <https://datadriven.io/problems/spark_reducebykey_vs_groupbykey_word_count>

Domain: PySpark · Difficulty: easy · Seniority: L5

## Problem

Your team's text analytics pipeline runs a word count job over a 50 GB corpus every night. It has been working fine for months, but after the corpus grew 3x last quarter the job started failing. The Spark UI shows 48 GB of shuffle write and three executors dead from OOM. The code uses groupByKey. Fix it.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/spark_reducebykey_vs_groupbykey_word_count)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.