# The Cache That Ate the Cluster

> You cached 200 GB and forgot to let go.

Canonical URL: <https://datadriven.io/problems/spark_cache_unpersist_iterative_ml>

Domain: PySpark · Difficulty: medium · Seniority: L5

## Problem

An iterative ML feature engineering pipeline reads a 200 GB base DataFrame and runs 8 sequential enrichment steps. Each step joins against a different dimension table and adds columns. A previous engineer cached the base DataFrame to speed up the repeated reads, but after step 4 executors start dying with OOM. The cache is eating so much memory that later steps have no room for shuffle data. Fix the caching strategy so the pipeline completes without OOM.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/spark_cache_unpersist_iterative_ml)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.