# Push It Down

> You renamed the column. Catalyst forgot how to prune.

Canonical URL: <https://datadriven.io/problems/spark_catalyst_predicate_pushdown>

Domain: PySpark · Difficulty: medium · Seniority: L5

## Problem

A daily analytics job reads a 3 TB user_events Parquet table partitioned by event_date, filters to yesterday (about 10 GB), and joins against user_profiles. The job takes 40 minutes but should take 5. A colleague wrote the pipeline using a subquery pattern that defeats partition pruning. The physical plan shows a full table scan of all 3 TB. Rewrite the query so Catalyst pushes the date filter down to the file scan.

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/spark_catalyst_predicate_pushdown)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.