Push It Down
A medium spark interview practice problem on DataDriven. Write and execute real spark code with instant grading.
- Domain
- spark
- Difficulty
- medium
- Seniority
- L5
Problem
A daily analytics job reads a 3 TB user_events Parquet table partitioned by event_date, filters to yesterday (about 10 GB), and joins against user_profiles. The job takes 40 minutes but should take 5. A colleague wrote the pipeline using a subquery pattern that defeats partition pruning. The physical plan shows a full table scan of all 3 TB. Rewrite the query so Catalyst pushes the date filter down to the file scan.
Summary
You renamed the column. Catalyst forgot how to prune.
Practice This Problem
Solve this spark problem with real code execution. DataDriven runs your solution and grades it automatically.