# Three Hours for Yesterday's Numbers

> 18 terabytes scanned. 50 megabytes needed.

Canonical URL: <https://datadriven.io/problems/full_scan_pivot_fix>

Domain: PySpark · Difficulty: medium · Seniority: L5

## Problem

The nightly `daily_store_sales` Spark job is breaching SLA. It reads a source table of store-level daily sales (one row per store per product per day) and pivots it into one row per store per day with product-level metrics as columns. The job takes 3 hours against a 45-minute SLA because it reads the entire source table every night. Diagnose and fix it.

## Worked solution and explanation

### Why this problem exists in real interviews

Retail companies run nightly batch jobs that transform transactional data into analytics-ready formats. A common mistake is scanning the entire history when you only need the latest increment. This question tests partition pruning, predicate pushdown, and incremental processing, which are the most impactful Spark optimizations in practice.

---

### The two optimizations

#### Step 1: Filter on the partition column

The source is partitioned by `sale_date`. Adding `.filter(F.col('sale_date') == yesterday)` tells Spark to read only one partition directory (50 MB) instead of all 1,095 (18 TB). The physical plan will show `PartitionFilters: [sale_date = ...]`.

#### Step 2: Pass known values to pivot

`.pivot('product_category', categories)` avoids an extra pass to discover distinct values. Without the list, Spark runs a preliminary aggregation to collect all distinct categories, then pivots. With a fixed list, it skips that step.

---

### The solution

```python

```

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/full_scan_pivot_fix)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.