# The Box That Won't Fit the Data

Canonical URL: <https://datadriven.io/problems/five-gigs-to-move-a-mountain>

Domain: Pipeline Design · Difficulty: hard · Seniority: senior

## Problem

Your nightly Spark job rolls a 100GB event export up to per-account daily totals, but the only box it runs on has 5GB of RAM and no cluster to fall back on. Land those totals durably in the local data lake without the job dying when the 100GB refuses to fit in memory.

## Worked solution and explanation

### Why this problem exists in real interviews

This is an out-of-core processing problem wearing a Spark-trivia costume. The interviewer is not testing whether you know Spark runs on one machine; they are testing whether you understand that 100GB through 5GB of RAM is fine as long as no single operation ever needs to hold the whole thing at once. The trap is reaching for a bigger box or a cluster, because the constraint forbids both. The real answer is: process in partitions, aggregate as you go, and let disk and the local lake hold what RAM cannot.

The default sketch is read the file, group by account, write the totals. It looks correct and it OOMs on the first run. The group-by triggers a wide shuffle that tries to buffer state for 2 million accounts across 100GB of rows, and the moment someone adds an orderBy to make the output tidy, Spark attempts a global sort that materializes far more than 5GB. The job dies at hour two with a heap error, and the candidate blames the box instead of the plan.

> **Trick to Solving**
>
> 100GB does not need 100GB of RAM. It needs one partition's worth of RAM at a time, plus disk for the spill and a durable lake for the partials.
> 
> 1. Read the export in bounded partitions so any one task's working set is a few hundred MB, not the whole file.
> 2. Aggregate partially per partition (a hash aggregate keyed by account plus date), so the in-memory state is bounded by distinct accounts in that slice, not by row count.
> 3. Stage the small partials to the local object store and merge them into exact totals, letting Spark spill the shuffle to local disk instead of holding it in the heap.

---

### Walk the requirements

#### Step 1: Read in partitions, never all at once

Point the read at the 100GB export with a partition size small enough that one task's input plus its aggregate state fits comfortably inside 5GB, with headroom for JVM overhead. Process partitions sequentially or with low parallelism. A first pass that rewrites the line-delimited JSON to Parquet in the lake is optional but pays off: projection pushdown means later passes touch only account_id, date, and amount, slashing bytes scanned per pass.

#### Step 2: Aggregate partially before you shuffle

The key move is map-side partial aggregation. Each partition emits a partial per-account daily total, so the data leaving a partition is bounded by the number of distinct accounts in that slice, not by its raw row count. This shrinks 100GB of rows to a handful of partials per account before anything crosses a partition boundary. A naive group-by that skips partial aggregation ships every raw row into the shuffle and that is what blows the heap.

#### Step 3: Let disk and the lake hold what RAM can't

Configure the job so the shuffle spills to local disk via spark.local.dir rather than buffering in memory, and stage each partition's partial output durably to the local object store. The final per-account table is small (2 million accounts fits in memory), so the closing merge of partials is cheap. The durable object-store and lakehouse tier is doing the heavy lifting that a bigger cluster would otherwise do, and it is why the job survives on 5GB.

#### Step 4: Make it resumable

Stage completed partition outputs in the lake so a reboot or a kill at hour three resumes from the partitions already done rather than restarting 100GB. On a single box with a 6-hour window and no redundancy, checkpoint-style staging is the difference between a one-hour recovery and a full re-run that misses the SLA.

---

### The shape that fits

**Out-of-core roll-up on one 5GB box**

```python
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Single box, 5GB RAM: low parallelism, force disk spill, never collect.
# /mnt/lake is the MinIO-backed local data lake; /mnt/spill is scratch disk.
spark = (
    SparkSession.builder
    .master("local[2]")                                 # not local[*]: bound concurrent tasks
    .config("spark.driver.memory", "4g")                # leave headroom under 5GB
    .config("spark.sql.files.maxPartitionBytes", str(256 * 1024 * 1024))  # ~256MB read tasks
    .config("spark.sql.shuffle.partitions", "400")      # small per-task shuffle working set
    .config("spark.shuffle.spill.compress", "true")
    .config("spark.local.dir", "/mnt/spill")            # local disk absorbs shuffle spill
    .getOrCreate()
)

# Pass 1 (optional): rewrite 100GB JSONL to columnar Parquet in the lake so
# later passes read only the three columns we roll up, via projection pushdown.
(
    spark.read.json("/data/events_100gb.jsonl")
    .select("account_id", F.to_date("event_ts").alias("day"), "amount")
    .write.mode("overwrite").parquet("/mnt/lake/staging/events_parquet")
)

# Pass 2: bounded-memory roll-up. Spark does map-side partial aggregation for
# groupBy().sum() automatically, so each task emits per-account partials and the
# shuffle spills to disk. No orderBy on the full dataset, no collect().
daily_totals = (
    spark.read.parquet("/mnt/lake/staging/events_parquet")
    .groupBy("account_id", "day")
    .agg(F.sum("amount").alias("total_amount"))
)

# Land the result in the local lakehouse, partitioned by day, so a killed run
# resumes per finished day partition instead of reprocessing 100GB.
(
    daily_totals.write
    .mode("overwrite")
    .partitionBy("day")
    .parquet("/mnt/lake/account_daily_totals")
)
```

*Low parallelism caps the per-task working set, groupBy runs map-side partial aggregation, and the shuffle spills to local disk under spark.local.dir. Partials and results land in the local lakehouse. No global sort, no collect to the driver.*

> **Where the memory goes**
>
> Peak RAM is one partition's input plus its hash-aggregate state, not the dataset. With a few hundred MB per partition and aggregate state bounded by distinct accounts in that slice, the job sits well under 5GB. The cost moves to disk I/O: the export is read once (twice if you convert to Parquet first), and staged partials are written and re-read once from the local lake. On a 6-hour window that I/O is the bottleneck, and it is a tunable one.

> **Interviewers Watch For**
>
> The tell of seniority is naming the operations that OOM regardless of partition size: a global orderBy, an unbounded wide shuffle, collect() to the driver. A strong candidate says map-side partial aggregation and disk spill in the same breath, and explains that partition sizing controls the working set while disk and the object store absorb the rest. Reaching for a bigger machine is the answer that ends the screen early.

> **Common Pitfall**
>
> Adding orderBy to make the output sorted. A global sort over 100GB tries to materialize and order the whole dataset and ignores your careful partitioning, OOMing on a box that was otherwise fine. If you need sorted output, sort within partitions or sort the small final result, never the raw input.

---

## Common follow-up questions

- The export grows to 1TB on the same 5GB box. What changes in your design and what stays the same? _(Tests whether the candidate sees that the partition-and-stage design already scales by adding passes and disk, while the in-memory final merge eventually needs its own bounded handling as distinct keys grow.)_
- One account is responsible for 30% of all rows. What does that do to your partial aggregation, and how do you keep that partition from OOMing? _(Tests awareness of key skew: a single hot key concentrates aggregate state and shuffle volume; the fix is salting the hot key and a two-stage aggregate, not just smaller partitions.)_

## Related

- [All practice problems](https://datadriven.io/problems)
- [Mock interview mode](https://datadriven.io/interview/five-gigs-to-move-a-mountain)
- [System Design Interview Questions](https://datadriven.io/data-engineering-system-design)
- [Data Engineering Interview Prep Guide](https://datadriven.io/data-engineer-interview-prep)
- [Daily Challenge](https://datadriven.io/daily)

---

Source: DataDriven (https://datadriven.io). 100% free data engineering interview prep. Live code execution against Postgres 16, Python 3.11, and Spark sandboxes. No paywall, no premium tier, no signup gate.