Loading lesson...
Small Files and Compaction
A million tiny files will crush your query engine; compaction is the unsexy but essential fix
A million tiny files will crush your query engine; compaction is the unsexy but essential fix
- Category
- Pipeline Architecture
- Difficulty
- advanced
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: "Why Is My Query So Slow?", Why Small Files Kill Performance, Compaction Strategies, Preventing Small Files at the Source, Compaction in Modern Table Formats
Lesson Sections
- "Why Is My Query So Slow?"
What They're Really Testing The Unlock Every file has a fixed overhead: metadata, file handle, S3/HDFS listing entry, Spark task scheduling. When you have 1 million 1 KB files instead of 1,000 1 MB files, you have 1,000x the overhead for the same data. The query engine spends more time opening files than reading data. 150 bytes The 60-Second Framework How Small Files Are Created
- Why Small Files Kill Performance
The performance impact of small files hits at every layer of the stack. Knowing the specific mechanics is what separates a senior answer from a generic one. The Four Performance Killers The Numbers
- Compaction Strategies
Compaction is the process of merging many small files into fewer large files. It is a background maintenance job, not part of the main pipeline. The interview tests whether you know how to design a compaction job with the right target file size, scheduling, and partition awareness. Compaction Design The Follow-Up Trap The strong-hire detail: 'I would add a compaction monitoring metric: average file size per partition. When average file size drops below 64 MB, the compaction job triggers automati
- Preventing Small Files at the Source
The best compaction strategy is not needing compaction. Preventing small files at the source is cheaper, simpler, and more reliable than cleaning them up after the fact. Prevention Strategies The Spark Repartition Trap
- Compaction in Modern Table Formats
Iceberg, Delta Lake, and Hudi all have built-in compaction. Knowing which command to use and when manual compaction is still needed is the senior differentiator. Table Format Compaction Comparison When Manual Compaction Is Still Needed Vocabulary That Signals Seniority The Bridge Move