Loading lesson...
Storage Layers and Table Formats: Intermediate
Columns, partitions, compression, and pushdown turn a 10TB scan into a 100GB scan
Columns, partitions, compression, and pushdown turn a 10TB scan into a 100GB scan
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 32 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Columnar Versus Row Storage, Partitioning and File Pruning, Compression: Bytes Versus CPU, Predicate Pushdown, 10TB Versus 100GB: A Worked Example
Lesson Sections
- Columnar Versus Row Storage (concepts: paColumnarVsRow, paParquet, paVectorizedExecution)
The single most important fact about a storage format is whether it lays out rows or columns contiguously on disk. The choice flips which queries are fast. A row store wins when queries fetch entire rows by key. A column store wins when queries scan a few columns across many rows. Modern analytical workloads are dominated by the second pattern, which is why every cloud warehouse and every serious lake format uses columnar storage. How the Bytes Are Arranged Consider a table with twenty columns a
- Partitioning and File Pruning (concepts: paPartitioning, paClusteringKey, paFilePruning)
Columnar layout helps a query read fewer columns. Partitioning helps a query read fewer files. The combination is what turns an analytical scan from minutes into seconds. Partitioning splits a table into separate file paths organized by the value of one or more partition columns. A query with a filter on a partition column reads only the matching paths. Done well, partitioning is the largest single performance improvement available to a data engineer. Done badly, it produces millions of small fi
- Compression: Bytes Versus CPU (concepts: paCompression, paSplittability, paZstdSnappy)
Compression is the lever that trades CPU cycles for bytes. Smaller files mean fewer bytes read from disk or network, which usually wins. Smaller files also mean more CPU spent decompressing on read and compressing on write. The tradeoff is rarely close in modern analytical workloads: I/O is slow and getting slower relative to CPU, so the bytes saved are almost always worth the cycles spent. The interesting choice is which codec, not whether to compress. The Codecs in Use ZSTD has become the mode
- Predicate Pushdown (concepts: paPredicatePushdown, paBloomFilter, paQueryPlanning)
Predicate pushdown is the technique of moving filter conditions as close to the storage layer as possible so the engine reads only the bytes that could match. Partition pruning is the coarsest form of pushdown. File-level statistics, row-group min/max, and bloom filters are finer forms. A well-designed Parquet file plus a smart query engine produces queries that scan a tiny fraction of the table while returning the same answer. The wins compound with partitioning and columnar layout. The Three L
- 10TB Versus 100GB: A Worked Example (concepts: paStorageOptimization, paBytesScanned, paLayoutComposition)
The four levers (columnar layout, partitioning, compression, predicate pushdown) compose. A worked example shows how the savings multiply. The setup is a real-shaped clickstream table at moderate scale: eighteen months of mobile events, twenty-eight columns wide, around two trillion total rows. The query is unremarkable: a daily count of unique users for one country in the last seven days. The same SQL runs three different ways across three table layouts. The bytes scanned change by two and a ha