An ad-tech company at roughly Series D scale ran a daily revenue report against a clickstream table holding eighteen months of events. The query took fifty-eight minutes. The data engineer who inherited the workload looked at the table layout: CSV files in S3, no partitioning, no compression worth mentioning. The same logical query, after the table was rewritten as Parquet partitioned by event_date with snappy compression, took forty-one seconds. No SQL changed. No new hardware appeared. The eighty-five-fold speedup came entirely from how the bytes were laid out on disk. This lesson is the mechanics that turn an abstract storage layer into the difference between forty-one seconds and fifty-eight minutes. Columnar layout, partitioning, compression, and predicate pushdown are the four levers, and they compose.
Columnar Versus Row Storage
Daily Life
Interviews
Explain why columnar storage accelerates analytical queries through I/O reduction, compression, and vectorized execution.
The single most important fact about a storage format is whether it lays out rows or columns contiguously on disk. The choice flips which queries are fast. A row store wins when queries fetch entire rows by key. A column store wins when queries scan a few columns across many rows. Modern analytical workloads are dominated by the second pattern, which is why every cloud warehouse and every serious lake format uses columnar storage.
How the Bytes Are Arranged
Layout
Disk Order
What a Scan of One Column Reads
Row store
(r1.c1, r1.c2, r1.c3) (r2.c1, r2.c2, r2.c3) ...
Every column of every row, even when only one is needed
Consider a table with twenty columns and a billion rows. A query computes the sum of one column. A row store reads the bytes of all twenty columns for every row, throwing away nineteen twentieths of what was read. A column store reads only the bytes for that one column. The reduction in bytes scanned is roughly twenty to one. The reduction in wall clock time is similar, because most analytical queries are bounded by how many bytes the engine has to scan, not by how many CPU cycles it spends on the bytes.
A Parquet file is divided into row groups (typically 128MB each). Within each row group, columns are stored as separate chunks. Each chunk has a header with min/max statistics, a null count, and the encoded data. The same shape applies to ORC files, which use stripes instead of row groups but the principle is identical. The column-by-column layout is what enables most of the optimizations later in this lesson.
Why Columnar Compresses Better
Adjacent values in a column are similar by nature. Country codes from the same region cluster together. Timestamps within a row group span minutes, not years. Customer IDs are often repeated across rows for the same buyer. Compression algorithms exploit similarity. Run-length encoding compresses a hundred US country codes into the pair (US, 100). Dictionary encoding maps customer IDs to small integers. Delta encoding stores timestamps as offsets from a base. A row store cannot exploit any of this because adjacent bytes belong to different columns with unrelated content. Columnar formats routinely achieve five to ten times better compression ratios on real analytical data than uncompressed row formats.
[US, US, US, US] becomes (US, 4). Effective for low-cardinality columns like country code or status.
Dictionary encoding
Strings become small integers
{u_42 -> 0, u_87 -> 1} replaces every customer_id with a 1-byte index. Used heavily in Parquet by default.
Delta encoding
Sequences become differences
Timestamps stored as offsets from a base value. Saves bytes when a column is sorted or near-sorted.
When Row Storage Still Wins
Columnar is not universally better. Row storage wins when the access pattern reads entire rows. An application fetching a user record by user_id wants every column of that one row. A column store has to assemble the row from twenty separate column chunks, which is slower than reading one row out of a row store. This is precisely why operational databases stayed row-oriented and analytical systems went columnar. The shape follows the access pattern, not the other way around.
•Row Storage Wins When
Queries read most or all columns of a small number of rows
Workload is dominated by single-row INSERT, UPDATE, DELETE
Transactions span multiple columns and need to commit atomically
✓Column Storage Wins When
Queries scan a few columns across many rows
Workload is bulk-load with infrequent updates
Compression matters because storage cost is significant
Vectorized execution can process column chunks at high speed
Vectorized Execution
Columnar storage enables vectorized execution. Modern CPUs operate on batches of values per instruction (SIMD: single instruction, multiple data). Summing a thousand integers stored contiguously can run as a few SIMD instructions instead of a thousand scalar adds. Row storage cannot use SIMD effectively because the integers are interleaved with bytes from other columns. Engines like DuckDB, Snowflake, BigQuery, and Photon (Databricks) are built around vectorized execution on columnar data, and the throughput gain over scalar row-by-row execution is often an order of magnitude on top of the I/O reduction. The two effects multiply.
1
# Conceptually, vectorized execution on a column chunk:
# One SIMD-friendly sum across all 8 values, no row iteration:
6
total=amount_cents.sum()
7
8
# A row-store equivalent must visit each row, fetching the amount
9
# alongside columns it does not need:
10
total=0
11
forrowinrows:
12
total+=row.amount_cents# disk reads pull all 20 columns per row
Columnar storage reads only the columns a query asks for and skips the rest entirely.
Compression ratios are five to ten times better in columnar formats because adjacent values are similar.
Vectorized execution multiplies the I/O win with a per-CPU-instruction throughput win.
Partitioning and File Pruning
Daily Life
Interviews
Choose a partition key by query filter pattern and avoid the small-files trap that kills naive partitioning.
Columnar layout helps a query read fewer columns. Partitioning helps a query read fewer files. The combination is what turns an analytical scan from minutes into seconds. Partitioning splits a table into separate file paths organized by the value of one or more partition columns. A query with a filter on a partition column reads only the matching paths. Done well, partitioning is the largest single performance improvement available to a data engineer. Done badly, it produces millions of small files that destroy performance worse than no partitioning at all.
The folder names encode partition values. A query engine like Spark, Trino, Athena, or Snowflake recognizes the convention and uses it to prune files. A query filtering on event_date = '2026-04-22' reads two files (US and DE for that date) and ignores the other four entirely. The engine never opens the pruned files; it does not even list them in detail. The pruning happens before the bytes are read.
Choosing a Partition Key
Partition Key Property
Why It Matters
Common Choice
Used in most query filters
Pruning only helps queries that filter on the partition column
event_date or ingestion_date in nearly every fact table
Has moderate cardinality
Too few partitions yields no pruning; too many partitions creates millions of small files
Date (365 partitions/year) is the sweet spot for daily fact tables
Stable over time
If a value can change for a row, partition reassignment becomes a rewrite
Immutable timestamps; never partition by status that flips
Skewed but not catastrophically
A partition that holds 90% of the data defeats the purpose
Country sometimes works; user_id never works
The Partition Cardinality Tradeoff
A daily fact table with one year of data has 365 date partitions. Most queries filter on a date range and prune to a handful. This is the right cardinality. Partitioning the same table by user_id, in contrast, would create one partition per user. With ten million users, the table now has ten million tiny directories, each containing a small file or a fragment of one. The metadata overhead alone makes queries slow before any data is read. The lesson is that partition cardinality should match the typical query selectivity. Daily filters want daily partitions. Hourly filters want hourly partitions. Per-user lookups belong on a clustering key, not a partition key.
Partitioning failure modes:
▸Too many partitions: millions of tiny files, listing dominates query time, cost explodes
▸Too few partitions: filters cannot prune effectively, scans read more than necessary
▸Wrong partition key: filters in actual queries do not match what the table is partitioned by
▸Skewed partitions: one partition is 100x larger than the rest, parallelism collapses
Multi-Level Partitioning
Tables can be partitioned by more than one column. A common pattern is event_date plus country, or event_date plus event_type. Multi-level partitioning works well when the secondary partition column has low cardinality and shows up in query filters often. It fails when the cardinalities multiply into a partition count that exceeds the data volume. Two columns at 365 dates by 200 countries equals 73,000 directories, with potentially small files in each. The right shape is one partition with on average tens to hundreds of megabytes of data; partitioning that produces sub-megabyte files has gone too far.
Partitioning is one tool; clustering or sorting within partitions is another. Snowflake supports clustering keys, BigQuery supports clustered tables, and Iceberg has hidden partitioning plus sort orders. Clustering reorders rows within files (or within partitions) so that values close together in sort order are physically adjacent. A query with a filter on a clustered column scans fewer pages even within a single file because the relevant rows cluster together. Clustering is most useful when the column is too high-cardinality to partition on but still appears in many filters (for instance customer_id in a fact table partitioned by date). Use partitioning for the broad cut, clustering for the finer cut inside each partition.
✓Partition Key
Splits data into separate file paths
Pruning happens before files are opened
Best for low-to-moderate cardinality columns
Wrong key creates the small-files problem
•Clustering / Sort Key
Reorders rows within files or partitions
Pruning happens at the row group or page level
Best for high-cardinality columns frequently filtered
Adds maintenance: clustering can drift and need re-sorting
Pick the partition key by what queries actually filter on, not by what feels like a natural primary key. The right key is the one that appears in WHERE clauses across most of the workload.
✓Do
Partition fact tables by event_date or ingestion_date by default
Aim for partitions that hold tens to hundreds of megabytes of data each
Add clustering on high-cardinality filter columns instead of partitioning on them
✗Don't
Partition by a high-cardinality key like user_id or order_id
Add a second partition column unless its cardinality stays low
Partition by a column whose value can change after a row is written
1
Compression: Bytes Versus CPU
Daily Life
Interviews
Choose a compression codec by balancing ratio, decompression speed, and splittability for the workload's I/O versus CPU profile.
Compression is the lever that trades CPU cycles for bytes. Smaller files mean fewer bytes read from disk or network, which usually wins. Smaller files also mean more CPU spent decompressing on read and compressing on write. The tradeoff is rarely close in modern analytical workloads: I/O is slow and getting slower relative to CPU, so the bytes saved are almost always worth the cycles spent. The interesting choice is which codec, not whether to compress.
The Codecs in Use
Codec
Compression Ratio
Decompression Speed
Typical Use
Snappy
Moderate (about 2-3x)
Very fast
Default for Parquet in Spark; balanced choice
GZIP
Higher (about 3-4x)
Slower than Snappy
Archival data where read frequency is low
ZSTD
Higher (about 3-5x)
Fast at low levels, slower at high
Modern default replacing both Snappy and GZIP in many stacks
LZ4
Lower (about 2x)
Fastest decompression
Hot data where decompression latency matters most
Brotli
Highest (about 4-5x)
Slow
Cold archival; rarely used for analytical files
ZSTD has become the modern default in many stacks because it offers GZIP-like ratios at near-Snappy speeds. Spark, Iceberg, Delta, and Trino all support ZSTD natively. Snappy remains the historical default for Parquet because it was the original choice when Parquet emerged at Twitter and Cloudera around 2013, and changing the default has compatibility costs. New tables with read-heavy workloads benefit from picking ZSTD explicitly.
Why Columnar Files Compress So Well
A column of country codes contains hundreds of US values in a row before the next country shows up. A row-oriented file interleaves these with timestamps, integers, and strings, and adjacent bytes are unrelated. Compression works on similarity, and similarity is dense in a column and sparse in a row. The same dataset stored as row-oriented JSON might compress 1.5x with GZIP. The same data stored column-by-column in Parquet with the same GZIP often hits 4-5x. The two compressions are not different by accident; the layout is doing most of the work and GZIP is finishing it.
# Result: 380 MB on disk (about 6x smaller than CSV)
19
20
# Same data, ZSTD level 3
21
pq.write_table(
22
pa.Table.from_pandas(df),
23
"orders_zstd.parquet",
24
compression="zstd",
25
compression_level=3
26
)
27
# Result: 240 MB on disk (about 10x smaller than CSV)
Splittability
A second consideration in compression is splittability. Spark and similar engines parallelize work by splitting files among executors. If a file format allows splits inside a compressed file, parallelism scales to many executors per file. If splits require decompressing the file from the beginning, parallelism is limited to one executor per file, which is fatal for large files. GZIP-compressed CSV is famously not splittable, which is why a single 5GB GZIP CSV is processed by exactly one executor. Parquet is splittable internally because each row group is independently compressed and decompressible. Splittability is invisible until the file is large enough to need it; at that point, it becomes the dominant performance factor.
Splittability cheat sheet:
▸Parquet with any of Snappy, ZSTD, GZIP, LZ4: splittable at row group boundaries
▸ORC with any codec: splittable at stripe boundaries
▸GZIP-compressed CSV or JSON: not splittable; one executor per file
▸BZIP2-compressed CSV: technically splittable but rarely supported by tooling
Compression Inside the Row Group
Parquet applies compression to column chunks within a row group, not to the whole file. Each chunk has its own header and codec choice. The encoding (dictionary, run-length, delta, plain) is applied first; then a general-purpose codec like Snappy or ZSTD is applied on top. The two layers compose: a dictionary-encoded country chunk that already collapsed to small integers compresses further with Snappy. The encoded-and-compressed result is often a hundredth of the size of the raw column. This is why Parquet on real analytical data routinely achieves order-of-magnitude compression versus the same data as CSV.
When Compression Hurts
Compression is not free. Writing compressed data costs CPU. Reading compressed data costs CPU. On hot in-memory workloads where the data already fits in RAM, compression can slow things down because the CPU is the bottleneck and bytes-on-disk does not matter. This is why DuckDB and similar in-memory analytical engines sometimes default to lighter compression or none at all. The general rule is: when storage is far away (object storage, network filesystems), compress aggressively; when storage is fast and local (NVMe SSDs in the same machine as compute), compression is a smaller win.
✓Heavy Compression Wins
Data lives in object storage; reads cross the network
Cold data read infrequently relative to write rate
Storage cost is a meaningful line item
Workloads are I/O-bound, not CPU-bound
•Light or No Compression Wins
Data is hot in RAM or on local NVMe
Workloads are CPU-bound, not I/O-bound
Decompression latency is in the critical path
Tooling has weak compression support and read errors are common
SnappyZSTDGZIP
Snappy
The historical default
Moderate ratio, very fast decompression. Default for Parquet in Spark since 2013. Still a safe choice when tool compatibility matters.
ZSTD
The modern default
Higher ratio than Snappy at near-equivalent speed. Tunable via compression level. The right pick for new Parquet tables on object storage.
GZIP
Archival workhorse
Strong ratio, slower decompression. Suitable for cold data read infrequently. Avoid on hot analytical tables where ZSTD wins on every axis.
1
Predicate Pushdown
Daily Life
Interviews
Apply predicate pushdown by writing filters that allow partition, row group, and bloom filter pruning, and verify pushdown via EXPLAIN.
Predicate pushdown is the technique of moving filter conditions as close to the storage layer as possible so the engine reads only the bytes that could match. Partition pruning is the coarsest form of pushdown. File-level statistics, row-group min/max, and bloom filters are finer forms. A well-designed Parquet file plus a smart query engine produces queries that scan a tiny fraction of the table while returning the same answer. The wins compound with partitioning and columnar layout.
The Three Levels of Pushdown
Level
Where It Happens
What It Skips
Partition pruning
Before any file is opened
Whole directories that cannot match the filter
Row group pruning (file statistics)
After the file footer is read
Whole row groups whose min/max disagrees with the filter
Page pruning (column statistics)
After the row group's column metadata is read
Individual pages within a column whose min/max disagrees
Bloom filter pruning
After the bloom filter is consulted
Row groups whose bloom filter says the value is definitely absent
Each level skips bytes the previous level could not. A query that filters on event_date prunes by partition, then by row group statistics within each partition's file, then by column page statistics. By the time the engine reads actual data, it has often pruned 99% of the table without a single full scan. The cost of the metadata reads is microseconds; the cost of the bytes saved is seconds or minutes.
Min/max statistics are written into the Parquet footer at write time. A reader reads the footer first (a few kilobytes), looks at the per-row-group statistics, and decides which row groups to read. The decision is precise when the data is sorted on the filter column, because sorted data produces non-overlapping row groups. The decision is fuzzy when the data is unsorted, because every row group's min/max range overlaps with the filter and pruning is less effective. Sorting on filter columns is the cheap way to make pushdown work harder.
Bloom Filters
Bloom filters are a probabilistic data structure that answers 'is this value in this set' with no false negatives but possible false positives. A bloom filter on a column lets a reader skip a row group if the filter says 'this value is definitely not here.' Bloom filters are valuable for high-cardinality equality filters (find rows where order_id = X) where min/max is too coarse. Parquet supports bloom filters as an optional feature; ORC has its own bloom filter implementation. They cost some space in the file metadata; they save large amounts of scanning when used well.
Pushdown in SQL Engines
1
/* Snowflake, BigQuery, Athena, Trino, Spark all do this: */
2
/* The query planner pushes the filter into the scan operator, */
3
/* which pushes it into the file format reader. */
GROUPBY/* The engine reads only the partitions, files, row groups, *//* and (if supported) pages whose statistics could match the filter. */customer_id
When Pushdown Fails
Pushdown does not work when the filter cannot be pushed. A function applied to the partition column (WHERE DATE(event_ts) = '2026-04-22' on a table partitioned by event_ts) often defeats partition pruning, because the engine cannot evaluate the function over the metadata. UDFs in the filter clause defeat statistics-based pruning. CAST inside the filter sometimes defeats it, depending on the engine. The discipline is to write filters that match the partition and clustering columns directly: WHERE event_date = '2026-04-22' rather than WHERE DATE(event_ts) = '2026-04-22'. The two queries return the same answer; only one allows pushdown.
Filter shapes that defeat pushdown:
▸Functions applied to partition columns: DATE(event_ts), CAST(amount AS STRING)
▸User-defined functions in WHERE clauses
▸OR clauses spanning multiple unrelated columns; only some engines push these
▸Subqueries that the planner cannot prove are deterministic
▸Filters using LIKE on a column not configured with a substring index
Verifying Pushdown
Every serious SQL engine exposes an EXPLAIN command that shows the query plan. The plan reveals whether filters are pushed into the scan or applied after the scan. A scan with a Filter operator above it has not pushed the filter; a scan with the predicate annotated inside it has pushed it. The habit of running EXPLAIN before assuming pushdown works is what separates engineers who write fast queries from engineers who think they did. Snowflake, Spark, Trino, and BigQuery all have variations of EXPLAIN that show pruned bytes or pruned partitions explicitly.
Predicate pushdown moves the filter to the storage layer so the engine reads only matching bytes.
Three pruning levels compose: partition, row group, and page or bloom filter.
Functions and casts in the WHERE clause can silently defeat pushdown; verify with EXPLAIN.
TIP
When a query is slower than expected, run EXPLAIN and check whether the filter sits inside the scan or after it. Most pushdown surprises are visible in the plan.
10TB Versus 100GB: A Worked Example
Daily Life
Interviews
Walk through how columnar layout, partitioning, compression, and pushdown compose to turn a 10TB scan into a 100GB scan.
The four levers (columnar layout, partitioning, compression, predicate pushdown) compose. A worked example shows how the savings multiply. The setup is a real-shaped clickstream table at moderate scale: eighteen months of mobile events, twenty-eight columns wide, around two trillion total rows. The query is unremarkable: a daily count of unique users for one country in the last seven days. The same SQL runs three different ways across three table layouts. The bytes scanned change by two and a half orders of magnitude.
The Workload
1
/* The query, identical across all three setups */
2
SELECT
3
event_date,
4
COUNT(DISTINCTuser_id)ASdau
5
FROMevents.mobile_clickstream
6
WHEREevent_dateBETWEEN'2026-04-17'AND'2026-04-23'
7
ANDcountry='US'
8
GROUPBYevent_date
9
ORDERBYevent_date
Setup 1: CSV, No Partitioning
Property
Value
Format
GZIP-compressed CSV files in S3
Partitioning
None
Total size on disk
Roughly 10 TB
Bytes the query scans
All 10 TB
Wall clock time
Hours, depending on cluster size
Cost (Snowflake medium warehouse)
Tens of dollars per run
The engine has no way to skip files because there are no partitions. The engine has no way to skip columns because CSV is row-oriented. The engine has no way to use min/max statistics because CSV does not embed any. Every byte gets read, decompressed, parsed, and discarded. This is the worst case and is also the actual state of many lake tables that started as raw landings and never got rewritten.
Setup 2: Parquet, No Partitioning
Property
Value
Format
Parquet with Snappy compression
Partitioning
None
Total size on disk
Roughly 1.5 TB (CSV compressed about 6-7x by columnar layout)
Bytes the query scans
Roughly 200 GB (only the columns event_date, user_id, country read)
Wall clock time
Minutes
Cost (Snowflake medium warehouse)
Single dollars per run
Columnar layout cuts both storage size and scan size dramatically. Storage drops from 10TB to 1.5TB because adjacent values in a column compress well. Scan size drops further because only three of twenty-eight columns are referenced. The engine reads roughly the bytes for those three columns across the full table, plus footer metadata for row group statistics. Min/max pruning helps a little: row groups whose event_date range falls outside the seven-day window get skipped. The order-of-magnitude improvement comes from columnar; the additional pruning is a smaller bonus.
Setup 3: Parquet, Partitioned by event_date
Property
Value
Format
Parquet with ZSTD compression
Partitioning
event_date (daily)
Total size on disk
Roughly 1.2 TB (slightly better than Snappy with ZSTD)
Bytes the query scans
Roughly 100 GB (7 day partitions out of 540 total)
Wall clock time
Tens of seconds
Cost (Snowflake medium warehouse)
Cents per run
Partition pruning eliminates 533 out of 540 day partitions. The seven days that match get scanned, with the same columnar advantage as before. Within each day's partition, row group statistics on country prune further because the data is sorted by country within each day. The engine ends up reading roughly 100GB out of 1.2TB on disk: a hundred-fold reduction in bytes scanned versus the partitioned-but-still-CSV alternative, and a hundred-fold reduction in dollars. The cost for the same business answer dropped from tens of dollars to cents.
How the Savings Stack
Optimization
Bytes Scanned (10TB Baseline)
Reduction
CSV, no partitioning
10 TB
1x baseline
Parquet, no partitioning
200 GB
50x
Parquet, partitioned by event_date
100 GB
100x
Parquet, partitioned by event_date, sorted by country
20 GB
500x
Add bloom filter on user_id
Same scan, faster aggregation
100x with finer pruning on user_id queries
The reductions are not additive but multiplicative. Each lever cuts a fraction of what the previous lever left. The numbers in the table are typical for clickstream-shaped workloads; specific tables can do better or worse depending on column cardinality, data sortedness, and query selectivity. The qualitative shape (orders-of-magnitude wins from layered optimization) is universal.
1
An identical SQL string can produce a forty-second result on a well-laid-out table or an hours-long result on a badly laid-out one. The query writer often gets blamed for slow queries, but the file format and partitioning decisions made years earlier, by someone else, often dominate. The discipline is to invest in the table layout once and pay the dividend on every query thereafter. Migrating a critical analytical table from CSV to partitioned Parquet is a one-day project that frequently pays for itself in compute savings within a quarter.
TIP
When a slow analytical query lands on the desk, the cheapest first check is the table format and partitioning. Most slow queries are slow because the bytes are laid out wrong, not because the SQL is wrong. Run EXPLAIN; verify pruning; only then look at warehouse sizing or SQL rewrites.
❯❯❯PUTTING IT ALL TOGETHER
> An ad-tech company has a daily revenue report that takes fifty-eight minutes to run against eighteen months of clickstream data stored as CSV in S3. The dashboard's freshness SLA is twenty minutes. The team is being asked to add another year of history and a second region. The data engineer is asked: 'What is the smallest set of changes that would make this query fast enough to meet SLA, and what does each change actually buy?'
Step one: rewrite the table from CSV to Parquet with ZSTD compression. Columnar layout reads only the columns the query references and compresses the rest of the storage by an order of magnitude. The query drops from fifty-eight minutes to several minutes with no other changes.
Step two: partition the Parquet table by event_date. The seven-day query prunes 533 of 540 partitions before reading any bytes. Combined with columnar layout, the bytes scanned drop by roughly two orders of magnitude versus the original CSV. This is the single largest win.
Step three: sort within each partition by country, the second filter column. Row group min/max statistics now allow the engine to skip entire row groups that do not contain US rows. Predicate pushdown takes the scan from 100GB down to roughly 20GB, well inside SLA budget.
Step four: verify with EXPLAIN that the filter is pushed into the scan. Functions like DATE(event_ts) silently defeat pruning, so the migrated table uses event_date as a stored column. The freshness SLA from Lesson 2 is satisfied because the daily run now finishes in under a minute, well under the twenty-minute window.
The four levers compose. Lesson 1 introduced storage as a single role; this rewrite makes that role earn its title by laying out bytes for the access pattern instead of for the writer's convenience. Adding the new region and another year of history extends the table by a known fraction without changing the optimization story.
KEY TAKEAWAYS
Columnar layout is the largest single optimization: analytical queries read only the columns they need, with built-in compression and statistics.
Partition cardinality must match query selectivity: daily filters want daily partitions; per-user filters want clustering, not partitioning.
Compression is almost always worth its CPU cost: ZSTD at level 3 is a strong modern default in Parquet on object storage.
Predicate pushdown composes with everything else: partition, row group, page, and bloom filter pruning each cut another fraction of what survived the previous step.
Identical SQL can be 100x slower or faster: the file format and partitioning decisions made before the query runs dominate performance.
Storage Layers and Table Formats: Intermediate
Columns, partitions, compression, and pushdown turn a 10TB scan into a 100GB scan
Category
Pipeline Architecture
Difficulty
intermediate
Duration
32 minutes
Challenges
0 hands-on challenges
Topics covered: Columnar Versus Row Storage, Partitioning and File Pruning, Compression: Bytes Versus CPU, Predicate Pushdown, 10TB Versus 100GB: A Worked Example
The single most important fact about a storage format is whether it lays out rows or columns contiguously on disk. The choice flips which queries are fast. A row store wins when queries fetch entire rows by key. A column store wins when queries scan a few columns across many rows. Modern analytical workloads are dominated by the second pattern, which is why every cloud warehouse and every serious lake format uses columnar storage. How the Bytes Are Arranged Consider a table with twenty columns a
Columnar layout helps a query read fewer columns. Partitioning helps a query read fewer files. The combination is what turns an analytical scan from minutes into seconds. Partitioning splits a table into separate file paths organized by the value of one or more partition columns. A query with a filter on a partition column reads only the matching paths. Done well, partitioning is the largest single performance improvement available to a data engineer. Done badly, it produces millions of small fi
Compression is the lever that trades CPU cycles for bytes. Smaller files mean fewer bytes read from disk or network, which usually wins. Smaller files also mean more CPU spent decompressing on read and compressing on write. The tradeoff is rarely close in modern analytical workloads: I/O is slow and getting slower relative to CPU, so the bytes saved are almost always worth the cycles spent. The interesting choice is which codec, not whether to compress. The Codecs in Use ZSTD has become the mode
Predicate pushdown is the technique of moving filter conditions as close to the storage layer as possible so the engine reads only the bytes that could match. Partition pruning is the coarsest form of pushdown. File-level statistics, row-group min/max, and bloom filters are finer forms. A well-designed Parquet file plus a smart query engine produces queries that scan a tiny fraction of the table while returning the same answer. The wins compound with partitioning and columnar layout. The Three L
The four levers (columnar layout, partitioning, compression, predicate pushdown) compose. A worked example shows how the savings multiply. The setup is a real-shaped clickstream table at moderate scale: eighteen months of mobile events, twenty-eight columns wide, around two trillion total rows. The query is unremarkable: a daily count of unique users for one country in the last seven days. The same SQL runs three different ways across three table layouts. The bytes scanned change by two and a ha