Loading lesson...
Survive the storage follow-ups: encoding, small files, schema evolution
Survive the storage follow-ups: encoding, small files, schema evolution
Topics covered: Encoding Types, The Small File Problem, Predicate Pushdown, Storage Tiering, Schema Evolution
What They Want to Hear 'Parquet applies encoding per column before compression. Dictionary encoding maps repeated values to small integers, so a column of 1 million country codes becomes 1 million tiny integers plus a 200-entry dictionary. RLE (run-length encoding) stores repeated consecutive values as (value, count). Delta encoding stores differences between sequential values, perfect for timestamps.' The key insight: encoding converts data into a more compressible form BEFORE the compression a
What They Want to Hear 'Small file problem. Over-partitioning or high-frequency writes create thousands of tiny files. The per-file overhead of open/read/close dominates query time. I fix it with compaction: merge small files into 128 MB to 1 GB targets. Delta OPTIMIZE or Iceberg rewrite_data_files.' This is the #1 practical storage question. Most candidates know partitioning but cannot diagnose why their partitioned table is still slow.
What They Want to Hear 'Predicate pushdown pushes the WHERE clause to the storage layer. In Parquet, each row group stores min/max statistics per column. If the query asks for revenue > 1000 and a row group's max revenue is 500, the entire row group is skipped without reading it. Combined with partition pruning, this can skip 99%+ of the data.' The key insight they are testing: pushdown works at TWO levels: partition pruning (skip folders) and row group pruning (skip chunks within files).
What They Want to Hear 'Not all data is accessed equally. I tier data by access pattern: hot (SSD, recent days/weeks, most expensive), warm (standard S3, recent months), cold (Glacier, historical years, cheapest). This alone can cut storage bills by 60-80% because the 80/20 rule applies: 80% of queries touch 20% of data.'
What They Want to Hear 'Schema changes are inevitable. My approach: adding a nullable column is always safe. Widening a type (int to long) is safe. Removing a column or narrowing a type is a breaking change that requires a migration plan with dual-write during the transition.' Then mention table formats: 'Iceberg and Delta handle safe schema evolution natively. For breaking changes, I use a dual-write period where both old and new schemas are produced simultaneously.'