DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Where Data Lives

Where Data Lives

Survive the storage follow-ups: encoding, small files, schema evolution

Survive the storage follow-ups: encoding, small files, schema evolution

Category
Pipeline Architecture
Difficulty
intermediate
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Encoding Types, The Small File Problem, Predicate Pushdown, Storage Tiering, Schema Evolution

Lesson Sections

  1. Encoding Types (concepts: paCompression)

    What They Want to Hear 'Parquet applies encoding per column before compression. Dictionary encoding maps repeated values to small integers, so a column of 1 million country codes becomes 1 million tiny integers plus a 200-entry dictionary. RLE (run-length encoding) stores repeated consecutive values as (value, count). Delta encoding stores differences between sequential values, perfect for timestamps.' The key insight: encoding converts data into a more compressible form BEFORE the compression a

  2. The Small File Problem (concepts: paPartitioning)

    What They Want to Hear 'Small file problem. Over-partitioning or high-frequency writes create thousands of tiny files. The per-file overhead of open/read/close dominates query time. I fix it with compaction: merge small files into 128 MB to 1 GB targets. Delta OPTIMIZE or Iceberg rewrite_data_files.' This is the #1 practical storage question. Most candidates know partitioning but cannot diagnose why their partitioned table is still slow.

  3. Predicate Pushdown (concepts: paColumnarVsRow)

    What They Want to Hear 'Predicate pushdown pushes the WHERE clause to the storage layer. In Parquet, each row group stores min/max statistics per column. If the query asks for revenue > 1000 and a row group's max revenue is 500, the entire row group is skipped without reading it. Combined with partition pruning, this can skip 99%+ of the data.' The key insight they are testing: pushdown works at TWO levels: partition pruning (skip folders) and row group pruning (skip chunks within files).

  4. Storage Tiering (concepts: paDataLake)

    What They Want to Hear 'Not all data is accessed equally. I tier data by access pattern: hot (SSD, recent days/weeks, most expensive), warm (standard S3, recent months), cold (Glacier, historical years, cheapest). This alone can cut storage bills by 60-80% because the 80/20 rule applies: 80% of queries touch 20% of data.'

  5. Schema Evolution (concepts: paTableFormats)

    What They Want to Hear 'Schema changes are inevitable. My approach: adding a nullable column is always safe. Widening a type (int to long) is safe. Removing a column or narrowing a type is a breaking change that requires a migration plan with dual-write during the transition.' Then mention table formats: 'Iceberg and Delta handle safe schema evolution natively. For breaking changes, I use a dual-write period where both old and new schemas are produced simultaneously.'

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges