Where Data Lives: Intermediate

You said 'Parquet, columnar, compressed.' The interviewer nods. Now: 'How does Parquet compress data so well?' or 'Your queries are slow. What do you check first?' or 'How do you handle schema changes without breaking everything?' These follow-ups test whether you actually operate data infrastructure.

Encoding Types

Daily Life
Interviews

Explain encoding types when they probe Parquet internals

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How does Parquet compress data so well?"
  • "What is dictionary encoding?"
  • "RLE vs dictionary: when do you use each?"

What They Want to Hear

'Parquet applies encoding per column before compression. Dictionary encoding maps repeated values to small integers, so a column of 1 million country codes becomes 1 million tiny integers plus a 200-entry dictionary. RLE (run-length encoding) stores repeated consecutive values as (value, count). Delta encoding stores differences between sequential values, perfect for timestamps.' The key insight: encoding converts data into a more compressible form BEFORE the compression algorithm runs.
EncodingHow It WorksBest For
DictionaryMap unique values to integersLow-cardinality: country, status, type
RLEStore (value, count) pairsSorted columns, boolean flags
DeltaStore differences between valuesTimestamps, sequential IDs
Bit-packingUse minimum bits per valueSmall integers, enums
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What happens if dictionary encoding is applied to a high-cardinality column?" It gets WORSE. If a column has millions of unique values (like email addresses), the dictionary itself is huge, and the file is larger than plain encoding. Parquet falls back automatically when cardinality is too high.
  • "How do you choose the encoding?" You usually do not. Parquet auto-selects based on column statistics. But knowing WHY dictionary encoding works shows you understand the mechanics, not just the tool.
KEY TAKEAWAYS
Say: 'Encoding converts data into a more compressible form before compression runs.'
Dictionary = low-cardinality, RLE = repeated values, Delta = sequential values
Know the gotcha: dictionary encoding HURTS on high-cardinality columns

The Small File Problem

Daily Life
Interviews

Diagnose the small file problem on the spot

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Your queries are slow. What do you check first?"
  • "How do you handle thousands of small files?"
  • "What is file compaction?"

What They Want to Hear

'Small file problem. Over-partitioning or high-frequency writes create thousands of tiny files. The per-file overhead of open/read/close dominates query time. I fix it with compaction: merge small files into 128 MB to 1 GB targets. Delta OPTIMIZE or Iceberg rewrite_data_files.' This is the #1 practical storage question. Most candidates know partitioning but cannot diagnose why their partitioned table is still slow.
Diagnosis Flow
remediationprevention
Over-Partitioning
Hourly partitions on low-volume data
Slow Reads
File listing alone takes minutes
Compaction
Merge into 128 MB - 1 GB files
Right Partition Key
Ensure adequate data per partition
The Curveball Follow-ups

After your initial answer, expect these probes

  • "How often do you run compaction?" Depends on write frequency. Daily for most batch pipelines. Hourly for high-throughput streaming. The tradeoff: compaction costs compute (you rewrite data that has not changed). Too frequent wastes money; too rare lets the problem grow.
  • "How do you prevent the problem in the first place?" Right-size your partitions. If hourly partitions are too small, use daily. Configure the writer to buffer and produce fewer, larger output files.
  • "What about streaming into a lake?" Streaming writes lots of tiny files by nature. Schedule compaction to run every 1-2 hours behind the streaming writer.
KEY TAKEAWAYS
Say: 'Small files. Over-partitioning or streaming writes create thousands of tiny files. Compaction merges them.'
This is the #1 'your queries are slow' diagnosis. Always check file count and sizes first.
Target: 128 MB to 1 GB per file. Anything under 10 MB is a problem.

Predicate Pushdown

Daily Life
Interviews

Explain predicate pushdown at both levels

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How does predicate pushdown work?"
  • "Why does adding a WHERE clause make my query 10x faster?"
  • "What are row group statistics?"

What They Want to Hear

'Predicate pushdown pushes the WHERE clause to the storage layer. In Parquet, each row group stores min/max statistics per column. If the query asks for revenue > 1000 and a row group's max revenue is 500, the entire row group is skipped without reading it. Combined with partition pruning, this can skip 99%+ of the data.' The key insight they are testing: pushdown works at TWO levels: partition pruning (skip folders) and row group pruning (skip chunks within files).
Two Levels of Pruning
date filtercolumn filterminimal reads
WHERE revenue > 1000
Partition Pruning
Skip irrelevant date folders
Row Group Pruning
Skip chunks where max < 1000
Read Qualifying Data
Only matching row groups
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What makes pushdown ineffective?" Randomly ordered data. If every row group's min is 0 and max is 10 million, no groups get skipped. Sorting data on the filter column makes pushdown dramatically more effective.
  • "Is this the same as an index?" Conceptually similar but lighter weight. Indexes are separate data structures that point to rows. Row group statistics are embedded in the file and work at chunk granularity, not row granularity.
KEY TAKEAWAYS
Say: 'Pushdown works at two levels: partition pruning skips folders, row group statistics skip chunks within files.'
Sorting data on filter columns makes pushdown dramatically more effective
Combined, partition + row group pruning can skip 99%+ of data

Storage Tiering

Daily Life
Interviews

Answer storage cost questions with tiering

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you manage storage costs at scale?"
  • "What is storage tiering?"
  • "When do you archive old data?"

What They Want to Hear

'Not all data is accessed equally. I tier data by access pattern: hot (SSD, recent days/weeks, most expensive), warm (standard S3, recent months), cold (Glacier, historical years, cheapest). This alone can cut storage bills by 60-80% because the 80/20 rule applies: 80% of queries touch 20% of data.'
What to Whiteboard
Hot
SSD/in-memoryRecent dataMost expensive, fastest
Warm
Standard S3/GCSRecent monthsGood cost-speed balance
Cold
S3 Glacier/ArchiveHistorical yearsCheapest, hours to retrieve
The Curveball Follow-ups

After your initial answer, expect these probes

  • "An analyst needs data from 3 years ago. How long does it take?" If it is in Glacier, 3-12 hours for retrieval. That is the tradeoff. Make sure retention policies are communicated to data consumers.
  • "What triggers the tier transition?" Time-based policies. Data older than 90 days moves to warm. Older than 1 year moves to cold. Some teams also tier based on query frequency.
  • "What about compliance requirements for data retention?" Compliance may require keeping raw data for 7+ years. Cold storage makes this affordable ($4/TB/month on Glacier vs $23/TB/month on standard S3).
KEY TAKEAWAYS
Say: 'Tier by access pattern. Hot for recent, warm for months, cold for years. Cuts bills 60-80%.'
The 80/20 rule: 80% of queries touch 20% of data. Tier the rest cheaply.
Know the retrieval tradeoff: Glacier is 80% cheaper but takes hours to retrieve

Schema Evolution

Daily Life
Interviews

Handle schema evolution questions with migration strategies

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What happens when a source adds a new column?"
  • "How do you handle schema changes without breaking pipelines?"
  • "Schema-on-read vs schema-on-write?"

What They Want to Hear

'Schema changes are inevitable. My approach: adding a nullable column is always safe. Widening a type (int to long) is safe. Removing a column or narrowing a type is a breaking change that requires a migration plan with dual-write during the transition.' Then mention table formats: 'Iceberg and Delta handle safe schema evolution natively. For breaking changes, I use a dual-write period where both old and new schemas are produced simultaneously.'
Schema Change Lifecycle
source changesevaluatesafe (additive)breakingconsumers updated
Stable
All consumers agree on schema
Change Proposed
New column, type change
Compatibility Check
Safe or breaking?
Migrating
Dual-write period
Adopted
All consumers updated
The Curveball Follow-ups

After your initial answer, expect these probes

  • "A vendor renames a column without telling you. How do you handle it?" The pipeline should validate schemas on ingestion. If the expected column is missing, alert and hold the data in Bronze (do not propagate the error downstream).
  • "What is the difference between schema-on-read and schema-on-write?" Schema-on-write (warehouse) rejects bad data at write time. Schema-on-read (lake) accepts anything and validates at query time. The tradeoff: early error detection vs flexibility.
  • "How do you version schemas?" Schema registry (Confluent for Kafka, Glue for AWS). Each schema version is stored with compatibility rules. Producers cannot break consumers.
Do
  • Adding a nullable column is always safe
  • Widening a type (int to long) is safe
  • Use schema registries for streaming data
Don't
  • Remove a column without checking downstream consumers
  • Narrow a type without verifying data range
  • Rename a column without a migration period
KEY TAKEAWAYS
Say: 'Additive changes are safe. Breaking changes need a dual-write migration period.'
Table formats (Iceberg, Delta) handle safe evolution natively. Use them.
The pro move: mention schema registries for streaming sources

Survive the storage follow-ups: encoding, small files, schema evolution

Category
Pipeline Architecture
Difficulty
intermediate
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Encoding Types, The Small File Problem, Predicate Pushdown, Storage Tiering, Schema Evolution

Lesson Sections

  1. Encoding Types (concepts: paCompression)

    What They Want to Hear 'Parquet applies encoding per column before compression. Dictionary encoding maps repeated values to small integers, so a column of 1 million country codes becomes 1 million tiny integers plus a 200-entry dictionary. RLE (run-length encoding) stores repeated consecutive values as (value, count). Delta encoding stores differences between sequential values, perfect for timestamps.' The key insight: encoding converts data into a more compressible form BEFORE the compression a

  2. The Small File Problem (concepts: paPartitioning)

    What They Want to Hear 'Small file problem. Over-partitioning or high-frequency writes create thousands of tiny files. The per-file overhead of open/read/close dominates query time. I fix it with compaction: merge small files into 128 MB to 1 GB targets. Delta OPTIMIZE or Iceberg rewrite_data_files.' This is the #1 practical storage question. Most candidates know partitioning but cannot diagnose why their partitioned table is still slow.

  3. Predicate Pushdown (concepts: paColumnarVsRow)

    What They Want to Hear 'Predicate pushdown pushes the WHERE clause to the storage layer. In Parquet, each row group stores min/max statistics per column. If the query asks for revenue > 1000 and a row group's max revenue is 500, the entire row group is skipped without reading it. Combined with partition pruning, this can skip 99%+ of the data.' The key insight they are testing: pushdown works at TWO levels: partition pruning (skip folders) and row group pruning (skip chunks within files).

  4. Storage Tiering (concepts: paDataLake)

    What They Want to Hear 'Not all data is accessed equally. I tier data by access pattern: hot (SSD, recent days/weeks, most expensive), warm (standard S3, recent months), cold (Glacier, historical years, cheapest). This alone can cut storage bills by 60-80% because the 80/20 rule applies: 80% of queries touch 20% of data.'

  5. Schema Evolution (concepts: paTableFormats)

    What They Want to Hear 'Schema changes are inevitable. My approach: adding a nullable column is always safe. Widening a type (int to long) is safe. Removing a column or narrowing a type is a breaking change that requires a migration plan with dual-write during the transition.' Then mention table formats: 'Iceberg and Delta handle safe schema evolution natively. For breaking changes, I use a dual-write period where both old and new schemas are produced simultaneously.'