Where Data Lives: Intermediate
You said 'Parquet, columnar, compressed.' The interviewer nods. Now: 'How does Parquet compress data so well?' or 'Your queries are slow. What do you check first?' or 'How do you handle schema changes without breaking everything?' These follow-ups test whether you actually operate data infrastructure.
Encoding Types
Explain encoding types when they probe Parquet internals
When you hear these in an interview, this is the concept being tested
- ▸"How does Parquet compress data so well?"
- ▸"What is dictionary encoding?"
- ▸"RLE vs dictionary: when do you use each?"
What They Want to Hear
| Encoding | How It Works | Best For |
|---|---|---|
| Dictionary | Map unique values to integers | Low-cardinality: country, status, type |
| RLE | Store (value, count) pairs | Sorted columns, boolean flags |
| Delta | Store differences between values | Timestamps, sequential IDs |
| Bit-packing | Use minimum bits per value | Small integers, enums |
After your initial answer, expect these probes
- ▸"What happens if dictionary encoding is applied to a high-cardinality column?" It gets WORSE. If a column has millions of unique values (like email addresses), the dictionary itself is huge, and the file is larger than plain encoding. Parquet falls back automatically when cardinality is too high.
- ▸"How do you choose the encoding?" You usually do not. Parquet auto-selects based on column statistics. But knowing WHY dictionary encoding works shows you understand the mechanics, not just the tool.
The Small File Problem
Diagnose the small file problem on the spot
When you hear these in an interview, this is the concept being tested
- ▸"Your queries are slow. What do you check first?"
- ▸"How do you handle thousands of small files?"
- ▸"What is file compaction?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"How often do you run compaction?" Depends on write frequency. Daily for most batch pipelines. Hourly for high-throughput streaming. The tradeoff: compaction costs compute (you rewrite data that has not changed). Too frequent wastes money; too rare lets the problem grow.
- ▸"How do you prevent the problem in the first place?" Right-size your partitions. If hourly partitions are too small, use daily. Configure the writer to buffer and produce fewer, larger output files.
- ▸"What about streaming into a lake?" Streaming writes lots of tiny files by nature. Schedule compaction to run every 1-2 hours behind the streaming writer.
Predicate Pushdown
Explain predicate pushdown at both levels
When you hear these in an interview, this is the concept being tested
- ▸"How does predicate pushdown work?"
- ▸"Why does adding a WHERE clause make my query 10x faster?"
- ▸"What are row group statistics?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What makes pushdown ineffective?" Randomly ordered data. If every row group's min is 0 and max is 10 million, no groups get skipped. Sorting data on the filter column makes pushdown dramatically more effective.
- ▸"Is this the same as an index?" Conceptually similar but lighter weight. Indexes are separate data structures that point to rows. Row group statistics are embedded in the file and work at chunk granularity, not row granularity.
Storage Tiering
Answer storage cost questions with tiering
When you hear these in an interview, this is the concept being tested
- ▸"How do you manage storage costs at scale?"
- ▸"What is storage tiering?"
- ▸"When do you archive old data?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"An analyst needs data from 3 years ago. How long does it take?" If it is in Glacier, 3-12 hours for retrieval. That is the tradeoff. Make sure retention policies are communicated to data consumers.
- ▸"What triggers the tier transition?" Time-based policies. Data older than 90 days moves to warm. Older than 1 year moves to cold. Some teams also tier based on query frequency.
- ▸"What about compliance requirements for data retention?" Compliance may require keeping raw data for 7+ years. Cold storage makes this affordable ($4/TB/month on Glacier vs $23/TB/month on standard S3).
Schema Evolution
Handle schema evolution questions with migration strategies
When you hear these in an interview, this is the concept being tested
- ▸"What happens when a source adds a new column?"
- ▸"How do you handle schema changes without breaking pipelines?"
- ▸"Schema-on-read vs schema-on-write?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"A vendor renames a column without telling you. How do you handle it?" The pipeline should validate schemas on ingestion. If the expected column is missing, alert and hold the data in Bronze (do not propagate the error downstream).
- ▸"What is the difference between schema-on-read and schema-on-write?" Schema-on-write (warehouse) rejects bad data at write time. Schema-on-read (lake) accepts anything and validates at query time. The tradeoff: early error detection vs flexibility.
- ▸"How do you version schemas?" Schema registry (Confluent for Kafka, Glue for AWS). Each schema version is stored with compatibility rules. Producers cannot break consumers.
- Adding a nullable column is always safe
- Widening a type (int to long) is safe
- Use schema registries for streaming data
- Remove a column without checking downstream consumers
- Narrow a type without verifying data range
- Rename a column without a migration period
Survive the storage follow-ups: encoding, small files, schema evolution
- Category
- Pipeline Architecture
- Difficulty
- intermediate
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Encoding Types, The Small File Problem, Predicate Pushdown, Storage Tiering, Schema Evolution
Lesson Sections
- Encoding Types (concepts: paCompression)
What They Want to Hear 'Parquet applies encoding per column before compression. Dictionary encoding maps repeated values to small integers, so a column of 1 million country codes becomes 1 million tiny integers plus a 200-entry dictionary. RLE (run-length encoding) stores repeated consecutive values as (value, count). Delta encoding stores differences between sequential values, perfect for timestamps.' The key insight: encoding converts data into a more compressible form BEFORE the compression a
- The Small File Problem (concepts: paPartitioning)
What They Want to Hear 'Small file problem. Over-partitioning or high-frequency writes create thousands of tiny files. The per-file overhead of open/read/close dominates query time. I fix it with compaction: merge small files into 128 MB to 1 GB targets. Delta OPTIMIZE or Iceberg rewrite_data_files.' This is the #1 practical storage question. Most candidates know partitioning but cannot diagnose why their partitioned table is still slow.
- Predicate Pushdown (concepts: paColumnarVsRow)
What They Want to Hear 'Predicate pushdown pushes the WHERE clause to the storage layer. In Parquet, each row group stores min/max statistics per column. If the query asks for revenue > 1000 and a row group's max revenue is 500, the entire row group is skipped without reading it. Combined with partition pruning, this can skip 99%+ of the data.' The key insight they are testing: pushdown works at TWO levels: partition pruning (skip folders) and row group pruning (skip chunks within files).
- Storage Tiering (concepts: paDataLake)
What They Want to Hear 'Not all data is accessed equally. I tier data by access pattern: hot (SSD, recent days/weeks, most expensive), warm (standard S3, recent months), cold (Glacier, historical years, cheapest). This alone can cut storage bills by 60-80% because the 80/20 rule applies: 80% of queries touch 20% of data.'
- Schema Evolution (concepts: paTableFormats)
What They Want to Hear 'Schema changes are inevitable. My approach: adding a nullable column is always safe. Widening a type (int to long) is safe. Removing a column or narrowing a type is a breaking change that requires a migration plan with dual-write during the transition.' Then mention table formats: 'Iceberg and Delta handle safe schema evolution natively. For breaking changes, I use a dual-write period where both old and new schemas are produced simultaneously.'