Loading lesson...

The Storage Question

Parquet pop quiz, partitioning, and the physical layer

Parquet pop quiz, partitioning, and the physical layer

Category
Pipeline Architecture
Difficulty
intermediate
Duration
35 minutes
Challenges
0 hands-on challenges

Topics covered: Why Parquet?, How Would You Partition?, Delta or Iceberg?, Data Lake or Warehouse?, How Much Will This Cost?

Lesson Sections

  1. Why Parquet? (concepts: paColumnarVsRow, paCompression)

    This is asked as a screener because it instantly reveals whether you've worked with production data at scale. The interviewer doesn't want "it's columnar." They want you to connect physical layout to the queries you actually run. Row vs. Columnar Layout CSV and JSON store data row-by-row. To answer "what's the average order amount?" on a 500-column table, a row-oriented reader must load all 500 columns into memory, skip 499 of them, and aggregate the one it needs. Parquet stores each column cont

  2. How Would You Partition? (concepts: paPartitioning)

    Partitioning is how you turn a 10 TB table scan into a 50 GB targeted read. The interviewer wants to hear your thought process for choosing a partition key - not just "partition by date." Choosing a Partition Key Start with how the data is queried. If 95% of queries filter on event_date, that's your partition key. If analysts always filter by region first, consider region. The goal is pruning: the query engine should eliminate partitions before reading any data. A table partitioned by date wit

  3. Delta or Iceberg? (concepts: paTableFormats)

    Both Delta Lake and Apache Iceberg add ACID transactions to files sitting on object storage. They solve the same core problem: Parquet files are immutable, so updates, deletes, and schema changes require a metadata layer. The interviewer wants you to know what each does well and where they diverge. The Core Problem They Solve Without a table format, a "table" is just a directory of Parquet files with a naming convention. There's no atomic commit - if a write fails halfway, you have partial dat

  4. Data Lake or Warehouse? (concepts: paDataLake)

    This question tests whether you understand the economics and tradeoffs, not just the definitions. The answer has shifted dramatically since 2022. The interviewer wants to hear you reason about it, not recite a comparison chart. The Traditional Split Data warehouses couple storage and compute into a managed service. You load structured data, it's optimized for SQL analytics, and you pay per query or per compute-second. Data lakes (object storage + Spark) store raw files in any format. You bring y

  5. How Much Will This Cost? (concepts: paCompression)

    Storage cost is the question that separates engineers who build pipelines from engineers who own pipelines. The interviewer wants to see that you think about money as a first-class engineering constraint. S3 Storage Tiers A common production setup: 500 GB/day ingestion in Parquet. That's ~15 TB/month raw. With 2-year retention, you're looking at 360 TB. At S3 Standard pricing, that's $8,280/month. Move data older than 90 days to IA and older than 1 year to Glacier Instant, and the same 360 TB co