The Storage Question: Intermediate

"Why Parquet?" is the single most common screening question in data engineering interviews - over 40 mentions across top tech and finance companies. It's asked early because the answer immediately reveals whether you've operated at scale or only read about it. The interviewer isn't looking for a textbook definition of columnar storage. They want to hear you connect physical layout to query performance, compression ratios to cloud bills, and partition strategy to pipeline reliability. This lesson covers the five storage questions that surface in 75+ real interview recordings, from file format basics through cost modeling.

What you will be able to do

Explain columnar storage, compression, and predicate pushdown with real numbers
Explain columnar storage, compression, and predicate pushdown with real numbers
Choose partition keys using cardinality analysis and avoid the over-partition trap
Choose partition keys using cardinality analysis and avoid the over-partition trap
Compare Delta Lake and Apache Iceberg on ACID guarantees, time travel, and schema evolution
Compare Delta Lake and Apache Iceberg on ACID guarantees, time travel, and schema evolution
Articulate when a data lake, warehouse, or lakehouse architecture fits
Articulate when a data lake, warehouse, or lakehouse architecture fits
Model storage costs including lifecycle policies and format impact
Model storage costs including lifecycle policies and format impact

Why Parquet?

Daily Life
Interviews

Explain why Parquet outperforms CSV/JSON for analytics - with numbers, not buzzwords

This is asked as a screener because it instantly reveals whether you've worked with production data at scale. The interviewer doesn't want "it's columnar." They want you to connect physical layout to the queries you actually run.

Row vs. Columnar Layout

Bronze (raw)
Silver (cleaned)
Gold (business)
raw
bronze
S3
cleaned
silver
Delta
business
gold
warehouse

The medallion storage layering: bronze keeps raw immutable data, silver is cleaned/conformed, gold is business-ready aggregates serving BI. Each layer trades storage cost for query readiness.

CSV and JSON store data row-by-row. To answer "what's the average order amount?" on a 500-column table, a row-oriented reader must load all 500 columns into memory, skip 499 of them, and aggregate the one it needs. Parquet stores each column contiguously on disk. That same query reads exactly one column - roughly 0.2% of the total bytes for a wide table.
Row Format (CSV/JSON)
  • Reads all columns to access one
  • Poor compression - mixed data types per block
  • No column statistics for query planning
  • Human-readable but bloated at scale
Columnar Format (Parquet)
  • Reads only columns referenced in query
  • 10:1 to 100:1 compression - same types cluster
  • Min/max stats enable predicate pushdown
  • Binary but 5-20x smaller on disk

Compression and Encoding

Parquet compresses well because columns contain homogeneous data. A status column with 5 distinct values across 100M rows uses dictionary encoding - each value stored once, then referenced by integer index. This routinely achieves 50:1 compression on low-cardinality string columns. Numeric columns use delta encoding (store differences between consecutive values) or run-length encoding for sorted data. A 200 GB CSV dataset commonly shrinks to 8-15 GB in Parquet with Snappy compression, or 5-10 GB with ZSTD.

Predicate Pushdown

Every Parquet file stores min/max statistics per column per row group (default ~128 MB uncompressed). When a query filters WHERE order_date > '2025-01-01', the engine reads the footer metadata first, checks the min/max stats, and skips entire row groups that can't contain matching rows. On a well-sorted file this eliminates 90%+ of I/O before a single data page is decompressed.
1/* Spark: predicate pushdown in action */
2/* Only reads row groups where amount could be > 1000 */
3SELECT
4 customer_id,
5 amount
6FROM orders_parquet
7WHERE amount > 1000
8AND order_date >= '2025-01-01' /* Check Spark UI: "files read" vs "files listed" */ /* shows how many files were skipped */
TIP
Senior signal: mention that predicate pushdown depends on how the data is sorted. If your date column is randomly distributed across row groups, the min/max stats won't help - every row group's range will span the full table. Sort your data on the most commonly filtered column before writing Parquet.
KEY TAKEAWAYS
Parquet's columnar layout means queries only read the columns they reference - critical for wide tables
Dictionary, delta, and RLE encoding routinely compress data 10-100x depending on cardinality
Predicate pushdown uses per-row-group min/max stats to skip irrelevant data before decompression
The combination of column pruning + pushdown + compression is why Parquet dominates analytical workloads

How Would You Partition?

Daily Life
Interviews

Choose partition keys using cardinality analysis and avoid the over-partition trap

Partitioning is how you turn a 10 TB table scan into a 50 GB targeted read. The interviewer wants to hear your thought process for choosing a partition key - not just "partition by date."

Choosing a Partition Key

Start with how the data is queried. If 95% of queries filter on event_date, that's your partition key. If analysts always filter by region first, consider region. The goal is pruning: the query engine should eliminate partitions before reading any data. A table partitioned by date with 3 years of daily partitions has ~1,095 partitions - a query for one day reads 0.09% of the data.
Interview Question
  • "You have a 500 GB/day event stream. How would you partition it in S3?"
  • They want: partition key analysis, cardinality reasoning, file size targets, and awareness of downstream query patterns.

Cardinality Analysis

The partition key's cardinality determines how many directories you create. Too few partitions (e.g., by continent - 7 values) means each partition is massive and you get minimal pruning benefit. Too many partitions (e.g., by user_id - 50M values) creates the small file problem: millions of tiny files that overwhelm the metastore and make listing operations painfully slow.
Partition KeyCardinalityFiles/DayAvg File SizeVerdict
date1~1050 GBToo coarse - no pruning within day
date + hour24~2402 GBGood for hourly queries
date + hour + user_id24 × 50Mmillions< 1 KBCatastrophic - small file hell
date + region~30~3001.7 GBSweet spot for region-filtered queries

The Over-Partition Trap

Hive metastore starts degrading around 100K partitions. Cloud catalog services have a soft limit of 10M partitions per table. But performance degrades long before you hit limits - listing 500K partitions takes 30+ seconds in most catalogs. The sweet spot is 128 MB to 1 GB per file, with partition counts in the low thousands for actively queried tables.
TIP
Senior signal: mention that you'd use Spark's repartition() or coalesce() after writing to consolidate small files. Say you'd add a compaction job that runs daily to merge files under 128 MB. This shows you've dealt with this in production.
KEY TAKEAWAYS
Partition key choice is driven by query patterns, not data structure
Cardinality between 100-10,000 is the sweet spot for most analytical tables
Over-partitioning creates small files that degrade listing performance and increase S3 API costs
File compaction is a required operational concern, not an optimization

Delta or Iceberg?

Daily Life
Interviews

Compare Delta Lake and Iceberg on ACID guarantees, time travel, and schema evolution

Both Delta Lake and Apache Iceberg add ACID transactions to files sitting on object storage. They solve the same core problem: Parquet files are immutable, so updates, deletes, and schema changes require a metadata layer. The interviewer wants you to know what each does well and where they diverge.

The Core Problem They Solve

Without a table format, a "table" is just a directory of Parquet files with a naming convention. There's no atomic commit - if a write fails halfway, you have partial data. There's no isolation - a reader might see a half-written batch. There's no schema enforcement - a new file could have different column names. Delta and Iceberg both fix this by adding a transaction log that tracks which files constitute the current table state.
Delta Lake
  • JSON-based transaction log (_delta_log/)
  • Tight Spark integration - native in its parent platform
  • MERGE INTO for upserts since day one
  • Optimized for its parent platform ecosystem
  • Change Data Feed for downstream CDC
  • Log compaction via checkpoints every 10 commits
Apache Iceberg
  • Manifest-based metadata (Avro manifest lists)
  • Engine-agnostic - Spark, Trino, Flink, Dremio
  • Hidden partitioning - partition evolution without rewrite
  • Snapshot-based branching and tagging
  • Time travel via snapshot isolation
  • Adopted by major cloud providers and query engines

Time Travel

Both formats support querying historical snapshots. Delta retains commits for 30 days by default (configurable). Iceberg keeps snapshots until you explicitly expire them. The mechanism differs: Delta replays the transaction log to reconstruct state, while Iceberg points directly to a snapshot's manifest list. For large tables (10K+ commits), Iceberg's approach is faster because it doesn't need to replay a log.

Schema Evolution

Both support adding columns, but Iceberg handles partition evolution - changing partition strategy without rewriting data. If you partition by day and later need by hour, Iceberg tracks both schemes and applies the correct one per file based on when it was written. Delta requires a full rewrite to change partitioning. This is a significant operational advantage for long-lived tables.
Interview Question
  • "When would you choose Iceberg over Delta Lake?"
  • Strong answer: multi-engine environment, need partition evolution, want vendor-neutral format. Mention the compute coupling tradeoff - Delta's platform-specific optimizations are real.
TIP
Senior signal: acknowledge that Delta open-sourced UniForm to write Iceberg-compatible metadata alongside Delta logs. The formats are converging. Say "the choice is increasingly about ecosystem, not features" - this shows you track the space actively.
KEY TAKEAWAYS
Delta and Iceberg both provide ACID transactions on top of Parquet files in object storage
Delta excels in platform-centric environments with tight Spark integration
Iceberg wins on engine-agnosticism and partition evolution without data rewrites
Time travel works differently - snapshot-based (Iceberg) vs log-replay (Delta)
The formats are converging; ecosystem fit matters more than feature comparison

Data Lake or Warehouse?

Daily Life
Interviews

Articulate when a data lake, warehouse, or lakehouse architecture fits - and why

This question tests whether you understand the economics and tradeoffs, not just the definitions. The answer has shifted dramatically since 2022. The interviewer wants to hear you reason about it, not recite a comparison chart.

The Traditional Split

Data warehouses couple storage and compute into a managed service. You load structured data, it's optimized for SQL analytics, and you pay per query or per compute-second. Data lakes (object storage + Spark) store raw files in any format. You bring your own compute engine and pay for storage + compute separately.
DimensionData WarehouseData Lake
Data formatProprietary internal formatOpen (Parquet, ORC, Avro)
SchemaSchema-on-write enforcedSchema-on-read flexible
Query engineBuilt-in, optimizedBring your own (Spark, Trino, etc.)
Cost modelCompute-time or per-queryStorage + compute separately
Best forBI dashboards, SQL analyticsML, unstructured data, raw archives
GovernanceBuilt-in RBAC and auditRequires external catalog + policies

The Lakehouse Convergence

The lakehouse combines open file formats on object storage with warehouse-grade query performance. The term was coined by one vendor, but every major cloud provider and warehouse moved here. The idea: store everything as Parquet/Iceberg on your object storage, but expose it through a high-performance SQL engine with ACID guarantees.
This means the question "lake or warehouse?" is increasingly "which engine on top of open storage?" The data lives in one place. You choose compute based on the workload - SQL engine for dashboards, Spark for transformations, Python for ML.
Interview Question
  • "Your team has data in both a managed warehouse and S3. How would you unify the architecture?"
  • Good answer: migrate warehouse-only tables to Iceberg on S3, use the warehouse as a compute engine via external tables. Keep it for BI workloads, Spark for heavy ETL. One storage layer, multiple compute engines.

When the Warehouse Still Wins

For small-to-medium teams (< 20 data people) with primarily SQL workloads, a managed warehouse is hard to beat. You skip the operational overhead of managing catalogs, compaction, file optimization, and access control. A managed warehouse gets you from zero to production dashboards in days. The lake/lakehouse path makes sense when you need multi-engine access, have ML workloads, process unstructured data, or want to avoid vendor lock-in on storage.
TIP
Senior signal: frame your answer around team size and workload mix, not technology preference. "It depends on whether your team spends 80% of their time in SQL or 40% in Python" is more credible than "lakehouse is always better."
KEY TAKEAWAYS
The lake vs warehouse distinction is dissolving into "open storage + choice of compute engine"
Lakehouse architecture keeps data in open formats on object storage with warehouse-grade SQL access
Pure warehouses still win for small teams with SQL-only workloads - operational simplicity matters
The real decision: single-engine simplicity vs multi-engine flexibility

How Much Will This Cost?

Daily Life
Interviews

Model storage costs including lifecycle policies and format impact on cloud bills

Storage cost is the question that separates engineers who build pipelines from engineers who own pipelines. The interviewer wants to see that you think about money as a first-class engineering constraint.

S3 Storage Tiers

TierCost/GB/MonthRetrieval CostBest For
S3 Standard$0.023NoneFrequently queried data (< 90 days)
S3 Infrequent Access$0.0125$0.01/GBMonthly reports, 90-365 day data
S3 Glacier Instant$0.004$0.03/GBCompliance archives, rare queries
S3 Glacier Deep$0.00099$0.02/GB + 12hr waitRegulatory retention, never queried
A common production setup: 500 GB/day ingestion in Parquet. That's ~15 TB/month raw. With 2-year retention, you're looking at 360 TB. At S3 Standard pricing, that's $8,280/month. Move data older than 90 days to IA and older than 1 year to Glacier Instant, and the same 360 TB costs ~$2,100/month - a 75% reduction with a lifecycle policy that takes 20 minutes to configure.

Format Impact on Cost

File format directly affects your storage bill. A 1 TB/day JSON pipeline costs $23/GB × 30 TB = $690/month in S3 Standard. Convert to Parquet with Snappy compression (typical 5:1 ratio on JSON) and the same data costs $138/month. Switch to ZSTD compression (8:1 on this data) and you're at $86/month. The format conversion job that runs once saves $7,200/year.

Lifecycle Policies

Every production table should have a retention policy. Not "we'll clean it up later" - an automated lifecycle rule. S3 lifecycle policies can transition objects between tiers based on age and delete objects after a retention period. The biggest cost mistake in data engineering is storing everything forever in hot storage because nobody wrote the policy.
TIP
Senior signal: mention that you'd set up S3 Inventory reports and Storage Lens dashboards to track storage growth by prefix. Proactive cost monitoring beats quarterly bill shock every time.
KEY TAKEAWAYS
Lifecycle policies from hot → IA → Glacier can reduce storage costs 60-75%
File format choice (JSON vs Parquet + ZSTD) has 5-10x cost impact on the same data
S3 API costs (LIST, GET) are a hidden expense when you have millions of small files
Cost modeling is an engineering skill - build it into pipeline design, not as an afterthought

Parquet pop quiz, partitioning, and the physical layer

Category
Pipeline Architecture
Difficulty
intermediate
Duration
35 minutes
Challenges
0 hands-on challenges

Topics covered: Why Parquet?, How Would You Partition?, Delta or Iceberg?, Data Lake or Warehouse?, How Much Will This Cost?

Lesson Sections

  1. Why Parquet? (concepts: paColumnarVsRow)

    This is asked as a screener because it instantly reveals whether you've worked with production data at scale. The interviewer doesn't want "it's columnar." They want you to connect physical layout to the queries you actually run. Row vs. Columnar Layout CSV and JSON store data row-by-row. To answer "what's the average order amount?" on a 500-column table, a row-oriented reader must load all 500 columns into memory, skip 499 of them, and aggregate the one it needs. Parquet stores each column cont

  2. How Would You Partition? (concepts: paPartitioning)

    Partitioning is how you turn a 10 TB table scan into a 50 GB targeted read. The interviewer wants to hear your thought process for choosing a partition key - not just "partition by date." Choosing a Partition Key Start with how the data is queried. If 95% of queries filter on event_date, that's your partition key. If analysts always filter by region first, consider region. The goal is pruning: the query engine should eliminate partitions before reading any data. A table partitioned by date wit

  3. Delta or Iceberg? (concepts: paTableFormats)

    Both Delta Lake and Apache Iceberg add ACID transactions to files sitting on object storage. They solve the same core problem: Parquet files are immutable, so updates, deletes, and schema changes require a metadata layer. The interviewer wants you to know what each does well and where they diverge. The Core Problem They Solve Without a table format, a "table" is just a directory of Parquet files with a naming convention. There's no atomic commit - if a write fails halfway, you have partial dat

  4. Data Lake or Warehouse? (concepts: paDataLake)

    This question tests whether you understand the economics and tradeoffs, not just the definitions. The answer has shifted dramatically since 2022. The interviewer wants to hear you reason about it, not recite a comparison chart. The Traditional Split Data warehouses couple storage and compute into a managed service. You load structured data, it's optimized for SQL analytics, and you pay per query or per compute-second. Data lakes (object storage + Spark) store raw files in any format. You bring y

  5. How Much Will This Cost? (concepts: paCostOptimization)

    Storage cost is the question that separates engineers who build pipelines from engineers who own pipelines. The interviewer wants to see that you think about money as a first-class engineering constraint. S3 Storage Tiers A common production setup: 500 GB/day ingestion in Parquet. That's ~15 TB/month raw. With 2-year retention, you're looking at 360 TB. At S3 Standard pricing, that's $8,280/month. Move data older than 90 days to IA and older than 1 year to Glacier Instant, and the same 360 TB co