The Storage Question: Intermediate

"Why Parquet?" is the single most common screening question in data engineering interviews - over 40 mentions across top tech and finance companies. It's asked early because the answer immediately reveals whether you've operated at scale or only read about it. The interviewer isn't looking for a textbook definition of columnar storage. They want to hear you connect physical layout to query performance, compression ratios to cloud bills, and partition strategy to pipeline reliability. This lesson covers the five storage questions that surface in 75+ real interview recordings, from file format basics through cost modeling.

What you will be able to do

Explain columnar storage, compression, and predicate pushdown with real numbers

Choose partition keys using cardinality analysis and avoid the over-partition trap

Compare Delta Lake and Apache Iceberg on ACID guarantees, time travel, and schema evolution

Articulate when a data lake, warehouse, or lakehouse architecture fits

Model storage costs including lifecycle policies and format impact

Why Parquet?

Daily Life

Interviews

Explain why Parquet outperforms CSV/JSON for analytics - with numbers, not buzzwords

This is asked as a screener because it instantly reveals whether you've worked with production data at scale. The interviewer doesn't want "it's columnar." They want you to connect physical layout to the queries you actually run.

Row vs. Columnar Layout

Bronze (raw)

Silver (cleaned)

Gold (business)

raw

bronze

• S3

cleaned

silver

• Delta

business

gold

• warehouse

The medallion storage layering: bronze keeps raw immutable data, silver is cleaned/conformed, gold is business-ready aggregates serving BI. Each layer trades storage cost for query readiness.

CSV and JSON store data row-by-row. To answer "what's the average order amount?" on a 500-column table, a row-oriented reader must load all 500 columns into memory, skip 499 of them, and aggregate the one it needs. Parquet stores each column contiguously on disk. That same query reads exactly one column - roughly 0.2% of the total bytes for a wide table.

•Row Format (CSV/JSON)

Reads all columns to access one
Poor compression - mixed data types per block
No column statistics for query planning
Human-readable but bloated at scale

•Columnar Format (Parquet)

Reads only columns referenced in query
10:1 to 100:1 compression - same types cluster
Min/max stats enable predicate pushdown
Binary but 5-20x smaller on disk

Compression and Encoding

Parquet compresses well because columns contain homogeneous data. A status column with 5 distinct values across 100M rows uses dictionary encoding - each value stored once, then referenced by integer index. This routinely achieves 50:1 compression on low-cardinality string columns. Numeric columns use delta encoding (store differences between consecutive values) or run-length encoding for sorted data. A 200 GB CSV dataset commonly shrinks to 8-15 GB in Parquet with Snappy compression, or 5-10 GB with ZSTD.

Predicate Pushdown

Every Parquet file stores min/max statistics per column per row group (default ~128 MB uncompressed). When a query filters WHERE order_date > '2025-01-01', the engine reads the footer metadata first, checks the min/max stats, and skips entire row groups that can't contain matching rows. On a well-sorted file this eliminates 90%+ of I/O before a single data page is decompressed.

1	/* Spark: predicate pushdown in action */
2	/* Only reads row groups where amount could be > 1000 */
3	SELECT
4	customer_id,
5	amount
6	FROM orders_parquet
7	WHERE amount > 1000
8	AND order_date >= '2025-01-01' /* Check Spark UI: "files read" vs "files listed" / / shows how many files were skipped */

TIP

Senior signal: mention that predicate pushdown depends on how the data is sorted. If your date column is randomly distributed across row groups, the min/max stats won't help - every row group's range will span the full table. Sort your data on the most commonly filtered column before writing Parquet.

KEY TAKEAWAYS

Parquet's columnar layout means queries only read the columns they reference - critical for wide tables

Dictionary, delta, and RLE encoding routinely compress data 10-100x depending on cardinality

Predicate pushdown uses per-row-group min/max stats to skip irrelevant data before decompression

The combination of column pruning + pushdown + compression is why Parquet dominates analytical workloads

How Would You Partition?

Daily Life

Interviews

Choose partition keys using cardinality analysis and avoid the over-partition trap

Partitioning is how you turn a 10 TB table scan into a 50 GB targeted read. The interviewer wants to hear your thought process for choosing a partition key - not just "partition by date."

Choosing a Partition Key

Start with how the data is queried. If 95% of queries filter on event_date, that's your partition key. If analysts always filter by region first, consider region. The goal is pruning: the query engine should eliminate partitions before reading any data. A table partitioned by date with 3 years of daily partitions has ~1,095 partitions - a query for one day reads 0.09% of the data.

Interview Question

▸"You have a 500 GB/day event stream. How would you partition it in S3?"
▸They want: partition key analysis, cardinality reasoning, file size targets, and awareness of downstream query patterns.

Cardinality Analysis

The partition key's cardinality determines how many directories you create. Too few partitions (e.g., by continent - 7 values) means each partition is massive and you get minimal pruning benefit. Too many partitions (e.g., by user_id - 50M values) creates the small file problem: millions of tiny files that overwhelm the metastore and make listing operations painfully slow.

Partition Key	Cardinality	Files/Day	Avg File Size	Verdict
date	1	~10	50 GB	Too coarse - no pruning within day
date + hour	24	~240	2 GB	Good for hourly queries
date + hour + user_id	24 × 50M	millions	< 1 KB	Catastrophic - small file hell
date + region	~30	~300	1.7 GB	Sweet spot for region-filtered queries

The Over-Partition Trap

Hive metastore starts degrading around 100K partitions. Cloud catalog services have a soft limit of 10M partitions per table. But performance degrades long before you hit limits - listing 500K partitions takes 30+ seconds in most catalogs. The sweet spot is 128 MB to 1 GB per file, with partition counts in the low thousands for actively queried tables.

TIP

Senior signal: mention that you'd use Spark's repartition() or coalesce() after writing to consolidate small files. Say you'd add a compaction job that runs daily to merge files under 128 MB. This shows you've dealt with this in production.

KEY TAKEAWAYS

Partition key choice is driven by query patterns, not data structure

Cardinality between 100-10,000 is the sweet spot for most analytical tables

Over-partitioning creates small files that degrade listing performance and increase S3 API costs

File compaction is a required operational concern, not an optimization

Delta or Iceberg?

Daily Life

Interviews

Compare Delta Lake and Iceberg on ACID guarantees, time travel, and schema evolution

Both Delta Lake and Apache Iceberg add ACID transactions to files sitting on object storage. They solve the same core problem: Parquet files are immutable, so updates, deletes, and schema changes require a metadata layer. The interviewer wants you to know what each does well and where they diverge.

The Core Problem They Solve

Without a table format, a "table" is just a directory of Parquet files with a naming convention. There's no atomic commit - if a write fails halfway, you have partial data. There's no isolation - a reader might see a half-written batch. There's no schema enforcement - a new file could have different column names. Delta and Iceberg both fix this by adding a transaction log that tracks which files constitute the current table state.

•Delta Lake

JSON-based transaction log (_delta_log/)
Tight Spark integration - native in its parent platform
MERGE INTO for upserts since day one
Optimized for its parent platform ecosystem
Change Data Feed for downstream CDC
Log compaction via checkpoints every 10 commits

•Apache Iceberg

Manifest-based metadata (Avro manifest lists)
Engine-agnostic - Spark, Trino, Flink, Dremio
Hidden partitioning - partition evolution without rewrite
Snapshot-based branching and tagging
Time travel via snapshot isolation
Adopted by major cloud providers and query engines

Time Travel

Both formats support querying historical snapshots. Delta retains commits for 30 days by default (configurable). Iceberg keeps snapshots until you explicitly expire them. The mechanism differs: Delta replays the transaction log to reconstruct state, while Iceberg points directly to a snapshot's manifest list. For large tables (10K+ commits), Iceberg's approach is faster because it doesn't need to replay a log.

Schema Evolution

Both support adding columns, but Iceberg handles partition evolution - changing partition strategy without rewriting data. If you partition by day and later need by hour, Iceberg tracks both schemes and applies the correct one per file based on when it was written. Delta requires a full rewrite to change partitioning. This is a significant operational advantage for long-lived tables.

Interview Question

▸"When would you choose Iceberg over Delta Lake?"
▸Strong answer: multi-engine environment, need partition evolution, want vendor-neutral format. Mention the compute coupling tradeoff - Delta's platform-specific optimizations are real.

TIP

Senior signal: acknowledge that Delta open-sourced UniForm to write Iceberg-compatible metadata alongside Delta logs. The formats are converging. Say "the choice is increasingly about ecosystem, not features" - this shows you track the space actively.

KEY TAKEAWAYS

Delta and Iceberg both provide ACID transactions on top of Parquet files in object storage

Delta excels in platform-centric environments with tight Spark integration

Iceberg wins on engine-agnosticism and partition evolution without data rewrites

Time travel works differently - snapshot-based (Iceberg) vs log-replay (Delta)

The formats are converging; ecosystem fit matters more than feature comparison

Data Lake or Warehouse?

Daily Life

Interviews

Articulate when a data lake, warehouse, or lakehouse architecture fits - and why

This question tests whether you understand the economics and tradeoffs, not just the definitions. The answer has shifted dramatically since 2022. The interviewer wants to hear you reason about it, not recite a comparison chart.

The Traditional Split

Data warehouses couple storage and compute into a managed service. You load structured data, it's optimized for SQL analytics, and you pay per query or per compute-second. Data lakes (object storage + Spark) store raw files in any format. You bring your own compute engine and pay for storage + compute separately.

Dimension	Data Warehouse	Data Lake
Data format	Proprietary internal format	Open (Parquet, ORC, Avro)
Schema	Schema-on-write enforced	Schema-on-read flexible
Query engine	Built-in, optimized	Bring your own (Spark, Trino, etc.)
Cost model	Compute-time or per-query	Storage + compute separately
Best for	BI dashboards, SQL analytics	ML, unstructured data, raw archives
Governance	Built-in RBAC and audit	Requires external catalog + policies

The Lakehouse Convergence

The lakehouse combines open file formats on object storage with warehouse-grade query performance. The term was coined by one vendor, but every major cloud provider and warehouse moved here. The idea: store everything as Parquet/Iceberg on your object storage, but expose it through a high-performance SQL engine with ACID guarantees.

This means the question "lake or warehouse?" is increasingly "which engine on top of open storage?" The data lives in one place. You choose compute based on the workload - SQL engine for dashboards, Spark for transformations, Python for ML.

Interview Question

▸"Your team has data in both a managed warehouse and S3. How would you unify the architecture?"
▸Good answer: migrate warehouse-only tables to Iceberg on S3, use the warehouse as a compute engine via external tables. Keep it for BI workloads, Spark for heavy ETL. One storage layer, multiple compute engines.

When the Warehouse Still Wins

For small-to-medium teams (< 20 data people) with primarily SQL workloads, a managed warehouse is hard to beat. You skip the operational overhead of managing catalogs, compaction, file optimization, and access control. A managed warehouse gets you from zero to production dashboards in days. The lake/lakehouse path makes sense when you need multi-engine access, have ML workloads, process unstructured data, or want to avoid vendor lock-in on storage.

TIP

Senior signal: frame your answer around team size and workload mix, not technology preference. "It depends on whether your team spends 80% of their time in SQL or 40% in Python" is more credible than "lakehouse is always better."

KEY TAKEAWAYS

The lake vs warehouse distinction is dissolving into "open storage + choice of compute engine"

Lakehouse architecture keeps data in open formats on object storage with warehouse-grade SQL access

Pure warehouses still win for small teams with SQL-only workloads - operational simplicity matters

The real decision: single-engine simplicity vs multi-engine flexibility

How Much Will This Cost?

Daily Life

Interviews

Model storage costs including lifecycle policies and format impact on cloud bills

Storage cost is the question that separates engineers who build pipelines from engineers who own pipelines. The interviewer wants to see that you think about money as a first-class engineering constraint.

S3 Storage Tiers

Tier	Cost/GB/Month	Retrieval Cost	Best For
S3 Standard	$0.023	None	Frequently queried data (< 90 days)
S3 Infrequent Access	$0.0125	$0.01/GB	Monthly reports, 90-365 day data
S3 Glacier Instant	$0.004	$0.03/GB	Compliance archives, rare queries
S3 Glacier Deep	$0.00099	$0.02/GB + 12hr wait	Regulatory retention, never queried

A common production setup: 500 GB/day ingestion in Parquet. That's ~15 TB/month raw. With 2-year retention, you're looking at 360 TB. At S3 Standard pricing, that's $8,280/month. Move data older than 90 days to IA and older than 1 year to Glacier Instant, and the same 360 TB costs ~$2,100/month - a 75% reduction with a lifecycle policy that takes 20 minutes to configure.

Format Impact on Cost

File format directly affects your storage bill. A 1 TB/day JSON pipeline costs $23/GB × 30 TB = $690/month in S3 Standard. Convert to Parquet with Snappy compression (typical 5:1 ratio on JSON) and the same data costs $138/month. Switch to ZSTD compression (8:1 on this data) and you're at $86/month. The format conversion job that runs once saves $7,200/year.

Lifecycle Policies

Every production table should have a retention policy. Not "we'll clean it up later" - an automated lifecycle rule. S3 lifecycle policies can transition objects between tiers based on age and delete objects after a retention period. The biggest cost mistake in data engineering is storing everything forever in hot storage because nobody wrote the policy.

TIP

Senior signal: mention that you'd set up S3 Inventory reports and Storage Lens dashboards to track storage growth by prefix. Proactive cost monitoring beats quarterly bill shock every time.

KEY TAKEAWAYS

Lifecycle policies from hot → IA → Glacier can reduce storage costs 60-75%

File format choice (JSON vs Parquet + ZSTD) has 5-10x cost impact on the same data

S3 API costs (LIST, GET) are a hidden expense when you have millions of small files

Cost modeling is an engineering skill - build it into pipeline design, not as an afterthought

Parquet pop quiz, partitioning, and the physical layer

Category: Pipeline Architecture
Difficulty: intermediate
Duration: 35 minutes
Challenges: 0 hands-on challenges

Topics covered: Why Parquet?, How Would You Partition?, Delta or Iceberg?, Data Lake or Warehouse?, How Much Will This Cost?

Lesson Sections

Why Parquet? (concepts: paColumnarVsRow)
This is asked as a screener because it instantly reveals whether you've worked with production data at scale. The interviewer doesn't want "it's columnar." They want you to connect physical layout to the queries you actually run. Row vs. Columnar Layout CSV and JSON store data row-by-row. To answer "what's the average order amount?" on a 500-column table, a row-oriented reader must load all 500 columns into memory, skip 499 of them, and aggregate the one it needs. Parquet stores each column cont
How Would You Partition? (concepts: paPartitioning)
Partitioning is how you turn a 10 TB table scan into a 50 GB targeted read. The interviewer wants to hear your thought process for choosing a partition key - not just "partition by date." Choosing a Partition Key Start with how the data is queried. If 95% of queries filter on event_date, that's your partition key. If analysts always filter by region first, consider region. The goal is pruning: the query engine should eliminate partitions before reading any data. A table partitioned by date wit
Delta or Iceberg? (concepts: paTableFormats)
Both Delta Lake and Apache Iceberg add ACID transactions to files sitting on object storage. They solve the same core problem: Parquet files are immutable, so updates, deletes, and schema changes require a metadata layer. The interviewer wants you to know what each does well and where they diverge. The Core Problem They Solve Without a table format, a "table" is just a directory of Parquet files with a naming convention. There's no atomic commit - if a write fails halfway, you have partial dat
Data Lake or Warehouse? (concepts: paDataLake)
This question tests whether you understand the economics and tradeoffs, not just the definitions. The answer has shifted dramatically since 2022. The interviewer wants to hear you reason about it, not recite a comparison chart. The Traditional Split Data warehouses couple storage and compute into a managed service. You load structured data, it's optimized for SQL analytics, and you pay per query or per compute-second. Data lakes (object storage + Spark) store raw files in any format. You bring y
How Much Will This Cost? (concepts: paCostOptimization)
Storage cost is the question that separates engineers who build pipelines from engineers who own pipelines. The interviewer wants to see that you think about money as a first-class engineering constraint. S3 Storage Tiers A common production setup: 500 GB/day ingestion in Parquet. That's ~15 TB/month raw. With 2-year retention, you're looking at 360 TB. At S3 Standard pricing, that's $8,280/month. Move data older than 90 days to IA and older than 1 year to Glacier Instant, and the same 360 TB co