Where Data Lives: Beginner

'Why Parquet?' is the most common screening question in pipeline interviews. If you cannot answer it in 10 seconds, the interview is over before it starts. Beyond format choice, the interviewer will ask about partitioning, lake vs warehouse, and table formats. Here is every answer you need.

What you will be able to do

Answer 'Why Parquet?' in 10 seconds flat

Explain partitioning and immediately name the pitfall (small files)

Distinguish lake vs warehouse vs lakehouse without rambling

Columnar vs Row

Daily Life

Interviews

Answer 'Why Parquet?' as a screening question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Why Parquet over CSV?"
▸"Row-oriented vs columnar: explain the difference."
▸"How does a columnar format speed up queries?"

The 10-Second Answer

'Parquet stores data by column instead of by row. Analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better. And row group statistics let the engine skip chunks without reading them.' Done. Three benefits, one sentence each. That is the answer.

✓Row-Oriented (CSV, Postgres)

All columns for each row together
Fast for reading full records
Used by transactional databases
Must read all columns even for one

✗Columnar (Parquet, BigQuery)

All values of each column together
Fast for reading selected columns
Used by analytical systems
Excellent compression (same types)

The Curveball Follow-ups

After your initial answer, expect these probes

▸"When would you NOT use Parquet?" Transactional workloads (inserting single rows). Human inspection (CSV is readable). Very small files (Parquet overhead outweighs benefits under ~10 MB).
▸"How does compression work better in columnar?" A column of country codes (US, US, US, CA, US) is just 2 unique values repeated. Compression algorithms love that. A row mixing strings, numbers, and dates compresses poorly.
▸"What is predicate pushdown?" The engine reads column statistics (min/max per chunk) and skips chunks that cannot match the query filter. 'WHERE revenue > 1000' skips any chunk where max(revenue) < 1000.

KEY TAKEAWAYS

The 10-second answer: columnar storage, 10x compression, predicate pushdown

Row-oriented = transactions (Postgres). Columnar = analytics (BigQuery, Parquet).

Know the exception: CSV for small human-readable exchanges, Avro for streaming

Compression

Daily Life

Interviews

Answer compression algorithm questions

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What compression do you use and why?"
▸"Snappy vs Gzip: when do you use each?"
▸"How does compression affect query performance?"

What They Want to Hear

'Snappy for interactive queries because it decompresses fast. Gzip for archival because it compresses more. Zstd is the emerging winner: Gzip-level compression at nearly Snappy-level speed.' That is the complete answer. They are testing whether you understand the speed-vs-ratio tradeoff, not whether you know the algorithms' internals.

Algorithm	Speed	Compression	Say This
Snappy	Very fast	Moderate (2-4x)	Default for analytics. Fast reads.
Gzip	Slow	High (5-8x)	Archival. Small files, slow queries.
Zstd	Fast	High (4-7x)	Best of both. Growing fast.
LZ4	Fastest	Low (2-3x)	Ultra-low latency streaming.

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Why not always use the highest compression?" CPU cost. Gzip takes 5-10x more CPU to decompress than Snappy. On interactive queries, decompression time dominates. You trade smaller files for slower queries.
▸"Does compression affect write performance?" Yes. Higher compression = slower writes. For streaming ingestion, use Snappy or LZ4 to keep write latency low.

KEY TAKEAWAYS

Say: 'Snappy for queries, Gzip for archives, Zstd for best of both.'

The tradeoff is always speed vs ratio. Know which end your use case needs.

You do not need to know algorithms internally. Just the tradeoff.

Partitioning

Daily Life

Interviews

Answer partitioning questions and name the pitfall

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"How do you partition this table?"
▸"What is partition pruning?"
▸"What happens if you over-partition?"

What They Want to Hear

'I partition by the column that appears most often in WHERE clauses, usually date. Partition pruning lets the engine skip all other date folders entirely. A query for one day reads 1/365th of the data.' Then immediately add the pitfall: 'The risk is over-partitioning. Too many partitions create thousands of tiny files, which is actually slower than no partitioning at all because of per-file overhead.'

Source

WHERE date = '2024-03-19'

Transform

Partition Pruning

Consumer

Read 1 Partition

What to Whiteboard

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What is the small file problem?" Over-partitioning creates thousands of tiny files. Opening 10,000 files of 1 MB each is far slower than 10 files of 1 GB, even though total data is the same. Per-file overhead dominates.
▸"How big should each partition be?" Target 128 MB to 1 GB per partition. If partitions are smaller, coarsen the partition key (hourly to daily) or add compaction.
▸"Can you partition on two columns?" Yes, but be careful. date + region creates (365 x number_of_regions) partitions. If that is too many, partition by date only and use Z-ordering or sorting on region.

KEY TAKEAWAYS

Say: 'Partition by the most common filter column, usually date. Pruning skips irrelevant partitions entirely.'

Always mention the pitfall: over-partitioning creates the small file problem

Rule of thumb: each partition should be 128 MB to 1 GB

Lake vs Warehouse

Daily Life

Interviews

Distinguish lake, warehouse, and lakehouse in interviews

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"Data lake vs warehouse: what is the difference?"
▸"When would you use a lake instead of a warehouse?"
▸"What is a lakehouse?"

What They Want to Hear

'A lake stores raw data cheaply on object storage. A warehouse stores structured, query-optimized data in a purpose-built engine. A lakehouse puts a table format (Delta, Iceberg) on top of lake storage to get warehouse features without the warehouse cost.' Then the critical insight: 'Most modern platforms use both. The lake is the cheap source of truth. The warehouse materializes the hot queries.'

✓Data Lake

Raw data on S3/GCS (cheap)
Schema-on-read (flexible)
Risk: becomes a data swamp
Best for: raw archive, ML training

✗Data Warehouse

Structured data, optimized engine
Schema-on-write (enforced)
Risk: expensive at scale
Best for: dashboards, SQL analytics

The Curveball Follow-ups

After your initial answer, expect these probes

▸"What is the lakehouse and why is it popular?" Lake cost + warehouse features. Add Delta Lake or Iceberg on top of S3 files and you get ACID transactions, schema enforcement, and time travel without paying for a separate warehouse.
▸"Would you use JUST a lake or JUST a warehouse?" Almost never. Raw data lands in the lake (cheap, flexible). Query-ready data is served from the warehouse or lakehouse (fast, governed). Both have a role.
▸"What is a data swamp?" A lake with no governance: no schema documentation, no ownership, no quality checks. Data goes in but nobody knows what is there or whether it is correct.

KEY TAKEAWAYS

Say: 'Lake for cheap raw storage, warehouse for fast queries, lakehouse for both via Delta/Iceberg.'

The smart answer: 'Most platforms use both. Lake is the archive, warehouse serves the hot queries.'

Always mention the swamp risk: a lake without governance is useless

Table Formats

Daily Life

Interviews

Answer table format questions with practical awareness

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

▸"What is Delta Lake?"
▸"Delta vs Iceberg vs Hudi: what is the difference?"
▸"How do you get ACID on object storage?"

What They Want to Hear

'Raw Parquet on S3 has no transaction guarantees. A failed write corrupts the table. Table formats like Delta Lake and Apache Iceberg add a metadata layer that provides ACID transactions, time travel, and schema enforcement on top of Parquet files.' Then the cheat sheet: 'Delta if you use Databricks, Iceberg if you want engine-agnostic, Hudi if you need fast upserts.'

What Table Formats Add

ParquetColumnar storage; Compression; No transactions
MetadataTransaction log; Schema history; Partition manifest
FeaturesACID writes; Time travel; Schema evolution

Format	Created By	Say This
Delta Lake	Databricks	Tight Spark integration. Default if using Databricks.
Apache Iceberg	Netflix	Engine-agnostic. Best partition evolution. Growing fast.
Apache Hudi	Uber	Fastest upserts. Good for streaming into a lake.

The Curveball Follow-ups

After your initial answer, expect these probes

▸"Which would you choose?"'Depends on the engine. If we are on Databricks, Delta. If we want flexibility across Spark, Trino, and Presto, Iceberg. I would not choose based on features alone; ecosystem fit matters more.'
▸"What is time travel?" Query the table as it existed at a specific timestamp or version number. Useful for debugging, auditing, and reproducing ML training data.
▸"Can you migrate between formats?" Yes. In-place migration adds new metadata on top of existing files (fast, limited). Full rewrite reads and rewrites all data (slow, complete).

KEY TAKEAWAYS

Say: 'Table formats add ACID, time travel, and schema enforcement to files on object storage.'

Delta = Databricks, Iceberg = engine-agnostic, Hudi = upserts. Know the one-liner for each.

The pro move: 'I would choose based on ecosystem fit, not feature comparison.'

Answer the storage questions: Parquet, partitioning, lake vs warehouse

Category: Pipeline Architecture
Difficulty: beginner
Duration: 20 minutes
Challenges: 0 hands-on challenges

Topics covered: Columnar vs Row, Compression, Partitioning, Lake vs Warehouse, Table Formats

Lesson Sections

Columnar vs Row (concepts: paColumnarVsRow)
The 10-Second Answer 'Parquet stores data by column instead of by row. Analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better. And row group statistics let the engine skip chunks without reading them.' Done. Three benefits, one sentence each. That is the answer.
Compression (concepts: paCompression)
What They Want to Hear 'Snappy for interactive queries because it decompresses fast. Gzip for archival because it compresses more. Zstd is the emerging winner: Gzip-level compression at nearly Snappy-level speed.' That is the complete answer. They are testing whether you understand the speed-vs-ratio tradeoff, not whether you know the algorithms' internals.
Partitioning (concepts: paPartitioning)
What They Want to Hear 'I partition by the column that appears most often in WHERE clauses, usually date. Partition pruning lets the engine skip all other date folders entirely. A query for one day reads 1/365th of the data.' Then immediately add the pitfall: 'The risk is over-partitioning. Too many partitions create thousands of tiny files, which is actually slower than no partitioning at all because of per-file overhead.'
Lake vs Warehouse (concepts: paDataLake)
What They Want to Hear 'A lake stores raw data cheaply on object storage. A warehouse stores structured, query-optimized data in a purpose-built engine. A lakehouse puts a table format (Delta, Iceberg) on top of lake storage to get warehouse features without the warehouse cost.' Then the critical insight: 'Most modern platforms use both. The lake is the cheap source of truth. The warehouse materializes the hot queries.'
Table Formats (concepts: paTableFormats)
What They Want to Hear 'Raw Parquet on S3 has no transaction guarantees. A failed write corrupts the table. Table formats like Delta Lake and Apache Iceberg add a metadata layer that provides ACID transactions, time travel, and schema enforcement on top of Parquet files.' Then the cheat sheet: 'Delta if you use Databricks, Iceberg if you want engine-agnostic, Hudi if you need fast upserts.'