Where Data Lives: Beginner

'Why Parquet?' is the most common screening question in pipeline interviews. If you cannot answer it in 10 seconds, the interview is over before it starts. Beyond format choice, the interviewer will ask about partitioning, lake vs warehouse, and table formats. Here is every answer you need.

Columnar vs Row

Daily Life
Interviews

Answer 'Why Parquet?' as a screening question

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Why Parquet over CSV?"
  • "Row-oriented vs columnar: explain the difference."
  • "How does a columnar format speed up queries?"

The 10-Second Answer

'Parquet stores data by column instead of by row. Analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better. And row group statistics let the engine skip chunks without reading them.' Done. Three benefits, one sentence each. That is the answer.
What to Whiteboard
Row-Oriented (CSV, Postgres)
All columns for each row together
Fast for reading full records
Used by transactional databases
Must read all columns even for one
Columnar (Parquet, BigQuery)
All values of each column together
Fast for reading selected columns
Used by analytical systems
Excellent compression (same types)
The Curveball Follow-ups

After your initial answer, expect these probes

  • "When would you NOT use Parquet?" Transactional workloads (inserting single rows). Human inspection (CSV is readable). Very small files (Parquet overhead outweighs benefits under ~10 MB).
  • "How does compression work better in columnar?" A column of country codes (US, US, US, CA, US) is just 2 unique values repeated. Compression algorithms love that. A row mixing strings, numbers, and dates compresses poorly.
  • "What is predicate pushdown?" The engine reads column statistics (min/max per chunk) and skips chunks that cannot match the query filter. 'WHERE revenue > 1000' skips any chunk where max(revenue) < 1000.
KEY TAKEAWAYS
The 10-second answer: columnar storage, 10x compression, predicate pushdown
Row-oriented = transactions (Postgres). Columnar = analytics (BigQuery, Parquet).
Know the exception: CSV for small human-readable exchanges, Avro for streaming

Compression

Daily Life
Interviews

Answer compression algorithm questions

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What compression do you use and why?"
  • "Snappy vs Gzip: when do you use each?"
  • "How does compression affect query performance?"

What They Want to Hear

'Snappy for interactive queries because it decompresses fast. Gzip for archival because it compresses more. Zstd is the emerging winner: Gzip-level compression at nearly Snappy-level speed.' That is the complete answer. They are testing whether you understand the speed-vs-ratio tradeoff, not whether you know the algorithms' internals.
AlgorithmSpeedCompressionSay This
SnappyVery fastModerate (2-4x)Default for analytics. Fast reads.
GzipSlowHigh (5-8x)Archival. Small files, slow queries.
ZstdFastHigh (4-7x)Best of both. Growing fast.
LZ4FastestLow (2-3x)Ultra-low latency streaming.
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Why not always use the highest compression?" CPU cost. Gzip takes 5-10x more CPU to decompress than Snappy. On interactive queries, decompression time dominates. You trade smaller files for slower queries.
  • "Does compression affect write performance?" Yes. Higher compression = slower writes. For streaming ingestion, use Snappy or LZ4 to keep write latency low.
KEY TAKEAWAYS
Say: 'Snappy for queries, Gzip for archives, Zstd for best of both.'
The tradeoff is always speed vs ratio. Know which end your use case needs.
You do not need to know algorithms internally. Just the tradeoff.

Partitioning

Daily Life
Interviews

Answer partitioning questions and name the pitfall

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "How do you partition this table?"
  • "What is partition pruning?"
  • "What happens if you over-partition?"

What They Want to Hear

'I partition by the column that appears most often in WHERE clauses, usually date. Partition pruning lets the engine skip all other date folders entirely. A query for one day reads 1/365th of the data.' Then immediately add the pitfall: 'The risk is over-partitioning. Too many partitions create thousands of tiny files, which is actually slower than no partitioning at all because of per-file overhead.'
What to Whiteboard
filter pushdown99.7% skipped
WHERE date = '2024-03-19'
Partition Pruning
Skip all other date folders
Read 1 Partition
1/365th of the data
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What is the small file problem?" Over-partitioning creates thousands of tiny files. Opening 10,000 files of 1 MB each is far slower than 10 files of 1 GB, even though total data is the same. Per-file overhead dominates.
  • "How big should each partition be?" Target 128 MB to 1 GB per partition. If partitions are smaller, coarsen the partition key (hourly to daily) or add compaction.
  • "Can you partition on two columns?" Yes, but be careful. date + region creates (365 x number_of_regions) partitions. If that is too many, partition by date only and use Z-ordering or sorting on region.
KEY TAKEAWAYS
Say: 'Partition by the most common filter column, usually date. Pruning skips irrelevant partitions entirely.'
Always mention the pitfall: over-partitioning creates the small file problem
Rule of thumb: each partition should be 128 MB to 1 GB

Lake vs Warehouse

Daily Life
Interviews

Distinguish lake, warehouse, and lakehouse in interviews

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "Data lake vs warehouse: what is the difference?"
  • "When would you use a lake instead of a warehouse?"
  • "What is a lakehouse?"

What They Want to Hear

'A lake stores raw data cheaply on object storage. A warehouse stores structured, query-optimized data in a purpose-built engine. A lakehouse puts a table format (Delta, Iceberg) on top of lake storage to get warehouse features without the warehouse cost.' Then the critical insight: 'Most modern platforms use both. The lake is the cheap source of truth. The warehouse materializes the hot queries.'
What to Whiteboard
Data Lake
Raw data on S3/GCS (cheap)
Schema-on-read (flexible)
Risk: becomes a data swamp
Best for: raw archive, ML training
Data Warehouse
Structured data, optimized engine
Schema-on-write (enforced)
Risk: expensive at scale
Best for: dashboards, SQL analytics
The Curveball Follow-ups

After your initial answer, expect these probes

  • "What is the lakehouse and why is it popular?" Lake cost + warehouse features. Add Delta Lake or Iceberg on top of S3 files and you get ACID transactions, schema enforcement, and time travel without paying for a separate warehouse.
  • "Would you use JUST a lake or JUST a warehouse?" Almost never. Raw data lands in the lake (cheap, flexible). Query-ready data is served from the warehouse or lakehouse (fast, governed). Both have a role.
  • "What is a data swamp?" A lake with no governance: no schema documentation, no ownership, no quality checks. Data goes in but nobody knows what is there or whether it is correct.
KEY TAKEAWAYS
Say: 'Lake for cheap raw storage, warehouse for fast queries, lakehouse for both via Delta/Iceberg.'
The smart answer: 'Most platforms use both. Lake is the archive, warehouse serves the hot queries.'
Always mention the swamp risk: a lake without governance is useless

Table Formats

Daily Life
Interviews

Answer table format questions with practical awareness

Interview Trigger Phrases

When you hear these in an interview, this is the concept being tested

  • "What is Delta Lake?"
  • "Delta vs Iceberg vs Hudi: what is the difference?"
  • "How do you get ACID on object storage?"

What They Want to Hear

'Raw Parquet on S3 has no transaction guarantees. A failed write corrupts the table. Table formats like Delta Lake and Apache Iceberg add a metadata layer that provides ACID transactions, time travel, and schema enforcement on top of Parquet files.' Then the cheat sheet: 'Delta if you use Databricks, Iceberg if you want engine-agnostic, Hudi if you need fast upserts.'
What Table Formats Add
Parquet
Columnar storageCompressionNo transactions
Metadata
Transaction logSchema historyPartition manifest
Features
ACID writesTime travelSchema evolution
FormatCreated BySay This
Delta LakeDatabricksTight Spark integration. Default if using Databricks.
Apache IcebergNetflixEngine-agnostic. Best partition evolution. Growing fast.
Apache HudiUberFastest upserts. Good for streaming into a lake.
The Curveball Follow-ups

After your initial answer, expect these probes

  • "Which would you choose?"'Depends on the engine. If we are on Databricks, Delta. If we want flexibility across Spark, Trino, and Presto, Iceberg. I would not choose based on features alone; ecosystem fit matters more.'
  • "What is time travel?" Query the table as it existed at a specific timestamp or version number. Useful for debugging, auditing, and reproducing ML training data.
  • "Can you migrate between formats?" Yes. In-place migration adds new metadata on top of existing files (fast, limited). Full rewrite reads and rewrites all data (slow, complete).
KEY TAKEAWAYS
Say: 'Table formats add ACID, time travel, and schema enforcement to files on object storage.'
Delta = Databricks, Iceberg = engine-agnostic, Hudi = upserts. Know the one-liner for each.
The pro move: 'I would choose based on ecosystem fit, not feature comparison.'

Answer the storage questions: Parquet, partitioning, lake vs warehouse

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: Columnar vs Row, Compression, Partitioning, Lake vs Warehouse, Table Formats

Lesson Sections

  1. Columnar vs Row (concepts: paColumnarVsRow)

    The 10-Second Answer 'Parquet stores data by column instead of by row. Analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better. And row group statistics let the engine skip chunks without reading them.' Done. Three benefits, one sentence each. That is the answer.

  2. Compression (concepts: paCompression)

    What They Want to Hear 'Snappy for interactive queries because it decompresses fast. Gzip for archival because it compresses more. Zstd is the emerging winner: Gzip-level compression at nearly Snappy-level speed.' That is the complete answer. They are testing whether you understand the speed-vs-ratio tradeoff, not whether you know the algorithms' internals.

  3. Partitioning (concepts: paPartitioning)

    What They Want to Hear 'I partition by the column that appears most often in WHERE clauses, usually date. Partition pruning lets the engine skip all other date folders entirely. A query for one day reads 1/365th of the data.' Then immediately add the pitfall: 'The risk is over-partitioning. Too many partitions create thousands of tiny files, which is actually slower than no partitioning at all because of per-file overhead.'

  4. Lake vs Warehouse (concepts: paDataLake)

    What They Want to Hear 'A lake stores raw data cheaply on object storage. A warehouse stores structured, query-optimized data in a purpose-built engine. A lakehouse puts a table format (Delta, Iceberg) on top of lake storage to get warehouse features without the warehouse cost.' Then the critical insight: 'Most modern platforms use both. The lake is the cheap source of truth. The warehouse materializes the hot queries.'

  5. Table Formats (concepts: paTableFormats)

    What They Want to Hear 'Raw Parquet on S3 has no transaction guarantees. A failed write corrupts the table. Table formats like Delta Lake and Apache Iceberg add a metadata layer that provides ACID transactions, time travel, and schema enforcement on top of Parquet files.' Then the cheat sheet: 'Delta if you use Databricks, Iceberg if you want engine-agnostic, Hudi if you need fast upserts.'