Where Data Lives: Beginner
'Why Parquet?' is the most common screening question in pipeline interviews. If you cannot answer it in 10 seconds, the interview is over before it starts. Beyond format choice, the interviewer will ask about partitioning, lake vs warehouse, and table formats. Here is every answer you need.
Columnar vs Row
Answer 'Why Parquet?' as a screening question
When you hear these in an interview, this is the concept being tested
- ▸"Why Parquet over CSV?"
- ▸"Row-oriented vs columnar: explain the difference."
- ▸"How does a columnar format speed up queries?"
The 10-Second Answer
After your initial answer, expect these probes
- ▸"When would you NOT use Parquet?" Transactional workloads (inserting single rows). Human inspection (CSV is readable). Very small files (Parquet overhead outweighs benefits under ~10 MB).
- ▸"How does compression work better in columnar?" A column of country codes (US, US, US, CA, US) is just 2 unique values repeated. Compression algorithms love that. A row mixing strings, numbers, and dates compresses poorly.
- ▸"What is predicate pushdown?" The engine reads column statistics (min/max per chunk) and skips chunks that cannot match the query filter. 'WHERE revenue > 1000' skips any chunk where max(revenue) < 1000.
Compression
Answer compression algorithm questions
When you hear these in an interview, this is the concept being tested
- ▸"What compression do you use and why?"
- ▸"Snappy vs Gzip: when do you use each?"
- ▸"How does compression affect query performance?"
What They Want to Hear
| Algorithm | Speed | Compression | Say This |
|---|---|---|---|
| Snappy | Very fast | Moderate (2-4x) | Default for analytics. Fast reads. |
| Gzip | Slow | High (5-8x) | Archival. Small files, slow queries. |
| Zstd | Fast | High (4-7x) | Best of both. Growing fast. |
| LZ4 | Fastest | Low (2-3x) | Ultra-low latency streaming. |
After your initial answer, expect these probes
- ▸"Why not always use the highest compression?" CPU cost. Gzip takes 5-10x more CPU to decompress than Snappy. On interactive queries, decompression time dominates. You trade smaller files for slower queries.
- ▸"Does compression affect write performance?" Yes. Higher compression = slower writes. For streaming ingestion, use Snappy or LZ4 to keep write latency low.
Partitioning
Answer partitioning questions and name the pitfall
When you hear these in an interview, this is the concept being tested
- ▸"How do you partition this table?"
- ▸"What is partition pruning?"
- ▸"What happens if you over-partition?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What is the small file problem?" Over-partitioning creates thousands of tiny files. Opening 10,000 files of 1 MB each is far slower than 10 files of 1 GB, even though total data is the same. Per-file overhead dominates.
- ▸"How big should each partition be?" Target 128 MB to 1 GB per partition. If partitions are smaller, coarsen the partition key (hourly to daily) or add compaction.
- ▸"Can you partition on two columns?" Yes, but be careful. date + region creates (365 x number_of_regions) partitions. If that is too many, partition by date only and use Z-ordering or sorting on region.
Lake vs Warehouse
Distinguish lake, warehouse, and lakehouse in interviews
When you hear these in an interview, this is the concept being tested
- ▸"Data lake vs warehouse: what is the difference?"
- ▸"When would you use a lake instead of a warehouse?"
- ▸"What is a lakehouse?"
What They Want to Hear
After your initial answer, expect these probes
- ▸"What is the lakehouse and why is it popular?" Lake cost + warehouse features. Add Delta Lake or Iceberg on top of S3 files and you get ACID transactions, schema enforcement, and time travel without paying for a separate warehouse.
- ▸"Would you use JUST a lake or JUST a warehouse?" Almost never. Raw data lands in the lake (cheap, flexible). Query-ready data is served from the warehouse or lakehouse (fast, governed). Both have a role.
- ▸"What is a data swamp?" A lake with no governance: no schema documentation, no ownership, no quality checks. Data goes in but nobody knows what is there or whether it is correct.
Table Formats
Answer table format questions with practical awareness
When you hear these in an interview, this is the concept being tested
- ▸"What is Delta Lake?"
- ▸"Delta vs Iceberg vs Hudi: what is the difference?"
- ▸"How do you get ACID on object storage?"
What They Want to Hear
| Format | Created By | Say This |
|---|---|---|
| Delta Lake | Databricks | Tight Spark integration. Default if using Databricks. |
| Apache Iceberg | Netflix | Engine-agnostic. Best partition evolution. Growing fast. |
| Apache Hudi | Uber | Fastest upserts. Good for streaming into a lake. |
After your initial answer, expect these probes
- ▸"Which would you choose?"'Depends on the engine. If we are on Databricks, Delta. If we want flexibility across Spark, Trino, and Presto, Iceberg. I would not choose based on features alone; ecosystem fit matters more.'
- ▸"What is time travel?" Query the table as it existed at a specific timestamp or version number. Useful for debugging, auditing, and reproducing ML training data.
- ▸"Can you migrate between formats?" Yes. In-place migration adds new metadata on top of existing files (fast, limited). Full rewrite reads and rewrites all data (slow, complete).
Answer the storage questions: Parquet, partitioning, lake vs warehouse
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 20 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Columnar vs Row, Compression, Partitioning, Lake vs Warehouse, Table Formats
Lesson Sections
- Columnar vs Row (concepts: paColumnarVsRow)
The 10-Second Answer 'Parquet stores data by column instead of by row. Analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better. And row group statistics let the engine skip chunks without reading them.' Done. Three benefits, one sentence each. That is the answer.
- Compression (concepts: paCompression)
What They Want to Hear 'Snappy for interactive queries because it decompresses fast. Gzip for archival because it compresses more. Zstd is the emerging winner: Gzip-level compression at nearly Snappy-level speed.' That is the complete answer. They are testing whether you understand the speed-vs-ratio tradeoff, not whether you know the algorithms' internals.
- Partitioning (concepts: paPartitioning)
What They Want to Hear 'I partition by the column that appears most often in WHERE clauses, usually date. Partition pruning lets the engine skip all other date folders entirely. A query for one day reads 1/365th of the data.' Then immediately add the pitfall: 'The risk is over-partitioning. Too many partitions create thousands of tiny files, which is actually slower than no partitioning at all because of per-file overhead.'
- Lake vs Warehouse (concepts: paDataLake)
What They Want to Hear 'A lake stores raw data cheaply on object storage. A warehouse stores structured, query-optimized data in a purpose-built engine. A lakehouse puts a table format (Delta, Iceberg) on top of lake storage to get warehouse features without the warehouse cost.' Then the critical insight: 'Most modern platforms use both. The lake is the cheap source of truth. The warehouse materializes the hot queries.'
- Table Formats (concepts: paTableFormats)
What They Want to Hear 'Raw Parquet on S3 has no transaction guarantees. A failed write corrupts the table. Table formats like Delta Lake and Apache Iceberg add a metadata layer that provides ACID transactions, time travel, and schema enforcement on top of Parquet files.' Then the cheat sheet: 'Delta if you use Databricks, Iceberg if you want engine-agnostic, Hudi if you need fast upserts.'