DataDriven
LearnPracticeInterviewDiscussDaily
HelpContactPrivacyTermsSecurityiOS App

© 2026 DataDriven

Loading lesson...

  1. Home
  2. Learn
  3. Where Data Lives

Where Data Lives

Answer the storage questions: Parquet, partitioning, lake vs warehouse

Answer the storage questions: Parquet, partitioning, lake vs warehouse

Category
Pipeline Architecture
Difficulty
beginner
Duration
20 minutes
Challenges
0 hands-on challenges

Topics covered: Columnar vs Row, Compression, Partitioning, Lake vs Warehouse, Table Formats

Lesson Sections

  1. Columnar vs Row (concepts: paColumnarVsRow)

    The 10-Second Answer 'Parquet stores data by column instead of by row. Analytical queries that only need 3 out of 100 columns read 97% less data. Same-type values grouped together compress 10x better. And row group statistics let the engine skip chunks without reading them.' Done. Three benefits, one sentence each. That is the answer.

  2. Compression (concepts: paCompression)

    What They Want to Hear 'Snappy for interactive queries because it decompresses fast. Gzip for archival because it compresses more. Zstd is the emerging winner: Gzip-level compression at nearly Snappy-level speed.' That is the complete answer. They are testing whether you understand the speed-vs-ratio tradeoff, not whether you know the algorithms' internals.

  3. Partitioning (concepts: paPartitioning)

    What They Want to Hear 'I partition by the column that appears most often in WHERE clauses, usually date. Partition pruning lets the engine skip all other date folders entirely. A query for one day reads 1/365th of the data.' Then immediately add the pitfall: 'The risk is over-partitioning. Too many partitions create thousands of tiny files, which is actually slower than no partitioning at all because of per-file overhead.'

  4. Lake vs Warehouse (concepts: paDataLake)

    What They Want to Hear 'A lake stores raw data cheaply on object storage. A warehouse stores structured, query-optimized data in a purpose-built engine. A lakehouse puts a table format (Delta, Iceberg) on top of lake storage to get warehouse features without the warehouse cost.' Then the critical insight: 'Most modern platforms use both. The lake is the cheap source of truth. The warehouse materializes the hot queries.'

  5. Table Formats (concepts: paTableFormats)

    What They Want to Hear 'Raw Parquet on S3 has no transaction guarantees. A failed write corrupts the table. Table formats like Delta Lake and Apache Iceberg add a metadata layer that provides ACID transactions, time travel, and schema enforcement on top of Parquet files.' Then the cheat sheet: 'Delta if you use Databricks, Iceberg if you want engine-agnostic, Hudi if you need fast upserts.'

Related

  • All Lessons
  • Practice Problems
  • Mock Interview Practice
  • Daily Challenges