Loading lesson...
Where Data Lives
System design depth: partition strategy, cost modeling, lakehouse
- Challenges
- 0 hands-on challenges
Lesson Sections
- Partition Design (concepts: paPartitioning)
What They Want to Hear 'I start from query patterns, not data structure. Which columns appear in WHERE clauses? What is the data volume per partition? Is there skew?' Then walk through: 'For this use case, I would partition by date because 90% of queries filter on date range. Each daily partition holds ~500 MB, which is the sweet spot. If one customer generates 80% of events, I would add hash bucketing within partitions to distribute evenly.'
- Cost Modeling (concepts: paDataLake)
What They Want to Hear 'Storage is 40-60% of platform budget, compute is 30-40%, network is 5-15%. The biggest cost lever is data retention: archiving data older than 90 days to Glacier cuts storage bills by 60-80%. The second lever is right-sizing compute: incremental models instead of full reloads reduce compute by 5-10x.'
- Format Migration (concepts: paTableFormats)
What They Want to Hear 'Two approaches: in-place migration adds Iceberg metadata on top of existing Parquet files. Fast but limited: some features like Z-ordering require a full rewrite. For zero downtime, I use dual-write: write to both old and new format during a transition period, validate that queries produce identical results, switch reads to the new format, then decommission the old one.'
- Z-Ordering (concepts: paColumnarVsRow)
What They Want to Hear 'Partitioning optimizes for one filter column. Z-ordering optimizes for 2-4 filter columns by interleaving their sort order. Think of it like organizing a library: partitioning separates shelves by genre. Z-ordering arranges books within a shelf so books by the same author AND from the same decade are near each other.'
- Lakehouse Architecture (concepts: paDataLake)
What They Want to Hear 'A lakehouse is four layers: storage (S3/GCS, cheap and durable), format (Delta or Iceberg for ACID and time travel), compute (Spark or Trino, separated from storage so you scale independently), and governance (Unity Catalog or Polaris for access control, lineage, and audit). The key difference from a lake with a query engine: the table format layer provides warehouse-like guarantees without the warehouse cost.'