Loading lesson...

The Storage Question

Parquet pop quiz, partitioning, and the physical layer

Challenges: 0 hands-on challenges

Lesson Sections

Why Parquet? (concepts: paColumnarVsRow, paCompression)
The harder version of this question is "When would you NOT use Parquet?" If you only say "it's columnar and compresses well," you've given a surface-level answer. The interviewer wants to hear the exceptions, the tradeoffs, and when other formats win. When Parquet Hurts The stronger signal is knowing when NOT to use Parquet, not reciting its benefits. Your interviewer expects you to lead with the failure modes: high-frequency writes, row-level access patterns, and small payloads are the three sc
How Would You Partition? (concepts: paPartitioning, paSparkExecutionModel)
The real question isn't "what should the partition key be?" - it's "the current partition scheme is wrong, how do you migrate 50 TB of live data without downtime?" Dynamic Partition Discovery Most catalogs use static partition registration - you explicitly add partitions via ALTER TABLE ADD PARTITION or MSCK REPAIR TABLE. At scale (100K+ partitions, dozens of pipelines writing concurrently), this breaks. Partition registration becomes a bottleneck, and stale metadata causes queries to miss d
Delta or Iceberg? (concepts: paTableFormats)
This stops being a feature comparison and becomes a vendor strategy question. The interviewer wants to hear how you'd make this choice for an organization, not a project. Compute Coupling and Lock-In Delta Lake is open-source, but the optimizations that make it fast - Z-ordering, liquid clustering, predictive I/O - are proprietary to its parent platform. Open-source Delta on vanilla Spark is 2-5x slower on complex queries than the commercial runtime. This is intentional: the vendor monetizes
Data Lake or Warehouse? (concepts: paDataLake, paSparkExecutionModel)
The answer is never "lake" or "warehouse." It's an architecture involving multiple engines, governance boundaries, ownership models, and cost allocation. The interviewer is testing whether you can design a data platform, not choose a product. Multi-Engine Strategy A modern data platform typically runs 3-5 engines on the same storage layer. Spark for batch ETL (high throughput, complex transformations). Trino or Athena for interactive SQL (low latency, ad-hoc exploration). Flink for streaming (re
How Much Will This Cost? (concepts: paCompression, paSparkExecutionModel)
Cost isn't a line item - it's an architectural constraint that shapes every decision. The interviewer wants to see you model costs at the per-query and per-team level, not just estimate monthly S3 bills. Per-Query Cost Modeling Every query has a measurable cost. In BigQuery, it's explicit: $6.25 per TB scanned. In Spark, it's compute-seconds × instance cost. In Snowflake, it's credits consumed × credit price. A strong engineer can estimate the cost of any query before it runs and identify the