Loading section...
Data Lake or Warehouse?
Concepts: paDataLake, paSparkExecutionModel
The answer is never "lake" or "warehouse." It's an architecture involving multiple engines, governance boundaries, ownership models, and cost allocation. The interviewer is testing whether you can design a data platform, not choose a product. Multi-Engine Strategy A modern data platform typically runs 3-5 engines on the same storage layer. Spark for batch ETL (high throughput, complex transformations). Trino or Athena for interactive SQL (low latency, ad-hoc exploration). Flink for streaming (real-time aggregation, CDC processing). A managed warehouse engine for BI dashboards (governed, optimized for concurrency). Python/Ray for ML training (direct file access, no SQL overhead). The architectural challenge isn't connecting these engines to storage - that's solved by Iceberg/Delta. It's m