Loading lesson...
Storage Layers and Table Formats: Beginner
Different shapes of storage exist because different jobs need different physics
Different shapes of storage exist because different jobs need different physics
- Category
- Pipeline Architecture
- Difficulty
- beginner
- Duration
- 25 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Storage Is Not Just the Database, The Data Warehouse, The Data Lake, The Operational Database, Picking the Right Storage Shape
Lesson Sections
- Storage Is Not Just the Database (concepts: paStorageLayers, paOperationalVsAnalytical)
Engineers entering data work often picture storage as one thing: a database. The mental model collapses every kind of persistent data into the same shape. That mental model breaks the moment a real workload meets it. A row that an app writes once and reads once belongs in a different physical layout than a row an analyst scans across two billion peers to compute a sum. The storage layer that is fast for the first job is slow for the second, and the layer that is fast for the second is wrong for
- The Data Warehouse (concepts: paDataWarehouse, paColumnarVsRow)
A data warehouse is the storage layer optimized for analytics. The shapes that win in a warehouse are very different from the shapes that win in an operational database. A warehouse stores data column by column rather than row by row. It enforces schema before data is written. It scales compute and storage independently so an analyst can run a thousand-dollar query without buying a thousand-dollar machine. The dominant cloud warehouses in 2026 are Snowflake, Google BigQuery, and Amazon Redshift,
- The Data Lake (concepts: paDataLake, paParquet, paLakeZones)
A data lake is files in object storage. That sentence sounds anticlimactic and is. The lake is not a database. It is a directory of files in S3, GCS, or Azure Data Lake Storage, organized by convention rather than enforced rules. Each file holds a chunk of data in some format (Parquet, ORC, JSON, CSV). Files are immutable once written. Reading is done by some external compute engine (Spark, Presto, Athena, Trino) that opens the files and parses them. The lake's superpower is cheap storage and co
- The Operational Database (concepts: paOperationalDb, paAcidTransactions, paReadReplica)
An operational database is the storage layer the application reads and writes during its normal operation. Postgres, MySQL, SQL Server, Oracle, and DynamoDB are all operational databases. The defining property is that the access pattern is small and frequent. A user logs in: read one row by user_id. A user places an order: insert one row, update one row in inventory, write one row to a payment log. Thousands of these tiny operations per second is the design center. Row Storage in One Picture An
- Picking the Right Storage Shape (concepts: paStorageSelection, paLayeredStorage)
Three shapes, three jobs, one rule. The rule is short and worth memorizing: warehouses for queries people read, lakes for raw and bulk, operational databases for the app. Most architectural confusion at junior levels collapses once that rule sits in working memory. The rest of this section unpacks the rule into the questions that select between the three when the choice is not obvious. The Selection Question Tree The tree is not exhaustive. Real architectures often store the same logical data in