Storage Layers and Table Formats: Beginner

A consumer fintech at Series C scale ran the entire business on one Postgres database. The app wrote to it, the analysts queried it, the CEO's dashboard read from it, and the data science team trained models against it. One Tuesday at 3pm a junior analyst ran an ad-hoc query that joined three large tables without an index. The query held a lock for nine minutes. During those nine minutes the mobile app stopped accepting payments because every transaction was waiting on the same database. The outage cost roughly forty thousand dollars in failed checkouts and a much larger amount in customer trust. The root cause was not the analyst. The root cause was that the company had one storage layer doing four different jobs that wanted four different physical layouts. This lesson is about why storage is plural, what each shape is good at, and how to pick the right one for the job at hand.

Storage Is Not Just the Database

Daily Life
Interviews

Recognize that storage is plural and name the three foundational shapes that solve different jobs.

Engineers entering data work often picture storage as one thing: a database. The mental model collapses every kind of persistent data into the same shape. That mental model breaks the moment a real workload meets it. A row that an app writes once and reads once belongs in a different physical layout than a row an analyst scans across two billion peers to compute a sum. The storage layer that is fast for the first job is slow for the second, and the layer that is fast for the second is wrong for the first.

Three Jobs, Three Shapes

JobAccess PatternStorage Shape That Wins
Run the applicationRead and write a few rows by primary key, thousands of times per secondOperational database (Postgres, MySQL, DynamoDB)
Answer business questionsScan billions of rows, summing or counting a few columns at a timeData warehouse (Snowflake, BigQuery, Redshift)
Hold raw and bulk dataLand enormous files cheaply, read them later in flexible shapesData lake (S3, GCS, Azure Data Lake Storage)
These three shapes are not the only storage layers in production environments, but every other shape is a specialization of one of them. A document store is an operational database with a flexible schema. A time-series database is an operational database optimized for one access pattern. A search index is a derived view, not a primary store. The three foundational shapes are operational, warehouse, and lake. Lesson 1 introduced storage as one of the four pipeline roles; this lesson opens that single role into the three shapes that the role actually contains in production.
Why one storage shape cannot serve every job:
  • An operational database optimized for fast point lookups is slow at scanning a billion rows
  • A data warehouse optimized for analytical scans cannot handle thousands of small writes per second
  • A data lake optimized for cheap bulk storage has no concept of transactions or row-level updates
  • Mixing the jobs into one system means one of them gets bad service all of the time

The Postgres-as-Everything Antipattern

Many startups begin life with one Postgres database serving every job. For the first year this is the right call. Postgres is reliable, well documented, and free. The trouble starts when the analytical workload begins to compete with the operational workload for the same compute. An analyst's two-minute query holds locks that block the application. A data scientist's full-table scan of the orders table evicts hot pages from the buffer cache and slows every transaction for the next ten minutes. The fix is not a faster database. The fix is moving the analytical workload to a different storage shape. The Postgres remains the operational database; a warehouse or lake takes over the analytical work.
One Database Doing Everything
  • Analytical queries lock production tables and slow the app
  • Schema changes for analytics break the application
  • Storage costs scale with the largest workload, not the smallest
  • Backups have to be tuned for two different shapes simultaneously
Separate Storage by Job
  • Production stays fast; analytical work runs on a copy that cannot affect the app
  • Analytical schema evolves separately from operational schema
  • Each storage layer pays only for the work it actually does
  • Each layer is backed up and tuned for its own access pattern

The Pipeline Maps to the Shapes

A pipeline reads from operational sources and writes to analytical destinations. The three storage shapes appear at known positions in the picture. The operational database is upstream, on the left of the diagram, as a source. The data lake is in the middle, as the raw and curated zones. The data warehouse is downstream, as the serving layer the dashboards and analyses read from. The shape of the pipeline becomes legible once the storage shapes are named.
1Postgres app operational lake warehouse consumer
Operational databases serve the application: small reads and writes by primary key, thousands per second.
Data warehouses serve analysts: scans across billions of rows summing a few columns at a time.
Data lakes hold raw and bulk data cheaply: large files in object storage, read later in flexible shapes.

Naming the storage shape at every node in a pipeline diagram is the single most useful documentation habit. A diagram that does not name the shapes hides the most important physics of the system.

1
TIP
When a pipeline misbehaves, the first diagnostic is to ask whether the storage shape matches the access pattern. A warehouse being used for point lookups and a lake being used for transactional updates are both common configurations of disappointment.

The Data Warehouse

Daily Life
Interviews

Identify a data warehouse by columnar layout, enforced schema, and separated storage and compute, and name three production examples.

A data warehouse is the storage layer optimized for analytics. The shapes that win in a warehouse are very different from the shapes that win in an operational database. A warehouse stores data column by column rather than row by row. It enforces schema before data is written. It scales compute and storage independently so an analyst can run a thousand-dollar query without buying a thousand-dollar machine. The dominant cloud warehouses in 2026 are Snowflake, Google BigQuery, and Amazon Redshift, with Databricks SQL Warehouse competing in the same lane.

Columnar Storage in One Picture

LayoutHow Rows Are Stored on DiskFast Access Pattern
Row store (Postgres, MySQL)All columns of row 1, then all columns of row 2Read or write an entire row by key
Column store (Snowflake, BigQuery)All values of column A, then all values of column BRead a few columns across many rows
An analytical query that computes the total revenue for the last quarter touches two columns: order_date and order_amount. A row-store database has to read every column of every order to find the two it needs. A column-store warehouse reads only the two columns and skips the rest. On a billion-row table with twenty columns, the warehouse reads roughly one twentieth of the bytes. The same query that runs in eighteen minutes on Postgres runs in nine seconds on Snowflake, and the gap widens as the table gets larger.

Schema Enforced at Write Time

A warehouse enforces schema. Every column has a declared type, every row that is inserted must conform, and the catalog always knows what shape a table has. This sounds like a small property and is not. It is the foundation that makes warehouse SQL fast and the source of friction that pushes raw data into the lake first. A semi-structured JSON event with a different shape every minute does not belong in a warehouse table directly; it belongs in a lake first, where it gets cleaned and conformed before being loaded.
1CREATE TABLE analytics.fct_orders(order_id STRING NOT NULL, customer_id STRING NOT NULL, order_timestamp TIMESTAMP_TZ NOT NULL, amount_cents NUMBER(18, 0) NOT NULL, country_code STRING) ; COPY INTO analytics.fct_orders
2
3FROM @ raw_stage / orders / dt = 2026 - 04 - 24 / FILE_FORMAT =(TYPE = PARQUET) ;

Separated Storage and Compute

Cloud warehouses split storage from compute. The data sits on object storage at near-trivial cost per terabyte per month. Compute clusters spin up on demand, run queries, and shut off. A team of one can run a query that briefly uses a hundred-CPU cluster for three minutes and pay only for those three minutes. The economics of analytics changed when this split happened around 2015. Before the split, every analytical query competed with every other query for a fixed pool of compute, and capacity planning was an annual event. After the split, capacity is elastic and the cost shifted from machine purchases to per-query bills.
SnowflakeBigQueryRedshift
Snowflake
Multi-cloud columnar warehouse
Built on top of S3, GCS, or Azure Blob. Virtual warehouses are isolated compute clusters that share the same storage. Heavily SQL-first.
BigQuery
Google Cloud serverless warehouse
No clusters to size. Compute is on-demand slots; storage is Capacitor format on Colossus. Pay per byte scanned or per slot-hour.
Redshift
AWS warehouse with two flavors
Original provisioned model with named clusters; Redshift Serverless is the elastic alternative. Spectrum extends queries to S3.

What Warehouses Are Bad At

Cloud warehouses are not good at point lookups by primary key, single-row updates, sub-second latency for application traffic, or thousands of tiny writes per second. The architecture is wrong for those jobs. The internal storage format batches data into immutable files, and a single row update has to rewrite a file. The query planner is built for scans, not for fetching a single row at low latency. An application that tries to use Snowflake as its operational database will quickly learn what each transaction costs in seconds and dollars.
Warehouse Excels At
  • Aggregations across billions of rows
  • Joining large tables with SQL
  • Storing structured fact and dimension tables long term
  • Concurrent analytical workloads with elastic compute
Warehouse Struggles With
  • Point lookups by primary key at sub-50ms latency
  • Single-row INSERT/UPDATE/DELETE at high frequency
  • Application traffic that demands thousands of small operations per second
  • Storing unstructured blobs (images, audio, raw event JSON before parsing)
1
Do
  • Reach for a warehouse when scans dominate the workload
  • Declare schema explicitly and let the warehouse reject bad data at load time
  • Use separate compute warehouses for ETL, BI, and ad-hoc work to avoid contention
Don't
  • Use the warehouse as the operational database for an application
  • Issue thousands of single-row writes when a bulk load would do the same work
  • Store raw unstructured payloads in warehouse tables when a lake would be cheaper and more flexible

The Data Lake

Daily Life
Interviews

Identify a data lake as files in object storage, name the typical zone structure, and pick Parquet for analytical reads.

A data lake is files in object storage. That sentence sounds anticlimactic and is. The lake is not a database. It is a directory of files in S3, GCS, or Azure Data Lake Storage, organized by convention rather than enforced rules. Each file holds a chunk of data in some format (Parquet, ORC, JSON, CSV). Files are immutable once written. Reading is done by some external compute engine (Spark, Presto, Athena, Trino) that opens the files and parses them. The lake's superpower is cheap storage and complete schema flexibility; its weakness is that flexibility comes at the cost of consistency guarantees.

What Lives in a Lake

LayerTypical ContentsFormat
Raw zoneSource data unchanged from the producerJSON, CSV, Avro, sometimes Parquet
Staging zoneCleaned, deduplicated, conformedParquet
Curated zoneBusiness-ready datasets shared across teamsParquet, often through an open table format like Iceberg or Delta
Archival zoneCold data kept for complianceParquet on infrequent-access storage class (S3 Glacier, GCS Archive)
The folder structure is organized by date in nearly every production lake, because partition by ingestion date is the cheapest and most useful organizing principle. Files arrive once and never move. A reader that wants last Tuesday's data reads the files under that date's prefix and ignores the rest. The folder structure is not a schema; it is a routing scheme that lets the reader prune what to open.
1s3 : / / company - data - lake / raw / orders / dt = 2026 - 04 - 23 / orders_001.json orders / dt = 2026 - 04 - 24 / orders_001.json payments / dt = 2026 - 04 - 23 / payments_001.json staging / orders / dt = 2026 - 04 - 23 / part - 00000. parquet curated / fct_orders / dt = 2026 - 04 - 23 / part - 00000. parquet

Why the Lake Is Cheap

Object storage charges roughly two cents per gigabyte per month for hot tiers and a fraction of a cent for cold tiers. A petabyte of data costs roughly twenty thousand dollars a month in S3 Standard, much less in Glacier. The same petabyte on a managed warehouse like Snowflake costs around twenty-three dollars per terabyte per month for storage alone, before any compute cost. The lake's price advantage compounds at scale. A company with petabytes of historical clickstream data keeps that data in the lake because keeping it in a warehouse would be financially absurd.

Schema on Read, Not on Write

A lake accepts whatever bytes are written. A producer can land a JSON event with seventeen fields one day and the same event with eighteen fields the next day, and the lake will not complain. The schema is interpreted at read time by whatever engine is reading. This flexibility is the point. It is also the trap: without discipline, a lake becomes a swamp, where nobody knows what columns exist, what types they are, or whether a given file is parseable. The fix for swampiness is the curated zone, where conformed Parquet files have stable schemas readable by any engine.
What makes a lake different from a folder of files:
  • Object storage offers eleven nines of durability with no operational work
  • Files are immutable once written; updates produce new files
  • Listings are eventually consistent; readers must handle missing files gracefully
  • Compute is external; readers like Spark, Athena, Trino open files in parallel

The Two Dominant File Formats

FormatShapeWhen To Use It
JSON / CSVRow-oriented, human-readable, no compression of repeated valuesRaw landing zone where the source decides the format
Parquet / ORCColumnar, binary, compressed per column, embedded statisticsCurated zone where read efficiency matters
Raw data lands in whatever shape the source produced (often JSON or CSV). Pipeline transforms convert it into Parquet for the curated zone. Parquet is the de facto columnar format for lakes, used by Spark, Trino, Athena, Snowflake external tables, and BigQuery external tables. ORC is functionally similar and dominant in the Hive ecosystem. Choosing between them is a near-tie; choosing either over JSON for analytical reads is a giant win.
1# Reading a Parquet partition from S3 with pandas via PyArrow.
2import pyarrow.dataset as ds
3
4dataset = ds.dataset(
5 "s3://company-data-lake/curated/fct_orders/",
6 format="parquet",
7 partitioning="hive"
8)
9
10# Push a filter to the engine; only matching partitions are read.
11table = dataset.to_table(
12 columns=["order_id", "amount_cents", "country_code"],
13 filter=ds.field("dt") == "2026-04-23"
14)

What the Lake Cannot Do

A pure lake (files in object storage with no metadata layer) has no transactions. Two writers landing files in the same directory at the same time can race. There is no concept of an atomic update across multiple files. There is no built-in time travel. There is no efficient row-level update. The advanced tier of this lesson covers the open table formats (Iceberg, Delta, Hudi) that solve these problems by adding a metadata layer on top of the files. For now, the mental model is: files are immutable, the lake is cheap, and consistency is achieved by the discipline of writing whole partitions at a time and never modifying a file in place.
A lake is files in object storage organized by convention, not a database with enforced rules.
Storage cost is roughly an order of magnitude cheaper than warehouses, especially at petabyte scale.
Parquet and ORC are the columnar formats that turn a lake from a dumping ground into a queryable analytical store.
TIP
Treat the raw zone as immutable and append-only. Any cleanup, dedup, or join belongs in a downstream zone. The discipline preserves the ability to recompute everything from raw if a transform turns out to be wrong.

The Operational Database

Daily Life
Interviews

Distinguish an operational database by row layout, ACID transactions, and OLTP access pattern, and explain why analytical queries belong elsewhere.

An operational database is the storage layer the application reads and writes during its normal operation. Postgres, MySQL, SQL Server, Oracle, and DynamoDB are all operational databases. The defining property is that the access pattern is small and frequent. A user logs in: read one row by user_id. A user places an order: insert one row, update one row in inventory, write one row to a payment log. Thousands of these tiny operations per second is the design center.

Row Storage in One Picture

An operational database stores rows together on disk. All the columns of one row sit next to each other in a single block. Reading a row by primary key is one disk seek. Writing a row is one block update. This is the right physics for the application's access pattern, and it is the wrong physics for analytics. An analytical scan that wants two columns out of twenty has to read all twenty for every row, because the disk does not know how to give the two without the eighteen.
OperationOperational Database PerformanceWhy
SELECT row WHERE id = 'user_123'Sub-millisecond on a hot indexIndex points to one block; one disk seek; row stored together
UPDATE row SET balance = balance - 10 WHERE id = 'user_123'Single-digit millisecondsB-tree index plus row-level locking; small in-place update
SELECT SUM(amount) FROM orders WHERE country = 'US'Slow at scale; locks pages; competes with app trafficFull scan of row blocks; reads every column whether needed or not
INSERT 1,000 rows in one bulk commandFast, but transaction log fills; replication lag possibleOperational databases are tuned for many small writes, not few large ones

ACID Transactions

Operational databases provide ACID transactions: atomic, consistent, isolated, durable. A bank transfer that debits one account and credits another either both happen or neither happens. A retailer's checkout that decrements inventory and creates an order succeeds completely or fails completely. ACID is a hard guarantee, expensive to implement, and the reason operational databases exist as a category. Lakes and warehouses sometimes offer transactional semantics on top of object storage, but the strongest, most efficient transaction implementations live in operational databases.
1BEGIN ; UPDATE accounts SET balance = balance - 100
2WHERE id = 42 ; UPDATE accounts SET balance = balance + 100
3WHERE id = 87 ; INSERT INTO transfers(from_id, to_id, amount, ts) VALUES(42, 87, 100, NOW()) ; COMMIT ;
4

Postgres Versus MySQL Versus DynamoDB

The three names cover most application workloads. Postgres is the modern open-source relational default, with rich data types, strong SQL, and an active extension ecosystem. MySQL has the longer history and slightly different tradeoffs around replication and locking; many large web companies still run MySQL or its fork MariaDB. DynamoDB is the AWS managed key-value store, optimized for predictable single-digit-millisecond latency at any scale; it gives up SQL flexibility for guaranteed performance. The selection between them is workload-driven and not within the scope of this lesson; what matters is that all three are operational databases serving the application, not analytical layers serving analysts.
Operational Database Job
  • Read one row by primary key in under a millisecond
  • Insert thousands of small rows per second
  • Maintain ACID transactions across multiple rows
  • Handle live customer traffic without queueing
Not the Operational Database's Job
  • Scan a billion rows for a quarterly revenue total
  • Hold raw clickstream events for analytical replay
  • Join three large fact tables for a dashboard
  • Train a machine learning model over a year of history

Why the Operational Database Should Not Be the Analytical Database

Mixing the workloads on one database creates the failure mode from this lesson's introduction. An analytical query holds locks. The application slows. Customers notice. The fix is not heroic indexing; it is moving the analytical workload to a separate storage layer with the right physics. The pipeline reads from the operational database (a source) and lands data in a warehouse or lake (a destination). The two systems serve different jobs and never compete for the same compute. That separation is the entire reason data engineering exists as a discipline.

The Read Replica as a Halfway Step

Before a full pipeline, many teams add a read replica: a copy of the production database that handles read traffic. Replicas are a database feature, not a data engineering build, and they solve the lock-contention problem cheaply. The replica still has the operational shape (row store, OLTP-tuned indexes), so it does not solve the analytical-shape problem. It buys time. When the schema's operational shape becomes painful for analysts (deeply normalized, full of internal columns, missing the joins they need), the replica's value runs out and the pipeline pays for itself. Lesson 1's beginner tier discussed when a read replica is the right answer; this section adds that the replica is operational shape on operational physics, only on a different machine.
Three signals that the analytical workload should leave the operational database:
  • Analytical queries are degrading application latency or causing lock contention
  • Analysts ask for joins or shapes that the operational schema makes painful
  • The cost of scaling the operational database for analytics exceeds the cost of a separate warehouse
1

ACID is the operational database's flagship property. Lakes and modern warehouses offer transactional semantics in narrower forms, but small-write ACID at high frequency is still the operational database's exclusive job.

Picking the Right Storage Shape

Daily Life
Interviews

Pick the appropriate storage layer for a workload by access pattern, freshness needs, and scale.

Three shapes, three jobs, one rule. The rule is short and worth memorizing: warehouses for queries people read, lakes for raw and bulk, operational databases for the app. Most architectural confusion at junior levels collapses once that rule sits in working memory. The rest of this section unpacks the rule into the questions that select between the three when the choice is not obvious.

The Selection Question Tree

QuestionIf YesIf No
Will the application read or write this data during a user request?Operational databaseContinue to the next question
Will analysts or dashboards scan this data with SQL?Data warehouseContinue to the next question
Is the data raw, bulk, semi-structured, or larger than the warehouse will tolerate cheaply?Data lakeReconsider whether the data needs to be stored at all
The tree is not exhaustive. Real architectures often store the same logical data in two of the three layers: raw events in the lake, conformed marts in the warehouse, and a small denormalized copy in the operational database for the application to read. The tree is a starting point, not a verdict. Most workloads land cleanly on one shape; a few span two; very few demand all three.

Three Workloads, Three Decisions

WorkloadStorage ChoiceReason
User profile read on every page loadOperational databaseApplication traffic, sub-millisecond reads, point lookups by user_id
Daily revenue dashboard for the executive teamData warehouseAnalytical scan over a fact table, SQL-shaped, runs once a day
Three years of mobile event JSON for ad-hoc analysisData lakeBulk semi-structured data, queried infrequently, must be cheap to retain
Live inventory count visible in the appOperational databaseApplication traffic, must be fresh and consistent at the row level
Year-over-year sales report by regionData warehouseAnalytical aggregation that scans many rows of a few columns
Training data for a recommendation modelData lake plus warehouseLake holds the raw history; warehouse or lake compute extracts features

The Three Layers Working Together

1App writes | | | | CDC every 15 MIN v v User request S3 raw zone | | daily Spark TRANSFORM v S3 curated/fct_orders | | load to warehouse v Snowflake mart_orders | v Looker dashboards
The same business event, an order, lives in three places. Postgres holds the order while the application is processing it. The lake holds the raw extract for replay and historical analysis. The warehouse holds the conformed mart that powers dashboards. Each copy is shaped for the job that reads it. The pipeline is the machinery that keeps the three copies in sync.
When To Add a Warehouse
  • Analytical SQL is the dominant downstream consumer
  • BI tools (Looker, Tableau, Mode) need a place to point
  • Data is structured and benefits from enforced schema
  • Concurrency among analysts matters more than per-byte storage cost
When To Add a Lake
  • Raw, semi-structured, or unstructured data dominates volume
  • Storage cost at petabyte scale is part of the budget conversation
  • Multiple compute engines (Spark, Trino, Athena) need to read the same data
  • Schema flexibility outweighs the friction of conforming on read

The Common Mistakes

Three mistakes appear repeatedly in production architectures. First: an analytical workload is left on the operational database because moving it requires building a pipeline. The cost of the lock contention exceeds the cost of the pipeline within a year. Second: raw data is loaded straight into the warehouse without a lake, because the warehouse is the only system the team knows. Storage cost climbs and schema evolution becomes painful. Third: the lake becomes the only storage layer, with no warehouse, and analysts spend their afternoons writing PySpark when SQL on Snowflake would be faster, cheaper, and easier to share. The right answer is almost always all three layers, each doing the job it is shaped for.
The short rule for picking storage:
  • Operational database for application traffic and ACID transactions
  • Data warehouse for analytical SQL that produces dashboards and reports
  • Data lake for raw, bulk, and semi-structured data that needs to be cheap
OperationalWarehouseLake
Operational
Application traffic, sub-ms reads
Postgres, MySQL, DynamoDB. Row-stored, ACID, OLTP-tuned. Powers user-facing requests; never used for analytical scans of the full table.
Warehouse
Analytical SQL, dashboards
Snowflake, BigQuery, Redshift. Columnar, schema-enforced, separated storage and compute. Powers BI tools, internal reporting, and ad-hoc analyst queries.
Lake
Raw and bulk, cheap retention
S3, GCS, ADLS. Files in object storage organized by convention. Powers raw landing, archival, and the curated zones that warehouses load from.
Do
  • Name the storage shape for every node in the architecture diagram
  • Move analytical workloads off the operational database before they cause incidents
  • Hold raw data in the lake even when it is also loaded into the warehouse
Don't
  • Use the warehouse as a key-value store for application reads
  • Treat the lake as a free-for-all without a curated zone
  • Pick a single storage layer for every job and tune around the mismatch
TIP
When a system feels like it is fighting itself, the diagnosis is often a workload running on a storage shape that does not match. The fix is to add the right layer, not to keep tuning the wrong one.
PUTTING IT ALL TOGETHER

> A growing e-commerce company has one Postgres database serving the app, the analytics, and the data science team. Checkouts have started slowing during the analyst's afternoon ad-hoc queries. The CEO wants a daily revenue dashboard, the ML team wants three years of clickstream history for a recommendation model, and the application still needs to support a thousand transactions per second. The new lead engineer is asked: 'What storage layers should this company actually have, and what goes where?'

The application keeps Postgres as its operational database. ACID transactions for checkout, point lookups for user profiles, and thousands of small writes per second all want row storage and OLTP physics. Removing the analytical workload here is the immediate win that ends the afternoon slowdowns.
A data lake on S3 holds the raw clickstream history and a daily extract from Postgres orders. Three years of semi-structured event data in Parquet costs an order of magnitude less than the same data in a warehouse. The raw zone is the source of truth that lets the recommendation model retrain from history.
A data warehouse on Snowflake or BigQuery holds the conformed marts the daily revenue dashboard reads. Columnar storage and elastic compute make the executive query fast and isolated from the application. The pipeline that builds these marts pulls from the lake's curated zone.
The three layers map to the four roles from Lesson 1. Postgres is a source. The lake is the storage layer where raw lands and curated tables form. The warehouse is the serving layer for analytical consumers. The pipeline closes the location, shape, and time gaps between them, and each layer pays only for the job it actually does.
KEY TAKEAWAYS
Storage is plural because access patterns are plural: operational, analytical, and bulk are different jobs that want different physical layouts.
Warehouses store columns, operational databases store rows: the layout decides which queries are fast and which are catastrophic.
A data lake is files in object storage, not a database: cheap, schema-flexible, and the right home for raw and bulk data.
ACID transactions are the operational database's flagship property: they are why the application uses Postgres and not Snowflake for checkout.
Most production architectures use all three layers: the same business event lives in operational, lake, and warehouse forms, each shaped for the job that reads it.

Storage Layers and Table Formats: Beginner

Different shapes of storage exist because different jobs need different physics

Category
Pipeline Architecture
Difficulty
beginner
Duration
25 minutes
Challenges
0 hands-on challenges

Topics covered: Storage Is Not Just the Database, The Data Warehouse, The Data Lake, The Operational Database, Picking the Right Storage Shape

Lesson Sections

  1. Storage Is Not Just the Database (concepts: paStorageLayers, paOperationalVsAnalytical)

    Engineers entering data work often picture storage as one thing: a database. The mental model collapses every kind of persistent data into the same shape. That mental model breaks the moment a real workload meets it. A row that an app writes once and reads once belongs in a different physical layout than a row an analyst scans across two billion peers to compute a sum. The storage layer that is fast for the first job is slow for the second, and the layer that is fast for the second is wrong for

  2. The Data Warehouse (concepts: paDataWarehouse, paColumnarVsRow)

    A data warehouse is the storage layer optimized for analytics. The shapes that win in a warehouse are very different from the shapes that win in an operational database. A warehouse stores data column by column rather than row by row. It enforces schema before data is written. It scales compute and storage independently so an analyst can run a thousand-dollar query without buying a thousand-dollar machine. The dominant cloud warehouses in 2026 are Snowflake, Google BigQuery, and Amazon Redshift,

  3. The Data Lake (concepts: paDataLake, paParquet, paLakeZones)

    A data lake is files in object storage. That sentence sounds anticlimactic and is. The lake is not a database. It is a directory of files in S3, GCS, or Azure Data Lake Storage, organized by convention rather than enforced rules. Each file holds a chunk of data in some format (Parquet, ORC, JSON, CSV). Files are immutable once written. Reading is done by some external compute engine (Spark, Presto, Athena, Trino) that opens the files and parses them. The lake's superpower is cheap storage and co

  4. The Operational Database (concepts: paOperationalDb, paAcidTransactions, paReadReplica)

    An operational database is the storage layer the application reads and writes during its normal operation. Postgres, MySQL, SQL Server, Oracle, and DynamoDB are all operational databases. The defining property is that the access pattern is small and frequent. A user logs in: read one row by user_id. A user places an order: insert one row, update one row in inventory, write one row to a payment log. Thousands of these tiny operations per second is the design center. Row Storage in One Picture An

  5. Picking the Right Storage Shape (concepts: paStorageSelection, paLayeredStorage)

    Three shapes, three jobs, one rule. The rule is short and worth memorizing: warehouses for queries people read, lakes for raw and bulk, operational databases for the app. Most architectural confusion at junior levels collapses once that rule sits in working memory. The rest of this section unpacks the rule into the questions that select between the three when the choice is not obvious. The Selection Question Tree The tree is not exhaustive. Real architectures often store the same logical data in