Netflix had a problem at sufficient scale that no existing system solved it. The data lake at petabyte scale gave them cheap storage and flexible compute. The lack of transactional semantics meant that two writers landing files in the same partition could produce incoherent reads. Schema changes required rewriting partitions or maintaining shadow tables. A query engine could see partial writes if it listed files at the wrong moment. Netflix open-sourced Iceberg in 2018 to solve these problems, Databricks released Delta Lake the same year, and Uber released Hudi shortly before. The three formats look different on the surface and converge on the same architectural idea: put a metadata layer on top of object storage that provides snapshot isolation, schema evolution, and time travel without giving up the lake's economics. This lesson is about that metadata layer, why it is the actual product, and how to operate it under load. The framing connects to Lesson 1's pipeline-as-product framing and Lesson 2's freshness tier analysis: open table formats are the storage technology that lets one set of tables serve batch consumers, streaming consumers, and ad-hoc analytical consumers without splitting into separate stores.
The Lakehouse: ACID on Object
Daily Life
Interviews
Recognize the lakehouse as object storage plus a metadata layer and explain why the metadata is the actual product.
The lakehouse is a marketing term that names a real architectural shift. The shift is the addition of a metadata layer on top of files in object storage that provides the consistency guarantees a database has and a folder of files lacks. Iceberg, Delta Lake, and Apache Hudi are three implementations of the same idea. The data files are still Parquet (or ORC). The folders look mostly the same. The difference is a small set of metadata files that turn the directory of Parquet into a transactional table.
What the Metadata Layer Adds
Capability
Plain Lake
Lakehouse Table Format
Atomic writes across files
Best-effort: a writer fails halfway leaves partial files
Atomic: either the snapshot commits or it does not
Snapshot isolation between readers and writers
Readers can see partial writes mid-operation
Readers see a consistent snapshot regardless of concurrent writes
Time travel to historical state
Possible only if old files were never overwritten
First-class: every snapshot is reachable by version or timestamp
Schema evolution without rewrites
Adding a column means rewriting partitions or splitting tables
Add, drop, rename at the metadata level; data files unchanged
Row-level updates and deletes
Possible only by rewriting whole partitions
Supported via merge-on-read or copy-on-write strategies
Concurrent writers
Last writer wins; data corruption likely
Optimistic concurrency: conflicting writers fail and retry
Each metadata.json file describes the table at a moment in time: schema, partition spec, and pointer to the current snapshot. Each snapshot points to a list of manifest files. Each manifest file lists the data files that are part of that snapshot, with their statistics. A read that wants the current state reads the latest metadata, then the snapshot, then the manifests, then the data files. A read that wants a historical state reads an older metadata.json instead. The data files themselves are usually unchanged across snapshots; the metadata is what evolves.
The Three Major Open Formats
Apache IcebergDelta LakeApache Hudi
Apache Iceberg
Snapshot-based, vendor-neutral
Originated at Netflix in 2017. Snapshot manifests in Avro. Partition evolution and hidden partitioning. Strong neutrality across compute engines (Spark, Trino, Snowflake, Athena, BigQuery, Flink).
Delta Lake
Transaction-log based, Databricks-led
Originated at Databricks in 2017. Transaction log in JSON files (_delta_log). Strong tooling on Databricks; broader open-source support has grown rapidly since 2022.
Apache Hudi
Mutation-first, streaming-oriented
Originated at Uber in 2017. First-class support for upserts and deletes via merge-on-read. Strongest fit for streaming ingest and CDC workloads where rows mutate frequently.
Why the Metadata Is the Product
From a distance, all three formats store Parquet files in object storage. The differences live entirely in the metadata layer. The metadata is what enables atomic writes, isolation, time travel, and schema evolution. Without the metadata, the data files are still Parquet but the table behavior is what a directory of Parquet has always been: best-effort, with all the operational headaches that implies. The marketing focuses on the formats; the engineering substance is the metadata. Engineers who understand which metadata operations are cheap and which are expensive operate these tables at scale; engineers who only understand the data files run into surprises.
Operations that touch metadata only (cheap):
▸Adding a column to the schema
▸Renaming a column at the schema level
▸Tagging a snapshot for time travel reference
▸Setting a retention policy on old snapshots
What Plain Object Storage Cannot Do
S3, GCS, and Azure Data Lake Storage are eventually consistent on listings (less so today than five years ago, but the model still permeates the API). They have no notion of multi-object atomicity. A writer that uploads ten files to a directory cannot make all ten visible at the same instant; readers can see five of them while the upload is in progress. A table format hides this by recording 'these ten files are now part of snapshot N' in a single metadata.json upload. The single upload is atomic at the object level, and that atomicity propagates to the whole snapshot. The metadata layer is, in effect, a way to launder object-level atomicity into table-level atomicity.
1
# Conceptual sketch of how Iceberg achieves atomic multi-file writes
2
3
# Step 1: Write new data files (no readers see them yet)
# Until step 4, no reader sees the new files. After step 4, every reader does.
✓Do
Choose an open table format for any analytical table that needs concurrent writers or schema evolution
Treat the metadata layer as the operational surface, not just the data files
Use hidden partitioning (Iceberg) or generated columns (Delta) so the partition logic survives schema evolution
✗Don't
Edit data files directly; the metadata loses track of them
Mix table-format-aware writers with plain object-store writers; the table state diverges
Assume all three formats interoperate fully today; check engine compatibility before standardizing
1
1
Snapshot Isolation and Time Travel
Daily Life
Interviews
Apply snapshot isolation and time travel to operate a table format under concurrent writers and reproduce historical state.
Snapshot isolation is the consistency guarantee that turns a table format into a real table. A reader sees a consistent point-in-time view of the table, even when writers are landing new data concurrently. Time travel is the operational use of the same machinery: read the table as it existed at a previous snapshot. The two features come from the same underlying mechanism, which is that the table's state is defined by an immutable chain of snapshots and the metadata pointer that names which snapshot is current.
Snapshot Isolation in Practice
1
TIMEWriterReader10:00Readsmetadata10:00metadatapointsATsnapshotS110:00:01Beginstowritebatch10:00:02Issuesquery(plannerseesS1)10:00:03Writes12newfilesContinuesscanningS1's files
10:00:04 Reaches a partition affected by writer
10:00:04 Reads the OLD files in S1, not the new ones
10:00:05 Atomic commit -> S2
10:00:06 Done Still scanning S1
10:00:07 Finishes; result is consistent with S1'
The reader's planner pinned the snapshot at the start of the query. Every file read during execution comes from S1's manifest list. The writer landed new files and committed S2 in the middle of the read, but the reader does not see them: those files are not in S1's manifest. There is no race condition, no dirty read, no inconsistent intermediate state. The reader sees the table as a single moment in time.
Time travel reads an older metadata.json (Iceberg) or replays the transaction log to an earlier point (Delta). The data files referenced by that older snapshot must still exist, which they will until the retention policy expires them. Querying historical state takes the same machinery the planner uses for snapshot isolation; it points at a different snapshot id.
Operational Uses of Time Travel
Use Case
What It Enables
What To Watch For
Debug a bad pipeline run
Compare today's output to yesterday's at the row level
Retention must extend back to the bad run
Reproduce an ML training run
Train against the exact snapshot the original model used
Pin the snapshot id at training time and store it with the model
Audit historical reporting
Reproduce the exact numbers a regulator saw last quarter
Snapshots required for compliance must not be expired by retention
Roll back an accidental DELETE
Restore the table to a snapshot before the destructive write
Roll back is itself a write; storage cost briefly doubles
Compare two versions of a transform
Run the new transform against an old snapshot and diff the result
Both snapshots must remain in retention for the comparison
Optimistic Concurrency Control
Two writers committing to the same table at the same time race for the metadata pointer. Both read the current snapshot S1 as the parent. Both prepare their new manifests. Both attempt to commit a new metadata.json that names S1 as the parent. Object storage allows one of the writes to succeed first. The second writer's commit fails because the pointer no longer references S1 (it references the first writer's S2). The second writer reloads, computes its commit against S2 instead, and tries again. This is optimistic concurrency: assume no conflict, retry on conflict. It works well when conflicts are rare, which they are for most analytical workloads (writers usually touch different partitions). It works poorly under heavy contention on the same partition, which is why heavy-update workloads sometimes prefer Hudi's mutation-first design.
When optimistic concurrency degrades:
▸Many writers update the same partition simultaneously
▸Long-running writes hold a stale parent snapshot for minutes
▸Streaming jobs commit at high frequency and each commit blocks the next
▸Compaction and ingestion compete for the same partition's metadata
Retention and Cost
Snapshots are not free. Each snapshot retains the data files it references. A daily-write table with thirty days of retention holds roughly thirty snapshots' worth of overlapping data. Compaction may rewrite files; the old files cannot be deleted while a snapshot still references them. The retention policy is the dial that controls cost: longer retention means more storage, shorter retention means less time travel. Common settings are seven days for hot operational tables, ninety days for analytical tables that need debug-by-time-travel, and several years for tables under regulatory hold.
Snapshot isolation lets readers and writers operate on the same table without seeing each other's partial state.
Time travel reuses the same machinery: pin the metadata pointer at an older snapshot id.
Retention is the cost dial: longer retention means more storage and longer recovery windows.
Optimistic concurrency works when writers usually touch different partitions; degrades when many writers compete for the same partition.
Snapshot retention controls how far back time travel reaches and how much storage the table consumes.
Pinning a snapshot id at the start of a long-running job is the durable definition of reproducibility.
TIP
Pin snapshot ids in long-running pipelines and ML training jobs. A reproducible run is one whose input snapshot is named, not one whose input table happens to look the same when re-read.
Schema Evolution Without Rewrites
Daily Life
Interviews
Apply schema evolution at the table-format level: add, drop, rename, and widen columns without rewriting data files.
A real production table changes shape over time. Producers add fields. Old fields get renamed. Columns become obsolete and need to be dropped. In a plain lake, every shape change requires rewriting partitions or splitting into a new table. In an open table format, additive and renaming changes happen at the metadata level and the data files stay where they are. The cost is bytes of metadata, not bytes of data.
Metadata-only; data files retain bytes but readers ignore them
Reorder columns
Application-level concern only
Metadata-only; column ordering is logical
Change a column type (widening)
Rewrite all partitions
Metadata-only for safe widenings (int -> long, decimal precision up)
Change a column type (narrowing)
Lossy; usually new column instead
Disallowed without rewrite; force a new column for the new type
Why Renaming Is Hard in Plain Parquet
A plain Parquet file embeds the column name in the file's footer. A rename at the table level means the new name does not match the old name in any historical file. A reader either sees the old name in old files and the new name in new files (fragmenting the schema) or the table has to be rewritten so every file uses the new name. Iceberg solves this by giving every column a stable internal id; the schema maps human-readable names to these ids, and a rename changes only the schema mapping, not the data. Delta Lake's column-mapping mode does something equivalent. The architectural trick is the same: separate the logical schema from the physical column layout.
1
# Iceberg: rename a column at the metadata level
2
frompyiceberg.catalogimportload_catalog
3
4
catalog=load_catalog("prod")
5
table=catalog.load_table("curated.fct_orders")
6
7
# Rename customer_id -> account_id at the schema level only
8
withtable.update_schema()asupd:
9
upd.rename_column("customer_id","account_id")
10
11
# All historical Parquet files still have a column physically named
12
# customer_id, but Iceberg's schema mapping translates reads to account_id.
13
# The data files were not rewritten; the metadata was.
Adding a Column
Adding a nullable column is the safest schema change. The new column has no historical data. Old rows return NULL when read with the new schema. New writes populate the column going forward. In Iceberg, adding a column writes a single metadata commit and creates a new schema version. Existing data files are unchanged. A read against an old snapshot sees the old schema; a read against the current snapshot sees the new schema with NULLs for old rows. The semantics are clean and the operation is cheap.
Dropping a Column
Dropping a column at the metadata level is also cheap. The column id is removed from the current schema. Existing data files still contain the column's bytes, but readers no longer reference them. Future compactions will eventually rewrite files without the dropped column, reclaiming the bytes. The discipline is to plan for a delay between the logical drop and the physical reclaim. Most operators schedule a quarterly compaction cycle that purges dropped column data; storage stays within budget without requiring an immediate rewrite.
Type Widening Versus Type Change
Type changes split into two cases. Widening (int to long, decimal(10,2) to decimal(18,2)) is safe because every old value still fits the new type. Iceberg and Delta allow widening without rewriting data files; the schema declares the wider type, and reads upcast as needed. Narrowing (string to int, long to int when values exceed the int range) is lossy and disallowed at the metadata level. A real type change of this kind happens by adding a new column, populating it, dual-reading for a while, and finally dropping the old column. The expand-contract pattern from data modeling applies directly.
✓Safe Schema Changes
Add a nullable column
Rename a column (with id-mapping table format)
Drop a column
Widen an integer or decimal type
•Unsafe Schema Changes
Narrow a type or change a type incompatibly
Drop a column that downstream consumers still read
Rename a column without coordinating with consumers
Add a non-nullable column without a default value
Partition Evolution
Iceberg goes further than the others on a related dimension: it allows the partition spec itself to evolve. A table partitioned by month can be re-partitioned by day going forward without rewriting the historical data. Old data stays partitioned by month; new data lands partitioned by day. Queries that filter on day take advantage of day partitioning for new data and fall back to month-level pruning for old data. Delta and Hudi support partition changes through more involved migrations. The capability is rarely needed but invaluable when a table outgrows its original partition design at scale.
Schema evolution best practices:
▸Treat the schema as a contract; coordinate breaking changes with consumers
▸Add new columns nullable; populate them in a separate write
▸Use the expand-contract pattern for type changes that are not pure widenings
▸Audit schema changes via the snapshot history; every change is recorded
•Iceberg Approach
Stable column ids embedded in the table schema
Renames change only the schema mapping; data files unchanged
Hidden partitioning lets partition spec evolve independently of the schema
Engine neutrality is a design goal: same metadata works across Spark, Trino, Snowflake
✓Delta Approach
Column-mapping mode required to support physical-name renames
Default mode keeps physical and logical names aligned
Generated columns express derived partition values without separate spec
Strongest tooling on Databricks; broad open-source support since 2023
Schema evolution is the feature that turns table format adoption from a nice-to-have into a load-bearing capability. A table that needs a column added every quarter is a table that needs a metadata-driven format.
The Small Files Problem
Daily Life
Interviews
Diagnose the small files problem from file counts and sizes and run compaction or OPTIMIZE to restore query performance.
A 30-second streaming job writes a file every 30 seconds per partition. That is 2,880 files per partition per day. A daily ingestion that writes one large file per partition produces one. The two designs run the same SQL the same way, but the streaming version is dramatically slower because every read has to open thousands of tiny files instead of a few large ones. This is the small files problem, and it is the single most common operational headache in lake and lakehouse environments. The fix is compaction: periodically rewriting many small files into fewer large files, with no change to the table's logical contents.
Why Small Files Hurt
Cost
Per-File Overhead
Why Many Files Compounds
Object storage requests
Each open is a billable request
Thousands of files = thousands of billed requests per query
Metadata reads
Each Parquet footer has to be parsed
Footer reads dominate when files are small relative to footer size
Task scheduling
Each file is a separate task in Spark / Trino
Scheduler overhead exceeds actual work for small files
Compression efficiency
Per-file dictionary encoding is less effective on small batches
Total bytes on disk grow even though raw data is the same
Query planner cost
Manifest lists grow with file count
Iceberg / Delta metadata operations slow as manifests bloat
The Numbers
The sweet spot is files between 128 MB and 1 GB after compression. At 256 MB, a typical Spark task processes the file in one stage with good vectorization and the metadata overhead is a small fraction of the work. Below 64 MB, the metadata costs start to dominate. Below 16 MB, queries spend more time opening files than reading them. Streaming jobs that write tiny micro-batches at high frequency can produce hundreds of thousands of sub-megabyte files per day. Without compaction, the table becomes unqueryable within a quarter.
Compaction reads the small files in a partition, merges them into target-sized files, and commits a new snapshot in which the small files are replaced. The data is unchanged. The snapshot retains the old small files (so time travel still works) until they are vacuumed. Compaction is safe to run concurrently with reads thanks to snapshot isolation; readers see a consistent snapshot regardless of whether compaction is in progress.
Where the Small Files Come From
Source
Why It Produces Small Files
Mitigation
Streaming micro-batches
Trigger interval forces a write per batch per partition
Schedule hourly or daily compaction; tune trigger interval up if SLA permits
Partition cardinality too high
Each partition gets a small slice of each batch
Reduce partition cardinality; cluster on the high-cardinality column instead
Concurrent writers without coordination
Each writer produces its own files for the same partition
Single ingestion pipeline per table; or coordinate via merge
CDC ingestion of high-update workloads
Each captured change becomes a small change file
Periodic merge-on-read compaction; switch to copy-on-write for read-heavy tables
Backfills with many small batches
Backfill granularity exceeds final partition granularity
Run backfill into a staging table and bulk-rewrite into the production table
Compaction As an Operational Discipline
Compaction is not a one-time fix. A streaming pipeline accumulates small files continuously, and compaction has to run continuously as well. Production environments schedule compaction as a low-priority maintenance DAG: hourly for hot streaming tables, daily for batch-ingest tables, weekly or monthly for cold archival tables. The compaction DAG is its own pipeline, with its own monitoring, its own SLA, and its own cost line. Engineers who treat compaction as someone else's job watch their tables degrade in legibility and query speed over months.
Compaction operational checklist:
▸Set a target file size (typically 128 MB to 512 MB)
▸Schedule compaction at a cadence proportional to write rate (hourly for streams, daily for batch)
▸Monitor average and median file size per table; alert when median drops below 64 MB
▸Run vacuum or expire_snapshots to reclaim storage from rewritten files
▸Keep compaction in a separate compute pool from query workloads to avoid contention
Sorting and Z-Ordering During Compaction
Compaction is also the natural point to apply sorting. Rewriting files anyway means the cost of sorting is marginal. Iceberg supports a sort order on the table that compaction respects. Delta supports Z-ordering, a multi-dimensional clustering technique that improves pruning across multiple filter columns. The combination of compaction plus sort produces files that are not only the right size but also internally organized so that predicate pushdown prunes more aggressively. The same query that scanned 100 GB on unsorted files might scan 10 GB on sorted-and-compacted files.
1
TIP
Make compaction a first-class pipeline with its own DAG and SLA. A table whose compaction stops getting maintenance is a table whose query latency drifts upward by orders of magnitude over months.
Choosing Storage Across Workloads
Daily Life
Interviews
Design a multi-layer storage architecture that combines an operational store, a lakehouse archive, and a warehouse mart for a workload with multiple concurrent access patterns.
A real production system rarely has one workload. The example here is a financial services platform with three concurrent demands on the same logical data: a regulatory archive that must retain seven years of transactions, a customer-facing app that needs single-row lookups under 50 milliseconds, and an analytical BI workload that runs daily aggregations across the entire history. No single storage layer is correct for all three. The right answer is a multi-layer architecture in which each workload reads from the storage shape that matches it, and pipelines move data between the shapes as needed.
The Three Workloads
Workload
Access Pattern
Freshness Need
Regulatory archive
Bulk read of historical transactions for audit and compliance
Daily ingest is fine; reads happen quarterly
Customer-facing app: account history view
Single-row lookups by account_id, sub-50ms latency
Latest transaction must appear within seconds
Analytical BI: revenue and risk dashboards
Aggregations across millions to billions of rows by date and product
Daily by 7am for executive review; near-real-time for risk
The Storage Layer Per Workload
Operational store: DynamoDBOpen table format: Iceberg on S3Warehouse: Snowflake mart
Operational store: DynamoDB
App reads, single-row lookups
Single-digit-millisecond reads by account_id. Stores last 90 days of transactions for fast app access. Backed by a CDC stream that propagates writes to the lake.
Open table format: Iceberg on S3
Analytical and archival storage
Seven years of transactions in Parquet, partitioned by event_date, with snapshot isolation and time travel. One physical storage layer serves both BI and the regulatory archive.
Warehouse: Snowflake mart
BI dashboards, executive reports
Curated marts loaded from Iceberg via daily Spark transforms. Holds the last 18 months pre-aggregated for fast dashboard queries; older history queried directly against Iceberg.
Each workload reads from the layer that matches its access pattern. The customer-facing app reads from DynamoDB. The risk dashboard reads recent days from Iceberg directly because it needs sub-hour freshness and full row-level scans. The executive BI dashboard reads from Snowflake because it wants pre-aggregated marts. Compliance reads time-traveled snapshots of Iceberg. Pipelines move data from DynamoDB into Iceberg via CDC, and from Iceberg into Snowflake via daily Spark. Each pipeline has its own freshness SLA, its own retry semantics, and its own owner. The storage layers do not compete; they cooperate.
Compliance, BI, and risk teams use different engines
Iceberg has the broadest neutral engine support
Time travel for compliance
Auditors ask for historical state at a specified date
Iceberg's snapshot model exposes time travel cleanly via SQL
Schema evolution over seven years
Transaction shape changes every couple of years
Iceberg's id-based schema renames without rewriting
Concurrent CDC writers and analytical readers
Streaming writes from DynamoDB run alongside dashboard reads
Snapshot isolation handles the concurrency
Delta Lake or Hudi could each serve this workload too. The choice between the three is rarely about feature gaps in 2026; all three support the core ACID, time travel, and schema evolution. The selection is about ecosystem fit. A Databricks-heavy shop chooses Delta because Databricks tooling is most native there. An Uber-style high-mutation streaming shop chooses Hudi for its merge-on-read advantage. A multi-engine shop with Spark, Trino, Snowflake, and BigQuery often picks Iceberg for the engine neutrality. The decision rule is: pick the format that matches the dominant compute engine, then accept the format's quirks rather than trying to interoperate three formats simultaneously.
Connecting to Earlier Lessons
Lesson 1 introduced the four roles: source, transform, storage, consumer. The architecture above contains all four with the storage role expanded into three concrete shapes. Lesson 2 introduced the freshness tier analysis: real-time for the app via DynamoDB, sub-hour for risk via Iceberg streaming reads, daily for the BI dashboard via Snowflake. Mixing tiers is fine because each is labeled. The same logical transactions live in three storage shapes serving three freshness tiers, with the pipelines bearing the cost of keeping them in sync. The bridge from the simple beginner picture (one storage layer per pipeline) to the production reality (multiple storage layers serving different workloads) is the open table format that makes one of the layers do double duty as both archive and analytical store.
•Three Separate Stores
Three separate copies of the transaction history
Three separate ingestion pipelines
Three different schema evolution policies
Reconciliation across stores is a perpetual problem
✓Lakehouse Plus Operational Plus Warehouse
Operational copy in DynamoDB for the app
Single Iceberg archive serving compliance, risk, and analytical preprocessing
Snowflake mart derived from Iceberg for BI consumers
Snapshot isolation guarantees the layers stay coherent
Decision rules for picking storage in a multi-workload system:
▸Application workloads with sub-50ms latency belong in an operational store, not a lake
▸Analytical workloads with aggregations belong in a columnar layer (warehouse or lakehouse)
▸Bulk historical retention belongs in object storage with a table format on top
▸Multi-engine analytical access pushes the choice toward Iceberg
▸Heavy mutation rate via CDC pushes the choice toward Hudi or copy-on-write strategies
What Goes Wrong Without This Discipline
The failure mode is using one storage layer for everything. The team standardized on Snowflake for all data three years ago. The app's account-history view now hits Snowflake on every request, and average latency is 800 milliseconds. Compliance scans seven years of transactions on warehouse compute and the bill is dominated by audit queries. Real-time risk runs on a 15-minute schedule because Snowflake cannot serve sub-minute freshness affordably. The fix is not abandoning Snowflake; it is splitting the workloads onto storage layers that match. DynamoDB takes the app traffic, Iceberg takes the archive and the risk reads, Snowflake keeps serving BI. The pipelines that move data between them become the operational backbone, and each layer pays only for its own job.
✓Do
Map each workload to a storage shape based on access pattern, not historical preference
Use one open table format consistently rather than mixing Iceberg, Delta, and Hudi without reason
Treat the operational copy as derived from the lake archive, not the source of truth
✗Don't
Force application traffic onto a warehouse with 100ms latency
Force analytical scans onto an operational database with row storage
Maintain three separate copies of transaction history when one Iceberg archive can serve archive, audit, and analytical scans
❯❯❯PUTTING IT ALL TOGETHER
> A fintech platform serves three concurrent workloads against the same logical transactions: a customer-facing account view at sub-50ms latency, a daily executive BI dashboard, and a seven-year regulatory archive that auditors query quarterly. The current architecture loads everything into Snowflake, the app feels slow, audit costs are high, and a streaming risk service is being added at sub-hour freshness. The new staff data engineer is asked: 'What is the smallest architecture that makes all four workloads operable, and what does each piece actually do?'
Step one: keep DynamoDB or Postgres as the operational store for the customer-facing app. Sub-50ms reads by account_id are an OLTP access pattern, and Lesson 1's four roles place this layer as a source. The warehouse cannot meet this latency at any reasonable cost.
Step two: introduce an Iceberg lakehouse on S3 as the durable archive. Snapshot isolation, time travel, and schema evolution turn seven years of Parquet into a queryable transactional table. Compliance reads time-traveled snapshots; risk reads recent partitions directly. The intermediate tier's columnar layout, partitioning, and predicate pushdown all apply, with compaction scheduled hourly to manage the small files the streaming ingest creates.
Step three: keep Snowflake but narrow its job to BI marts. A daily Spark transform reads the most recent partitions from Iceberg and produces pre-aggregated tables for the executive dashboard. The warehouse stops being the universal store and becomes the analytical serving layer it is shaped for. Lesson 2's freshness tier analysis applies: this layer is daily, the risk service is sub-hour off Iceberg directly, and the app is real-time off the operational store.
Step four: connect the layers with explicit pipelines. CDC streams from the operational store into Iceberg via Kinesis or Kafka. A daily Spark job builds the Snowflake mart from Iceberg. Each pipeline has its own DAG, its own owner, and its own SLA, in the pipeline-as-product sense from Lesson 1's advanced tier. The compaction DAG is itself a first-class pipeline that runs hourly to keep file sizes near 256 MB.
Step five: validate with concrete numbers. App latency drops from 800 ms to under 50 ms because the access path no longer goes through warehouse SQL. Audit query cost drops because Iceberg time travel reads only the snapshots auditors ask for, not the live warehouse. Risk freshness improves because Iceberg accepts streaming writes and snapshot isolation lets the risk reader work concurrently. The system that violated the storage-shape rule for everything now matches each workload to its shape, and the pipelines bear the cost of keeping them coherent.
The bridge move worth remembering: the storage layer is not a tool choice; it is a physics choice. Pick the shape that matches the access pattern, build the pipelines that move data between shapes, and the rest of the architecture follows.
KEY TAKEAWAYS
Open table formats put ACID on object storage: Iceberg, Delta, and Hudi turn a directory of Parquet into a transactional table via a metadata layer.
Snapshot isolation and time travel share machinery: every commit creates an immutable snapshot, and any historical state is reachable by snapshot id or timestamp.
Schema evolution lives in the metadata, not the data files: add, rename, drop, or widen columns at the metadata level without rewriting partitions.
The small files problem demands ongoing compaction: streaming jobs create thousands of tiny files per partition; OPTIMIZE or rewrite_data_files keeps median file size near 256 MB.
Multi-workload systems use multiple storage layers: operational store for app traffic, lakehouse for archive and analytical scans, warehouse for BI marts; pipelines keep them coherent.
Storage Layers and Table Formats: Advanced
Open table formats turn object storage into a transactional database without giving up the lake
Category
Pipeline Architecture
Difficulty
advanced
Duration
38 minutes
Challenges
0 hands-on challenges
Topics covered: The Lakehouse: ACID on Object, Snapshot Isolation and Time Travel, Schema Evolution Without Rewrites, The Small Files Problem, Choosing Storage Across Workloads
The lakehouse is a marketing term that names a real architectural shift. The shift is the addition of a metadata layer on top of files in object storage that provides the consistency guarantees a database has and a folder of files lacks. Iceberg, Delta Lake, and Apache Hudi are three implementations of the same idea. The data files are still Parquet (or ORC). The folders look mostly the same. The difference is a small set of metadata files that turn the directory of Parquet into a transactional
Snapshot isolation is the consistency guarantee that turns a table format into a real table. A reader sees a consistent point-in-time view of the table, even when writers are landing new data concurrently. Time travel is the operational use of the same machinery: read the table as it existed at a previous snapshot. The two features come from the same underlying mechanism, which is that the table's state is defined by an immutable chain of snapshots and the metadata pointer that names which snaps
A real production table changes shape over time. Producers add fields. Old fields get renamed. Columns become obsolete and need to be dropped. In a plain lake, every shape change requires rewriting partitions or splitting into a new table. In an open table format, additive and renaming changes happen at the metadata level and the data files stay where they are. The cost is bytes of metadata, not bytes of data. Operations and Their Costs Why Renaming Is Hard in Plain Parquet A plain Parquet file
A 30-second streaming job writes a file every 30 seconds per partition. That is 2,880 files per partition per day. A daily ingestion that writes one large file per partition produces one. The two designs run the same SQL the same way, but the streaming version is dramatically slower because every read has to open thousands of tiny files instead of a few large ones. This is the small files problem, and it is the single most common operational headache in lake and lakehouse environments. The fix i
A real production system rarely has one workload. The example here is a financial services platform with three concurrent demands on the same logical data: a regulatory archive that must retain seven years of transactions, a customer-facing app that needs single-row lookups under 50 milliseconds, and an analytical BI workload that runs daily aggregations across the entire history. No single storage layer is correct for all three. The right answer is a multi-layer architecture in which each workl