Storage Layers and Table Formats: Advanced

Netflix had a problem at sufficient scale that no existing system solved it. The data lake at petabyte scale gave them cheap storage and flexible compute. The lack of transactional semantics meant that two writers landing files in the same partition could produce incoherent reads. Schema changes required rewriting partitions or maintaining shadow tables. A query engine could see partial writes if it listed files at the wrong moment. Netflix open-sourced Iceberg in 2018 to solve these problems, Databricks released Delta Lake the same year, and Uber released Hudi shortly before. The three formats look different on the surface and converge on the same architectural idea: put a metadata layer on top of object storage that provides snapshot isolation, schema evolution, and time travel without giving up the lake's economics. This lesson is about that metadata layer, why it is the actual product, and how to operate it under load. The framing connects to Lesson 1's pipeline-as-product framing and Lesson 2's freshness tier analysis: open table formats are the storage technology that lets one set of tables serve batch consumers, streaming consumers, and ad-hoc analytical consumers without splitting into separate stores.

The Lakehouse: ACID on Object

Daily Life
Interviews

Recognize the lakehouse as object storage plus a metadata layer and explain why the metadata is the actual product.

The lakehouse is a marketing term that names a real architectural shift. The shift is the addition of a metadata layer on top of files in object storage that provides the consistency guarantees a database has and a folder of files lacks. Iceberg, Delta Lake, and Apache Hudi are three implementations of the same idea. The data files are still Parquet (or ORC). The folders look mostly the same. The difference is a small set of metadata files that turn the directory of Parquet into a transactional table.

What the Metadata Layer Adds

CapabilityPlain LakeLakehouse Table Format
Atomic writes across filesBest-effort: a writer fails halfway leaves partial filesAtomic: either the snapshot commits or it does not
Snapshot isolation between readers and writersReaders can see partial writes mid-operationReaders see a consistent snapshot regardless of concurrent writes
Time travel to historical statePossible only if old files were never overwrittenFirst-class: every snapshot is reachable by version or timestamp
Schema evolution without rewritesAdding a column means rewriting partitions or splitting tablesAdd, drop, rename at the metadata level; data files unchanged
Row-level updates and deletesPossible only by rewriting whole partitionsSupported via merge-on-read or copy-on-write strategies
Concurrent writersLast writer wins; data corruption likelyOptimistic concurrency: conflicting writers fail and retry

How an Iceberg Table Is Laid Out

1s3 : / / lake / curated / fct_orders / data / < - Parquet data files event_date = 2026 - 04 - 22 / 00000 - 1 - aaaa.parquet 00001 - 2 - bbbb.parquet metadata / v1.metadata.json < - TABLE metadata, v1 v2.metadata.json < - TABLE metadata, v2(after a write) v3.metadata.json < - v3, etc.snap - 1234. avro < - snapshot manifest list snap - 1235. avro abc - m0.avro < - manifest files(which data files are live) abc - m1.avro
Each metadata.json file describes the table at a moment in time: schema, partition spec, and pointer to the current snapshot. Each snapshot points to a list of manifest files. Each manifest file lists the data files that are part of that snapshot, with their statistics. A read that wants the current state reads the latest metadata, then the snapshot, then the manifests, then the data files. A read that wants a historical state reads an older metadata.json instead. The data files themselves are usually unchanged across snapshots; the metadata is what evolves.

The Three Major Open Formats

Apache IcebergDelta LakeApache Hudi
Apache Iceberg
Snapshot-based, vendor-neutral
Originated at Netflix in 2017. Snapshot manifests in Avro. Partition evolution and hidden partitioning. Strong neutrality across compute engines (Spark, Trino, Snowflake, Athena, BigQuery, Flink).
Delta Lake
Transaction-log based, Databricks-led
Originated at Databricks in 2017. Transaction log in JSON files (_delta_log). Strong tooling on Databricks; broader open-source support has grown rapidly since 2022.
Apache Hudi
Mutation-first, streaming-oriented
Originated at Uber in 2017. First-class support for upserts and deletes via merge-on-read. Strongest fit for streaming ingest and CDC workloads where rows mutate frequently.

Why the Metadata Is the Product

From a distance, all three formats store Parquet files in object storage. The differences live entirely in the metadata layer. The metadata is what enables atomic writes, isolation, time travel, and schema evolution. Without the metadata, the data files are still Parquet but the table behavior is what a directory of Parquet has always been: best-effort, with all the operational headaches that implies. The marketing focuses on the formats; the engineering substance is the metadata. Engineers who understand which metadata operations are cheap and which are expensive operate these tables at scale; engineers who only understand the data files run into surprises.
Operations that touch metadata only (cheap):
  • Adding a column to the schema
  • Renaming a column at the schema level
  • Tagging a snapshot for time travel reference
  • Setting a retention policy on old snapshots

What Plain Object Storage Cannot Do

S3, GCS, and Azure Data Lake Storage are eventually consistent on listings (less so today than five years ago, but the model still permeates the API). They have no notion of multi-object atomicity. A writer that uploads ten files to a directory cannot make all ten visible at the same instant; readers can see five of them while the upload is in progress. A table format hides this by recording 'these ten files are now part of snapshot N' in a single metadata.json upload. The single upload is atomic at the object level, and that atomicity propagates to the whole snapshot. The metadata layer is, in effect, a way to launder object-level atomicity into table-level atomicity.
1# Conceptual sketch of how Iceberg achieves atomic multi-file writes
2
3# Step 1: Write new data files (no readers see them yet)
4new_files = write_parquet_files(rows, partition='event_date=2026-04-22')
5
6# Step 2: Build a new manifest listing them
7manifest = build_manifest(new_files, stats=collect_stats(new_files))
8
9# Step 3: Build a new snapshot referencing the new manifest plus existing manifests
10new_snapshot = Snapshot(
11 parent_id=current_snapshot_id,
12 manifests=[manifest, *existing_manifests],
13 schema_id=current_schema_id
14)
15
16# Step 4: Write the new metadata.json that points at the new snapshot
17# This is the SINGLE atomic operation that commits the write
18write_metadata_json("v5.metadata.json", current_snapshot=new_snapshot)
19
20# Until step 4, no reader sees the new files. After step 4, every reader does.
Do
  • Choose an open table format for any analytical table that needs concurrent writers or schema evolution
  • Treat the metadata layer as the operational surface, not just the data files
  • Use hidden partitioning (Iceberg) or generated columns (Delta) so the partition logic survives schema evolution
Don't
  • Edit data files directly; the metadata loses track of them
  • Mix table-format-aware writers with plain object-store writers; the table state diverges
  • Assume all three formats interoperate fully today; check engine compatibility before standardizing
1
1

Snapshot Isolation and Time Travel

Daily Life
Interviews

Apply snapshot isolation and time travel to operate a table format under concurrent writers and reproduce historical state.

Snapshot isolation is the consistency guarantee that turns a table format into a real table. A reader sees a consistent point-in-time view of the table, even when writers are landing new data concurrently. Time travel is the operational use of the same machinery: read the table as it existed at a previous snapshot. The two features come from the same underlying mechanism, which is that the table's state is defined by an immutable chain of snapshots and the metadata pointer that names which snapshot is current.

Snapshot Isolation in Practice

1TIME Writer Reader 10 : 00 Reads metadata 10 : 00 metadata points AT snapshot S1 10 : 00 : 01 Begins to write batch 10 : 00 : 02 Issues query(planner sees S1) 10 : 00 : 03 Writes 12 new files Continues scanning S1 's files 10:00:04 Reaches a partition affected by writer 10:00:04 Reads the OLD files in S1, not the new ones 10:00:05 Atomic commit -> S2 10:00:06 Done Still scanning S1 10:00:07 Finishes; result is consistent with S1'
The reader's planner pinned the snapshot at the start of the query. Every file read during execution comes from S1's manifest list. The writer landed new files and committed S2 in the middle of the read, but the reader does not see them: those files are not in S1's manifest. There is no race condition, no dirty read, no inconsistent intermediate state. The reader sees the table as a single moment in time.

Time Travel

1SELECT *
2FROM curated.fct_orders FOR VERSION AS OF 1683847291 ;
3
4SELECT *
5FROM curated.fct_orders FOR TIMESTAMP AS OF '2026-04-22 06:00:00' ;
6
7
8SELECT *
9FROM curated.fct_orders VERSION AS OF 47 ;
10
11SELECT *
12FROM curated.fct_orders TIMESTAMP AS OF '2026-04-22 06:00:00' ;
Time travel reads an older metadata.json (Iceberg) or replays the transaction log to an earlier point (Delta). The data files referenced by that older snapshot must still exist, which they will until the retention policy expires them. Querying historical state takes the same machinery the planner uses for snapshot isolation; it points at a different snapshot id.

Operational Uses of Time Travel

Use CaseWhat It EnablesWhat To Watch For
Debug a bad pipeline runCompare today's output to yesterday's at the row levelRetention must extend back to the bad run
Reproduce an ML training runTrain against the exact snapshot the original model usedPin the snapshot id at training time and store it with the model
Audit historical reportingReproduce the exact numbers a regulator saw last quarterSnapshots required for compliance must not be expired by retention
Roll back an accidental DELETERestore the table to a snapshot before the destructive writeRoll back is itself a write; storage cost briefly doubles
Compare two versions of a transformRun the new transform against an old snapshot and diff the resultBoth snapshots must remain in retention for the comparison

Optimistic Concurrency Control

Two writers committing to the same table at the same time race for the metadata pointer. Both read the current snapshot S1 as the parent. Both prepare their new manifests. Both attempt to commit a new metadata.json that names S1 as the parent. Object storage allows one of the writes to succeed first. The second writer's commit fails because the pointer no longer references S1 (it references the first writer's S2). The second writer reloads, computes its commit against S2 instead, and tries again. This is optimistic concurrency: assume no conflict, retry on conflict. It works well when conflicts are rare, which they are for most analytical workloads (writers usually touch different partitions). It works poorly under heavy contention on the same partition, which is why heavy-update workloads sometimes prefer Hudi's mutation-first design.
When optimistic concurrency degrades:
  • Many writers update the same partition simultaneously
  • Long-running writes hold a stale parent snapshot for minutes
  • Streaming jobs commit at high frequency and each commit blocks the next
  • Compaction and ingestion compete for the same partition's metadata

Retention and Cost

Snapshots are not free. Each snapshot retains the data files it references. A daily-write table with thirty days of retention holds roughly thirty snapshots' worth of overlapping data. Compaction may rewrite files; the old files cannot be deleted while a snapshot still references them. The retention policy is the dial that controls cost: longer retention means more storage, shorter retention means less time travel. Common settings are seven days for hot operational tables, ninety days for analytical tables that need debug-by-time-travel, and several years for tables under regulatory hold.
1CALL system.expire_snapshots(TABLE = > 'curated.fct_orders', older_than = > TIMESTAMP '2026-01-01 00:00:00', retain_last = > 30) ; VACUUM curated.fct_orders RETAIN 168 HOURS ;
Snapshot isolation lets readers and writers operate on the same table without seeing each other's partial state.
Time travel reuses the same machinery: pin the metadata pointer at an older snapshot id.
Retention is the cost dial: longer retention means more storage and longer recovery windows.
Optimistic concurrency works when writers usually touch different partitions; degrades when many writers compete for the same partition.
Snapshot retention controls how far back time travel reaches and how much storage the table consumes.
Pinning a snapshot id at the start of a long-running job is the durable definition of reproducibility.
TIP
Pin snapshot ids in long-running pipelines and ML training jobs. A reproducible run is one whose input snapshot is named, not one whose input table happens to look the same when re-read.

Schema Evolution Without Rewrites

Daily Life
Interviews

Apply schema evolution at the table-format level: add, drop, rename, and widen columns without rewriting data files.

A real production table changes shape over time. Producers add fields. Old fields get renamed. Columns become obsolete and need to be dropped. In a plain lake, every shape change requires rewriting partitions or splitting into a new table. In an open table format, additive and renaming changes happen at the metadata level and the data files stay where they are. The cost is bytes of metadata, not bytes of data.

Operations and Their Costs

OperationCost in Plain Parquet LakeCost in Iceberg / Delta
Add a nullable columnRewrite all partitions or split tableMetadata-only; data files unchanged
Rename a columnRewrite all partitionsMetadata-only (Iceberg); Delta requires column-mapping mode
Drop a columnRewrite all partitionsMetadata-only; data files retain bytes but readers ignore them
Reorder columnsApplication-level concern onlyMetadata-only; column ordering is logical
Change a column type (widening)Rewrite all partitionsMetadata-only for safe widenings (int -> long, decimal precision up)
Change a column type (narrowing)Lossy; usually new column insteadDisallowed without rewrite; force a new column for the new type

Why Renaming Is Hard in Plain Parquet

A plain Parquet file embeds the column name in the file's footer. A rename at the table level means the new name does not match the old name in any historical file. A reader either sees the old name in old files and the new name in new files (fragmenting the schema) or the table has to be rewritten so every file uses the new name. Iceberg solves this by giving every column a stable internal id; the schema maps human-readable names to these ids, and a rename changes only the schema mapping, not the data. Delta Lake's column-mapping mode does something equivalent. The architectural trick is the same: separate the logical schema from the physical column layout.
1# Iceberg: rename a column at the metadata level
2from pyiceberg.catalog import load_catalog
3
4catalog = load_catalog("prod")
5table = catalog.load_table("curated.fct_orders")
6
7# Rename customer_id -> account_id at the schema level only
8with table.update_schema() as upd:
9 upd.rename_column("customer_id", "account_id")
10
11# All historical Parquet files still have a column physically named
12# customer_id, but Iceberg's schema mapping translates reads to account_id.
13# The data files were not rewritten; the metadata was.

Adding a Column

Adding a nullable column is the safest schema change. The new column has no historical data. Old rows return NULL when read with the new schema. New writes populate the column going forward. In Iceberg, adding a column writes a single metadata commit and creates a new schema version. Existing data files are unchanged. A read against an old snapshot sees the old schema; a read against the current snapshot sees the new schema with NULLs for old rows. The semantics are clean and the operation is cheap.

Dropping a Column

Dropping a column at the metadata level is also cheap. The column id is removed from the current schema. Existing data files still contain the column's bytes, but readers no longer reference them. Future compactions will eventually rewrite files without the dropped column, reclaiming the bytes. The discipline is to plan for a delay between the logical drop and the physical reclaim. Most operators schedule a quarterly compaction cycle that purges dropped column data; storage stays within budget without requiring an immediate rewrite.

Type Widening Versus Type Change

Type changes split into two cases. Widening (int to long, decimal(10,2) to decimal(18,2)) is safe because every old value still fits the new type. Iceberg and Delta allow widening without rewriting data files; the schema declares the wider type, and reads upcast as needed. Narrowing (string to int, long to int when values exceed the int range) is lossy and disallowed at the metadata level. A real type change of this kind happens by adding a new column, populating it, dual-reading for a while, and finally dropping the old column. The expand-contract pattern from data modeling applies directly.
Safe Schema Changes
  • Add a nullable column
  • Rename a column (with id-mapping table format)
  • Drop a column
  • Widen an integer or decimal type
Unsafe Schema Changes
  • Narrow a type or change a type incompatibly
  • Drop a column that downstream consumers still read
  • Rename a column without coordinating with consumers
  • Add a non-nullable column without a default value

Partition Evolution

Iceberg goes further than the others on a related dimension: it allows the partition spec itself to evolve. A table partitioned by month can be re-partitioned by day going forward without rewriting the historical data. Old data stays partitioned by month; new data lands partitioned by day. Queries that filter on day take advantage of day partitioning for new data and fall back to month-level pruning for old data. Delta and Hudi support partition changes through more involved migrations. The capability is rarely needed but invaluable when a table outgrows its original partition design at scale.
Schema evolution best practices:
  • Treat the schema as a contract; coordinate breaking changes with consumers
  • Add new columns nullable; populate them in a separate write
  • Use the expand-contract pattern for type changes that are not pure widenings
  • Audit schema changes via the snapshot history; every change is recorded
Iceberg Approach
  • Stable column ids embedded in the table schema
  • Renames change only the schema mapping; data files unchanged
  • Hidden partitioning lets partition spec evolve independently of the schema
  • Engine neutrality is a design goal: same metadata works across Spark, Trino, Snowflake
Delta Approach
  • Column-mapping mode required to support physical-name renames
  • Default mode keeps physical and logical names aligned
  • Generated columns express derived partition values without separate spec
  • Strongest tooling on Databricks; broad open-source support since 2023

Schema evolution is the feature that turns table format adoption from a nice-to-have into a load-bearing capability. A table that needs a column added every quarter is a table that needs a metadata-driven format.

The Small Files Problem

Daily Life
Interviews

Diagnose the small files problem from file counts and sizes and run compaction or OPTIMIZE to restore query performance.

A 30-second streaming job writes a file every 30 seconds per partition. That is 2,880 files per partition per day. A daily ingestion that writes one large file per partition produces one. The two designs run the same SQL the same way, but the streaming version is dramatically slower because every read has to open thousands of tiny files instead of a few large ones. This is the small files problem, and it is the single most common operational headache in lake and lakehouse environments. The fix is compaction: periodically rewriting many small files into fewer large files, with no change to the table's logical contents.

Why Small Files Hurt

CostPer-File OverheadWhy Many Files Compounds
Object storage requestsEach open is a billable requestThousands of files = thousands of billed requests per query
Metadata readsEach Parquet footer has to be parsedFooter reads dominate when files are small relative to footer size
Task schedulingEach file is a separate task in Spark / TrinoScheduler overhead exceeds actual work for small files
Compression efficiencyPer-file dictionary encoding is less effective on small batchesTotal bytes on disk grow even though raw data is the same
Query planner costManifest lists grow with file countIceberg / Delta metadata operations slow as manifests bloat

The Numbers

The sweet spot is files between 128 MB and 1 GB after compression. At 256 MB, a typical Spark task processes the file in one stage with good vectorization and the metadata overhead is a small fraction of the work. Below 64 MB, the metadata costs start to dominate. Below 16 MB, queries spend more time opening files than reading them. Streaming jobs that write tiny micro-batches at high frequency can produce hundreds of thousands of sub-megabyte files per day. Without compaction, the table becomes unqueryable within a quarter.

Compaction Mechanics

1CALL system.rewrite_data_files(TABLE = > 'curated.fct_events', options = > MAP('target-file-size-bytes', '268435456', 'min-input-files', '5', 'rewrite-all', 'false')) ; OPTIMIZE curated.fct_events
2
3
4WHERE event_date >= CURRENT_DATE() - INTERVAL 7 DAYS ; OPTIMIZE curated.fct_events ZORDER BY(user_id) ;
Compaction reads the small files in a partition, merges them into target-sized files, and commits a new snapshot in which the small files are replaced. The data is unchanged. The snapshot retains the old small files (so time travel still works) until they are vacuumed. Compaction is safe to run concurrently with reads thanks to snapshot isolation; readers see a consistent snapshot regardless of whether compaction is in progress.

Where the Small Files Come From

SourceWhy It Produces Small FilesMitigation
Streaming micro-batchesTrigger interval forces a write per batch per partitionSchedule hourly or daily compaction; tune trigger interval up if SLA permits
Partition cardinality too highEach partition gets a small slice of each batchReduce partition cardinality; cluster on the high-cardinality column instead
Concurrent writers without coordinationEach writer produces its own files for the same partitionSingle ingestion pipeline per table; or coordinate via merge
CDC ingestion of high-update workloadsEach captured change becomes a small change filePeriodic merge-on-read compaction; switch to copy-on-write for read-heavy tables
Backfills with many small batchesBackfill granularity exceeds final partition granularityRun backfill into a staging table and bulk-rewrite into the production table

Compaction As an Operational Discipline

Compaction is not a one-time fix. A streaming pipeline accumulates small files continuously, and compaction has to run continuously as well. Production environments schedule compaction as a low-priority maintenance DAG: hourly for hot streaming tables, daily for batch-ingest tables, weekly or monthly for cold archival tables. The compaction DAG is its own pipeline, with its own monitoring, its own SLA, and its own cost line. Engineers who treat compaction as someone else's job watch their tables degrade in legibility and query speed over months.
Compaction operational checklist:
  • Set a target file size (typically 128 MB to 512 MB)
  • Schedule compaction at a cadence proportional to write rate (hourly for streams, daily for batch)
  • Monitor average and median file size per table; alert when median drops below 64 MB
  • Run vacuum or expire_snapshots to reclaim storage from rewritten files
  • Keep compaction in a separate compute pool from query workloads to avoid contention

Sorting and Z-Ordering During Compaction

Compaction is also the natural point to apply sorting. Rewriting files anyway means the cost of sorting is marginal. Iceberg supports a sort order on the table that compaction respects. Delta supports Z-ordering, a multi-dimensional clustering technique that improves pruning across multiple filter columns. The combination of compaction plus sort produces files that are not only the right size but also internally organized so that predicate pushdown prunes more aggressively. The same query that scanned 100 GB on unsorted files might scan 10 GB on sorted-and-compacted files.
1
TIP
Make compaction a first-class pipeline with its own DAG and SLA. A table whose compaction stops getting maintenance is a table whose query latency drifts upward by orders of magnitude over months.

Choosing Storage Across Workloads

Daily Life
Interviews

Design a multi-layer storage architecture that combines an operational store, a lakehouse archive, and a warehouse mart for a workload with multiple concurrent access patterns.

A real production system rarely has one workload. The example here is a financial services platform with three concurrent demands on the same logical data: a regulatory archive that must retain seven years of transactions, a customer-facing app that needs single-row lookups under 50 milliseconds, and an analytical BI workload that runs daily aggregations across the entire history. No single storage layer is correct for all three. The right answer is a multi-layer architecture in which each workload reads from the storage shape that matches it, and pipelines move data between the shapes as needed.

The Three Workloads

WorkloadAccess PatternFreshness Need
Regulatory archiveBulk read of historical transactions for audit and complianceDaily ingest is fine; reads happen quarterly
Customer-facing app: account history viewSingle-row lookups by account_id, sub-50ms latencyLatest transaction must appear within seconds
Analytical BI: revenue and risk dashboardsAggregations across millions to billions of rows by date and productDaily by 7am for executive review; near-real-time for risk

The Storage Layer Per Workload

Operational store: DynamoDBOpen table format: Iceberg on S3Warehouse: Snowflake mart
Operational store: DynamoDB
App reads, single-row lookups
Single-digit-millisecond reads by account_id. Stores last 90 days of transactions for fast app access. Backed by a CDC stream that propagates writes to the lake.
Open table format: Iceberg on S3
Analytical and archival storage
Seven years of transactions in Parquet, partitioned by event_date, with snapshot isolation and time travel. One physical storage layer serves both BI and the regulatory archive.
Warehouse: Snowflake mart
BI dashboards, executive reports
Curated marts loaded from Iceberg via daily Spark transforms. Holds the last 18 months pre-aggregated for fast dashboard queries; older history queried directly against Iceberg.

How the Pipeline Glues Them Together

1App writes transactions | v DynamoDB: hot 90 days | | DynamoDB Streams(CDC) -> Kinesis -> Spark Structured Streaming v Iceberg: 7-year archive on S3 | + | + | + Looker < - reads Snowflake mart_revenue Risk service < - reads Iceberg WITH sub - hour SLA Compliance < - reads Iceberg TIME - traveled to a regulator - specified DATE
Each workload reads from the layer that matches its access pattern. The customer-facing app reads from DynamoDB. The risk dashboard reads recent days from Iceberg directly because it needs sub-hour freshness and full row-level scans. The executive BI dashboard reads from Snowflake because it wants pre-aggregated marts. Compliance reads time-traveled snapshots of Iceberg. Pipelines move data from DynamoDB into Iceberg via CDC, and from Iceberg into Snowflake via daily Spark. Each pipeline has its own freshness SLA, its own retry semantics, and its own owner. The storage layers do not compete; they cooperate.

Choosing the Open Format

PropertyWhy It Mattered HereFormat Choice
Multi-engine reads (Spark, Trino, Snowflake, BigQuery)Compliance, BI, and risk teams use different enginesIceberg has the broadest neutral engine support
Time travel for complianceAuditors ask for historical state at a specified dateIceberg's snapshot model exposes time travel cleanly via SQL
Schema evolution over seven yearsTransaction shape changes every couple of yearsIceberg's id-based schema renames without rewriting
Concurrent CDC writers and analytical readersStreaming writes from DynamoDB run alongside dashboard readsSnapshot isolation handles the concurrency
Delta Lake or Hudi could each serve this workload too. The choice between the three is rarely about feature gaps in 2026; all three support the core ACID, time travel, and schema evolution. The selection is about ecosystem fit. A Databricks-heavy shop chooses Delta because Databricks tooling is most native there. An Uber-style high-mutation streaming shop chooses Hudi for its merge-on-read advantage. A multi-engine shop with Spark, Trino, Snowflake, and BigQuery often picks Iceberg for the engine neutrality. The decision rule is: pick the format that matches the dominant compute engine, then accept the format's quirks rather than trying to interoperate three formats simultaneously.

Connecting to Earlier Lessons

Lesson 1 introduced the four roles: source, transform, storage, consumer. The architecture above contains all four with the storage role expanded into three concrete shapes. Lesson 2 introduced the freshness tier analysis: real-time for the app via DynamoDB, sub-hour for risk via Iceberg streaming reads, daily for the BI dashboard via Snowflake. Mixing tiers is fine because each is labeled. The same logical transactions live in three storage shapes serving three freshness tiers, with the pipelines bearing the cost of keeping them in sync. The bridge from the simple beginner picture (one storage layer per pipeline) to the production reality (multiple storage layers serving different workloads) is the open table format that makes one of the layers do double duty as both archive and analytical store.
Three Separate Stores
  • Three separate copies of the transaction history
  • Three separate ingestion pipelines
  • Three different schema evolution policies
  • Reconciliation across stores is a perpetual problem
Lakehouse Plus Operational Plus Warehouse
  • Operational copy in DynamoDB for the app
  • Single Iceberg archive serving compliance, risk, and analytical preprocessing
  • Snowflake mart derived from Iceberg for BI consumers
  • Snapshot isolation guarantees the layers stay coherent
Decision rules for picking storage in a multi-workload system:
  • Application workloads with sub-50ms latency belong in an operational store, not a lake
  • Analytical workloads with aggregations belong in a columnar layer (warehouse or lakehouse)
  • Bulk historical retention belongs in object storage with a table format on top
  • Multi-engine analytical access pushes the choice toward Iceberg
  • Heavy mutation rate via CDC pushes the choice toward Hudi or copy-on-write strategies

What Goes Wrong Without This Discipline

The failure mode is using one storage layer for everything. The team standardized on Snowflake for all data three years ago. The app's account-history view now hits Snowflake on every request, and average latency is 800 milliseconds. Compliance scans seven years of transactions on warehouse compute and the bill is dominated by audit queries. Real-time risk runs on a 15-minute schedule because Snowflake cannot serve sub-minute freshness affordably. The fix is not abandoning Snowflake; it is splitting the workloads onto storage layers that match. DynamoDB takes the app traffic, Iceberg takes the archive and the risk reads, Snowflake keeps serving BI. The pipelines that move data between them become the operational backbone, and each layer pays only for its own job.
Do
  • Map each workload to a storage shape based on access pattern, not historical preference
  • Use one open table format consistently rather than mixing Iceberg, Delta, and Hudi without reason
  • Treat the operational copy as derived from the lake archive, not the source of truth
Don't
  • Force application traffic onto a warehouse with 100ms latency
  • Force analytical scans onto an operational database with row storage
  • Maintain three separate copies of transaction history when one Iceberg archive can serve archive, audit, and analytical scans
PUTTING IT ALL TOGETHER

> A fintech platform serves three concurrent workloads against the same logical transactions: a customer-facing account view at sub-50ms latency, a daily executive BI dashboard, and a seven-year regulatory archive that auditors query quarterly. The current architecture loads everything into Snowflake, the app feels slow, audit costs are high, and a streaming risk service is being added at sub-hour freshness. The new staff data engineer is asked: 'What is the smallest architecture that makes all four workloads operable, and what does each piece actually do?'

Step one: keep DynamoDB or Postgres as the operational store for the customer-facing app. Sub-50ms reads by account_id are an OLTP access pattern, and Lesson 1's four roles place this layer as a source. The warehouse cannot meet this latency at any reasonable cost.
Step two: introduce an Iceberg lakehouse on S3 as the durable archive. Snapshot isolation, time travel, and schema evolution turn seven years of Parquet into a queryable transactional table. Compliance reads time-traveled snapshots; risk reads recent partitions directly. The intermediate tier's columnar layout, partitioning, and predicate pushdown all apply, with compaction scheduled hourly to manage the small files the streaming ingest creates.
Step three: keep Snowflake but narrow its job to BI marts. A daily Spark transform reads the most recent partitions from Iceberg and produces pre-aggregated tables for the executive dashboard. The warehouse stops being the universal store and becomes the analytical serving layer it is shaped for. Lesson 2's freshness tier analysis applies: this layer is daily, the risk service is sub-hour off Iceberg directly, and the app is real-time off the operational store.
Step four: connect the layers with explicit pipelines. CDC streams from the operational store into Iceberg via Kinesis or Kafka. A daily Spark job builds the Snowflake mart from Iceberg. Each pipeline has its own DAG, its own owner, and its own SLA, in the pipeline-as-product sense from Lesson 1's advanced tier. The compaction DAG is itself a first-class pipeline that runs hourly to keep file sizes near 256 MB.
Step five: validate with concrete numbers. App latency drops from 800 ms to under 50 ms because the access path no longer goes through warehouse SQL. Audit query cost drops because Iceberg time travel reads only the snapshots auditors ask for, not the live warehouse. Risk freshness improves because Iceberg accepts streaming writes and snapshot isolation lets the risk reader work concurrently. The system that violated the storage-shape rule for everything now matches each workload to its shape, and the pipelines bear the cost of keeping them coherent.
The bridge move worth remembering: the storage layer is not a tool choice; it is a physics choice. Pick the shape that matches the access pattern, build the pipelines that move data between shapes, and the rest of the architecture follows.
KEY TAKEAWAYS
Open table formats put ACID on object storage: Iceberg, Delta, and Hudi turn a directory of Parquet into a transactional table via a metadata layer.
Snapshot isolation and time travel share machinery: every commit creates an immutable snapshot, and any historical state is reachable by snapshot id or timestamp.
Schema evolution lives in the metadata, not the data files: add, rename, drop, or widen columns at the metadata level without rewriting partitions.
The small files problem demands ongoing compaction: streaming jobs create thousands of tiny files per partition; OPTIMIZE or rewrite_data_files keeps median file size near 256 MB.
Multi-workload systems use multiple storage layers: operational store for app traffic, lakehouse for archive and analytical scans, warehouse for BI marts; pipelines keep them coherent.

Storage Layers and Table Formats: Advanced

Open table formats turn object storage into a transactional database without giving up the lake

Category
Pipeline Architecture
Difficulty
advanced
Duration
38 minutes
Challenges
0 hands-on challenges

Topics covered: The Lakehouse: ACID on Object, Snapshot Isolation and Time Travel, Schema Evolution Without Rewrites, The Small Files Problem, Choosing Storage Across Workloads

Lesson Sections

  1. The Lakehouse: ACID on Object (concepts: paLakehouse, paTableFormats, paIceberg, paDeltaLake)

    The lakehouse is a marketing term that names a real architectural shift. The shift is the addition of a metadata layer on top of files in object storage that provides the consistency guarantees a database has and a folder of files lacks. Iceberg, Delta Lake, and Apache Hudi are three implementations of the same idea. The data files are still Parquet (or ORC). The folders look mostly the same. The difference is a small set of metadata files that turn the directory of Parquet into a transactional

  2. Snapshot Isolation and Time Travel (concepts: paSnapshotIsolation, paTimeTravel, paOptimisticConcurrency)

    Snapshot isolation is the consistency guarantee that turns a table format into a real table. A reader sees a consistent point-in-time view of the table, even when writers are landing new data concurrently. Time travel is the operational use of the same machinery: read the table as it existed at a previous snapshot. The two features come from the same underlying mechanism, which is that the table's state is defined by an immutable chain of snapshots and the metadata pointer that names which snaps

  3. Schema Evolution Without Rewrites (concepts: paSchemaEvolution, paColumnMapping, paPartitionEvolution)

    A real production table changes shape over time. Producers add fields. Old fields get renamed. Columns become obsolete and need to be dropped. In a plain lake, every shape change requires rewriting partitions or splitting into a new table. In an open table format, additive and renaming changes happen at the metadata level and the data files stay where they are. The cost is bytes of metadata, not bytes of data. Operations and Their Costs Why Renaming Is Hard in Plain Parquet A plain Parquet file

  4. The Small Files Problem (concepts: paSmallFiles, paCompaction, paZOrdering)

    A 30-second streaming job writes a file every 30 seconds per partition. That is 2,880 files per partition per day. A daily ingestion that writes one large file per partition produces one. The two designs run the same SQL the same way, but the streaming version is dramatically slower because every read has to open thousands of tiny files instead of a few large ones. This is the small files problem, and it is the single most common operational headache in lake and lakehouse environments. The fix i

  5. Choosing Storage Across Workloads (concepts: paStorageSelection, paLakehouse, paMultiLayerStorage)

    A real production system rarely has one workload. The example here is a financial services platform with three concurrent demands on the same logical data: a regulatory archive that must retain seven years of transactions, a customer-facing app that needs single-row lookups under 50 milliseconds, and an analytical BI workload that runs daily aggregations across the entire history. No single storage layer is correct for all three. The right answer is a multi-layer architecture in which each workl