Data Lakehouse Architecture

The lakehouse merges the warehouse and the lake into a single architecture. Store everything in open formats on cheap object storage. Get ACID transactions, schema enforcement, and warehouse query performance through table formats like Delta Lake, Apache Iceberg, and Apache Hudi. No more copying data between a lake and a warehouse. No more maintaining two systems with different access patterns.

What this guide actually says

A lakehouse is Parquet on object storage plus a transaction log — everything else is implementation detail. Delta Lake won the migration; Iceberg is winning the greenfield. Hudi is what you pick when streaming upserts at sub-minute SLA matter more than ecosystem. ACID on object storage is real and performance is real; operability is uneven — compaction cadence, manifest growth, VACUUM windows, and concurrent writer coordination are where production lakehouses bleed. The medallion pattern is a contract (bronze = replayable, silver = queryable, gold = trustworthy), not an architecture. All three formats give snapshot isolation, not serializability — be precise.

2021
Databricks paper
27
System design rounds analyzed
172
L6 staff questions surveyed
1,042
Interviews in the dataset

Why the lakehouse exists

Data lakes gave us cheap storage and schema flexibility but no transactions, no enforcement, and poor query performance. Data warehouses gave us ACID transactions and fast queries but at higher cost, with vendor lock-in, and with the need to load data from the lake into the warehouse.

The two-tier architecture (lake + warehouse) means data gets copied, pipelines multiply, and engineers maintain two systems. Raw data lands in S3. ETL jobs transform it and load it into Snowflake or Redshift. Analysts query the warehouse. Data scientists go back to the lake because the warehouse doesn't have the raw data they need. This loop creates cost, latency, and inconsistency.

The lakehouse eliminates this by adding warehouse capabilities directly to the lake. Open table formats provide the metadata layer that enables ACID, time travel, schema evolution, and efficient query planning on files in object storage. One copy of the data serves both BI and data science.

Prepare for the interview
01 / Open invite
02min.

Know data lakehouse the way the interviewer who asks it knows it.

a data lakehouse query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1fact_orders
2 order_id bigint PK
3 customer_sk bigint FK
4 order_date date SCD2
5
Execute your solution0.4s avg.
StripeInterview question
Solve a data lakehouse problem
Prepare for the interview
03 / From the bank02 of many
02hand-picked.

Marketplace Sales Warehouse

Hard40 min

No schema given. The interviewer is watching.

Pulled from debriefs where modeling rounds went sideways.

Lake vs warehouse vs lakehouse

The trade-offs that drove the lakehouse design, in one row-by-row read.

DimensionData lakeData warehouseLakehouse
Storage costLow (object storage)High (managed warehouse)Low (object storage)
Schema enforcementNoneStrictStrict via table format
ACID transactionsNoYesYes via metadata layer
Query performanceSlow without tuningFast, vendor tunedFast with compaction and pruning
Open formatYes (Parquet, ORC)Often proprietaryYes (Delta, Iceberg, Hudi)
Best forRaw archives, ML feature storesBI dashboards, high concurrencyUnified BI, ML, ad-hoc

How the lakehouse works

Three layers stack to give you ACID guarantees on object storage.

Layer 1

Storage layer

Object storage (S3, GCS, ADLS) holding data files in columnar formats (Parquet, ORC). Cheap, durable, decoupled from compute.

Layer 2

Table format layer

Delta Lake, Iceberg, or Hudi adds a metadata layer that tracks file-level details: which files belong to each table version, column statistics, partition information, transaction logs.

Layer 3

Query engine layer

Spark, Trino, Databricks SQL, Athena, StarRocks. The engine reads the metadata to plan queries efficiently and reads only the files needed.

What the metadata layer unlocks

ACID transactions

Optimistic concurrency control. Writers create new data files and atomically update the metadata to point to them. Readers see a consistent snapshot. Failed writes leave no partial data.

Time travel

Each write creates a new table version. Query any previous version by timestamp or version number. Enables reproducible analytics, debugging, and audit.

Schema evolution

Add, rename, or reorder columns without rewriting existing data files. The metadata layer maps old files to the new schema using column IDs (Iceberg) or schema versioning (Delta).

Partition evolution

Iceberg supports changing the partition scheme without rewriting data. Start with daily partitions; switch to hourly when volume increases. Old files keep their original partitioning; new files use the new scheme.

The transaction log is the whole point

Strip everything else and the lakehouse is one idea: Parquet plus a small, ordered, atomically updated log of which files belong to the current snapshot. Three formats, three implementations.

Delta's _delta_log

Every commit lands as a numbered JSON file (000...0.json, 000...1.json, ...) inside _delta_log next to the data. Each file lists Add and Remove actions plus optional Metadata, Protocol, and CommitInfo entries. Every ten commits Delta writes a Parquet checkpoint summarizing state so readers don't replay the full log. A reader lists _delta_log, finds the latest checkpoint, replays JSON entries after it, and knows which files belong to the current snapshot.

Iceberg's metadata.json + manifest list

A single current metadata.json is the entry point. It records schema, partition spec, sort order, and pointers to each historical snapshot. Each snapshot points at a manifest list. Each manifest list enumerates manifests. Each manifest enumerates data files with column-level min/max stats. Three levels of indirection means the planner can prune entire manifest lists without opening a Parquet footer. The expensive part is keeping the tree healthy; the cheap part is querying it.

Hudi's timeline

Hudi stores its transaction record as a sequence of instants under .hoodie/timeline. Each instant has a state (REQUESTED, INFLIGHT, COMPLETED) and an action (commit, deltacommit, compaction, clean, rollback). The instant carries metadata for the action so a reader can reconstruct table state at any point. This makes Hudi feel like a stream of changes rather than a series of snapshots, and is why incremental queries are first-class.

The catalog holds the current pointer

For Delta the table location plus _delta_log is enough. For Iceberg the catalog (Glue, Nessie, Polaris, Unity, Hive Metastore, REST) holds the pointer to the current metadata.json and is what makes atomic commits possible. Lose the catalog without backups and you've lost the table even if every Parquet file is still in S3.

Delta Lake, Iceberg, and Hudi

Three open table formats compete for the lakehouse standard. Each started at a different company with different priorities. All three have converged on similar feature sets, but their architectures and ecosystems differ.

Databricks

Delta Lake

Created by Databricks. JSON-based transaction log (_delta_log) records every change. Periodic checkpoints (Parquet summaries) speed up reads. Default table format on Databricks; deepest Spark integration. Key features: ACID, time travel, Z-ordering for data skipping, OPTIMIZE for file compaction, schema enforcement and evolution, Change Data Feed for CDC consumers.

Netflix

Apache Iceberg

Created at Netflix. Tree of metadata files: a metadata file points to manifest lists, which point to manifests, which point to data files. The tree enables fast query planning on tables with millions of files because the engine reads only the metadata it needs. Key features: ACID, time travel, hidden partitioning, partition evolution without rewriting data, column-level stats in manifests for fine-grained file pruning. Broadest engine support: Spark, Trino, Flink, Snowflake, BigQuery, Athena, Dremio, StarRocks.

Uber

Apache Hudi

Created at Uber. Designed for incremental data processing and upserts. Tables can be Copy-on-Write (best for read-heavy) or Merge-on-Read (best for write-heavy with streaming ingestion). Key features: ACID, time travel, record-level upserts and deletes, incremental queries (read only changed records since a timestamp), built-in CDC support.

Delta vs Iceberg vs Hudi: how interviewers grade your answer

Interviewers aren't looking for a winner. They're looking for whether you can name the actual axes of difference and pick correctly for a given constraint.

AxisDelta LakeIcebergHudi
Origin and governanceDatabricks; donated to Linux FoundationNetflix; top-level Apache projectUber; top-level Apache project
Transaction modelOptimistic concurrency on a serialized JSON logOptimistic concurrency on a metadata tree (atomic swap)MVCC timeline with lock-free conflict detection
Time travelLog replay with checkpoints (Parquet summaries)Snapshot pointer per commit, manifest tree per snapshotTimeline of instants (commits, compactions, cleans)
Schema evolutionSchema versioned in log; add and reorder safe; rename via mappingColumn IDs in metadata; add, drop, rename safe without rewriteSchema versioned per commit; add safe; rename requires care
CompactionOPTIMIZE compacts files; Z-ORDER reorders within filesRewrite manifests; rewrite data files; bin-packInline (CoW) or async services (MoR)
Partition evolutionRequires rewrite or table replacementNative, in place, no rewriteRequires rewrite
File pruningData skipping via file-level min/max statsManifest stats + partition stats; finest grainedFile-level stats + column indexes (recent versions)
Streaming upsert ergonomicsMERGE INTO; works at minute-plus cadenceMERGE INTO; row-level deletes via positional or equalityNative upsert as first-class; MoR shines under sub-second SLA

What ACID on object storage actually means

Object storage is not a database. It doesn't support transactions natively. Table formats achieve ACID by using optimistic concurrency control with atomic metadata updates.

Write path

A writer creates new data files in object storage. It then attempts to atomically update the metadata pointer (Delta log entry, Iceberg metadata file, Hudi timeline entry) to include the new files. If another writer committed first, the current writer retries with conflict resolution.

Read path

A reader reads the current metadata pointer, which gives it a consistent snapshot of the table. It sees only files committed before the snapshot. In-progress writes are invisible.

Atomicity guarantee

Delta uses S3's conditional PUT (or DynamoDB for coordination). Iceberg uses atomic rename of the metadata pointer through the catalog. Hudi uses a timeline with lock-free conflict detection. Different mechanisms, same result: writes are all-or-nothing.

Optimistic concurrency in practice

Writers A and B both read snapshot 100, both stage files, both attempt to commit snapshot 101. The winner succeeds. The loser sees the snapshot advanced, replays its change against snapshot 101, tries again as 102. Non-overlapping partitions almost always succeed. Overlapping rows (two writers updating the same key): the loser fails with a concurrent modification exception. This is why most production lakehouses serialize writes to a single ingestion job per table.

The medallion architecture

Bronze, silver, gold. Three layers each carrying a different contract. Interviewers grade on whether you understand the contracts, not whether you list the colors.

Bronze

Raw fidelity

Records arrive exactly as the source produced them, including malformed rows, duplicates, late-arriving updates. Contract: replayability. If a downstream layer is wrong, rebuild from bronze without re-fetching. Append-only, partitioned by ingestion time, rarely deleted. Cheapest layer to operate, most expensive to recover.

Silver

Clean schema

Records deduplicated, conformed to canonical type, joined against reference data where the join is cheap and stable. What data scientists query for ad-hoc work. Contract: queryability — a column with the same name means the same thing across rows; a row with the same key represents the same entity over time.

Gold

Business semantics

Aggregates, dimensional models, and metrics the BI layer points at. The public API of the data platform; renaming a column or changing a metric definition is a breaking change. Contract: trust — if a number changes, it changed because the business changed, not the pipeline.

What goes wrong at 3 AM

Six failure modes that page on-call engineers, and how to defuse them. Interviewers love this list because it separates people who have run a lakehouse in production from people who have read about one.

Small file problem after autoloader

Streaming ingest at one micro-batch per minute creates 1,440 tiny files per day per partition. Reads degrade because the planner opens hundreds of thousands of files and listing dominates wall clock. Fix: OPTIMIZE on Delta or rewrite_data_files on Iceberg, scheduled often enough that file count stays bounded.

VACUUM removing files in active read

A long-running batch job started at snapshot 100 still has file references when VACUUM 0 hours runs and removes everything not in the current snapshot. The reader fails with a file-not-found error halfway through. Fix: never set retention shorter than the longest possible read. Default Delta retention is seven days for exactly this reason.

Concurrent MERGE failures from optimistic concurrency

Two MERGE INTO jobs target the same partition. Both read snapshot 100, both stage updates, the second fails with concurrent append exception. Fix: partition writers by key range, serialize MERGEs through a single orchestrated job, or accept the retry cost.

Z-order slowness on wide tables

OPTIMIZE ZORDER BY (col1, col2, col3, col4, col5) on a 10 TB table runs for hours and produces marginal improvement. Z-order is most effective on the first one or two columns; later columns contribute almost nothing. Fix: pick the highest-cardinality predicate column you actually filter on and stop after one or two.

Iceberg metadata explosion

Every commit creates a new manifest list. After a million commits the metadata directory is tens of gigabytes and the planner slows even though the table is small. Fix: schedule rewrite_manifests and expire_snapshots on a cadence proportional to write rate.

Hudi MoR compaction backlog

A Merge-on-Read table accepts upserts as log files. If the async compaction service falls behind, reads must merge an ever-growing log against base files at query time and latency spikes. Fix: size the compaction service for peak write rate, not average, and alert on log-file count per partition.

Migration patterns: lake to lakehouse

Order of operations for converting an existing lake into a lakehouse without taking downtime or duplicating data. Skip a step and the migration stalls.

  1. 01

    Land everything in Parquet on S3 first

    Before introducing a table format, prove your data lands cleanly in columnar Parquet with sane partitioning. Bad partitioning is invisible until you put a transaction log on top, then it's expensive to fix. Cheap step. Get it right.

  2. 02

    Pick a format

    Delta if your stack is Databricks-centric and engines are mostly Spark and Databricks SQL. Iceberg if you have or expect multiple engines (Trino, Flink, Snowflake external tables, BigQuery external tables, Athena, Dremio). Hudi if streaming upserts at sub-minute SLA are the central requirement and you accept narrower ecosystem support.

  3. 03

    Convert in place

    Delta provides CONVERT TO DELTA which writes a transaction log against existing Parquet without moving data. Iceberg has a hidden-partition migration that scans the existing layout, infers a partition spec, and produces an initial metadata tree pointing at the same files. Both finish in minutes for terabyte tables. Move-and-rewrite migration is almost never necessary.

  4. 04

    Bring over a small dataset, prove ACID and time travel

    Pick one bronze table with a manageable footprint. Run a write, concurrent read, failed write that rolls back, time-travel query, and OPTIMIZE. Confirm each behaves the way the documentation says. This is where you discover catalog misconfiguration or IAM too narrow, before migrating anything important.

  5. 05

    Move analytics to query the lakehouse

    Point your BI tools at the lakehouse via Spark, Trino, or external tables in Snowflake. Validate dashboards against the warehouse copy. The goal is parity, not speed wins. Once parity is proven, decommission the duplicate warehouse copies.

  6. 06

    Rebuild ingestion to write through the format directly

    Last step, not first. Update upstream pipelines to write directly to the lakehouse table rather than to raw Parquet plus a separate load step. This is where operational simplification actually arrives. Skipping the earlier steps and starting here is how migrations stall.

How the lakehouse changes the DE stack

No ETL from lake to warehouse

Data stays in object storage. Transformations write back to the same storage layer. Eliminates an entire category of pipelines and their failure modes.

Table maintenance is your job

File compaction, orphan file cleanup, snapshot expiration, and data ordering (Z-order, sort order) are now the data engineer's responsibility. In a managed warehouse the vendor handles this. In a lakehouse you run OPTIMIZE and VACUUM (Delta) or rewrite manifests and expire snapshots (Iceberg).

Catalog management matters

A lakehouse catalog (AWS Glue, Nessie, Polaris, Unity Catalog) is the central registry mapping table names to metadata locations. Choosing and managing the catalog is a new infrastructure concern.

Multi-engine flexibility

Since data is in an open format, different teams can use different engines: Spark for heavy transformations, Trino for interactive queries, Flink for streaming, Athena for ad-hoc. Powerful but requires understanding how each engine interacts with the table format.

What interviewers actually grade on

Five questions that show up in lakehouse-flavored system design rounds.

Q01

Walk me through your medallion architecture and what each layer guarantees.

Strong: bronze keeps the raw record exactly as it arrived, including malformed rows, with a write timestamp and a source identifier. Silver applies the schema, deduplicates, and produces typed columns. Gold contains business-meaningful aggregates and dimensional models the BI layer hits. Contract: bronze is replayable, silver is queryable, gold is trustworthy. Weak: just lists the colors in order.

Q02

How does optimistic concurrency in Delta differ from MVCC in Iceberg?

Strong: Delta serializes through a JSON log; the n-th commit must produce the n-th log file via a conditional write, so two writers racing for the same n compete and one retries. Iceberg uses atomic rename of the metadata pointer through the catalog, with snapshot-isolated commits that detect overlap on data files; non-overlapping commits can succeed even if they target the same partition. Both are optimistic. The difference is the unit of conflict and where the atomic operation lives.

Q03

Your VACUUM is reclaiming files an active reader still needs. Diagnose.

Strong: VACUUM removes files no longer referenced by the current snapshot beyond the retention window. A long-running reader holds a snapshot reference older than that window, but VACUUM doesn't consult readers; it consults retention. Fix: set retention longer than the longest possible read, not shorten reads. If retention has to stay tight for cost, serialize VACUUM behind the read. Weak: blames the storage layer.

Q04

Design a CDC pipeline that lands into a lakehouse with exactly-once semantics.

Strong: capture from the source with Debezium or DMS, write to a Kafka topic with idempotent producers, run a structured streaming job with checkpointing into a bronze table keyed on source PK plus a monotonic LSN, MERGE INTO silver on the natural key with a tiebreaker on LSN to preserve the last write. Bronze gets exactly-once via checkpoints. Silver gets idempotency via MERGE on a stable key. Gold is rebuilt from silver and is therefore deterministic.

Q05

When would you NOT pick a lakehouse?

Strong: high-concurrency BI with sub-second latency budgets where a managed warehouse like Snowflake or BigQuery still wins; tight transactional workloads with heavy small-row updates that belong in OLTP, not analytics; very small data volumes where the operational overhead of compaction, VACUUM, and catalog management is more cost than the storage savings recover. Weak: always pick lakehouse.

Myth vs reality

Myth: Lakehouse replaces the warehouse

Reality: it replaces the lake side of the lambda architecture. Warehouses still win on low-latency BI for many workloads, especially high-concurrency dashboards with sub-second budgets. Common shape: lakehouse for storage and transformation, warehouse as a fast serving layer fed from gold tables.

Myth: Iceberg is just Delta's open competitor

Reality: concurrency models differ, migration semantics differ, integrations differ, partition stories differ materially. Iceberg supports partition evolution natively; Delta does not without rewrite. Delta has deepest Spark/Databricks integration; Iceberg has broadest cross-engine story. Treating them as interchangeable misses the decision.

Myth: ACID equals serializable

Reality: most table formats give snapshot isolation, not serializability. Two non-conflicting writers can commit against the same starting snapshot and both succeed. A reader sees a consistent snapshot, but two readers at different times can disagree about row counts without anyone violating ACID. Be precise.

Myth: Time travel is for analytics

Reality: mostly used for incident recovery (revert a bad batch), debugging (what did the table look like when this row was wrong), and ML training reproducibility (train against the snapshot the model was trained on). Production analytics queries almost always read latest. Building time travel into daily BI is a smell.

Myth: OPTIMIZE fixes performance

Reality: OPTIMIZE compacts small files into larger ones. That helps when your bottleneck is file count and listing overhead. Z-ORDER changes data layout so predicate pushdown can prune more aggressively — but only on the first one or two columns and only if you actually filter on them. Without the right keys, OPTIMIZE alone helps less than expected.

Decision matrix

Single-screen pick guide. Match your situation on the left; the format on the right is almost always the correct first answer.

SituationPickReason
Databricks shop, Spark workloads, Unity CatalogDelta LakeNative default, deepest integration, lowest friction.
Multi-engine plans (Spark + Trino + Snowflake + Flink)IcebergBroadest engine support and a true open standard.
Heavy streaming upserts with sub-minute SLAHudi MoRNative upsert is first class; CoW alternatives lag.
Greenfield analytics platform on AWSIcebergGlue, Athena, Redshift Spectrum, EMR all speak it natively.
Already on Snowflake, considering external tablesIcebergSnowflake-managed Iceberg tables are the supported open path.
GCP shop, BigQuery is the BI layerIcebergBigLake tables and BigQuery external tables both use it.
Single-team, single-engine, mostly batch ETLDelta LakeSimpler operational footprint, fewer moving parts.
CDC from many OLTP databases at onceHudi or IcebergHudi if SLA is tight; Iceberg with MERGE if ecosystem matters more.

Data lakehouse FAQ

What is a data lakehouse?+
A data architecture combining the low-cost storage and schema flexibility of a data lake with the ACID transactions, schema enforcement, and query performance of a warehouse. Achieved through open table formats (Delta, Iceberg, Hudi) that add a metadata layer on top of files in object storage (S3, GCS, ADLS). The table format tracks which files belong to each version, enabling transactions, time travel, and efficient query planning without copying data into a proprietary warehouse.
What is the difference between Delta Lake, Iceberg, and Hudi?+
All three add ACID transactions to files on object storage. Delta (Databricks) uses a JSON-based transaction log. Iceberg (Netflix) uses a tree of manifest files for faster query planning on very large tables. Hudi (Uber) focuses on incremental processing and upserts. All three support ACID, time travel, schema evolution, partition pruning. Differences: ecosystem support, performance at scale, community momentum. Iceberg has the broadest engine support; Delta is tightly integrated with Databricks; Hudi is strong for CDC and streaming upsert workloads.
Does a lakehouse replace a data warehouse?+
For some workloads, yes. Lakehouses handle BI queries, ad-hoc analytics, and data science on the same data without copying it to a separate warehouse. For workloads requiring sub-second latency, high concurrency, and fully managed performance tuning, a dedicated warehouse (Snowflake, BigQuery, Redshift) may still be better. Many organizations run a hybrid: raw and transformed data in the lakehouse, a warehouse as the fast serving layer for high-concurrency BI.
What does ACID actually guarantee on object storage?+
Snapshot isolation, atomic commits, and durable writes. Not serializability. Two writers committing non-overlapping changes can both succeed against the same starting snapshot. A reader sees a consistent snapshot at the moment its query starts. Failed writes leave no visible state behind. The atomic step is the metadata commit, not the file write; data files exist before the commit but are invisible to readers until the commit references them.
When should I pick Iceberg over Delta?+
When you have or expect multiple query engines reading the same tables, when you need partition evolution without rewriting data, or when you want a vendor-neutral open standard with first-party support across Spark, Trino, Flink, Snowflake, BigQuery, Athena. Delta is the better pick if your stack is centered on Databricks and Spark and you value the OPTIMIZE + Z-ORDER + Change Data Feed integration there.
02 / Why practice

Practice the architecture round on real problems

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related guides