Data Engineering

Data Lakehouse Architecture

The data lakehouse merges the warehouse and the lake into a single architecture. Store everything in open formats on cheap object storage. Get ACID transactions, schema enforcement, and warehouse query performance through table formats like Delta Lake, Apache Iceberg, and Apache Hudi. No more copying data between a lake and a warehouse. No more maintaining two systems with different access patterns. This page covers the architecture, the three major table formats, how the lakehouse changes the data engineering stack, and how interviewers test your understanding of it.

2021

Databricks Paper

27

System Design Rounds

172

L6 Staff Questions

1,042

Interviews Analyzed

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Why the Lakehouse Exists

Data lakes gave us cheap storage and schema flexibility but no transactions, no enforcement, and poor query performance. Data warehouses gave us ACID transactions and fast queries but at higher cost, with vendor lock-in, and with the need to load data from the lake into the warehouse.

The two-tier architecture (lake + warehouse) means data gets copied, pipelines multiply, and engineers maintain two systems. Raw data lands in S3. ETL jobs transform it and load it into Snowflake or Redshift. Analysts query the warehouse. Data scientists go back to the lake because the warehouse does not have the raw data they need. This loop creates cost, latency, and inconsistency.

The lakehouse eliminates this by adding warehouse capabilities directly to the lake. Open table formats provide the metadata layer that enables ACID transactions, time travel, schema evolution, and efficient query planning on files stored in object storage. One copy of the data serves both BI and data science.

How the Lakehouse Works

A lakehouse has three layers. The storage layer is object storage (S3, GCS, ADLS) holding data files in columnar formats (Parquet, ORC). The table format layer (Delta Lake, Iceberg, or Hudi) adds a metadata layer that tracks file-level details: which files belong to each table version, column statistics, partition information, and transaction logs. The query engine layer (Spark, Trino, Databricks SQL, Athena, StarRocks) reads the metadata to plan queries efficiently and reads only the files needed.

ACID transactions: The table format uses optimistic concurrency control. Writers create new data files and atomically update the metadata to point to them. Readers see a consistent snapshot. Failed writes do not leave partial data.

Time travel: Each write creates a new table version. You can query any previous version by specifying a timestamp or version number. This enables reproducible analytics, debugging, and audit.

Schema evolution: Add, rename, or reorder columns without rewriting existing data files. The metadata layer maps old files to the new schema using column IDs (Iceberg) or schema versioning (Delta Lake).

Partition evolution: Iceberg supports changing the partition scheme without rewriting data. Start with daily partitions. Switch to hourly when volume increases. Old files retain their original partitioning; new files use the new scheme. The metadata layer handles both.

Delta Lake, Iceberg, and Hudi

Three open table formats compete for the lakehouse standard. Each started at a different company with different priorities. All three have converged on similar feature sets, but their architectures and ecosystems differ.

Delta Lake

Created by Databricks. Uses a JSON-based transaction log (the _delta_log directory) that records every change to the table. Periodic checkpoints (Parquet files summarizing the log) speed up reads. Delta Lake is the default table format on the Databricks platform and has the deepest Spark integration.

Key features: ACID transactions, time travel, Z-ordering for data skipping, OPTIMIZE for file compaction, schema enforcement and evolution, Change Data Feed for CDC consumers.

Apache Iceberg

Created at Netflix. Uses a tree of metadata files: a metadata file points to manifest lists, which point to manifests, which point to data files. This tree structure enables fast query planning even on tables with millions of files because the engine reads only the metadata it needs.

Key features: ACID transactions, time travel, hidden partitioning (partition transforms applied automatically), partition evolution without rewriting data, column-level statistics in manifests for fine-grained file pruning. Iceberg has the broadest engine support: Spark, Trino, Flink, Snowflake, BigQuery, Athena, Dremio, StarRocks.

Apache Hudi

Created at Uber. Designed for incremental data processing and upserts. Hudi tables can be Copy-on-Write (best for read-heavy workloads) or Merge-on-Read (best for write-heavy workloads with streaming ingestion).

Key features: ACID transactions, time travel, record-level upserts and deletes, incremental queries (read only changed records since a given timestamp), built-in support for CDC ingestion from databases.

FeatureDelta LakeIcebergHudi
OriginDatabricksNetflixUber
MetadataJSON transaction logManifest treeTimeline + metadata table
Partition evolutionRequires rewriteIn-place, no rewriteRequires rewrite
Engine supportSpark, Trino, FlinkBroadest (Spark, Trino, Flink, Snowflake, BQ)Spark, Flink, Trino
Best forDatabricks-centric stacksMulti-engine, large-scale tablesStreaming upserts, CDC

ACID on Object Storage

Object storage (S3, GCS) is not a database. It does not support transactions natively. Table formats achieve ACID by using optimistic concurrency control with atomic metadata updates.

Write path: A writer creates new data files in object storage. It then attempts to atomically update the metadata pointer (Delta log entry, Iceberg metadata file, Hudi timeline entry) to include the new files. If another writer committed first, the current writer retries with conflict resolution.

Read path: A reader reads the current metadata pointer, which gives it a consistent snapshot of the table. It sees only files that were committed before the snapshot. In-progress writes are invisible.

Atomicity guarantee: Delta Lake uses S3's conditional PUT (or DynamoDB for coordination). Iceberg uses atomic rename of metadata files (or a catalog-level lock). Hudi uses a timeline with lock-free conflict detection. The specifics differ, but the result is the same: writes are all-or-nothing.

Interview note: If asked how a lakehouse provides ACID on S3, explain optimistic concurrency control with atomic metadata updates. Mention that the data files are immutable (append-only). Deletes and updates write new files and update the metadata to exclude old ones. Old files are cleaned up by a background compaction process.

How the Lakehouse Changes the DE Stack

The lakehouse consolidates what used to be two separate systems. This changes the daily work of data engineers in several ways.

No ETL from lake to warehouse: Data stays in object storage. Transformations write back to the same storage layer. This eliminates an entire category of pipelines and their associated failure modes.

Table maintenance becomes your job: File compaction, orphan file cleanup, snapshot expiration, and data ordering (Z-order, sort order) are now the DE's responsibility. In a managed warehouse, the vendor handles this. In a lakehouse, you run OPTIMIZE and VACUUM (Delta) or rewrite manifests and expire snapshots (Iceberg).

Catalog management matters: A lakehouse catalog (AWS Glue, Nessie, Polaris, Unity Catalog) is the central registry that maps table names to metadata locations. Choosing and managing the catalog is a new infrastructure concern.

Multi-engine flexibility: Since data is in an open format, different teams can use different engines. Spark for heavy transformations. Trino for interactive queries. Flink for streaming. Athena for ad-hoc. This flexibility is powerful but requires the DE to understand how each engine interacts with the table format.

Data Lakehouse FAQ

What is a data lakehouse?+
A data lakehouse is a data architecture that combines the low-cost storage and schema flexibility of a data lake with the ACID transactions, schema enforcement, and query performance of a data warehouse. It achieves this through open table formats (Delta Lake, Apache Iceberg, Apache Hudi) that add a metadata layer on top of files stored in object storage (S3, GCS, ADLS). The table format tracks which files belong to each table version, enabling transactions, time travel, and efficient query planning without copying data into a proprietary warehouse.
What is the difference between Delta Lake, Iceberg, and Hudi?+
All three are open table formats that add ACID transactions to files on object storage. Delta Lake was created by Databricks and uses a JSON-based transaction log. Apache Iceberg was created at Netflix and uses a tree of manifest files for faster query planning on very large tables. Apache Hudi was created at Uber and focuses on incremental data processing and upserts. In practice, all three support ACID, time travel, schema evolution, and partition pruning. The differences are in ecosystem support, performance characteristics at scale, and community momentum. Iceberg has the broadest engine support (Spark, Trino, Flink, Snowflake, BigQuery, Athena). Delta Lake is tightly integrated with Databricks. Hudi is strong for CDC and streaming upsert workloads.
Does a lakehouse replace a data warehouse?+
For some workloads, yes. Lakehouses handle BI queries, ad-hoc analytics, and data science on the same data without copying it to a separate warehouse. For workloads that require sub-second latency, high concurrency, and fully managed performance tuning, a dedicated warehouse (Snowflake, BigQuery, Redshift) may still be better. Many organizations run a hybrid: raw and transformed data lives in the lakehouse, and a warehouse serves as a fast serving layer for high-concurrency BI dashboards.
How does the lakehouse affect the data engineering stack?+
The lakehouse simplifies the stack by eliminating the ETL step from lake to warehouse. Data lands in object storage, gets processed in place using Spark or Flink, and is queryable by BI tools through engines like Trino, Databricks SQL, or Athena. This means fewer data copies, lower storage costs, and less pipeline maintenance. Data engineers shift from managing warehouse loading to managing table formats, compaction, optimization, and catalog management. The tradeoff is that you take on more infrastructure responsibility compared to a fully managed warehouse.

Prepare for Architecture Interview Questions

Lakehouse, warehouse, and system design questions test your understanding of modern data architectures. Practice with real problems and instant feedback.