Concepts

Data Lake vs Data Warehouse vs Lakehouse

Three approaches to storing analytical data. Interviewers test whether you know when to use each and can explain the reasoning.

Three-Way Comparison

Schema

Data Lake

Schema-on-read. Raw format (JSON, CSV, Parquet). Schema applied at query time. Flexible but can have quality issues.

Data Warehouse

Schema-on-write. Must conform to defined schema at load time. Enforces quality but requires upfront design.

Lakehouse

Schema enforcement at table level with evolution support. Delta Lake, Iceberg, Hudi add validation on lake storage.

Data Types

Data Lake

Any: structured, semi-structured, unstructured (images, logs, video).

Data Warehouse

Structured only. Tables with defined columns. Some semi-structured support (BigQuery JSON, Snowflake VARIANT).

Lakehouse

All types. Structured and semi-structured get ACID transactions and schema enforcement.

Cost

Data Lake

Cheapest (S3, GCS, ADLS). No cost when idle.

Data Warehouse

More expensive per GB. Storage and compute may be bundled or separated.

Lakehouse

Lake-level storage costs with warehouse-level query performance.

Query Performance

Data Lake

Slower for SQL without optimization. Depends on file format and partitioning.

Data Warehouse

Fastest for SQL. Built-in optimizer, statistics, caching.

Lakehouse

Good with table formats. File-level statistics, Z-ordering, compaction.

ACID Transactions

Data Lake

None natively. Concurrent writes can corrupt data.

Data Warehouse

Full ACID. Concurrent reads/writes are safe.

Lakehouse

ACID via Delta Lake, Iceberg, Hudi. Time travel enabled.

Best For

Data Lake

Raw storage, ML, unknown future use cases, very large datasets.

Data Warehouse

BI, SQL analytics, well-defined models, fast interactive queries.

Lakehouse

Teams needing both analytics and ML, cost-conscious orgs with large data.

The Evolution: Why Lakehouses Emerged

Each era solved the previous era's problems and created new ones. Understanding this history helps you explain to interviewers why the current architecture exists.

Traditional Warehouse Era (1990s to 2010)

On-premise systems like Teradata, Oracle, and IBM Netezza. Data was structured, ETL was rigid, and storage was expensive. Everything went through a defined schema before it entered the warehouse. Capacity planning meant buying hardware months in advance. This worked for stable, well-understood reporting workloads, but could not handle the data volume explosion that came with web-scale applications.

Hadoop / Data Lake Era (2010 to 2016)

Hadoop promised unlimited storage on cheap commodity hardware. Companies dumped everything into HDFS: logs, clickstreams, JSON, CSV, images. The idea was 'store now, figure out the schema later.' In practice, many data lakes became data swamps. Without schema enforcement, data quality degraded. Finding and trusting data became harder than storing it. MapReduce was powerful but painful to write. Hive added SQL on top, but performance lagged behind dedicated warehouses.

Cloud Warehouse Era (2016 to 2021)

Snowflake, BigQuery, and Redshift Spectrum moved warehouses to the cloud with separated storage and compute. Suddenly you could scale compute up for a big query and scale it back down. No capacity planning, no hardware purchases. Cloud warehouses made SQL analytics accessible and fast. But they were expensive for very large datasets (petabyte scale), and they were not designed for ML workloads that need raw file access.

Lakehouse Era (2021 to Present)

Delta Lake, Apache Iceberg, and Apache Hudi add warehouse features (ACID transactions, schema enforcement, time travel) on top of lake storage (S3, GCS). You get the cost of a lake with the reliability of a warehouse. Engines like Spark, Trino, and Dremio query lakehouse tables directly. Snowflake and BigQuery now read Iceberg tables natively. The boundary between lake and warehouse is blurring, and interviewers want you to understand this convergence.

Schema-on-Read vs Schema-on-Write

This is the core conceptual difference between lakes and warehouses. Interviewers test whether you understand the trade-offs, not just the definitions.

Schema-on-Write (Warehouse)

-- Table must exist with defined types BEFORE data loads
CREATE TABLE orders (
  order_id     BIGINT       NOT NULL,
  customer_id  BIGINT       NOT NULL,
  order_date   DATE         NOT NULL,
  total_amount DECIMAL(10,2) NOT NULL,
  status       VARCHAR(20)  NOT NULL
);

-- This INSERT fails if types do not match
INSERT INTO orders VALUES (1, 42, '2025-03-15', 99.99, 'shipped');

-- Schema violations are caught at write time
-- Bad data never enters the warehouse

Schema-on-write catches errors at ingestion time. If a source system sends a string where a number is expected, the load fails immediately. This is the warehouse approach: strict, reliable, but requires knowing your schema upfront. Changes to the schema require ALTER TABLE and potentially reloading historical data.

Schema-on-Read (Data Lake)

-- Raw JSON files land in S3 with no schema enforcement
-- s3://data-lake/orders/2025/03/15/orders.json
-- {"order_id": 1, "customer_id": 42, "total": "99.99", "status": "shipped"}
-- {"order_id": 2, "customer_id": null, "total": "bad_value"}

-- Schema applied at query time
SELECT
  CAST(order_id AS BIGINT) AS order_id,
  CAST(customer_id AS BIGINT) AS customer_id,
  TRY_CAST(total AS DECIMAL(10,2)) AS total_amount
FROM read_json('s3://data-lake/orders/2025/03/15/*.json');

Schema-on-read stores raw data as-is. The schema is applied when you query, not when you load. This is flexible: you can change your interpretation of the data without reloading it. But it pushes data quality problems downstream. That null customer_id and the 'bad_value' string will not cause errors until someone queries the data and the CAST fails. Every consumer must handle these issues independently.

When to Use Each

The right architecture depends on the specific workload. Here are common scenarios and what to recommend in each.

SQL-based BI dashboards for the business team

Data Warehouse

BI tools like Tableau, Looker, and Power BI are built for warehouse connections. They expect fast SQL responses, defined schemas, and consistent data types. A warehouse gives the best query performance and the most predictable experience for business users.

Storing raw event logs from a high-traffic application

Data Lake

Event logs can generate terabytes per day. Storing them in a warehouse is expensive and unnecessary if most of the data will never be queried directly. Land the raw data in S3 as compressed Parquet files. Run batch processing with Spark when you need to analyze it. Move aggregated results to the warehouse for dashboards.

ML training data for a recommendation engine

Data Lake or Lakehouse

ML frameworks (PyTorch, TensorFlow, Spark MLlib) read directly from object storage. They need file-level access, not SQL access. A data lake gives the cheapest storage for large training datasets. A lakehouse adds versioning and schema enforcement so your training data is reproducible.

Regulatory data that requires audit trails and versioning

Lakehouse

Lakehouse table formats (Iceberg, Delta Lake) provide time travel and audit logs natively. You can query the state of any table as of any point in time. This is valuable for compliance, debugging, and reproducing historical reports without maintaining separate snapshots.

A startup with 5 GB of data and 3 analysts

Data Warehouse (managed)

At this scale, a managed warehouse like BigQuery (pay per query) or Snowflake (auto-suspend) costs almost nothing. The operational simplicity of a single system outweighs the cost savings of a lake. Do not over-architect for scale you do not have.

A media company storing millions of images and videos with metadata

Data Lake

Unstructured binary files (images, video, audio) cannot be stored in a warehouse. The metadata can go in a warehouse for querying, but the files themselves belong in object storage. A lake is the only option for binary data at scale.

How Interviewers Test This

Interviewers do not ask you to recite definitions. They present a scenario and want to hear you reason through the architecture choice.

They give you a scenario and ask which architecture to recommend

The interviewer describes a company's data needs: volume, types, users, latency requirements. They want you to pick lake, warehouse, or lakehouse and explain why. The strongest answers name specific trade-offs: 'A warehouse is ideal here because the data is structured, the team uses Looker, and the total volume is under 1 TB. A lake would save on storage but add operational complexity the team cannot afford.'

They ask you to design a data platform from scratch

This is a system design question in disguise. The expected answer usually involves both a lake and a warehouse: raw data lands in the lake, transformations move clean data to the warehouse, and ML workloads read directly from the lake. The lakehouse pattern is the modern answer for companies that want to reduce duplication. They want to hear you reason through the decision, not just name the components.

They ask about schema evolution

What happens when a source system adds a new column? In a warehouse, you ALTER TABLE or rebuild the model. In a lake, the new column just appears in the raw files and downstream queries need updating. In a lakehouse with Iceberg, schema evolution is a first-class operation: add, rename, or reorder columns without rewriting data. This question tests whether you understand the operational burden of each approach.

They ask about data freshness vs cost

Real-time data in a warehouse is expensive (continuous loading, always-on compute). Batch loading to a lake is cheap but stale. Interviewers want you to match freshness requirements to architecture: real-time dashboards need a warehouse with streaming ingestion, while weekly reports can use batch-loaded lake data. The answer is always 'it depends' followed by the specific factors you would evaluate.

Lakehouse Table Formats: Delta Lake vs Iceberg vs Hudi

These three formats compete to be the standard metadata layer for lakehouses. Knowing the differences is a common interview topic for senior roles.

Delta Lake

Created by Databricks

The first widely adopted lakehouse format. Built on Parquet files with a JSON transaction log. Tight integration with Spark and Databricks. Supports MERGE, UPDATE, DELETE, time travel, and Z-ordering for query optimization. The transaction log records every change, enabling audit trails and rollbacks. Strongest ecosystem within the Databricks platform. Now open-sourced and supported by other engines, but the deepest feature set remains on Databricks.

Apache Iceberg

Created by Netflix (now Apache)

Designed for vendor independence. Uses a metadata tree structure (catalog, metadata files, manifest lists, manifest files) that tracks every data file. Supports hidden partitioning (partition evolution without rewriting data), schema evolution, and time travel. Adopted by Snowflake, BigQuery, AWS Athena, Trino, Spark, and Dremio. The broadest engine support of any format. Becoming the default choice for teams that want to avoid vendor lock-in.

Apache Hudi

Created by Uber (now Apache)

Built for incremental data processing. Specializes in upserts and change data capture on data lakes. Two table types: Copy-on-Write (rewrites entire files on update, fast reads) and Merge-on-Read (appends deltas, fast writes, slower reads until compaction). Strong fit for CDC pipelines where you need to replicate OLTP changes to the lake efficiently. Less broad engine support than Iceberg but strong in the Spark ecosystem.

Cost Considerations

Cost is often the deciding factor between architectures. Understanding where the money goes shows interviewers you have production experience.

Storage Costs

Data Lake

Lowest. S3 Standard is roughly $0.023 per GB per month. Infrequent Access drops to $0.0125. For 100 TB, that is about $2,300/month on Standard. No cost when data is idle.

Data Warehouse

Higher per GB. Snowflake compressed storage is roughly $23 to $40 per TB per month. BigQuery active storage is $0.02 per GB ($20 per TB). Warehouses compress data aggressively, so actual costs are lower than raw size, but still more than lake storage.

Lakehouse

Same as lake storage because data lives in S3/GCS. The metadata layer (Iceberg manifests, Delta logs) adds negligible overhead. You get warehouse-like features at lake storage prices.

Compute Costs

Data Lake

Pay for the compute engine separately (EMR, Dataproc, Spark clusters). Costs vary widely based on cluster size and duration. Can be expensive if clusters run continuously, but cheap for sporadic batch jobs.

Data Warehouse

Bundled or separated. Snowflake charges per credit (roughly $2 to $4 per credit). BigQuery charges per TB scanned ($5 to $6.25 per TB). Costs scale directly with query volume and complexity. Auto-suspend reduces idle costs.

Lakehouse

Same as lake compute. You bring your own engine. But lakehouse metadata (statistics, file pruning) reduces the amount of data scanned, lowering the effective compute cost compared to querying raw lake files.

Operational Costs

Data Lake

Highest hidden cost. Someone must manage file formats, partitioning, compaction, and data quality manually. Without governance tooling, the lake becomes unusable over time.

Data Warehouse

Lowest operational cost. The vendor handles optimization, vacuuming, statistics, and infrastructure. You write SQL and pay the bill.

Lakehouse

Middle ground. Table formats handle compaction and metadata automatically, but you still need to manage the underlying infrastructure (Spark clusters, catalog services, access control). More operational work than a warehouse, less than a raw lake.

Governance and Access Control

Data governance is a frequent interview topic at senior levels. Each architecture handles discovery, access control, and lineage differently.

Data Discovery

Data Lake

Without a catalog, data is invisible. Files accumulate in nested S3 paths with no searchable metadata. Teams create duplicate datasets because they cannot find existing ones.

Data Warehouse

Built-in. Every table is visible in the information schema. BI tools browse the catalog natively. Discovery is a solved problem.

Lakehouse

Requires a catalog service (AWS Glue Catalog, Hive Metastore, Unity Catalog). Once set up, tables are discoverable like a warehouse. Without it, same problem as a raw lake.

Access Control

Data Lake

IAM policies on S3 buckets or paths. Coarse-grained: you can restrict by folder but not by column or row. No built-in row-level security.

Data Warehouse

Fine-grained. Column-level masking, row-level security, role-based access. Snowflake and BigQuery handle this natively.

Lakehouse

Improving. Unity Catalog (Databricks) and Polaris (Snowflake/Iceberg) add column-level security to lakehouse tables. Still less mature than warehouse-native controls.

Lineage

Data Lake

No built-in lineage. Requires external tools (DataHub, OpenLineage) to track where data came from and where it goes.

Data Warehouse

Some built-in lineage (BigQuery tracks query dependencies). Still benefits from external tools for cross-system tracking.

Lakehouse

Table formats track file-level provenance. Combined with orchestration metadata, lineage is achievable but requires assembly.

Interview Questions with Guidance

What is the difference between a data lake and a data warehouse?

Lakes store raw data in any format (schema-on-read). Warehouses store structured data with enforced schemas (schema-on-write). Lakes are cheaper and more flexible. Warehouses are faster for SQL and enforce quality. Most organizations use both.

What is a data lakehouse?

Combines lake storage (cheap, flexible) with warehouse features (ACID, schema enforcement, fast SQL). Delta Lake, Iceberg, and Hudi add metadata layers on top of Parquet files in object storage.

When would you recommend a lake over a warehouse?

Extremely large data where cost matters. Unstructured data (images, logs). Unknown schema. ML workloads needing raw data. For pure SQL analytics, a warehouse is still better.

What is schema-on-read vs schema-on-write?

Schema-on-write enforces structure at load time (warehouse). Schema-on-read applies structure at query time (lake). Schema-on-write catches errors early. Schema-on-read is more flexible but pushes validation downstream.

What is Apache Iceberg?

Open table format adding schema enforcement, ACID transactions, partition evolution, and time travel to Parquet files. Makes lakes behave like warehouses without vendor lock-in. Supported by Snowflake, BigQuery, and Spark.

How would you migrate from a data lake to a lakehouse?

You do not move the data. The data stays in S3/GCS as Parquet files. You add a table format (Iceberg, Delta) on top. Register existing Parquet files with the catalog, define schemas, and start writing through the table format for ACID guarantees. Historical data can be migrated in place or left as-is and read through the new metadata layer. The key point: a lakehouse migration is a metadata migration, not a data migration.

Compare Delta Lake and Apache Iceberg

Both add ACID transactions and schema enforcement to lake storage. Delta Lake is tightly integrated with Databricks and Spark. Iceberg has broader engine support (Snowflake, BigQuery, Trino, Spark). Iceberg's hidden partitioning and partition evolution are more flexible. Delta Lake's Z-ordering and OPTIMIZE commands are more mature. For interviews, the safe answer is: Iceberg for multi-engine environments, Delta Lake for Databricks-centric teams.

What is the 'data swamp' problem and how do you prevent it?

A data swamp is a data lake where nobody can find or trust the data. It happens when teams dump data without documentation, quality checks, or governance. Prevention: enforce naming conventions, require metadata at ingestion (owner, schema, freshness SLA), run automated quality checks, catalog everything with a tool like DataHub or Atlan. A lakehouse with schema enforcement also prevents the worst quality issues.

How does time travel work in a lakehouse?

Table formats track every change as a new snapshot. Each snapshot points to a set of data files. Querying an older snapshot reads the files that existed at that point. No data is overwritten; old files are retained until explicitly cleaned up (vacuum/expire). This enables reproducing historical queries, debugging data issues, and rolling back bad writes.

Frequently Asked Questions

What is a data lake?+
Centralized storage for raw data in native format. Structured, semi-structured, or unstructured. No predefined schema required. Common on S3, GCS, ADLS.
What is a data warehouse?+
Structured database optimized for analytical queries. Defined schemas, columnar storage, fast SQL performance. Examples: Snowflake, BigQuery, Redshift.
Is the lakehouse replacing warehouses?+
Not entirely. Lakehouses are growing, but dedicated warehouses still offer the best SQL performance and BI tool ecosystem. The right choice depends on workload and team.
Do interviews ask about this?+
Yes. Tests foundational architecture knowledge. Interviewers want trade-off reasoning and the ability to recommend the right solution for a given scenario.
What is the difference between Parquet and CSV for lake storage?+
Parquet is a columnar binary format with compression, type information, and column-level statistics. CSV is a plain text format with no types and poor compression. Parquet files are typically 5 to 10x smaller than CSV and 10 to 100x faster to query. Always use Parquet (or ORC) in production data lakes.
Can I query a data lake with SQL?+
Yes, with an engine on top. Trino (formerly Presto), Spark SQL, AWS Athena, and BigQuery can all query files in S3 or GCS directly. Performance depends on file format (Parquet is fast, CSV is slow) and whether you use a table format like Iceberg for metadata.
What is a data catalog and why does it matter for lakes?+
A data catalog is a searchable inventory of all datasets, their schemas, owners, and freshness. Warehouses have built-in catalogs (you can see all tables). Lakes do not. Without a catalog, users cannot find data. Tools like AWS Glue Catalog, DataHub, and Atlan solve this.
How do lakehouse formats handle concurrent writes?+
Through optimistic concurrency control. Each write creates a new snapshot. If two writers conflict (modifying the same partition), one succeeds and the other retries. The transaction log or metadata tree acts as the coordination layer. This is how Delta Lake, Iceberg, and Hudi provide ACID on top of object storage, which has no built-in locking.

Practice System Design

DataDriven covers data architecture, SQL, and modeling at interview difficulty.

Start Practicing