Every production pipeline you've ever seen lives on one side of this split. The warehouse is the serving layer, tuned for analyst latency and BI tools that expect sub-second joins. The lake is the substrate, where raw events land before anyone agrees on a schema. Where you put a dataset decides the cost curve, the consumer, and the failure modes your on-call rotation inherits for the next five years. This page maps both architectures to their place in a real stack, then walks through the lakehouse middle ground and when it actually pays off.
Data Modeling Interviews
Modeling Questions Tracked
L5 Senior Rounds
Companies In Corpus
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
The core architectural difference is schema enforcement. A warehouse uses schema-on-write. A lake uses schema-on-read. Everything else follows from that distinction.
A data warehouse stores data in a columnar format optimized for analytical queries. Before data enters the warehouse, it passes through an ETL (or ELT) pipeline that validates, cleans, and transforms it into a defined schema. The warehouse has tables with typed columns, constraints, and indexes. Query engines know the schema in advance and use it to optimize execution plans.
Storage format: Columnar (Snowflake micro-partitions, BigQuery Capacitor, Redshift blocks). Compression rates are high because columns of the same type compress well.
Schema: Required at write time. CREATE TABLE with typed columns. Data that does not match the schema fails to load.
Compute: Built-in query engine with cost-based optimizer. MPP (massively parallel processing) distributes queries across nodes.
Examples: Snowflake, BigQuery, Redshift, Azure Synapse, ClickHouse.
A data lake stores files in object storage without requiring a predefined schema. Data lands in its original format: JSON logs, CSV exports, Parquet files, images, PDFs, whatever the source produces. The schema is applied at query time by the compute engine reading the files.
Storage format: Any file format. Parquet and ORC for structured data, JSON for semi-structured, raw text for logs. Files are organized into folders (partitions) by convention.
Schema: Applied at read time. A query engine (Spark, Trino, Athena) infers or is told the schema when reading files.
Compute: Separate from storage. You bring your own engine: Spark, Presto/Trino, Athena, Hive. You can swap engines without moving data.
Examples: S3 + Athena, GCS + BigQuery external tables, ADLS + Synapse Serverless, HDFS + Spark.
This table summarizes the practical differences that affect day-to-day engineering decisions.
| Dimension | Data Warehouse | Data Lake |
|---|---|---|
| Schema | On write (enforced) | On read (flexible) |
| Data types | Structured only | Structured, semi-structured, unstructured |
| Query speed | Sub-second to seconds | Seconds to minutes |
| Storage cost | Higher (optimized format) | Lower (object storage) |
| Compute cost | Pay per query/credit | Pay for cluster/runtime |
| Users | Analysts, BI tools, dashboards | Data engineers, data scientists, ML pipelines |
| ACID transactions | Built-in | Requires table format (Delta, Iceberg) |
| Governance | Built-in RBAC, column masking | Requires external catalog (Glue, Unity) |
| Data freshness | Near-real-time (Snowpipe, streaming inserts) | Real-time possible (direct file writes) |
The right choice depends on your data volume, query patterns, team skills, and budget. Here are the clear-cut scenarios.
Your primary consumers are analysts and dashboards. Business users need fast, interactive queries. They write SQL and expect results in seconds. A warehouse delivers this out of the box.
Your data is structured and well-defined. If most of your data comes from transactional databases, SaaS APIs, and structured event streams, a warehouse handles it natively.
You want managed infrastructure. Snowflake, BigQuery, and Redshift Serverless handle scaling, optimization, and maintenance. Your team focuses on data modeling, not cluster management.
Data volume is under 10 TB. For small-to-medium data volumes, warehouse storage costs are reasonable and query performance is excellent.
You have large volumes of raw or semi-structured data. Clickstream logs, IoT sensor data, application event streams, and machine learning training data are better stored in a lake where you can keep everything without paying warehouse storage rates.
Multiple compute engines need access. If Spark jobs, ML training pipelines, and SQL query engines all need the same data, a lake lets each engine read from the same files without copying data into engine-specific formats.
You need to store data before you know how to use it. When new data sources arrive and you are not sure yet what schema to impose, a lake lets you store first and model later.
You process petabytes. At very large scale, object storage is an order of magnitude cheaper than warehouse storage. The compute cost savings of running your own Spark clusters (or spot instances) can be significant too.
How you query data differs between the two architectures. This affects both developer experience and performance.
-- Data warehouse query (Snowflake)
-- Fast: columnar storage, partition pruning, result caching
SELECT
DATE_TRUNC('month', order_date) AS month,
region,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM analytics.orders
WHERE order_date >= '2024-01-01'
GROUP BY 1, 2
ORDER BY 1, 2;
-- Returns in < 1 second for 100M rows (pruned + cached)-- Data lake query (Athena / Trino over S3 Parquet files)
-- Partition pruning on year/month folder structure
SELECT
DATE_TRUNC('month', order_date) AS month,
region,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM lake_catalog.orders
WHERE year = 2024 -- partition filter (folder structure)
GROUP BY 1, 2
ORDER BY 1, 2;
-- Returns in 5-30 seconds depending on file count and sizeSame SQL, two very different execution graphs. The warehouse owns the storage layout, so its planner already knows cardinalities, column stats, and which micro-partitions to skip. The lake planner starts from folder names and manifest files, then pushes filters down to whatever Parquet it can reach. That gap shows up on every dashboard that waits more than two seconds.
Interview note: If asked “why not just use a data lake for everything,” the answer is query latency. Dashboard queries need sub-second response times. Data lake queries over object storage cannot reliably hit that bar without significant optimization (caching, materialized views, or a lakehouse layer).
A data lakehouse adds warehouse capabilities (ACID transactions, schema enforcement, indexing, time travel) on top of data lake storage. The idea is to get the best of both worlds: cheap storage, flexible formats, and fast queries.
An open-source storage layer that sits on top of Parquet files in S3/GCS/ADLS. Adds a transaction log (_delta_log) that tracks every change, enabling ACID transactions, time travel, and schema evolution. Databricks uses Delta Lake as its default table format.
An open table format originally developed at Netflix. Tracks data files and metadata in manifest files, enabling snapshot isolation, partition evolution, and hidden partitioning. Supported by Spark, Trino, Flink, Snowflake, and BigQuery.
Optimized for upsert-heavy workloads and incremental processing. Originally built at Uber for their data lake needs. Supports Copy-on-Write and Merge-on-Read table types. Good fit for CDC pipelines where rows are frequently updated.
The lakehouse is not a silver bullet. Query performance on a lakehouse still lags behind a dedicated warehouse for interactive analytics. Managing table formats (compaction, vacuuming, manifest files) adds operational complexity. And the tooling ecosystem is still maturing. For teams that need simple, fast analytics, a warehouse is still easier to run. For teams processing large-scale mixed workloads, the lakehouse can reduce data duplication and infrastructure cost.
Governance is often the deciding factor for large organizations. Here is how the two approaches handle access control, data quality, and compliance.
Built-in RBAC with role hierarchies. Column-level masking and row-level security are native features in Snowflake, BigQuery, and Redshift. Audit logs track who queried what and when. Data classification tags can be applied to columns. The warehouse is the single access point, so all governance policies are enforced in one place.
Governance requires external tools. AWS Lake Formation, Databricks Unity Catalog, or Apache Ranger provide access control on top of object storage. The challenge is that multiple compute engines (Spark, Athena, Presto) access the same files, and each engine needs to respect the governance layer. Without a centralized catalog, lake governance becomes inconsistent. This is why data lakes without governance investment often degrade into “data swamps” where nobody knows what the data means or who owns it.
Interviewers don't care whether you've memorized Snowflake docs. They care whether you can place a dataset, justify the layer, and defend the trade-off when the VP of Data asks why the bill tripled. Practice the SQL and the architectural reasoning that goes with it.