Concepts
Three approaches to storing analytical data. Interviewers test whether you know when to use each and can explain the reasoning.
Schema-on-read. Raw format (JSON, CSV, Parquet). Schema applied at query time. Flexible but can have quality issues.
Schema-on-write. Must conform to defined schema at load time. Enforces quality but requires upfront design.
Schema enforcement at table level with evolution support. Delta Lake, Iceberg, Hudi add validation on lake storage.
Any: structured, semi-structured, unstructured (images, logs, video).
Structured only. Tables with defined columns. Some semi-structured support (BigQuery JSON, Snowflake VARIANT).
All types. Structured and semi-structured get ACID transactions and schema enforcement.
Cheapest (S3, GCS, ADLS). No cost when idle.
More expensive per GB. Storage and compute may be bundled or separated.
Lake-level storage costs with warehouse-level query performance.
Slower for SQL without optimization. Depends on file format and partitioning.
Fastest for SQL. Built-in optimizer, statistics, caching.
Good with table formats. File-level statistics, Z-ordering, compaction.
None natively. Concurrent writes can corrupt data.
Full ACID. Concurrent reads/writes are safe.
ACID via Delta Lake, Iceberg, Hudi. Time travel enabled.
Raw storage, ML, unknown future use cases, very large datasets.
BI, SQL analytics, well-defined models, fast interactive queries.
Teams needing both analytics and ML, cost-conscious orgs with large data.
Each era solved the previous era's problems and created new ones. Understanding this history helps you explain to interviewers why the current architecture exists.
On-premise systems like Teradata, Oracle, and IBM Netezza. Data was structured, ETL was rigid, and storage was expensive. Everything went through a defined schema before it entered the warehouse. Capacity planning meant buying hardware months in advance. This worked for stable, well-understood reporting workloads, but could not handle the data volume explosion that came with web-scale applications.
Hadoop promised unlimited storage on cheap commodity hardware. Companies dumped everything into HDFS: logs, clickstreams, JSON, CSV, images. The idea was 'store now, figure out the schema later.' In practice, many data lakes became data swamps. Without schema enforcement, data quality degraded. Finding and trusting data became harder than storing it. MapReduce was powerful but painful to write. Hive added SQL on top, but performance lagged behind dedicated warehouses.
Snowflake, BigQuery, and Redshift Spectrum moved warehouses to the cloud with separated storage and compute. Suddenly you could scale compute up for a big query and scale it back down. No capacity planning, no hardware purchases. Cloud warehouses made SQL analytics accessible and fast. But they were expensive for very large datasets (petabyte scale), and they were not designed for ML workloads that need raw file access.
Delta Lake, Apache Iceberg, and Apache Hudi add warehouse features (ACID transactions, schema enforcement, time travel) on top of lake storage (S3, GCS). You get the cost of a lake with the reliability of a warehouse. Engines like Spark, Trino, and Dremio query lakehouse tables directly. Snowflake and BigQuery now read Iceberg tables natively. The boundary between lake and warehouse is blurring, and interviewers want you to understand this convergence.
This is the core conceptual difference between lakes and warehouses. Interviewers test whether you understand the trade-offs, not just the definitions.
-- Table must exist with defined types BEFORE data loads CREATE TABLE orders ( order_id BIGINT NOT NULL, customer_id BIGINT NOT NULL, order_date DATE NOT NULL, total_amount DECIMAL(10,2) NOT NULL, status VARCHAR(20) NOT NULL ); -- This INSERT fails if types do not match INSERT INTO orders VALUES (1, 42, '2025-03-15', 99.99, 'shipped'); -- Schema violations are caught at write time -- Bad data never enters the warehouse
Schema-on-write catches errors at ingestion time. If a source system sends a string where a number is expected, the load fails immediately. This is the warehouse approach: strict, reliable, but requires knowing your schema upfront. Changes to the schema require ALTER TABLE and potentially reloading historical data.
-- Raw JSON files land in S3 with no schema enforcement
-- s3://data-lake/orders/2025/03/15/orders.json
-- {"order_id": 1, "customer_id": 42, "total": "99.99", "status": "shipped"}
-- {"order_id": 2, "customer_id": null, "total": "bad_value"}
-- Schema applied at query time
SELECT
CAST(order_id AS BIGINT) AS order_id,
CAST(customer_id AS BIGINT) AS customer_id,
TRY_CAST(total AS DECIMAL(10,2)) AS total_amount
FROM read_json('s3://data-lake/orders/2025/03/15/*.json');Schema-on-read stores raw data as-is. The schema is applied when you query, not when you load. This is flexible: you can change your interpretation of the data without reloading it. But it pushes data quality problems downstream. That null customer_id and the 'bad_value' string will not cause errors until someone queries the data and the CAST fails. Every consumer must handle these issues independently.
The right architecture depends on the specific workload. Here are common scenarios and what to recommend in each.
BI tools like Tableau, Looker, and Power BI are built for warehouse connections. They expect fast SQL responses, defined schemas, and consistent data types. A warehouse gives the best query performance and the most predictable experience for business users.
Event logs can generate terabytes per day. Storing them in a warehouse is expensive and unnecessary if most of the data will never be queried directly. Land the raw data in S3 as compressed Parquet files. Run batch processing with Spark when you need to analyze it. Move aggregated results to the warehouse for dashboards.
ML frameworks (PyTorch, TensorFlow, Spark MLlib) read directly from object storage. They need file-level access, not SQL access. A data lake gives the cheapest storage for large training datasets. A lakehouse adds versioning and schema enforcement so your training data is reproducible.
Lakehouse table formats (Iceberg, Delta Lake) provide time travel and audit logs natively. You can query the state of any table as of any point in time. This is valuable for compliance, debugging, and reproducing historical reports without maintaining separate snapshots.
At this scale, a managed warehouse like BigQuery (pay per query) or Snowflake (auto-suspend) costs almost nothing. The operational simplicity of a single system outweighs the cost savings of a lake. Do not over-architect for scale you do not have.
Unstructured binary files (images, video, audio) cannot be stored in a warehouse. The metadata can go in a warehouse for querying, but the files themselves belong in object storage. A lake is the only option for binary data at scale.
Interviewers do not ask you to recite definitions. They present a scenario and want to hear you reason through the architecture choice.
The interviewer describes a company's data needs: volume, types, users, latency requirements. They want you to pick lake, warehouse, or lakehouse and explain why. The strongest answers name specific trade-offs: 'A warehouse is ideal here because the data is structured, the team uses Looker, and the total volume is under 1 TB. A lake would save on storage but add operational complexity the team cannot afford.'
This is a system design question in disguise. The expected answer usually involves both a lake and a warehouse: raw data lands in the lake, transformations move clean data to the warehouse, and ML workloads read directly from the lake. The lakehouse pattern is the modern answer for companies that want to reduce duplication. They want to hear you reason through the decision, not just name the components.
What happens when a source system adds a new column? In a warehouse, you ALTER TABLE or rebuild the model. In a lake, the new column just appears in the raw files and downstream queries need updating. In a lakehouse with Iceberg, schema evolution is a first-class operation: add, rename, or reorder columns without rewriting data. This question tests whether you understand the operational burden of each approach.
Real-time data in a warehouse is expensive (continuous loading, always-on compute). Batch loading to a lake is cheap but stale. Interviewers want you to match freshness requirements to architecture: real-time dashboards need a warehouse with streaming ingestion, while weekly reports can use batch-loaded lake data. The answer is always 'it depends' followed by the specific factors you would evaluate.
These three formats compete to be the standard metadata layer for lakehouses. Knowing the differences is a common interview topic for senior roles.
The first widely adopted lakehouse format. Built on Parquet files with a JSON transaction log. Tight integration with Spark and Databricks. Supports MERGE, UPDATE, DELETE, time travel, and Z-ordering for query optimization. The transaction log records every change, enabling audit trails and rollbacks. Strongest ecosystem within the Databricks platform. Now open-sourced and supported by other engines, but the deepest feature set remains on Databricks.
Designed for vendor independence. Uses a metadata tree structure (catalog, metadata files, manifest lists, manifest files) that tracks every data file. Supports hidden partitioning (partition evolution without rewriting data), schema evolution, and time travel. Adopted by Snowflake, BigQuery, AWS Athena, Trino, Spark, and Dremio. The broadest engine support of any format. Becoming the default choice for teams that want to avoid vendor lock-in.
Built for incremental data processing. Specializes in upserts and change data capture on data lakes. Two table types: Copy-on-Write (rewrites entire files on update, fast reads) and Merge-on-Read (appends deltas, fast writes, slower reads until compaction). Strong fit for CDC pipelines where you need to replicate OLTP changes to the lake efficiently. Less broad engine support than Iceberg but strong in the Spark ecosystem.
Cost is often the deciding factor between architectures. Understanding where the money goes shows interviewers you have production experience.
Lowest. S3 Standard is roughly $0.023 per GB per month. Infrequent Access drops to $0.0125. For 100 TB, that is about $2,300/month on Standard. No cost when data is idle.
Higher per GB. Snowflake compressed storage is roughly $23 to $40 per TB per month. BigQuery active storage is $0.02 per GB ($20 per TB). Warehouses compress data aggressively, so actual costs are lower than raw size, but still more than lake storage.
Same as lake storage because data lives in S3/GCS. The metadata layer (Iceberg manifests, Delta logs) adds negligible overhead. You get warehouse-like features at lake storage prices.
Pay for the compute engine separately (EMR, Dataproc, Spark clusters). Costs vary widely based on cluster size and duration. Can be expensive if clusters run continuously, but cheap for sporadic batch jobs.
Bundled or separated. Snowflake charges per credit (roughly $2 to $4 per credit). BigQuery charges per TB scanned ($5 to $6.25 per TB). Costs scale directly with query volume and complexity. Auto-suspend reduces idle costs.
Same as lake compute. You bring your own engine. But lakehouse metadata (statistics, file pruning) reduces the amount of data scanned, lowering the effective compute cost compared to querying raw lake files.
Highest hidden cost. Someone must manage file formats, partitioning, compaction, and data quality manually. Without governance tooling, the lake becomes unusable over time.
Lowest operational cost. The vendor handles optimization, vacuuming, statistics, and infrastructure. You write SQL and pay the bill.
Middle ground. Table formats handle compaction and metadata automatically, but you still need to manage the underlying infrastructure (Spark clusters, catalog services, access control). More operational work than a warehouse, less than a raw lake.
Data governance is a frequent interview topic at senior levels. Each architecture handles discovery, access control, and lineage differently.
Without a catalog, data is invisible. Files accumulate in nested S3 paths with no searchable metadata. Teams create duplicate datasets because they cannot find existing ones.
Built-in. Every table is visible in the information schema. BI tools browse the catalog natively. Discovery is a solved problem.
Requires a catalog service (AWS Glue Catalog, Hive Metastore, Unity Catalog). Once set up, tables are discoverable like a warehouse. Without it, same problem as a raw lake.
IAM policies on S3 buckets or paths. Coarse-grained: you can restrict by folder but not by column or row. No built-in row-level security.
Fine-grained. Column-level masking, row-level security, role-based access. Snowflake and BigQuery handle this natively.
Improving. Unity Catalog (Databricks) and Polaris (Snowflake/Iceberg) add column-level security to lakehouse tables. Still less mature than warehouse-native controls.
No built-in lineage. Requires external tools (DataHub, OpenLineage) to track where data came from and where it goes.
Some built-in lineage (BigQuery tracks query dependencies). Still benefits from external tools for cross-system tracking.
Table formats track file-level provenance. Combined with orchestration metadata, lineage is achievable but requires assembly.
Lakes store raw data in any format (schema-on-read). Warehouses store structured data with enforced schemas (schema-on-write). Lakes are cheaper and more flexible. Warehouses are faster for SQL and enforce quality. Most organizations use both.
Combines lake storage (cheap, flexible) with warehouse features (ACID, schema enforcement, fast SQL). Delta Lake, Iceberg, and Hudi add metadata layers on top of Parquet files in object storage.
Extremely large data where cost matters. Unstructured data (images, logs). Unknown schema. ML workloads needing raw data. For pure SQL analytics, a warehouse is still better.
Schema-on-write enforces structure at load time (warehouse). Schema-on-read applies structure at query time (lake). Schema-on-write catches errors early. Schema-on-read is more flexible but pushes validation downstream.
Open table format adding schema enforcement, ACID transactions, partition evolution, and time travel to Parquet files. Makes lakes behave like warehouses without vendor lock-in. Supported by Snowflake, BigQuery, and Spark.
You do not move the data. The data stays in S3/GCS as Parquet files. You add a table format (Iceberg, Delta) on top. Register existing Parquet files with the catalog, define schemas, and start writing through the table format for ACID guarantees. Historical data can be migrated in place or left as-is and read through the new metadata layer. The key point: a lakehouse migration is a metadata migration, not a data migration.
Both add ACID transactions and schema enforcement to lake storage. Delta Lake is tightly integrated with Databricks and Spark. Iceberg has broader engine support (Snowflake, BigQuery, Trino, Spark). Iceberg's hidden partitioning and partition evolution are more flexible. Delta Lake's Z-ordering and OPTIMIZE commands are more mature. For interviews, the safe answer is: Iceberg for multi-engine environments, Delta Lake for Databricks-centric teams.
A data swamp is a data lake where nobody can find or trust the data. It happens when teams dump data without documentation, quality checks, or governance. Prevention: enforce naming conventions, require metadata at ingestion (owner, schema, freshness SLA), run automated quality checks, catalog everything with a tool like DataHub or Atlan. A lakehouse with schema enforcement also prevents the worst quality issues.
Table formats track every change as a new snapshot. Each snapshot points to a set of data files. Querying an older snapshot reads the files that existed at that point. No data is overwritten; old files are retained until explicitly cleaned up (vacuum/expire). This enables reproducing historical queries, debugging data issues, and rolling back bad writes.
DataDriven covers data architecture, SQL, and modeling at interview difficulty.
Start Practicing