Data warehouse interview questions for data engineer roles. Warehouse design from raw ingest to gold layer, fact granularity and grain selection, conformed dimensions across facts, ETL versus ELT trade-offs, Kimball versus Inmon versus Data Vault as competing design philosophies, OLAP versus OLTP.
A data warehouse interview question for a data engineer role can mean three different things depending on the company. At Kimball-shop companies (most), the question is about dimensional modeling: star schemas, conformed dimensions, SCD types, fact grain. At Inmon-influenced companies (large financial services, healthcare, traditional enterprise IT), the question is about a 3NF integrated enterprise data warehouse with downstream data marts. At lakehouse-first companies (Databricks, Netflix, Uber, modern startups), the question is about medallion architecture (bronze/silver/gold) with Iceberg or Delta as the table format and dbt or Spark for the transformation layer.
Six warehouse design questions appear most often in 2026 data engineer interviews. ETL versus ELT: ETL transforms in flight (Talend, Informatica, custom code) and lands the clean data in the warehouse. ELT lands raw data in the warehouse first (bronze) and transforms inside the warehouse (dbt or Spark on the bronze tables). ELT dominates in 2026 because compute is cheap inside columnar warehouses and the raw layer enables replay. OLAP versus OLTP: OLAP (analytical processing) is wide read queries over historical data, columnar storage, denormalized stars or OBT. OLTP (transactional processing) is narrow point-lookup and update queries over current state, row storage, normalized 3NF. The data engineer's warehouse is OLAP; the source systems are OLTP. Kimball versus Inmon versus Vault: Kimball is denormalized star schemas optimized for query simplicity, designed bottom-up from business processes. Inmon is normalized 3NF integrated EDW designed top-down from enterprise data architecture, with downstream Kimball-style marts for query. Vault is hub-link-satellite designed for audit and parallel ingestion, with a downstream business vault or Kimball marts for query.
Conformed dimensions across data marts is the senior data engineer warehouse question. A single dim_customer used by the sales mart, the support mart, and the marketing mart enables cross-mart analysis without explicit identifier translation. Without conformed dims, each mart has its own dim_customer schema and identity, and joining "customers who placed orders and opened support tickets" requires custom mapping. Conformed dim discipline is upfront design cost paid back across every cross-mart query. Senior data engineer rubrics weight this; mid-level rubrics often skip it.
Slowly-changing fact corrections is another senior warehouse question. When a transaction correction arrives (a refund processed after the daily aggregation ran), how does the warehouse reflect it? Two patterns. Append-only with version column: every correction is a new fact row with version+1; the latest version is the truth; queries filter on max(version) per natural key. In-place update with audit log: UPDATE the existing fact row, INSERT into an audit table for compliance. Trade-off: query simplicity (in-place) versus audit ease (append-only) versus storage cost (append-only takes more). Financial warehouses almost always need append-only with audit; product analytics warehouses usually accept in-place with a lightweight audit log.
Multi-region warehouse design appears at L6 senior data engineer interviews. Active-active across regions with async CDC replication for cross-region facts; conflict resolution via last-writer-wins or CRDT for counters; SLA tiers (real-time within region, eventually-consistent across regions); 2x storage minimum cost. Multi-region is rarely the right answer in a 45-minute interview but is the expected stretch-question follow-up at L6+.
Data Warehouse Interview Questions
Warehouse architecture and design problems for data engineer interview prep.
57 practice problems matching this filter. Difficulty: medium (32), easy (8), hard (17).
Data Modeling (57)
- A/B Experiment Assignment Schema - medium - One user, one experiment, one variant. No exceptions.
- Where They Used to Live - medium - They moved. The data stayed behind.
- Airline Flight Operations Schema - medium - Flights, passengers, and routes. Before you draw a single table, tell me the grain.
- A Number for the Seller - easy - They want a total. Give them the right schema first.
- B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
- Clickstream and Session Schema - medium - Millions of clicks, mostly anonymous.
- Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
- Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
- Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
- Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
Common questions
- What is the difference between ETL and ELT?
- ETL transforms data in flight (in a Talend or Informatica or custom-code pipeline) before landing the clean data in the warehouse. ELT lands raw data in the warehouse first (a bronze layer) and transforms inside the warehouse with dbt or Spark on the bronze tables. ELT dominates in 2026 because columnar warehouse compute is cheap and the raw layer enables replay and backfill without re-running ingest.
- What is the difference between OLAP and OLTP?
- OLAP (online analytical processing) is wide read queries over historical data, columnar storage, denormalized star schemas or OBT, optimized for aggregation. OLTP (online transactional processing) is narrow point-lookup and update queries over current state, row storage, normalized 3NF schemas, optimized for transaction throughput. A data engineer's warehouse is OLAP. The source systems feeding the warehouse are OLTP.
- What is the difference between Kimball, Inmon, and Data Vault?
- Kimball: denormalized star schemas optimized for query simplicity, designed bottom-up from business processes. Inmon: normalized 3NF integrated enterprise data warehouse designed top-down from enterprise architecture, with downstream Kimball marts for query. Data Vault: hub-link-satellite designed for audit and parallel ingestion, with a downstream business vault or Kimball marts for query. Most 2026 startups and tech companies use Kimball. Large financial services, healthcare, and traditional enterprise IT often use Inmon or Vault.
- What are conformed dimensions and why do senior data engineer rubrics weight them?
- Conformed dimensions are dimension tables with one schema and one set of surrogate keys, used by multiple fact tables (or data marts). One dim_customer used by the sales mart, support mart, and marketing mart. Without conformed dims, joining 'customers who placed orders and opened support tickets' requires custom identifier mapping. The upfront design cost pays back on every cross-mart query.
- How does a data engineer handle slowly-changing fact corrections?
- Two patterns. Append-only with version column: every correction is a new fact row with version+1; queries filter on max(version) per natural key. In-place update with audit log: UPDATE the fact row, INSERT into a separate audit table for compliance. Trade-off: query simplicity vs audit ease vs storage cost. Financial warehouses usually need append-only with audit; product analytics warehouses often accept in-place with lightweight audit.
- What is the medallion architecture in a modern warehouse?
- Three layers. Bronze: raw, append-only, schema-on-read, ideally still in source format. Silver: cleaned, typed, deduplicated, conformed dimensions ready. Gold: business-ready and modeled, usually star schemas. Each layer has its own ownership and quality contract. A bug in silver does not require re-ingesting from the source. Default on lakehouses like Databricks Delta and Iceberg.
- How should a data engineer design a multi-region warehouse?
- Active-active across regions with async CDC replication for cross-region facts. Conflict resolution via last-writer-wins for ordered data or CRDT for counters. SLA tiers: real-time within region, eventually-consistent across regions. 2x storage minimum cost. Multi-region is rarely the right answer in a 45-minute interview but is the expected stretch-question follow-up at L6+. Most companies do not need it.
- What is the difference between a data warehouse and a data lake?
- Data warehouse: structured tables (rows and columns), columnar storage, optimized for analytical queries, examples are Snowflake, BigQuery, Redshift. Data lake: object storage (S3, GCS, ADLS) holding files in formats like Parquet, ORC, Avro, JSON, with a separate query engine (Athena, Presto, Spark) on top. Lakehouse merges the two: object storage with table format (Iceberg, Delta, Hudi) providing ACID transactions and schema evolution on top of files. Databricks and Snowflake are converging toward the lakehouse.