Concepts

Data Warehouse Design: What Interviewers Test

A data warehouse is a centralized repository optimized for analytical queries. It uses columnar storage, denormalized schemas, and layered architecture (staging, integration, presentation) to serve reporting, ML, and business intelligence at scale.

The cloud data warehouse market reached $30B in 2025 (Synergy Research), with Snowflake, BigQuery, and Redshift holding 65%+ market share. Interview panelists consistently rate warehouse design as the #1 topic in data modeling rounds, appearing in 80%+ of senior DE interviews.

Architecture Approaches

Three philosophies dominate warehouse design. Interviewers expect you to compare them with nuance, not just name them.

1

Kimball (Dimensional)

Philosophy

Bottom-up. Start with business processes and build dimensional models (star schemas) that users can query directly. Data marts come first; the warehouse is the union of conformed data marts.

Schema Pattern

Star schema with fact tables surrounded by denormalized dimension tables. Optimized for read performance and intuitive querying.

Best For

Organizations that need fast time-to-value, teams with strong BI requirements, and environments where business users query data directly.

2

Inmon (Enterprise)

Philosophy

Top-down. Build a centralized, normalized enterprise data warehouse first. Data marts are then derived from the EDW for departmental use. The single source of truth lives in 3NF.

Schema Pattern

Third normal form (3NF) in the core warehouse. Dimensional models only appear in downstream data marts built from the normalized core.

Best For

Large enterprises with complex data relationships, environments requiring strict consistency, and organizations with dedicated data warehouse teams.

3

Data Vault 2.0

Philosophy

Insert-only, auditable, and resilient to source changes. Separates structure (hubs and links) from context (satellites). Designed for agility when source systems change frequently.

Schema Pattern

Hubs (business keys), links (relationships), and satellites (descriptive attributes with load timestamps). Everything is append-only with full history.

Best For

Regulated industries needing full auditability, environments with volatile source systems, and teams scaling warehouse ingestion across many sources.

Warehouse Layers

Staging

Raw landing zone. Data arrives as-is from source systems with minimal transformation. Truncate-and-reload or append-only patterns.

Integration / ODS

Cleansed, conformed, and deduplicated. Business keys are resolved. This is where Data Vault hubs and links live, or Inmon 3NF tables reside.

Presentation / Mart

Business-friendly dimensional models. Star schemas optimized for BI tools. Aggregated metrics, conformed dimensions, and user-facing tables.

Semantic

Metric definitions, access policies, and virtual views. Tools like dbt metrics layer or Cube.js sit here. The contract between data and consumers.

Data flows from left to right through increasingly refined layers, each serving a distinct purpose in the warehouse architecture.

Design Components

The building blocks of every dimensional model. Interviewers expect precise definitions and practical understanding of each.

Fact Tables

Store quantitative, measurable events at a specific grain. Each row represents a business event (an order, a click, a payment). Contain foreign keys to dimensions and numeric measures. Three types: transaction facts (one row per event), periodic snapshots (one row per time period), and accumulating snapshots (one row per lifecycle).

Dimension Tables

Store descriptive context for facts. Customer names, product categories, geographic regions. Typically wide and denormalized for query simplicity. Good dimensions have meaningful business attributes, not just IDs. The quality of your dimensions determines how useful your warehouse is for ad-hoc analysis.

Slowly Changing Dimensions

Handle attributes that change over time. Type 1 overwrites the old value (no history). Type 2 adds a new row with effective dates and a current flag (full history). Type 3 stores previous and current values in separate columns (limited history). Interviewers almost always ask about Type 2 because it reveals whether you understand temporal modeling.

Surrogate Keys

System-generated integer keys that replace natural business keys in dimension tables. Provide insulation from source system key changes, support SCD Type 2 versioning (each version gets its own surrogate), and improve join performance. Never expose surrogate keys to business users; they are internal plumbing.

Conformed Dimensions

Shared dimensions used identically across multiple fact tables and data marts. A conformed date dimension, customer dimension, or product dimension means every team uses the same definitions. Without conformance, the same customer has different attributes in different reports, and metrics cannot be compared.

Grain

The level of detail stored in a fact table. Declaring the grain is the single most important design decision. 'One row per order line item per day' is a grain statement. Getting grain wrong causes double-counting, fan-out joins, and incorrect aggregations. Interviewers test this relentlessly.

Modern Warehouse Patterns

Warehouse architecture has shifted significantly. These patterns reflect how teams build warehouses today, and what interviewers expect you to know beyond classical Kimball.

ELT-First Architecture

Load raw data into the warehouse first, then transform in-place using SQL. The warehouse engine handles compute. dbt popularized this pattern by making SQL transformations testable and version-controlled. Replaces traditional ETL where transformation happened before loading.

Lakehouse Hybrid

Combine data lake storage (Parquet on S3, Delta Lake, Iceberg) with warehouse query engines. Store once in open formats, query with SQL engines like Spark, Trino, or Snowflake external tables. Eliminates the cost of duplicating data between lake and warehouse.

Separation of Storage and Compute

Decouple where data lives from what processes it. Snowflake, BigQuery, and Redshift Serverless let you scale compute independently of storage. Multiple teams can query the same data with different compute clusters. Eliminates resource contention.

Materialized Views and Incremental Models

Precompute expensive aggregations and store results. Materialized views refresh automatically; dbt incremental models append only new data. Critical for tables with billions of rows where full rebuilds are too slow or expensive.

Zero-Copy Clones

Create instant copies of tables or entire databases without duplicating storage. Snowflake, Databricks, and BigQuery support this natively. Enables safe development and testing against production-scale data without doubling storage costs.

How Interviewers Test This

Warehouse design questions are scenario-based. Interviewers describe a business and expect you to model it on the spot.

Scenario 1|Design an e-commerce warehouse

We have an e-commerce platform with orders, products, customers, and inventory. Design a data warehouse schema.

What they want to hear

Start with the grain: one row per order line item. Identify the fact table (fct_order_lines with quantity, unit_price, discount, total_amount). Identify dimensions (dim_customer, dim_product, dim_date, dim_store). Discuss conformed dimensions, SCD Type 2 for customer address changes, and how you would handle returns as a separate fact or a negative line.

Scenario 2|Model a ride-sharing platform

Design the warehouse for a ride-sharing company. The CEO wants to see driver earnings, rider activity, and trip metrics.

What they want to hear

Multiple fact tables at different grains: fct_trips (one row per trip with duration, fare, distance), fct_driver_shifts (one row per shift), fct_rider_sessions (one row per app session). Shared dimensions: dim_driver, dim_rider, dim_location, dim_date, dim_vehicle. Discuss surge pricing as a degenerate dimension or a separate fact, and how to model cancellations.

Scenario 3|Optimize a slow dashboard

A critical dashboard takes 45 seconds to load. It joins five tables with 2 billion rows of transaction data. How do you fix it?

What they want to hear

Pre-aggregate into a periodic snapshot fact table at the grain the dashboard needs. Create materialized views or dbt incremental models. Add proper clustering or sort keys. Consider a semantic layer for caching. Investigate if the grain is too fine for the dashboard requirements and if the joins can be eliminated by denormalizing.

Scenario 4|Handle late-arriving facts

Orders sometimes arrive in our warehouse 3 to 5 days after they occurred. How do you model this?

What they want to hear

Separate the event timestamp (when it happened) from the load timestamp (when it arrived). Use the event date for the date dimension foreign key, not the load date. Discuss idempotent loading to handle reprocessing. Mention that late-arriving dimensions are the harder problem: if a dimension row does not exist yet, insert a placeholder and update it later.

Interview Questions with Guidance

Q1

Explain the difference between Kimball and Inmon approaches.

A strong answer includes:

Kimball is bottom-up: build dimensional star schemas per business process, and the warehouse is the union of conformed data marts. Inmon is top-down: build a normalized 3NF enterprise data warehouse first, then derive dimensional marts. Kimball delivers value faster; Inmon ensures consistency. Most modern teams blend both, using a normalized staging layer with dimensional presentation layers.

Q2

What is grain and why is it critical?

A strong answer includes:

Grain defines what a single row in a fact table represents. It is the most important decision in dimensional modeling because every other choice follows from it. Getting grain wrong causes double-counting from fan-out joins or loss of detail from premature aggregation. Always state the grain explicitly before designing columns.

Q3

Walk me through SCD Type 2 implementation.

A strong answer includes:

Add surrogate_key, effective_date, end_date, and is_current columns to the dimension table. When an attribute changes, close the current row (set end_date and is_current = false) and insert a new row with a new surrogate key, updated attributes, and is_current = true. Fact tables join on surrogate key, so historical facts retain the dimension values as they were at transaction time.

Q4

How do you choose between a star schema and a snowflake schema?

A strong answer includes:

Star schemas denormalize dimensions into single wide tables for simpler queries and better performance. Snowflake schemas normalize dimensions into sub-tables, reducing storage but increasing join complexity. Star schemas are almost always preferred in modern warehouses because storage is cheap, query simplicity matters, and columnar engines handle wide tables efficiently. Only snowflake when you have extremely high-cardinality hierarchies that would bloat the dimension table.

Q5

What are conformed dimensions and why do they matter?

A strong answer includes:

Conformed dimensions are shared dimension tables used identically across multiple fact tables. A conformed dim_date or dim_customer means every report uses the same definitions. Without them, different teams define 'active customer' differently, and cross-functional metrics are impossible to reconcile. They are the glue that makes a warehouse a warehouse rather than a collection of disconnected tables.

Q6

How would you model a many-to-many relationship in a star schema?

A strong answer includes:

Use a bridge table (also called a factless fact table or association table). For example, if one order can have multiple promotions and one promotion applies to many orders, create a bridge table with order_key and promotion_key. Include a weighting factor if you need to allocate measures across the relationship. Acknowledge that bridge tables add complexity and should only be used when the many-to-many relationship is real.

Q7

Explain the three types of fact tables.

A strong answer includes:

Transaction facts record one row per discrete event (each order line, each click). Periodic snapshots capture the state at regular intervals (daily account balances, monthly inventory levels). Accumulating snapshots track the lifecycle of a process with milestone dates (order placed, shipped, delivered, returned). Each type serves different analytical needs.

Q8

How do you handle data warehouse partitioning and clustering?

A strong answer includes:

Partition large fact tables by date to enable partition pruning (queries that filter by date only scan relevant partitions). Cluster within partitions by high-cardinality columns frequently used in filters or joins (customer_id, product_id). In Snowflake, use clustering keys. In BigQuery, use partitioning plus clustering. In Redshift, use sort keys and distribution keys. The goal is to minimize data scanned per query.

Q9

What is a degenerate dimension?

A strong answer includes:

A dimension attribute stored directly in the fact table rather than in a separate dimension table. Typically transaction identifiers like order_number or invoice_id. They do not warrant their own dimension table because they have no additional descriptive attributes. They exist in the fact table as a reference back to the source transaction.

Q10

How would you migrate a legacy warehouse to a modern architecture?

A strong answer includes:

Phase 1: Inventory existing tables, reports, and consumers using lineage. Phase 2: Replicate the staging layer in the new platform (ELT into cloud warehouse). Phase 3: Rebuild presentation models incrementally, starting with the highest-value dashboards. Phase 4: Run dual systems in parallel, validate outputs match, then cut over consumers. Never attempt a big-bang migration; migrate mart by mart with rollback capability.

Common Interview Mistakes

Designing without declaring the grain first

Grain must be the first decision. Without it, you cannot determine which columns belong in the fact table, which dimensions are needed, or whether joins will produce correct results. State the grain in one sentence before touching any columns.

Using natural keys instead of surrogate keys in fact tables

Natural keys from source systems change, get recycled, or vary across systems. Surrogate keys provide stability, support SCD Type 2 versioning, and improve join performance. Always use integer surrogate keys in fact-to-dimension relationships.

Treating the warehouse as a copy of the source system

A warehouse is not a replica of your OLTP database. It is a purpose-built analytical structure. Simply copying source tables into the warehouse without dimensional modeling means you inherit all the normalization and complexity of the source without any analytical benefit.

Ignoring slowly changing dimensions

If you assume dimensions never change, your historical analysis is wrong. Customer addresses change, product categories are reorganized, employee departments shift. Discuss SCD types proactively. Interviewers specifically probe for this understanding.

Over-normalizing the presentation layer

Normalization belongs in the staging and integration layers. The presentation layer should be denormalized star schemas optimized for query performance and user comprehension. BI tools expect wide dimension tables, not a web of normalized sub-tables.

Frequently Asked Questions

What is data warehouse design?+
Data warehouse design is the process of structuring a database specifically for analytical queries and reporting. It involves choosing an architecture (Kimball, Inmon, or Data Vault), defining fact and dimension tables, establishing grain, and building layers from raw staging through to business-facing presentation models.
What is the difference between a data warehouse and a database?+
A database (OLTP) is optimized for fast transactional reads and writes with normalized schemas. A data warehouse (OLAP) is optimized for complex analytical queries across large volumes of historical data, using denormalized schemas like star and snowflake designs.
Do interviews ask about data warehouse design?+
Yes, frequently. Warehouse design appears in system design rounds, data modeling exercises, and even SQL questions (where understanding grain prevents incorrect joins). It signals that a candidate thinks architecturally about data, not just about individual queries.
What is a star schema?+
A star schema is a dimensional modeling pattern with a central fact table (containing measures and foreign keys) surrounded by denormalized dimension tables (containing descriptive attributes). Named for its star-like shape when diagrammed. It is the most common warehouse schema pattern.
What is the difference between ETL and ELT?+
ETL transforms data before loading into the warehouse (transformation happens in a separate engine). ELT loads raw data first, then transforms inside the warehouse using SQL. ELT is the modern standard because cloud warehouses provide cheap, scalable compute for in-place transformations.
How do I choose between Kimball and Data Vault?+
Kimball is better for teams that need fast, query-friendly models and have relatively stable source systems. Data Vault is better for environments with many volatile sources, strict auditability requirements, or teams that need to scale ingestion independently of presentation modeling.
What are the layers of a data warehouse?+
Typically four layers: staging (raw data landing zone), integration or ODS (cleansed and conformed data), presentation or mart (dimensional models for BI), and semantic (metric definitions and access policies). Not every warehouse implements all four, but understanding the layered approach shows architectural maturity.
How does a data warehouse handle real-time data?+
Traditional warehouses are batch-oriented, but modern platforms support near-real-time via micro-batch loading (every few minutes), streaming ingestion (Snowpipe, BigQuery streaming inserts), and materialized views that refresh incrementally. True sub-second latency still requires a separate streaming layer alongside the warehouse.

Practice Data Warehouse Design

DataDriven covers warehouse modeling, SQL, pipeline design, and system design at interview difficulty.

Start Practicing