Data Warehouse Design

A data warehouse is a centralized repository optimized for analytical queries. It uses columnar storage, denormalized schemas, and layered architecture (staging, integration, presentation) to serve reporting, ML, and business intelligence at scale.

What this guide actually says

Data warehouse design is grain first, dimensional model second, layer third. Kimball is the default; Data Vault when source systems are volatile or audit is required. The presentation layer should be denormalized star schemas regardless of the integration layer underneath. SCD Type 2 is the interview-tested temporal pattern. Modern patterns (ELT, lakehouse, separation of storage and compute, zero-copy clones) reshape execution; they don't change the modeling principles.

Architecture approaches

Three philosophies dominate warehouse design. Interviewers expect you to compare them with nuance, not just name them.

Bottom-up

Kimball (Dimensional)

Start with business processes and build dimensional models (star schemas) that users can query directly. Data marts come first; the warehouse is the union of conformed data marts. Schema: star with fact tables surrounded by denormalized dimensions. Best for organizations needing fast time-to-value, teams with strong BI requirements, and environments where business users query data directly.

Top-down

Inmon (Enterprise)

Build a centralized, normalized enterprise data warehouse first. Data marts are then derived from the EDW for departmental use. The single source of truth lives in 3NF. Schema: 3NF in the core warehouse; dimensional models only appear in downstream marts. Best for large enterprises with complex data relationships, environments requiring strict consistency, organizations with dedicated warehouse teams.

Insert-only

Data Vault 2.0

Auditable and resilient to source changes. Separates structure (hubs and links) from context (satellites). Designed for agility when source systems change frequently. Schema: hubs (business keys), links (relationships), satellites (descriptive attributes with load timestamps). Append-only with full history. Best for regulated industries needing full auditability, environments with volatile source systems, and teams scaling warehouse ingestion across many sources.

Warehouse layers

Data flows from raw landing through increasingly refined layers, each serving a distinct purpose.

LayerPurposeDescription
StagingRaw landing zoneData arrives as-is from source systems with minimal transformation. Truncate-and-reload or append-only patterns.
Integration / ODSCleansed and conformedDeduplicated. Business keys resolved. Where Data Vault hubs and links live, or Inmon 3NF tables reside.
Presentation / MartDimensional modelsBusiness-friendly star schemas optimized for BI. Aggregated metrics, conformed dimensions, user-facing tables.
SemanticMetric layer + accessMetric definitions, access policies, virtual views. Tools like dbt metrics layer or Cube.js. The contract between data and consumers.

Design components

The building blocks of every dimensional model. Interviewers expect precise definitions and practical understanding of each.

Fact tables

Store quantitative, measurable events at a specific grain. Each row represents a business event (order, click, payment). Contain foreign keys to dimensions and numeric measures. Three types: transaction facts (one row per event), periodic snapshots (one row per time period), accumulating snapshots (one row per lifecycle).

Dimension tables

Store descriptive context for facts. Customer names, product categories, geographic regions. Typically wide and denormalized for query simplicity. Good dimensions have meaningful business attributes, not just IDs. The quality of your dimensions determines how useful your warehouse is for ad-hoc analysis.

Slowly changing dimensions

Handle attributes that change over time. Type 1 overwrites the old value (no history). Type 2 adds a new row with effective dates and a current flag (full history). Type 3 stores previous and current values in separate columns (limited history). Interviewers almost always ask about Type 2 because it reveals whether you understand temporal modeling.

Surrogate keys

System-generated integer keys that replace natural business keys in dimension tables. Provide insulation from source system key changes, support SCD Type 2 versioning (each version gets its own surrogate), improve join performance. Never expose surrogate keys to business users; they're internal plumbing.

Conformed dimensions

Shared dimensions used identically across multiple fact tables and data marts. A conformed date dimension, customer dimension, or product dimension means every team uses the same definitions. Without conformance, the same customer has different attributes in different reports, and metrics cannot be compared.

Grain

The level of detail stored in a fact table. Declaring the grain is the single most important design decision. 'One row per order line item per day' is a grain statement. Getting grain wrong causes double-counting, fan-out joins, incorrect aggregations. Interviewers test this relentlessly.

Modern warehouse patterns

Warehouse architecture has shifted significantly. These patterns reflect how teams build warehouses today, beyond classical Kimball.

ELT-first architecture

Load raw data into the warehouse first, then transform in-place using SQL. The warehouse engine handles compute. dbt popularized this by making SQL transformations testable and version-controlled. Replaces traditional ETL where transformation happened before loading.

Lakehouse hybrid

Combine data lake storage (Parquet on S3, Delta Lake, Iceberg) with warehouse query engines. Store once in open formats, query with SQL engines like Spark, Trino, or Snowflake external tables. Eliminates the cost of duplicating data between lake and warehouse.

Separation of storage and compute

Decouple where data lives from what processes it. Snowflake, BigQuery, and Redshift Serverless let you scale compute independently of storage. Multiple teams can query the same data with different compute clusters. Eliminates resource contention.

Materialized views and incremental models

Precompute expensive aggregations and store results. Materialized views refresh automatically; dbt incremental models append only new data. Critical for tables with billions of rows where full rebuilds are too slow or expensive.

Zero-copy clones

Create instant copies of tables or entire databases without duplicating storage. Snowflake, Databricks, and BigQuery support this natively. Enables safe development and testing against production-scale data without doubling storage costs.

How interviewers test this

Warehouse design questions are scenario-based. Interviewers describe a business and expect you to model it on the spot.

E-commerce warehouse

Design a warehouse for orders, products, customers, and inventory.

Start with grain: one row per order line item. Identify the fact (fct_order_lines with quantity, unit_price, discount, total_amount). Identify dimensions (dim_customer, dim_product, dim_date, dim_store). Discuss conformed dimensions, SCD Type 2 for customer address changes, and how you'd handle returns as a separate fact or a negative line.

Ride-sharing

Design the warehouse for a ride-sharing company.

Multiple facts at different grains: fct_trips (one row per trip with duration, fare, distance), fct_driver_shifts (one row per shift), fct_rider_sessions (one row per app session). Shared dimensions: dim_driver, dim_rider, dim_location, dim_date, dim_vehicle. Discuss surge pricing as a degenerate dimension or a separate fact, and how to model cancellations.

Slow dashboard

A critical dashboard takes 45 seconds. It joins five tables with 2B rows. How do you fix it?

Pre-aggregate into a periodic snapshot fact at the grain the dashboard needs. Create materialized views or dbt incremental models. Add proper clustering or sort keys. Consider a semantic layer for caching. Investigate if the grain is too fine for the dashboard requirements and if joins can be eliminated by denormalizing.

Late-arriving facts

Orders sometimes arrive in our warehouse 3-5 days after they occurred. How do you model this?

Separate the event timestamp (when it happened) from the load timestamp (when it arrived). Use event date for the date dimension FK, not load date. Discuss idempotent loading to handle reprocessing. Mention that late-arriving dimensions are the harder problem: if a dimension row doesn't exist yet, insert a placeholder and update it later.

Interview questions with guidance

Ten questions covering architecture, grain, SCD types, and modern patterns.

Explain the difference between Kimball and Inmon approaches.

Kimball is bottom-up: build dimensional star schemas per business process, and the warehouse is the union of conformed data marts. Inmon is top-down: build a normalized 3NF enterprise data warehouse first, then derive dimensional marts. Kimball delivers value faster; Inmon ensures consistency. Most modern teams blend both, using a normalized staging layer with dimensional presentation layers.

What is grain and why is it critical?

Grain defines what a single row in a fact table represents. The most important decision in dimensional modeling because every other choice follows from it. Getting grain wrong causes double-counting from fan-out joins or loss of detail from premature aggregation. Always state the grain explicitly before designing columns.

Walk me through SCD Type 2 implementation.

Add surrogate_key, effective_date, end_date, is_current columns to the dimension table. When an attribute changes, close the current row (set end_date and is_current = false) and insert a new row with a new surrogate key, updated attributes, is_current = true. Fact tables join on surrogate key, so historical facts retain the dimension values as they were at transaction time.

How do you choose between a star schema and a snowflake schema?

Star schemas denormalize dimensions into single wide tables for simpler queries and better performance. Snowflake schemas normalize dimensions into sub-tables, reducing storage but increasing join complexity. Star schemas are almost always preferred in modern warehouses because storage is cheap, query simplicity matters, and columnar engines handle wide tables efficiently. Only snowflake when you have extremely high-cardinality hierarchies that would bloat the dimension table.

What are conformed dimensions and why do they matter?

Shared dimension tables used identically across multiple fact tables. A conformed dim_date or dim_customer means every report uses the same definitions. Without them, different teams define 'active customer' differently, and cross-functional metrics are impossible to reconcile. The glue that makes a warehouse a warehouse rather than a collection of disconnected tables.

How would you model a many-to-many relationship in a star schema?

Use a bridge table (also called a factless fact table or association table). If one order can have multiple promotions and one promotion applies to many orders, create a bridge table with order_key and promotion_key. Include a weighting factor if you need to allocate measures across the relationship. Bridge tables add complexity and should only be used when the many-to-many is real.

Explain the three types of fact tables.

Transaction facts record one row per discrete event (each order line, each click). Periodic snapshots capture state at regular intervals (daily account balances, monthly inventory levels). Accumulating snapshots track the lifecycle of a process with milestone dates (order placed, shipped, delivered, returned). Each type serves different analytical needs.

How do you handle warehouse partitioning and clustering?

Partition large fact tables by date for partition pruning. Cluster within partitions by high-cardinality columns frequently used in filters or joins (customer_id, product_id). In Snowflake: clustering keys. In BigQuery: partitioning plus clustering. In Redshift: sort keys and distribution keys. The goal is to minimize data scanned per query.

What is a degenerate dimension?

A dimension attribute stored directly in the fact table rather than in a separate dimension table. Typically transaction identifiers like order_number or invoice_id. They don't warrant their own dimension table because they have no additional descriptive attributes. They exist in the fact table as a reference back to the source transaction.

How would you migrate a legacy warehouse to a modern architecture?

Phase 1: inventory existing tables, reports, consumers using lineage. Phase 2: replicate the staging layer in the new platform (ELT into cloud warehouse). Phase 3: rebuild presentation models incrementally, starting with the highest-value dashboards. Phase 4: run dual systems in parallel, validate outputs match, then cut over consumers. Never attempt a big-bang migration; migrate mart by mart with rollback capability.

Common interview mistakes

Designing without declaring the grain first

Grain must be the first decision. Without it, you cannot determine which columns belong in the fact, which dimensions are needed, or whether joins will produce correct results. State the grain in one sentence before touching any columns.

Using natural keys instead of surrogate keys in fact tables

Natural keys from source systems change, get recycled, or vary across systems. Surrogate keys provide stability, support SCD Type 2 versioning, improve join performance. Always use integer surrogate keys in fact-to-dimension relationships.

Treating the warehouse as a copy of the source system

A warehouse is not a replica of your OLTP database. It is a purpose-built analytical structure. Simply copying source tables into the warehouse without dimensional modeling means you inherit all the normalization and complexity of the source without any analytical benefit.

Ignoring slowly changing dimensions

If you assume dimensions never change, your historical analysis is wrong. Customer addresses change, product categories are reorganized, employee departments shift. Discuss SCD types proactively. Interviewers specifically probe for this understanding.

Over-normalizing the presentation layer

Normalization belongs in the staging and integration layers. The presentation layer should be denormalized star schemas optimized for query performance and user comprehension. BI tools expect wide dimension tables, not a web of normalized sub-tables.

Frequently asked questions

What is data warehouse design?+
Structuring a database specifically for analytical queries and reporting. Involves choosing an architecture (Kimball, Inmon, Data Vault), defining fact and dimension tables, establishing grain, building layers from raw staging through to business-facing presentation models.
What is the difference between a data warehouse and a database?+
A database (OLTP) is optimized for fast transactional reads and writes with normalized schemas. A data warehouse (OLAP) is optimized for complex analytical queries across large volumes of historical data, using denormalized schemas like star and snowflake designs.
What is a star schema?+
A dimensional modeling pattern with a central fact table (containing measures and foreign keys) surrounded by denormalized dimension tables (containing descriptive attributes). Named for its star-like shape when diagrammed. The most common warehouse schema pattern.
What is the difference between ETL and ELT?+
ETL transforms data before loading into the warehouse (transformation happens in a separate engine). ELT loads raw data first, then transforms inside the warehouse using SQL. ELT is the modern standard because cloud warehouses provide cheap, scalable compute for in-place transformations.
How do I choose between Kimball and Data Vault?+
Kimball is better for teams that need fast, query-friendly models and have relatively stable source systems. Data Vault is better for environments with many volatile sources, strict auditability requirements, or teams that need to scale ingestion independently of presentation modeling.
What are the layers of a data warehouse?+
Typically four layers: staging (raw data landing zone), integration or ODS (cleansed and conformed data), presentation or mart (dimensional models for BI), semantic (metric definitions and access policies). Not every warehouse implements all four, but understanding the layered approach shows architectural maturity.
How does a data warehouse handle real-time data?+
Traditional warehouses are batch-oriented, but modern platforms support near-real-time via micro-batch loading (every few minutes), streaming ingestion (Snowpipe, BigQuery streaming inserts), and materialized views that refresh incrementally. True sub-second latency still requires a separate streaming layer alongside the warehouse.
02 / Why practice

Practice data warehouse design

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related guides