Data Warehouse Design
A data warehouse is a centralized repository optimized for analytical queries. It uses columnar storage, denormalized schemas, and layered architecture (staging, integration, presentation) to serve reporting, ML, and business intelligence at scale.
What this guide actually says
Data warehouse design is grain first, dimensional model second, layer third. Kimball is the default; Data Vault when source systems are volatile or audit is required. The presentation layer should be denormalized star schemas regardless of the integration layer underneath. SCD Type 2 is the interview-tested temporal pattern. Modern patterns (ELT, lakehouse, separation of storage and compute, zero-copy clones) reshape execution; they don't change the modeling principles.
Architecture approaches
Three philosophies dominate warehouse design. Interviewers expect you to compare them with nuance, not just name them.
Kimball (Dimensional)
Start with business processes and build dimensional models (star schemas) that users can query directly. Data marts come first; the warehouse is the union of conformed data marts. Schema: star with fact tables surrounded by denormalized dimensions. Best for organizations needing fast time-to-value, teams with strong BI requirements, and environments where business users query data directly.
Inmon (Enterprise)
Build a centralized, normalized enterprise data warehouse first. Data marts are then derived from the EDW for departmental use. The single source of truth lives in 3NF. Schema: 3NF in the core warehouse; dimensional models only appear in downstream marts. Best for large enterprises with complex data relationships, environments requiring strict consistency, organizations with dedicated warehouse teams.
Data Vault 2.0
Auditable and resilient to source changes. Separates structure (hubs and links) from context (satellites). Designed for agility when source systems change frequently. Schema: hubs (business keys), links (relationships), satellites (descriptive attributes with load timestamps). Append-only with full history. Best for regulated industries needing full auditability, environments with volatile source systems, and teams scaling warehouse ingestion across many sources.
Warehouse layers
Data flows from raw landing through increasingly refined layers, each serving a distinct purpose.
| Layer | Purpose | Description |
|---|---|---|
| Staging | Raw landing zone | Data arrives as-is from source systems with minimal transformation. Truncate-and-reload or append-only patterns. |
| Integration / ODS | Cleansed and conformed | Deduplicated. Business keys resolved. Where Data Vault hubs and links live, or Inmon 3NF tables reside. |
| Presentation / Mart | Dimensional models | Business-friendly star schemas optimized for BI. Aggregated metrics, conformed dimensions, user-facing tables. |
| Semantic | Metric layer + access | Metric definitions, access policies, virtual views. Tools like dbt metrics layer or Cube.js. The contract between data and consumers. |
Design components
The building blocks of every dimensional model. Interviewers expect precise definitions and practical understanding of each.
Fact tables
Store quantitative, measurable events at a specific grain. Each row represents a business event (order, click, payment). Contain foreign keys to dimensions and numeric measures. Three types: transaction facts (one row per event), periodic snapshots (one row per time period), accumulating snapshots (one row per lifecycle).
Dimension tables
Store descriptive context for facts. Customer names, product categories, geographic regions. Typically wide and denormalized for query simplicity. Good dimensions have meaningful business attributes, not just IDs. The quality of your dimensions determines how useful your warehouse is for ad-hoc analysis.
Slowly changing dimensions
Handle attributes that change over time. Type 1 overwrites the old value (no history). Type 2 adds a new row with effective dates and a current flag (full history). Type 3 stores previous and current values in separate columns (limited history). Interviewers almost always ask about Type 2 because it reveals whether you understand temporal modeling.
Surrogate keys
System-generated integer keys that replace natural business keys in dimension tables. Provide insulation from source system key changes, support SCD Type 2 versioning (each version gets its own surrogate), improve join performance. Never expose surrogate keys to business users; they're internal plumbing.
Conformed dimensions
Shared dimensions used identically across multiple fact tables and data marts. A conformed date dimension, customer dimension, or product dimension means every team uses the same definitions. Without conformance, the same customer has different attributes in different reports, and metrics cannot be compared.
Grain
The level of detail stored in a fact table. Declaring the grain is the single most important design decision. 'One row per order line item per day' is a grain statement. Getting grain wrong causes double-counting, fan-out joins, incorrect aggregations. Interviewers test this relentlessly.
Modern warehouse patterns
Warehouse architecture has shifted significantly. These patterns reflect how teams build warehouses today, beyond classical Kimball.
ELT-first architecture
Load raw data into the warehouse first, then transform in-place using SQL. The warehouse engine handles compute. dbt popularized this by making SQL transformations testable and version-controlled. Replaces traditional ETL where transformation happened before loading.
Lakehouse hybrid
Combine data lake storage (Parquet on S3, Delta Lake, Iceberg) with warehouse query engines. Store once in open formats, query with SQL engines like Spark, Trino, or Snowflake external tables. Eliminates the cost of duplicating data between lake and warehouse.
Separation of storage and compute
Decouple where data lives from what processes it. Snowflake, BigQuery, and Redshift Serverless let you scale compute independently of storage. Multiple teams can query the same data with different compute clusters. Eliminates resource contention.
Materialized views and incremental models
Precompute expensive aggregations and store results. Materialized views refresh automatically; dbt incremental models append only new data. Critical for tables with billions of rows where full rebuilds are too slow or expensive.
Zero-copy clones
Create instant copies of tables or entire databases without duplicating storage. Snowflake, Databricks, and BigQuery support this natively. Enables safe development and testing against production-scale data without doubling storage costs.
How interviewers test this
Warehouse design questions are scenario-based. Interviewers describe a business and expect you to model it on the spot.
Design a warehouse for orders, products, customers, and inventory.
Start with grain: one row per order line item. Identify the fact (fct_order_lines with quantity, unit_price, discount, total_amount). Identify dimensions (dim_customer, dim_product, dim_date, dim_store). Discuss conformed dimensions, SCD Type 2 for customer address changes, and how you'd handle returns as a separate fact or a negative line.
Design the warehouse for a ride-sharing company.
Multiple facts at different grains: fct_trips (one row per trip with duration, fare, distance), fct_driver_shifts (one row per shift), fct_rider_sessions (one row per app session). Shared dimensions: dim_driver, dim_rider, dim_location, dim_date, dim_vehicle. Discuss surge pricing as a degenerate dimension or a separate fact, and how to model cancellations.
A critical dashboard takes 45 seconds. It joins five tables with 2B rows. How do you fix it?
Pre-aggregate into a periodic snapshot fact at the grain the dashboard needs. Create materialized views or dbt incremental models. Add proper clustering or sort keys. Consider a semantic layer for caching. Investigate if the grain is too fine for the dashboard requirements and if joins can be eliminated by denormalizing.
Orders sometimes arrive in our warehouse 3-5 days after they occurred. How do you model this?
Separate the event timestamp (when it happened) from the load timestamp (when it arrived). Use event date for the date dimension FK, not load date. Discuss idempotent loading to handle reprocessing. Mention that late-arriving dimensions are the harder problem: if a dimension row doesn't exist yet, insert a placeholder and update it later.
Interview questions with guidance
Ten questions covering architecture, grain, SCD types, and modern patterns.
Explain the difference between Kimball and Inmon approaches.
Kimball is bottom-up: build dimensional star schemas per business process, and the warehouse is the union of conformed data marts. Inmon is top-down: build a normalized 3NF enterprise data warehouse first, then derive dimensional marts. Kimball delivers value faster; Inmon ensures consistency. Most modern teams blend both, using a normalized staging layer with dimensional presentation layers.
What is grain and why is it critical?
Grain defines what a single row in a fact table represents. The most important decision in dimensional modeling because every other choice follows from it. Getting grain wrong causes double-counting from fan-out joins or loss of detail from premature aggregation. Always state the grain explicitly before designing columns.
Walk me through SCD Type 2 implementation.
Add surrogate_key, effective_date, end_date, is_current columns to the dimension table. When an attribute changes, close the current row (set end_date and is_current = false) and insert a new row with a new surrogate key, updated attributes, is_current = true. Fact tables join on surrogate key, so historical facts retain the dimension values as they were at transaction time.
How do you choose between a star schema and a snowflake schema?
Star schemas denormalize dimensions into single wide tables for simpler queries and better performance. Snowflake schemas normalize dimensions into sub-tables, reducing storage but increasing join complexity. Star schemas are almost always preferred in modern warehouses because storage is cheap, query simplicity matters, and columnar engines handle wide tables efficiently. Only snowflake when you have extremely high-cardinality hierarchies that would bloat the dimension table.
What are conformed dimensions and why do they matter?
Shared dimension tables used identically across multiple fact tables. A conformed dim_date or dim_customer means every report uses the same definitions. Without them, different teams define 'active customer' differently, and cross-functional metrics are impossible to reconcile. The glue that makes a warehouse a warehouse rather than a collection of disconnected tables.
How would you model a many-to-many relationship in a star schema?
Use a bridge table (also called a factless fact table or association table). If one order can have multiple promotions and one promotion applies to many orders, create a bridge table with order_key and promotion_key. Include a weighting factor if you need to allocate measures across the relationship. Bridge tables add complexity and should only be used when the many-to-many is real.
Explain the three types of fact tables.
Transaction facts record one row per discrete event (each order line, each click). Periodic snapshots capture state at regular intervals (daily account balances, monthly inventory levels). Accumulating snapshots track the lifecycle of a process with milestone dates (order placed, shipped, delivered, returned). Each type serves different analytical needs.
How do you handle warehouse partitioning and clustering?
Partition large fact tables by date for partition pruning. Cluster within partitions by high-cardinality columns frequently used in filters or joins (customer_id, product_id). In Snowflake: clustering keys. In BigQuery: partitioning plus clustering. In Redshift: sort keys and distribution keys. The goal is to minimize data scanned per query.
What is a degenerate dimension?
A dimension attribute stored directly in the fact table rather than in a separate dimension table. Typically transaction identifiers like order_number or invoice_id. They don't warrant their own dimension table because they have no additional descriptive attributes. They exist in the fact table as a reference back to the source transaction.
How would you migrate a legacy warehouse to a modern architecture?
Phase 1: inventory existing tables, reports, consumers using lineage. Phase 2: replicate the staging layer in the new platform (ELT into cloud warehouse). Phase 3: rebuild presentation models incrementally, starting with the highest-value dashboards. Phase 4: run dual systems in parallel, validate outputs match, then cut over consumers. Never attempt a big-bang migration; migrate mart by mart with rollback capability.
Common interview mistakes
Designing without declaring the grain first
Grain must be the first decision. Without it, you cannot determine which columns belong in the fact, which dimensions are needed, or whether joins will produce correct results. State the grain in one sentence before touching any columns.
Using natural keys instead of surrogate keys in fact tables
Natural keys from source systems change, get recycled, or vary across systems. Surrogate keys provide stability, support SCD Type 2 versioning, improve join performance. Always use integer surrogate keys in fact-to-dimension relationships.
Treating the warehouse as a copy of the source system
A warehouse is not a replica of your OLTP database. It is a purpose-built analytical structure. Simply copying source tables into the warehouse without dimensional modeling means you inherit all the normalization and complexity of the source without any analytical benefit.
Ignoring slowly changing dimensions
If you assume dimensions never change, your historical analysis is wrong. Customer addresses change, product categories are reorganized, employee departments shift. Discuss SCD types proactively. Interviewers specifically probe for this understanding.
Over-normalizing the presentation layer
Normalization belongs in the staging and integration layers. The presentation layer should be denormalized star schemas optimized for query performance and user comprehension. BI tools expect wide dimension tables, not a web of normalized sub-tables.
Frequently asked questions
What is data warehouse design?+
What is the difference between a data warehouse and a database?+
What is a star schema?+
What is the difference between ETL and ELT?+
How do I choose between Kimball and Data Vault?+
What are the layers of a data warehouse?+
How does a data warehouse handle real-time data?+
Practice data warehouse design
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition