A data warehouse is a centralized repository optimized for analytical queries. It uses columnar storage, denormalized schemas, and layered architecture (staging, integration, presentation) to serve reporting, ML, and business intelligence at scale.
The cloud data warehouse market reached $30B in 2025 (Synergy Research), with Snowflake, BigQuery, and Redshift holding 65%+ market share. Interview panelists consistently rate warehouse design as the #1 topic in data modeling rounds, appearing in 80%+ of senior DE interviews.
Three philosophies dominate warehouse design. Interviewers expect you to compare them with nuance, not just name them.
Philosophy
Bottom-up. Start with business processes and build dimensional models (star schemas) that users can query directly. Data marts come first; the warehouse is the union of conformed data marts.
Schema Pattern
Star schema with fact tables surrounded by denormalized dimension tables. Optimized for read performance and intuitive querying.
Best For
Organizations that need fast time-to-value, teams with strong BI requirements, and environments where business users query data directly.
Philosophy
Top-down. Build a centralized, normalized enterprise data warehouse first. Data marts are then derived from the EDW for departmental use. The single source of truth lives in 3NF.
Schema Pattern
Third normal form (3NF) in the core warehouse. Dimensional models only appear in downstream data marts built from the normalized core.
Best For
Large enterprises with complex data relationships, environments requiring strict consistency, and organizations with dedicated data warehouse teams.
Philosophy
Insert-only, auditable, and resilient to source changes. Separates structure (hubs and links) from context (satellites). Designed for agility when source systems change frequently.
Schema Pattern
Hubs (business keys), links (relationships), and satellites (descriptive attributes with load timestamps). Everything is append-only with full history.
Best For
Regulated industries needing full auditability, environments with volatile source systems, and teams scaling warehouse ingestion across many sources.
Staging
Raw landing zone. Data arrives as-is from source systems with minimal transformation. Truncate-and-reload or append-only patterns.
Integration / ODS
Cleansed, conformed, and deduplicated. Business keys are resolved. This is where Data Vault hubs and links live, or Inmon 3NF tables reside.
Presentation / Mart
Business-friendly dimensional models. Star schemas optimized for BI tools. Aggregated metrics, conformed dimensions, and user-facing tables.
Semantic
Metric definitions, access policies, and virtual views. Tools like dbt metrics layer or Cube.js sit here. The contract between data and consumers.
Data flows from left to right through increasingly refined layers, each serving a distinct purpose in the warehouse architecture.
The building blocks of every dimensional model. Interviewers expect precise definitions and practical understanding of each.
Store quantitative, measurable events at a specific grain. Each row represents a business event (an order, a click, a payment). Contain foreign keys to dimensions and numeric measures. Three types: transaction facts (one row per event), periodic snapshots (one row per time period), and accumulating snapshots (one row per lifecycle).
Store descriptive context for facts. Customer names, product categories, geographic regions. Typically wide and denormalized for query simplicity. Good dimensions have meaningful business attributes, not just IDs. The quality of your dimensions determines how useful your warehouse is for ad-hoc analysis.
Handle attributes that change over time. Type 1 overwrites the old value (no history). Type 2 adds a new row with effective dates and a current flag (full history). Type 3 stores previous and current values in separate columns (limited history). Interviewers almost always ask about Type 2 because it reveals whether you understand temporal modeling.
System-generated integer keys that replace natural business keys in dimension tables. Provide insulation from source system key changes, support SCD Type 2 versioning (each version gets its own surrogate), and improve join performance. Never expose surrogate keys to business users; they are internal plumbing.
Shared dimensions used identically across multiple fact tables and data marts. A conformed date dimension, customer dimension, or product dimension means every team uses the same definitions. Without conformance, the same customer has different attributes in different reports, and metrics cannot be compared.
The level of detail stored in a fact table. Declaring the grain is the single most important design decision. 'One row per order line item per day' is a grain statement. Getting grain wrong causes double-counting, fan-out joins, and incorrect aggregations. Interviewers test this relentlessly.
Warehouse architecture has shifted significantly. These patterns reflect how teams build warehouses today, and what interviewers expect you to know beyond classical Kimball.
Load raw data into the warehouse first, then transform in-place using SQL. The warehouse engine handles compute. dbt popularized this pattern by making SQL transformations testable and version-controlled. Replaces traditional ETL where transformation happened before loading.
Combine data lake storage (Parquet on S3, Delta Lake, Iceberg) with warehouse query engines. Store once in open formats, query with SQL engines like Spark, Trino, or Snowflake external tables. Eliminates the cost of duplicating data between lake and warehouse.
Decouple where data lives from what processes it. Snowflake, BigQuery, and Redshift Serverless let you scale compute independently of storage. Multiple teams can query the same data with different compute clusters. Eliminates resource contention.
Precompute expensive aggregations and store results. Materialized views refresh automatically; dbt incremental models append only new data. Critical for tables with billions of rows where full rebuilds are too slow or expensive.
Create instant copies of tables or entire databases without duplicating storage. Snowflake, Databricks, and BigQuery support this natively. Enables safe development and testing against production-scale data without doubling storage costs.
Warehouse design questions are scenario-based. Interviewers describe a business and expect you to model it on the spot.
“We have an e-commerce platform with orders, products, customers, and inventory. Design a data warehouse schema.”
Start with the grain: one row per order line item. Identify the fact table (fct_order_lines with quantity, unit_price, discount, total_amount). Identify dimensions (dim_customer, dim_product, dim_date, dim_store). Discuss conformed dimensions, SCD Type 2 for customer address changes, and how you would handle returns as a separate fact or a negative line.
“Design the warehouse for a ride-sharing company. The CEO wants to see driver earnings, rider activity, and trip metrics.”
Multiple fact tables at different grains: fct_trips (one row per trip with duration, fare, distance), fct_driver_shifts (one row per shift), fct_rider_sessions (one row per app session). Shared dimensions: dim_driver, dim_rider, dim_location, dim_date, dim_vehicle. Discuss surge pricing as a degenerate dimension or a separate fact, and how to model cancellations.
“A critical dashboard takes 45 seconds to load. It joins five tables with 2 billion rows of transaction data. How do you fix it?”
Pre-aggregate into a periodic snapshot fact table at the grain the dashboard needs. Create materialized views or dbt incremental models. Add proper clustering or sort keys. Consider a semantic layer for caching. Investigate if the grain is too fine for the dashboard requirements and if the joins can be eliminated by denormalizing.
“Orders sometimes arrive in our warehouse 3 to 5 days after they occurred. How do you model this?”
Separate the event timestamp (when it happened) from the load timestamp (when it arrived). Use the event date for the date dimension foreign key, not the load date. Discuss idempotent loading to handle reprocessing. Mention that late-arriving dimensions are the harder problem: if a dimension row does not exist yet, insert a placeholder and update it later.
Explain the difference between Kimball and Inmon approaches.
Kimball is bottom-up: build dimensional star schemas per business process, and the warehouse is the union of conformed data marts. Inmon is top-down: build a normalized 3NF enterprise data warehouse first, then derive dimensional marts. Kimball delivers value faster; Inmon ensures consistency. Most modern teams blend both, using a normalized staging layer with dimensional presentation layers.
What is grain and why is it critical?
Grain defines what a single row in a fact table represents. It is the most important decision in dimensional modeling because every other choice follows from it. Getting grain wrong causes double-counting from fan-out joins or loss of detail from premature aggregation. Always state the grain explicitly before designing columns.
Walk me through SCD Type 2 implementation.
Add surrogate_key, effective_date, end_date, and is_current columns to the dimension table. When an attribute changes, close the current row (set end_date and is_current = false) and insert a new row with a new surrogate key, updated attributes, and is_current = true. Fact tables join on surrogate key, so historical facts retain the dimension values as they were at transaction time.
How do you choose between a star schema and a snowflake schema?
Star schemas denormalize dimensions into single wide tables for simpler queries and better performance. Snowflake schemas normalize dimensions into sub-tables, reducing storage but increasing join complexity. Star schemas are almost always preferred in modern warehouses because storage is cheap, query simplicity matters, and columnar engines handle wide tables efficiently. Only snowflake when you have extremely high-cardinality hierarchies that would bloat the dimension table.
What are conformed dimensions and why do they matter?
Conformed dimensions are shared dimension tables used identically across multiple fact tables. A conformed dim_date or dim_customer means every report uses the same definitions. Without them, different teams define 'active customer' differently, and cross-functional metrics are impossible to reconcile. They are the glue that makes a warehouse a warehouse rather than a collection of disconnected tables.
How would you model a many-to-many relationship in a star schema?
Use a bridge table (also called a factless fact table or association table). For example, if one order can have multiple promotions and one promotion applies to many orders, create a bridge table with order_key and promotion_key. Include a weighting factor if you need to allocate measures across the relationship. Acknowledge that bridge tables add complexity and should only be used when the many-to-many relationship is real.
Explain the three types of fact tables.
Transaction facts record one row per discrete event (each order line, each click). Periodic snapshots capture the state at regular intervals (daily account balances, monthly inventory levels). Accumulating snapshots track the lifecycle of a process with milestone dates (order placed, shipped, delivered, returned). Each type serves different analytical needs.
How do you handle data warehouse partitioning and clustering?
Partition large fact tables by date to enable partition pruning (queries that filter by date only scan relevant partitions). Cluster within partitions by high-cardinality columns frequently used in filters or joins (customer_id, product_id). In Snowflake, use clustering keys. In BigQuery, use partitioning plus clustering. In Redshift, use sort keys and distribution keys. The goal is to minimize data scanned per query.
What is a degenerate dimension?
A dimension attribute stored directly in the fact table rather than in a separate dimension table. Typically transaction identifiers like order_number or invoice_id. They do not warrant their own dimension table because they have no additional descriptive attributes. They exist in the fact table as a reference back to the source transaction.
How would you migrate a legacy warehouse to a modern architecture?
Phase 1: Inventory existing tables, reports, and consumers using lineage. Phase 2: Replicate the staging layer in the new platform (ELT into cloud warehouse). Phase 3: Rebuild presentation models incrementally, starting with the highest-value dashboards. Phase 4: Run dual systems in parallel, validate outputs match, then cut over consumers. Never attempt a big-bang migration; migrate mart by mart with rollback capability.
Designing without declaring the grain first
Grain must be the first decision. Without it, you cannot determine which columns belong in the fact table, which dimensions are needed, or whether joins will produce correct results. State the grain in one sentence before touching any columns.
Using natural keys instead of surrogate keys in fact tables
Natural keys from source systems change, get recycled, or vary across systems. Surrogate keys provide stability, support SCD Type 2 versioning, and improve join performance. Always use integer surrogate keys in fact-to-dimension relationships.
Treating the warehouse as a copy of the source system
A warehouse is not a replica of your OLTP database. It is a purpose-built analytical structure. Simply copying source tables into the warehouse without dimensional modeling means you inherit all the normalization and complexity of the source without any analytical benefit.
Ignoring slowly changing dimensions
If you assume dimensions never change, your historical analysis is wrong. Customer addresses change, product categories are reorganized, employee departments shift. Discuss SCD types proactively. Interviewers specifically probe for this understanding.
Over-normalizing the presentation layer
Normalization belongs in the staging and integration layers. The presentation layer should be denormalized star schemas optimized for query performance and user comprehension. BI tools expect wide dimension tables, not a web of normalized sub-tables.
DataDriven covers warehouse modeling, SQL, pipeline design, and system design at interview difficulty.
Start Practicing