Ralph Kimball published "The Data Warehouse Toolkit" in 1996, and the star schema he described in that book still runs most of the analytics you touch today. Thirty years later, interviewers still ask you to draw one on a whiteboard, state the grain of the fact table, and defend why the customer dimension shouldn't link to a separate city table.
The reason it survived four generations of warehouse technology is boring and important: it matches how analysts actually ask questions. Kimball wrote about it in 1996 for the same reason Snowflake benchmarks it in 2026.
Kimball Toolkit Published
Dimensional Qs Involve Star
Modeling Qs Analyzed
Onsite Modeling Rounds
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
A star schema is a data warehouse design pattern where a single fact table sits at the center, connected by foreign keys to multiple dimension tables around it. The name comes from the visual shape: the fact table is the body of the star, and each dimension table is a point radiating outward.
Kimball was reacting to a specific problem. Bill Inmon's 3NF warehouse designs from the late 1980s were technically correct but hostile to analysts, who had to write seven-join SQL to answer "revenue by product category last quarter." Kimball's 1996 book flipped the constraint: optimize for how humans ask questions, not for insert speed. That choice aged well. Snowflake, BigQuery, and Redshift all perform best against star schemas, and BI tools like Tableau, Power BI, and Looker were built assuming one.
The structure is straightforward. Consider an e-commerce company tracking orders. The fact table (fact_orders) contains one row per order line item. Each row has foreign keys to dim_date, dim_customer, dim_product, dim_store, and dim_promotion. Each row also has measures: quantity, unit_price, discount_amount, and line_total. To answer"total revenue by product category last quarter," an analyst joins fact_orders to dim_product and dim_date, filters on date range, groups by category, and sums line_total. Two joins, one aggregation, done.
dim_date
|
dim_customer -- fact_orders -- dim_product
|
dim_store
The fact table at the center connects to each dimension through a foreign key. No dimension connects to another dimension.
Three building blocks make up every star schema. Interviewers expect you to explain each one with concrete examples and justify design decisions around them.
The fact table sits at the center of the star. Each row records a single measurable event: a sale, a page view, a shipment. Columns are either foreign keys pointing to dimension tables or numeric measures like revenue, quantity, or duration. Fact tables are tall and narrow. A mid-size e-commerce company might generate 50 million rows per month in its orders fact table. The grain of the fact table (what one row represents) is the single most important design decision in the entire schema.
Measures in a fact table fall into three categories. Additive measures like revenue and quantity can be summed across every dimension. Semi-additive measures like account_balance can be summed across some dimensions but not time (you would not sum Monday's balance with Tuesday's balance). Non-additive measures like unit_price or conversion_rate cannot be meaningfully summed at all and must be averaged or weighted. Interviewers ask you to classify measures because the classification determines how downstream reports aggregate data. Getting this wrong leads to dashboards that show nonsensical numbers.
When asked to design a fact table, state the grain before listing any columns. 'One row per order line item' or 'one row per daily session per user.' If you skip the grain, the interviewer will stop you and ask for it. Having a clear grain statement is the strongest signal of modeling maturity.
Dimension tables surround the fact table and provide descriptive context: the who, what, where, and when of each event. A customer dimension includes name, city, signup_date, and acquisition_channel. A product dimension includes name, category, brand, and price_tier. Dimension tables are short and wide, typically with 10 to 50 columns and thousands to millions of rows. They change slowly compared to fact tables.
A well-designed dimension table is richly attributed. Thin dimensions with only an ID and a name force analysts to write complex queries to filter and group data. Rich dimensions with 20 or more descriptive columns let analysts slice data by any attribute without modifying the fact query. In interviews, err toward putting more attributes into the dimension rather than fewer. The cost of extra columns in a dimension table is trivial compared to the analytical flexibility they provide.
If an interviewer asks whether a specific column belongs in the fact table or a dimension table, apply this rule: if you would aggregate it (sum, average, count), it is a fact. If you would filter or group by it, it is a dimension attribute. Price is a classic trick question: unit_price at the time of a sale is a fact measure, while standard_list_price is a dimension attribute on the product table.
Every dimension table in a star schema uses a surrogate key as its primary key: a system-generated integer with no business meaning. The fact table references dimensions through these surrogate keys, not through natural business keys like customer_email or product_sku. This decoupling is fundamental to the star schema pattern.
Surrogate keys solve three problems that natural keys cannot. First, they handle slowly changing dimensions: when a customer changes their address, you insert a new row with a new surrogate key, preserving the old row for historical accuracy. Second, they absorb source system changes: if the ERP system changes its product ID format, only the ETL pipeline needs to know. The star schema is unaffected. Third, they handle the 'unknown' case: surrogate key 0 or -1 can represent missing or not-yet-loaded dimension members, letting fact rows load even when dimension data is incomplete.
Interviewers test surrogate keys to see if you understand why they exist, not just that they exist. Be ready to explain all three use cases: SCD support, source system independence, and unknown member handling. Candidates who say 'we use surrogate keys because that is the standard' are weaker than those who explain the specific problems surrogate keys solve.
This comparison is one of the most frequently asked questions in data modeling interviews. The difference comes down to how dimension tables are structured: flat (star) or normalized into sub-tables (snowflake).
| Property | Star Schema | Snowflake Schema |
|---|---|---|
| Dimension Structure | Denormalized. Each dimension is a single flat table. | Normalized. Dimensions are split into sub-tables linked by foreign keys. |
| Number of Joins | Fewer joins. One hop from fact to any dimension. | More joins. Two or three hops to reach normalized sub-dimensions. |
| Query Complexity | Simpler SQL. Analysts join fact to dimension directly. | More complex SQL. Extra joins to reach normalized attributes. |
| Storage | Higher. Dimension attributes are repeated across rows. | Lower. Normalization eliminates redundant text values. |
| BI Tool Compatibility | Excellent. Looker, Tableau, Power BI are optimized for star schemas. | Adequate, but BI tools often auto-generate suboptimal SQL for normalized dimensions. |
| ETL Complexity | Simpler. Fewer tables to load and manage. | More complex. More tables, more dependency chains in the load order. |
| When Recommended | Default for analytics warehouses. Almost always the right starting point. | When dimension tables are very large (millions of rows) and normalization produces significant storage savings. |
The correct default answer in 2026 interviews is star schema. Modern columnar warehouses compress denormalized dimensions so efficiently that snowflaking rarely saves meaningful storage. State your preference for star, then acknowledge the edge case where snowflaking might help (very large dimensions with deeply hierarchical sub-entities).
Star schemas are not universal. They are purpose-built for analytical workloads. Here are the four scenarios where they provide the most value.
Star schemas are the gold standard for OLAP workloads. Every major BI tool generates more efficient queries against star schemas than against normalized models. Tableau, Power BI, and Looker all assume a central fact table surrounded by dimension tables. When you build a star schema, you are designing for the tools your analysts already use. A warehouse serving 200 analysts at a Series C startup will almost always use star schemas for its presentation layer.
Power BI's internal engine (VertiPaq) is explicitly designed for star schemas. Microsoft's documentation recommends star schema as the default data model. Relationships in Power BI map directly to the fact-to-dimension foreign key pattern. DAX measures perform best when the model follows star schema conventions. If you are building a Power BI semantic model, a star schema is not just recommended; it is the expected architecture.
Snowflake, BigQuery, and Redshift all use columnar storage with aggressive compression. Denormalized dimension tables compress efficiently because repeated values in a column (like 'Electronics' appearing 50,000 times in a product category column) compress down to nearly nothing. The storage savings from snowflaking are negligible in columnar engines, which eliminates the primary reason to normalize dimensions.
When multiple teams query the same warehouse, the star schema's simplicity is a major advantage. A marketing analyst and a finance analyst can both write queries against the same fact table, joining to the same conformed dimensions. The structure is self-documenting: the fact table name tells you the business process, and the dimension names tell you the available filters. Normalized schemas require more institutional knowledge to query correctly.
These questions come up repeatedly in data engineering and analytics engineering interviews. The guidance below shows how strong candidates structure their answers.
A star schema is a data warehouse design pattern where a central fact table connects to multiple dimension tables through foreign keys. The visual shape resembles a star: the fact table at the center with dimension tables radiating outward. It was formalized by Ralph Kimball as the foundation of dimensional modeling. The design optimizes for read-heavy analytical queries by denormalizing dimension tables into single flat tables, reducing the number of joins needed. Name the reason it matters: BI tools, query simplicity, and columnar compression all favor this layout.
Start with the grain: one row per completed ride. The fact table (fact_rides) contains measures: fare_amount, tip_amount, ride_duration_minutes, distance_miles, surge_multiplier. Foreign keys point to five dimensions: dim_date (pickup date and time attributes), dim_rider (name, signup_date, rider_tier, city), dim_driver (name, signup_date, vehicle_type, rating_bucket), dim_pickup_location (zone, neighborhood, city, state), dim_dropoff_location (same structure). Classify the measures: fare_amount and tip_amount are additive. surge_multiplier is non-additive (averaging it across rides is meaningful; summing it is not). If the interviewer asks about ride cancellations, that is a separate fact table with a different grain.
In practice, almost never in a modern cloud warehouse. Columnar compression handles dimension redundancy efficiently, so the storage savings from snowflaking are minimal. The added join complexity slows queries and confuses analysts. The main exception is when a dimension table has millions of rows and contains a sub-entity with many attributes (like a product dimension with a deeply attributed category hierarchy). Even then, many teams prefer to keep the star schema and accept the redundancy. State this preference clearly, then acknowledge the exception.
SCD Type 1: overwrite the old value. Simple but loses history. SCD Type 2: insert a new row with a new surrogate key, mark the old row as expired. Preserves full history. SCD Type 3: add a column (e.g., previous_city, current_city) to track the last change. Tracks one change only. In a star schema, Type 2 is the most common because it preserves history without changing the fact table. The fact table continues to reference surrogate keys; old facts point to the old dimension row, new facts point to the new row. Explain why surrogate keys make this possible: natural keys cannot distinguish between the 'before' and 'after' versions of the same entity.
Check five things in order. First, the query plan: is the engine doing a full table scan on the fact table? Add a partition on the date column if not already present. Second, join keys: are fact-to-dimension joins on integer surrogate keys or on string natural keys? Strings are slower. Third, the fact table grain: is the table at a finer grain than needed? An hourly grain table queried for monthly aggregates does unnecessary work; consider a pre-aggregated monthly fact table. Fourth, dimension table size: a dimension with millions of rows (a 'monster dimension') may need to be split or filtered. Fifth, the warehouse configuration: in Snowflake or BigQuery, is the warehouse/slot allocation sized for the query volume? These five checks cover 90% of star schema performance issues.
Practice the same four-step design process Kimball taught in 1996: pick the business process, declare the grain, identify the dimensions, identify the facts. Do it on real interview scenarios until it's muscle memory.
Start Practicing