Most candidates explain snowflake schemas as if they're a real choice. They aren't, not in a columnar warehouse built after 2014. Interviewers know this. When they ask "star or snowflake," they aren't looking for a balanced pros-and-cons list. They're checking whether you've actually queried a warehouse or only read about them.
The trap is sounding diplomatic. Candidates who hedge lose the round. The real answer is short: star, unless the dimension is huge and the hierarchy is clean. Everything else is a tell that you've never paid a Snowflake bill.
Modeling Qs Default to Star
Modeling Qs Analyzed
Onsite Modeling Rounds
Companies in Corpus
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
A snowflake schema extends the star schema by normalizing dimension tables. Instead of storing all attributes in a single flat dimension table, the snowflake schema splits dimensions along natural hierarchical boundaries into separate related tables.
Consider a product dimension in a retail warehouse. In a star schema, dim_product is one table with columns like product_name, sku, category_name, category_description, department_name, and department_budget_code. All product and category and department attributes live in a single row. In a snowflake schema, this splits into three tables: dim_product (product_name, sku, category_key), dim_category (category_name, category_description, department_key), and dim_department (department_name, department_budget_code). Each table links to the next through a foreign key.
The name comes from the visual shape. When you draw the schema diagram, the multi-level branching of normalized dimensions resembles a snowflake crystal. The fact table is still at the center, but instead of single-hop joins to flat dimensions (the star), you see chains of two or three hops extending outward.
People defend snowflake schemas by pointing to storage savings. That argument was real in 2001 on Oracle row-store. It's mostly theater on BigQuery in 2026. If 20,000 products share the "Electronics" category, a columnar engine stores "Electronics" approximately once in the compressed column block. You saved nothing by splitting the table. What you did do was add a join, force your BI tool to auto-generate worse SQL, and give analysts another table to learn. Test the claim before you repeat it: run the same query against both layouts and read the bytes scanned.
dim_department
|
dim_date -- fact_sales -- dim_product -- dim_category
|
dim_customer -- dim_geography
Dimension chains extend outward. dim_product connects to dim_category, which connects to dim_department. Each hop is an additional join in analytical queries.
This comparison is the core of the snowflake schema interview question. Know every row of this table and be ready to explain the reasoning behind each difference.
| Property | Star | Snowflake |
|---|---|---|
| Dimension Design | Each dimension is one flat, denormalized table | Dimensions are normalized into multiple related tables |
| Joins per Query | 1 join per dimension (fact to dimension) | 2-3 joins per dimension (fact to dimension to sub-dimension) |
| Storage | Higher redundancy in dimension text values | Lower redundancy through normalization |
| Query Complexity | Simple. Analysts join fact to dimension directly. | More complex. Sub-dimension joins required for some attributes. |
| ETL Complexity | Fewer tables to populate and maintain | More tables with dependency ordering in the load sequence |
| BI Tool Performance | Optimized. Tableau, Power BI, Looker assume flat dimensions. | Adequate, but auto-generated SQL may be suboptimal. |
| Storage Savings (Columnar Engine) | Columnar compression handles repeated values efficiently | Savings are marginal because columnar compression already deduplicates |
| Recommendation (2026) | Default choice for analytics warehouses | Consider only for very large dimensions with deep hierarchies |
The strongest answer in an interview is: "I default to star schema because columnar compression, BI tool compatibility, and query simplicity all favor it. I would only snowflake a specific dimension if it has millions of rows and a clearly separable hierarchy causing measurable storage or maintenance issues."That shows both a clear default and the ability to reason about trade-offs.
Snowflake schemas are rarely the right default, but there are specific situations where normalizing dimensions produces genuine benefits. Knowing these exceptions shows interviewers that you think critically rather than applying rules blindly.
When a dimension table has millions of rows and contains a sub-entity with many text attributes, normalization can produce meaningful savings. A product dimension with 10 million SKUs where each row includes 8 category-level attributes (category_name, category_description, department_name, department_head, department_budget_code, category_margin_target, category_seasonal_flag, category_display_order) repeats those same 8 values across thousands of products per category. Normalizing the category into its own table with 500 rows eliminates that repetition. On a row-store database, this saves gigabytes. On a columnar engine, the savings are smaller but still measurable at this scale.
Some organizations require that entity definitions live in exactly one table. If the category team owns category attributes and the product team owns product attributes, a snowflake schema enforces clear ownership boundaries. Each team maintains its own table, and changes to category attributes never touch the product table. This organizational benefit is independent of performance. In regulated industries (banking, healthcare), this separation can simplify audit trails because attribute changes are isolated to a single table per entity.
On row-store databases like PostgreSQL, MySQL, or Oracle OLTP systems, denormalized dimensions consume substantially more storage because the engine stores each row as a contiguous block. A customer dimension with a repeated region_name value of 'Asia Pacific' stored 2 million times uses real disk space. Normalization matters on row stores. If you are building an analytical model on a row-store engine (not recommended, but sometimes required), snowflake schemas are a reasonable trade-off. On columnar engines like Snowflake, BigQuery, or Redshift, this argument largely disappears.
When multiple star schemas in the same warehouse share a hierarchical sub-entity, normalizing it into a standalone table avoids duplication across stars. A geography hierarchy (city, state, country, region) used by a sales star, a marketing star, and a support star can be maintained as a single normalized geography table referenced by dimension tables in all three stars. This is a practical hybrid approach: the individual stars remain mostly denormalized, but the shared hierarchy is normalized once.
Interviewers rarely ask you to design a snowflake schema from scratch. Instead, they ask you to compare it with star schema and defend a recommendation. These questions test that skill.
A snowflake schema normalizes dimension tables into multiple related tables, while a star schema keeps each dimension as a single flat table. In a snowflake schema, a product dimension might split into a product table and a category table, where product references category through a foreign key. The fact table still sits at the center, but reaching certain dimension attributes now requires two or three joins instead of one. The trade-off is reduced storage redundancy at the cost of increased query complexity. In modern columnar warehouses, the storage savings are minimal because columnar compression already handles repeated values efficiently.
In most cases, you would not. Star schemas are the default for modern analytics warehouses because columnar engines compress denormalized dimensions efficiently. The exception is when a dimension table is very large (millions of rows) and contains a deeply attributed sub-entity that repeats across many parent rows. A product dimension with 10 million rows where category attributes (8 columns of text) repeat across 20,000 products per category is a case where snowflaking the category produces measurable savings. State the default (star), then describe this specific exception with numbers.
Push back with three concrete arguments. First, query complexity increases because analysts need more joins to reach the same attributes, leading to longer SQL and more opportunities for errors. Second, BI tools like Tableau and Power BI generate more efficient queries against flat dimensions; snowflaked dimensions often produce suboptimal auto-generated SQL. Third, the storage savings on a columnar engine are negligible because repeated values in a column compress to nearly zero. Propose a compromise: keep dimensions flat by default and only snowflake specific sub-entities where the dimension table exceeds a threshold (e.g., over 5 million rows with a clearly separable hierarchy).
Snowflaking increases ETL complexity in three ways. First, more tables means more load jobs to orchestrate with correct dependency ordering (sub-dimensions must load before parent dimensions). Second, surrogate key generation and lookup becomes multi-step: the ETL must resolve category surrogate keys before inserting product rows that reference them. Third, SCD (slowly changing dimension) processing is more complex because a single attribute change in a sub-entity can cascade: changing a category name requires updating the category table, which may trigger SCD logic in the product dimension depending on how keys are managed. These costs are real and ongoing.
Practice defending a clear default against a skeptical interviewer. DataDriven gives you real schema prompts and pushes back when your answer sounds like a textbook.
Start Practicing