Data Modeling
Kimball dimensional modeling is the foundation of data warehouse design and the most frequently tested modeling methodology in DE interviews. Interviewers want you to define the grain, separate facts from dimensions, classify measures, and defend your trade-offs.
Data modeling questions appear in roughly a third of DE interview loops. Dimensional modeling is the single most common sub-topic. If the job involves a data warehouse, expect these questions.
Every dimensional modeling interview comes back to these building blocks. Know them cold, with examples and trade-offs ready.
Fact tables store measurable events: transactions, clicks, shipments, payments. Each row represents one event at a specific grain. Columns are either foreign keys pointing to dimension tables or numeric measures (revenue, quantity, duration). Fact tables are tall and narrow. They grow continuously as new events occur.
Measures fall into three categories. Additive measures (revenue, quantity) can be summed across any dimension. Semi-additive measures (account balance) can be summed across some dimensions but not time. Non-additive measures (unit price, ratio) cannot be meaningfully summed and must be averaged or weighted. Knowing these categories matters in interviews because the interviewer may ask you to classify a measure and explain how to aggregate it correctly.
When asked to design a fact table, always state the grain first. 'One row per order line item' or 'one row per daily user session.' If you skip the grain, the interviewer will push you on it. The grain is the single most important decision in dimensional modeling.
Dimension tables store descriptive context: who, what, where, when. A customer dimension has name, city, signup_date, acquisition_channel. A product dimension has name, category, brand, price_tier. Dimension tables are short and wide. They change slowly, which is why SCD types exist.
Good dimension tables are richly attributed. A thin dimension with only an ID and a name forces analysts to do filtering and grouping in the fact query, which is slower and harder to read. A rich dimension with 20 to 30 descriptive columns lets analysts slice data by any attribute without modifying the fact query. In interviews, err on the side of putting more attributes into the dimension rather than fewer.
Interviewers often ask you to classify an attribute as fact or dimension. The rule: if it is a measurement that you would aggregate (sum, average, count), it belongs in the fact table. If it is a descriptor that you would filter or group by, it belongs in a dimension. Price is tricky: unit_price at the time of sale is a fact; standard_list_price is a dimension attribute.
Grain is the level of detail in a fact table. 'One row per order' is a different grain than 'one row per order line item.' Choosing the wrong grain means you either lose detail you need later or create bloated tables with redundant data. The grain dictates which dimensions can meaningfully attach to the fact table and which measures make sense.
A common mistake is mixing grains in a single fact table. If some rows represent individual transactions and other rows represent daily summaries, the table is unusable: you cannot sum the measures without double-counting. Every row in a fact table must represent exactly the same thing at exactly the same level of detail.
This is the single most important concept in dimensional modeling interviews. State it early and clearly: 'One row represents X.' If you can articulate the grain, the rest of the design falls into place. If you cannot, the interviewer knows you are guessing. Practice stating the grain for different business scenarios: retail, SaaS, logistics, advertising.
A bus matrix maps business processes (rows) to dimensions (columns). An X in a cell means that dimension applies to that business process. The bus matrix is the architectural plan for the entire warehouse. It forces alignment so that when both the sales fact and the returns fact reference the customer dimension, they use the same definition of 'customer.'
Without a bus matrix, each team designs its own dimensions independently. Marketing defines 'active customer' as someone who logged in this month. Sales defines it as someone who purchased this quarter. The bus matrix forces alignment: there is one customer dimension, with one definition, shared across all fact tables. Inconsistent definitions across teams is one of the most expensive problems in data warehousing.
Mentioning the bus matrix in an interview shows you understand enterprise-scale dimensional modeling, not just single-table design. Even if you are not asked about it directly, referencing it when discussing conformed dimensions demonstrates strategic thinking about the warehouse as a whole.
A conformed dimension is shared identically across multiple fact tables. The date dimension is used by every fact table. The customer dimension is used by sales, support, and marketing facts. Without conformed dimensions, you get inconsistent metrics: 'active customer' means one thing in one report and another thing in a different report.
Conformed dimensions are the mechanism that makes a data warehouse more valuable than a collection of independent data marts. They enable cross-process analysis: 'show me the customers who bought product X and then called support within 30 days.' This query only works if the sales fact and the support fact share the same customer dimension with the same surrogate keys.
If the interviewer asks about data consistency across teams or departments, conformed dimensions are the answer. This shows you think about the warehouse as a platform, not just a collection of tables. It is a Kimball concept that separates experienced engineers from those who have only built single-team solutions.
A degenerate dimension is a dimension key that lives directly in the fact table without its own dimension table. The classic example is order_number: it is a descriptor (not a measure), but creating a separate table for it adds no value because the only attribute would be the order number itself.
Other examples include invoice_number, transaction_id, and confirmation_code. These values are used for grouping (all line items in the same order) or drill-down (look up a specific transaction), but they carry no additional attributes that would justify a separate dimension table.
Interviewers use degenerate dimensions to test whether you blindly follow rules or think critically. If you can explain why some dimension keys do not need their own table, you demonstrate practical judgment. If you try to create a dimension table for order_number, the interviewer knows you are applying the pattern mechanically.
Knowing all three types distinguishes you from candidates who only think of fact tables as transaction logs. Each type serves a different analytical purpose.
One row per discrete event. Each row captures a single occurrence: a purchase, a click, a shipment. This is the most common fact table type. It grows continuously and can become very large. There is no upper bound on the number of rows because new events keep arriving.
An order_line_items fact with one row per product per order. Measures include quantity, unit_price, discount_amount, and line_total.
When you need to capture individual events with full detail. This is the default fact table type for most business processes.
One row per entity per time period. Captures the state of something at regular intervals. The table has a predetermined number of rows based on the number of entities multiplied by the number of time periods. Rows are added predictably on schedule.
A monthly_account_balance fact with one row per account per month. Measures include end_of_month_balance, deposits_this_month, withdrawals_this_month.
When you need to track cumulative or point-in-time metrics that do not have discrete events. Balances, inventory levels, and enrollment counts are classic periodic snapshot use cases.
One row per process instance, updated as milestones are reached. Unlike transaction facts, these rows are updated after initial insert. The row tracks the progress of a multi-step process from start to completion. Useful for workflows with defined stages.
An order_fulfillment fact with one row per order. Columns include order_date, payment_date, ship_date, delivery_date. Each date is NULL until that milestone is reached, then filled in. Measures include days_to_ship and days_to_deliver.
When you need to analyze throughput, bottlenecks, and stage durations in a process. Manufacturing pipelines, loan origination, and order fulfillment are common use cases.
Follow this four-step process in every design interview. It gives you a structured approach that interviewers recognize as Kimball-aligned, and it prevents the most common mistake: jumping straight to column names without establishing the grain.
Start with a business process, not a department or a report. A business process generates events: orders, shipments, support tickets, ad impressions. Each process becomes its own fact table. Do not try to model the entire business in one fact table.
In an interview, the business process is usually given to you ('design a model for our e-commerce platform'). Clarify which specific process: are we modeling orders, returns, sessions, or all three? Each one is a separate fact table with its own grain.
State what one row in the fact table represents. 'One row per order line item.' 'One row per ad impression.' 'One row per daily active user.' The grain determines everything else: which dimensions apply, which measures are possible, and how large the table will be.
Say the grain out loud before doing anything else. 'Before I start listing columns, let me define the grain: one row represents a single line item in a customer order.' This is the highest-signal move in a dimensional modeling interview. If you get the grain wrong, everything downstream is wrong too.
List every descriptive context that applies at the declared grain. For an order line item, the dimensions are: date (when), customer (who), product (what), store or channel (where), promotion (why the discount). Each dimension becomes a foreign key in the fact table.
Use the journalistic questions (who, what, where, when, why, how) as a checklist. Walk through each one and identify the relevant dimension. If you miss a dimension, the interviewer may prompt you. Having a systematic approach shows maturity.
List the numeric, measurable values that occur at the declared grain. For an order line item: quantity, unit_price, discount_amount, line_total, tax_amount. Classify each as additive, semi-additive, or non-additive. This classification determines how the measure can be aggregated in queries.
Avoid putting derived measures in the fact table. Store quantity and unit_price; let the query compute line_total as quantity * unit_price. Exception: when the derivation involves business logic that should be applied once (like tax calculation), store the result. The interviewer may challenge you on whether a measure should be stored or computed.
These questions appear repeatedly in data modeling interview rounds. For each one, the guidance shows how a strong candidate structures their answer.
Start with the grain: 'one row per order line item.' Then list dimensions: date, customer, product, store/channel, promotion. The fact table measures are quantity, unit_price, discount_amount, line_total, shipping_cost_allocated. Stating the grain first is the most important thing you can do. Then walk through the dimensions systematically and justify each measure's classification (additive vs non-additive).
Fact tables store measurements (numeric values you aggregate). Dimension tables store descriptive context (text, categories, hierarchies you filter and group by). Facts answer 'how much' or 'how many.' Dimensions answer 'who, what, where, when.' Facts are tall and narrow, growing with every event. Dimensions are short and wide, changing slowly. In a star schema, the fact table sits at the center with foreign keys pointing outward to dimensions.
Snowflaking normalizes dimension tables to eliminate redundancy. A product dimension might split into product and category tables. The trade-off: storage savings vs query complexity (more joins). In modern cloud warehouses where storage is cheap and columnar compression handles redundancy efficiently, star schemas are almost always preferred because they are simpler to query and BI tools are optimized for them. Mention snowflaking as an option, but explain why you would default to star.
The bus matrix is a grid with business processes (fact tables) as rows and dimensions as columns. An X indicates a shared dimension. Its purpose is dimensional consistency across the enterprise. Without it, each team builds their own version of 'customer' or 'product' with different definitions. The bus matrix forces a single, conformed definition. Sketch one on the whiteboard: orders fact uses date, customer, product. Returns fact uses date, customer, product. Customer and product are conformed because they appear in both.
Three options: (1) Bridge table between the fact and a category dimension, enabling many-to-many relationships. This is the most flexible but adds a join. (2) Pick the primary category and store it in the product dimension. Simpler, but loses secondary categories. (3) Multi-valued dimension with a group key (one row per unique combination of categories). The right choice depends on the business requirement: does the end user need to filter by any category, or just the primary one? Ask the interviewer to clarify before committing to an approach.
Transaction (one row per event, grows continuously, no upper bound). Periodic snapshot (one row per entity per period, captures state at regular intervals, predictable growth). Accumulating snapshot (one row per process instance, updated as milestones are reached, tracks workflow progress). Most candidates only know transaction facts. Naming all three with concrete examples (orders, monthly balance, order fulfillment) demonstrates depth.
Start with the business processes: user signups (transaction fact), monthly subscription snapshots (periodic snapshot), feature usage events (transaction fact), support tickets (transaction fact or accumulating snapshot if you track resolution stages). Then identify conformed dimensions that cross-cut: user, account/company, plan, date. The bus matrix for this scenario shows how the user dimension is shared across all fact tables, ensuring consistent user metrics across product, finance, and support analytics.
These are the errors interviewers see most often. Avoiding them puts you ahead of the majority of candidates.
A common interview anti-pattern is to look at a specific dashboard mockup and design a table that matches it exactly. This creates a table that serves one report and is useless for every other question. Dimensional modeling starts with the business process (orders, shipments, clicks), not the output format. The correct design supports any report about that process.
If some rows represent individual transactions and others represent daily summaries, the table is broken. You cannot sum the measures without double-counting. Every row must represent the same thing at the same level of detail. If you need both transaction-level and daily-level data, create two separate fact tables.
Storing customer_name, product_category, or region directly in the fact table instead of joining to a dimension table. This creates data redundancy, update anomalies, and inconsistent descriptions. The only text values in a fact table should be degenerate dimensions (order_number) or status flags.
If you store unit_price in a fact table and a downstream report sums it across 1,000 line items, the result is meaningless. Knowing that unit_price is non-additive (it should be averaged or weighted, not summed) prevents this. Interviewers test this classification because it reveals whether you understand how measures will be consumed.
Interviewers may ask how Kimball compares to other methodologies. The key is knowing which layer of the warehouse each approach targets.
3NF minimizes redundancy through normalization. It is optimized for transactional workloads (OLTP) where updates must be fast and consistent. Dimensional models are deliberately denormalized, optimized for analytical workloads (OLAP) where reads are far more frequent than writes. In an interview, explain that 3NF is for the source system; dimensional modeling is for the warehouse.
Data Vault (hubs, links, satellites) is designed for the integration layer of a warehouse. It excels when you have many source systems with conflicting definitions and need full auditability. Dimensional modeling is designed for the presentation layer. Many enterprises use Data Vault for integration and then build dimensional models (star schemas) on top for analytics consumption.
OBT pre-joins everything into a single wide table. It trades storage and update complexity for query simplicity. OBT works for specific, well-defined use cases (a single dashboard, a feature store). It does not scale to a warehouse with dozens of business processes and hundreds of dimensions. Dimensional modeling provides structure that OBT does not.
DataDriven covers dimensional modeling, SQL, and system design with hands-on challenges at interview difficulty. Practice the concepts interviewers test, not textbook exercises.
Start Practicing