What is the Kimball four-step dimensional modeling process?

Step 1: identify the business process (sales, payments, ad impressions). Step 2: declare the grain of the fact table (one row per X). Step 3: identify the dimensions that apply at that grain (who, what, where, when, how). Step 4: identify the facts (numeric measures: revenue, quantity, duration). Most data engineer modeling rounds implicitly expect this framework even when not named.

What is a role-playing dimension?

A single dimension table referenced multiple times in the same fact with different roles. dim_date with FKs order_date_key, ship_date_key, return_date_key all pointing to the same dim_date table. Saves storage versus separate date dims. The query layer typically creates views (dim_order_date, dim_ship_date, dim_return_date) that alias the same underlying table.

What is a junk dimension?

A consolidated dimension for low-cardinality flags. Instead of separate FKs in the fact for is_promotional, is_first_purchase, channel, payment_type, build one dim_junk with all combinations of these flags as rows. One FK in the fact. Avoids bloating the fact with many low-cardinality FKs. Especially useful when the flags have few combinations relative to fact volume.

What is a degenerate dimension?

A natural key stored on the fact without a separate dim table. Order_number, invoice_number, transaction_id. There are no attributes beyond the key itself, so a dim table would be redundant. The key remains on the fact for traceability and ad-hoc joins back to the source system.

How do conformed dimensions differ from regular dimensions?

Conformed dimensions are designed to span multiple business processes. One dim_customer used by the sales fact, the returns fact, and the support fact, with identical schema and surrogate keys. Regular (non-conformed) dimensions live in only one fact. Senior data engineer rubrics weight conformed dimensions because they enable cross-fact analysis (a customer's lifetime value combining sales and returns) without explicit identifier translation.

What is an additive fact and why does it matter?

A fact measure that can be summed across all dimensions. Revenue is additive (sums across customer, product, date). Quantity is additive. Cost is additive. Semi-additive measures (account balance) sum across some dimensions but not time. Non-additive (ratios, percentages) must be computed at the desired aggregation level. Identifying additivity is part of the modeling rubric.

How does dimensional modeling handle hierarchies?

Two patterns. Parent-child columns: dim_employee with manager_id FK to itself (works for fixed-depth or limited-depth hierarchies). Hierarchy bridge: a separate bridge table that flattens all ancestor-descendant relationships (handles arbitrary depth, supports SQL queries for any level). Categories with parent-categories often use parent-child; org charts often use hierarchy bridge.

When is a snowflake schema better than a star schema?

When a specific dimension is genuinely too large to broadcast (millions of rows with high-cardinality attributes that the optimizer cannot fit in driver memory or shuffle efficiently). Snowflake that one dimension by normalizing its attribute group into a sub-dim. Do not snowflake the whole model; the join cost outweighs the storage saving for most dimensions in modern columnar warehouses.

Dimensional Modeling Interview Questions (Kimball)

Dimensional Modeling Interview Questions

Kimball dimensional modeling problems for data engineer interview prep.

Dimensional modeling interview questions for data engineer roles. Kimball-style design across e-commerce, marketplace, payments, ad tech, and content platform domains. Grain selection, conformed dimensions across multiple facts, SCD Type 1 vs Type 2 vs Type 3, additive measures, bridge tables for many-to-many.

Dimensional modeling for data engineer interviews follows the Kimball methodology: identify the business process, declare the grain of the fact table, identify the dimensions that apply at that grain, identify the facts (measures). The Kimball four-step process is the framework most data engineer modeling rounds implicitly expect, even when the interviewer does not name it.

Step one: identify the business process. Sales, payments, ad impressions, content views, ride trips, support interactions. The business process is the source of the events that flow into the fact table. Conformed dimensions are designed to span multiple business processes (a single dim_customer used across the sales fact, the returns fact, and the support fact). Step two: declare the grain of the fact table. One row per X, where X is the atomic unit of the business process. For sales, that is usually one row per order line item, not one row per order. For trips, one row per trip. For ad impressions, one row per impression. Mixed grains in one fact table is the failure mode. Step three: identify the dimensions that apply at that grain. Each dimension is a context: who (customer, user), what (product, content), where (store, location, geography), when (date, time), how (channel, payment method). Some dimensions are degenerate (an order number with no attributes beyond the natural key), some are role-playing (the same dim_date used as order_date, ship_date, return_date with three FKs in the fact). Step four: identify the facts. Numeric measures the business cares about: revenue, quantity, duration, cost. Mark each as additive, semi-additive, or non-additive.

Dimensional modeling interview questions for data engineer roles test six recurring patterns. Conformed dimensions: same dim_customer in multiple facts, with one schema and one set of surrogate keys. Role-playing dimensions: dim_date referenced as order_date_key, ship_date_key, return_date_key in the same fact. Junk dimensions: low-cardinality flags (is_promotional, is_first_purchase, channel) consolidated into one dim_junk to avoid bloating the fact with multiple FKs. Degenerate dimensions: natural keys (order_number, invoice_number) stored on the fact without a separate dim table. Bridge tables: many-to-many product-category, patient-diagnosis, impression-conversion. SCD Type 2 mechanics: surrogate_key, natural_key, effective_from, effective_to, is_current; expire-and-insert pattern on change; fact joins on surrogate for point-in-time correctness.

Senior data engineer dimensional modeling rounds add platform concerns. Conformed dimensions across multiple data marts (a single dim_customer used by the sales mart, the marketing mart, and the support mart). Slowly changing fact tables for corrections (append-only with version column versus in-place with audit log). Late-arriving dimensions with placeholder rows. Hierarchies in dimensions (employee-to-manager, category-to-parent-category) modeled as parent-child columns versus a hierarchy bridge table. Multi-valued dimensions modeled as bridges with weighting factors for fractional attribution.

Star schema interview questions - Star is the default dimensional model; conformed dims, additive facts.
Full data modeling interview catalog - Star, snowflake, vault, medallion, SCD types.
Slowly changing dimension interview questions - Type 1 vs Type 2 vs Type 3 merge logic.
Fact table interview questions - Grain selection, additivity, snapshot vs transaction.
Data warehouse interview questions - Warehouse design with dimensional modeling as the gold layer.
Data modeling interview prep guide - Round-by-round dimensional modeling prep.
Data modeling practice problems - Hands-on Kimball-style schema design across 6 domains.

What is the Kimball four-step dimensional modeling process?: Step 1: identify the business process (sales, payments, ad impressions). Step 2: declare the grain of the fact table (one row per X). Step 3: identify the dimensions that apply at that grain (who, what, where, when, how). Step 4: identify the facts (numeric measures: revenue, quantity, duration). Most data engineer modeling rounds implicitly expect this framework even when not named.
What is a role-playing dimension?: A single dimension table referenced multiple times in the same fact with different roles. dim_date with FKs order_date_key, ship_date_key, return_date_key all pointing to the same dim_date table. Saves storage versus separate date dims. The query layer typically creates views (dim_order_date, dim_ship_date, dim_return_date) that alias the same underlying table.
What is a junk dimension?: A consolidated dimension for low-cardinality flags. Instead of separate FKs in the fact for is_promotional, is_first_purchase, channel, payment_type, build one dim_junk with all combinations of these flags as rows. One FK in the fact. Avoids bloating the fact with many low-cardinality FKs. Especially useful when the flags have few combinations relative to fact volume.
What is a degenerate dimension?: A natural key stored on the fact without a separate dim table. Order_number, invoice_number, transaction_id. There are no attributes beyond the key itself, so a dim table would be redundant. The key remains on the fact for traceability and ad-hoc joins back to the source system.
How do conformed dimensions differ from regular dimensions?: Conformed dimensions are designed to span multiple business processes. One dim_customer used by the sales fact, the returns fact, and the support fact, with identical schema and surrogate keys. Regular (non-conformed) dimensions live in only one fact. Senior data engineer rubrics weight conformed dimensions because they enable cross-fact analysis (a customer's lifetime value combining sales and returns) without explicit identifier translation.
What is an additive fact and why does it matter?: A fact measure that can be summed across all dimensions. Revenue is additive (sums across customer, product, date). Quantity is additive. Cost is additive. Semi-additive measures (account balance) sum across some dimensions but not time. Non-additive (ratios, percentages) must be computed at the desired aggregation level. Identifying additivity is part of the modeling rubric.
How does dimensional modeling handle hierarchies?: Two patterns. Parent-child columns: dim_employee with manager_id FK to itself (works for fixed-depth or limited-depth hierarchies). Hierarchy bridge: a separate bridge table that flattens all ancestor-descendant relationships (handles arbitrary depth, supports SQL queries for any level). Categories with parent-categories often use parent-child; org charts often use hierarchy bridge.
When is a snowflake schema better than a star schema?: When a specific dimension is genuinely too large to broadcast (millions of rows with high-cardinality attributes that the optimizer cannot fit in driver memory or shuffle efficiently). Snowflake that one dimension by normalizing its attribute group into a sub-dim. Do not snowflake the whole model; the join cost outweighs the storage saving for most dimensions in modern columnar warehouses.

63 practice problems matching this filter. Difficulty: medium (33), easy (9), hard (21).

Data Modeling (63)

Split Decision - medium - One user, one experiment, one variant. No exceptions.
Where They Used to Live - medium - They moved. The data stayed behind.
The Double Count - medium - One flight carries hundreds of seats; one ticket spans many flights. Model them so neither gets counted twice.
A Number for the Seller - easy - They want a total. Give them the right schema first.
B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
The Anonymous Majority - medium - Millions of clicks, mostly anonymous.
Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
E-Commerce Supply Chain Tracking - hard - A package splits, reroutes, and (maybe) arrives.
Signal and Silence - medium - They opened the assignment. Did they actually read it?
Employee Application Time Tracking - medium - Every minute tracked. Every app accounted for.
Employee Transfer Tracking System - medium - People switch teams. HR loses track.
Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
Financial Trading Warehouse - hard - Every trade, every tick, every fraction of a share. The regulators want receipts.
Personal Best - easy - Reps, sets, streaks, and personal bests. Gym rats love their stats.
The No-Show - easy - Every reserved seat ends one of five ways. Build the model that can tell them apart.
Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
Deal Flow - medium - Sellers want buyers. Buyers want deals.
Insurance Claims Lifecycle - hard - A claim gets filed. Then it gets complicated. Then it gets reassigned. Then it loops back.
Livestream Analytics Schema - medium - Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.
Approval and After - medium - Approved, declined, or pending. Design the tables that say so.
The Balance Always Reconciles - easy - Money out, payments back. The balance has to be exact.
The Vital Few - medium - Two terabytes a day, and the lines that matter are a rounding error in the noise.
The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.
Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
Metric Definition Reverse Engineering - hard - Five numbers on a dashboard. Your job: figure out where they come from.
Movie Streaming Analytics Schema - medium - They pressed play. What happened next is the whole question.
Multiplayer Game Match History - medium - Millions of matches. The leaderboard refreshes in fifteen minutes.
Online Marketplace - Seller Payouts - hard - The buyer paid one number. The seller got a different one.
The Retail Blueprint - medium - One business. A thousand transactions. Only one layout survives the analytics layer.
The Last Mile - medium - Order placed. Now track it to the door.
POS Sales Data Warehouse - medium - Every beep at the register. Coupons, returns, all of it.
Property Booking Platform - hard - Five-star listing. Three-star reality.
Retailer Data Warehouse Design - medium - Queries are crawling. The analysts are not happy.
Ride-Sharing Platform Schema - medium - Riders, drivers, and fares. Everyone takes a cut.
The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
The Customer Who Changed - hard - She moved. She upgraded. She became someone new. The record has to keep up.
The Endless Thread - medium - Follows, likes, replies to replies. It never stops.
Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
When the Music Stops - medium - Subscribers are leaving. The data knows why.
The Heat of the Map - hard
Telecom Network Connectivity Warehouse - hard - One device goes down. The ripple keeps going.
The Celebrity Problem - medium - One post. A million notifications. Something has to give.
The Churner Who Came Back - hard - They cancelled. They came back. The report has to tell both stories correctly.
The Gaps Between Clicks - hard
The JSON Files That Became a Data Mart - medium - Three semi-structured inputs. One queryable warehouse.
The League With Too Many Loyalties - hard - A player can belong to many teams. The schema must agree.
The Other Seat - hard
The Person They Were Then - easy
The Plan That Changed Twice This Month - medium - Subscribers come, go, downgrade, and share. The schema has to keep up.
The Retail Tables That Need a New Home - medium - A working system. Now redesign it so the analysts can actually use it.
The Schema That Could Not Answer Back - hard - Forty columns in. Zero useful answers out.
The Slow Yes - hard
The Table That Lies - medium - Every query comes out wrong. The data is all there.
The Talent Funnel - medium - Thousands applied. One accepted. Where did the rest go?
The Territory That Keeps Moving - hard - Reps get reassigned. The receipts have to survive.
The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
Three-Sided Marketplace Delivery Schema - hard - One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?
Toll Road Sensor Analytics - easy - Cars enter, cars exit. Except when they don't.
Trending Dishes Dashboard - medium - What's everyone eating? The answer changes hourly.
Who Comes Back - medium