What is the grain of a fact table?

The grain is the unit of analysis: one row per X. For an orders fact, the grain is usually one row per order line item (not one row per order, which loses line-item detail). For an impressions fact, one row per impression. State the grain in one sentence before drawing the fact table. Mixed-grain fact tables are the failure mode interviewers explicitly test.

What is an accumulating snapshot fact?

One row per process instance, updated as the process moves through stages. One row per order with timestamps for order_placed, paid, shipped, delivered. Starts with order_placed_ts populated and the other timestamps NULL; updates as each milestone happens. Useful for funnel analysis, lifecycle tracking, SLA monitoring.

What is an additive measure?

A measure that can be summed across all dimensions. Revenue is additive (sums across customer, product, date, region). Quantity is additive. Cost is additive. Most transaction-fact measures are additive at the chosen grain.

What is a semi-additive measure?

A measure that sums across some dimensions but not others. Account balance is semi-additive: sums across customers (total balance across all customers), does not sum across dates (you take the latest balance, not the sum of 30 daily balances). Inventory level is semi-additive similarly. Periodic snapshot facts usually contain semi-additive measures.

What is a degenerate dimension?

A natural key stored on the fact without a separate dim table because there are no attributes beyond the key itself. Order_number, invoice_number, transaction_id. Storing them on the fact preserves traceability to the source system without the overhead of a single-column dim table. Senior data engineer modeling rounds explicitly ask for degenerate dimensions to be identified.

How does a fact table FK to an SCD Type 2 dimension?

The fact FKs to the surrogate key of the dim version that was current at the time of the fact event. When customer 42 has surrogate_key 1001 from 2025-01-01 to 2025-06-30 and surrogate_key 1002 from 2025-06-30 onward, an order on 2025-05-15 links to surrogate_key 1001. This is what enables point-in-time correctness: querying the order's customer attributes joins to the version current at the time of the order, not the current version.

Should a fact table have a primary key?

Usually yes, even though most fact tables do not enforce one. The primary key is typically a composite of all FKs plus a degenerate dimension (order_number) plus a sequence number for line items. Some warehouses skip the explicit PK for ingest performance and rely on dedup logic in the pipeline. Modern table formats (Iceberg, Delta) often track row identity via internal sequence numbers regardless of declared PK.

Fact Table Interview Questions: Grain + Additivity

Fact Table Interview Questions

Fact table design problems for data engineer interview prep.

Fact table interview questions isolated from the data modeling catalog. Grain selection (one row per X), additive vs semi-additive vs non-additive measures, transaction fact versus periodic snapshot versus accumulating snapshot, degenerate dimensions, and the surrogate-key-versus-natural-key decision on facts.

Fact tables are where the numeric measures of a data engineer's warehouse live. Each fact table has a declared grain (one row per X), FKs to dimensions that apply at that grain, and one or more numeric measures. Three fact table types appear in 2026 data engineer interviews: transaction fact, periodic snapshot fact, accumulating snapshot fact. Each fits a different analytical workload.

Transaction fact: one row per event at the moment it happens. One row per order line item, one row per impression, one row per click, one row per trip. The most common fact type, used for activity tracking and event aggregation. Measures are usually additive (revenue, quantity, duration). FKs point to dim_customer, dim_product, dim_date_time, dim_store. The grain is atomic and immutable: once a row is written, it does not change (corrections become new rows in a slowly-changing fact pattern).

Periodic snapshot fact: one row per entity per period, capturing state at the end of the period. One row per account per day with the daily balance, one row per inventory item per week with the weekly count. Measures are usually semi-additive: sum across customer or product (additive across entity), do not sum across time (you take the latest or average; summing a daily balance across 30 days does not produce a monthly balance). The benefit is fast period-over-period analysis without aggregating the transaction fact. The cost is data redundancy (the same daily-balance row repeats unchanged from yesterday if nothing happens).

Accumulating snapshot fact: one row per process instance, updated as the process moves through stages. One row per order with timestamps for order_placed, paid, shipped, delivered. The row starts with order_placed_ts populated and the other timestamps NULL; it updates as each milestone happens. Measures are durations (paid_ts minus placed_ts, shipped_ts minus paid_ts) and the row is mutable until the process ends. Useful for funnel analysis, lifecycle tracking, and SLA monitoring. Less common than transaction fact but distinct enough to be tested in data engineer modeling rounds at L5+.

Fact measure additivity is a recurring question. Additive measures sum across all dimensions (revenue, quantity, cost). Semi-additive measures sum across some dimensions but not others (account balance sums across customers but the daily balance does not sum across dates: you take the latest balance, not the sum of 30 daily balances). Non-additive measures (ratios, percentages, distinct counts) must be computed at the desired aggregation level (do not pre-compute a customer-day-level conversion rate and expect it to roll up to month: aggregate the raw counts, then compute the ratio). Identifying additivity per measure is part of the data engineer modeling rubric.

Degenerate dimensions on the fact: a natural key stored on the fact without a separate dim table because there are no attributes beyond the key itself. Order_number, invoice_number, transaction_id. Storing them on the fact preserves traceability back to the source system without the overhead of a single-column dim table. Senior data engineer modeling rounds explicitly ask for degenerate dimensions to be identified; junior rounds often skip this nuance.

Full data modeling interview catalog - Star, snowflake, vault, SCD, fact tables.
Star schema interview questions - Star = one fact table surrounded by dims.
Dimensional modeling interview questions - Kimball methodology including fact-table grain.
SCD interview questions - How facts FK to surrogate keys of SCD Type 2 dims.
Data warehouse interview questions - Warehouse design with fact tables in the gold layer.
Data modeling interview prep guide - Prep covering fact grain and additivity.
Data modeling practice problems - Hands-on fact-table design across multiple domains.

What is the grain of a fact table?: The grain is the unit of analysis: one row per X. For an orders fact, the grain is usually one row per order line item (not one row per order, which loses line-item detail). For an impressions fact, one row per impression. State the grain in one sentence before drawing the fact table. Mixed-grain fact tables are the failure mode interviewers explicitly test.
What is the difference between a transaction fact and a periodic snapshot fact?: Transaction fact: one row per event at the moment it happens (one row per order line item). Immutable once written. Measures usually additive. Periodic snapshot fact: one row per entity per period capturing state at end of period (one row per account per day with daily balance). Mutable in the sense that yesterday's row exists alongside today's. Measures usually semi-additive.
What is an accumulating snapshot fact?: One row per process instance, updated as the process moves through stages. One row per order with timestamps for order_placed, paid, shipped, delivered. Starts with order_placed_ts populated and the other timestamps NULL; updates as each milestone happens. Useful for funnel analysis, lifecycle tracking, SLA monitoring.
What is an additive measure?: A measure that can be summed across all dimensions. Revenue is additive (sums across customer, product, date, region). Quantity is additive. Cost is additive. Most transaction-fact measures are additive at the chosen grain.
What is a semi-additive measure?: A measure that sums across some dimensions but not others. Account balance is semi-additive: sums across customers (total balance across all customers), does not sum across dates (you take the latest balance, not the sum of 30 daily balances). Inventory level is semi-additive similarly. Periodic snapshot facts usually contain semi-additive measures.
What is a degenerate dimension?: A natural key stored on the fact without a separate dim table because there are no attributes beyond the key itself. Order_number, invoice_number, transaction_id. Storing them on the fact preserves traceability to the source system without the overhead of a single-column dim table. Senior data engineer modeling rounds explicitly ask for degenerate dimensions to be identified.
How does a fact table FK to an SCD Type 2 dimension?: The fact FKs to the surrogate key of the dim version that was current at the time of the fact event. When customer 42 has surrogate_key 1001 from 2025-01-01 to 2025-06-30 and surrogate_key 1002 from 2025-06-30 onward, an order on 2025-05-15 links to surrogate_key 1001. This is what enables point-in-time correctness: querying the order's customer attributes joins to the version current at the time of the order, not the current version.
Should a fact table have a primary key?: Usually yes, even though most fact tables do not enforce one. The primary key is typically a composite of all FKs plus a degenerate dimension (order_number) plus a sequence number for line items. Some warehouses skip the explicit PK for ingest performance and rely on dedup logic in the pipeline. Modern table formats (Iceberg, Delta) often track row identity via internal sequence numbers regardless of declared PK.

63 practice problems matching this filter. Difficulty: medium (33), easy (9), hard (21).

Data Modeling (63)

Split Decision - medium - One user, one experiment, one variant. No exceptions.
Where They Used to Live - medium - They moved. The data stayed behind.
The Double Count - medium - One flight carries hundreds of seats; one ticket spans many flights. Model them so neither gets counted twice.
A Number for the Seller - easy - They want a total. Give them the right schema first.
B2B Invoicing Data Model - easy - Invoices go out, partial payments trickle in, and some customers are three months overdue.
The Anonymous Majority - medium - Millions of clicks, mostly anonymous.
Cloud File Storage Metadata Schema - hard - A file is also a folder. A folder is also a file.
Content Engagement Data Model - hard - Post published. Now measure everything that happens next.
Content Search and Discovery Schema - hard - Searchable from every angle. Design it so nothing gets lost.
Customer Address History - easy - People move. Sometimes twice in a month. How do you remember where everyone was, and when?
E-Commerce Supply Chain Tracking - hard - A package splits, reroutes, and (maybe) arrives.
Signal and Silence - medium - They opened the assignment. Did they actually read it?
Employee Application Time Tracking - medium - Every minute tracked. Every app accounted for.
Employee Transfer Tracking System - medium - People switch teams. HR loses track.
Event Ticketing System Data Model - easy - JSON in. Reporting warehouse out. Design both ends.
Financial Trading Warehouse - hard - Every trade, every tick, every fraction of a share. The regulators want receipts.
Personal Best - easy - Reps, sets, streaks, and personal bests. Gym rats love their stats.
The No-Show - easy - Every reserved seat ends one of five ways. Build the model that can tell them apart.
Food Truck Operations Data Model - medium - Mobile vendor, fixed menu, unpredictable locations.
Deal Flow - medium - Sellers want buyers. Buyers want deals.
Insurance Claims Lifecycle - hard - A claim gets filed. Then it gets complicated. Then it gets reassigned. Then it loops back.
Livestream Analytics Schema - medium - Someone goes live, thousands tune in, chat explodes, and virtual gifts start flying.
Approval and After - medium - Approved, declined, or pending. Design the tables that say so.
The Balance Always Reconciles - easy - Money out, payments back. The balance has to be exact.
The Vital Few - medium - Two terabytes a day, and the lines that matter are a rounding error in the noise.
The Shape of a Run - medium - Two log lines bracket every process. Pair them and the fleet's rhythm appears.
Marketplace Sales Warehouse - hard - No schema given. The interviewer is watching.
Metric Definition Reverse Engineering - hard - Five numbers on a dashboard. Your job: figure out where they come from.
Movie Streaming Analytics Schema - medium - They pressed play. What happened next is the whole question.
Multiplayer Game Match History - medium - Millions of matches. The leaderboard refreshes in fifteen minutes.
Online Marketplace - Seller Payouts - hard - The buyer paid one number. The seller got a different one.
The Retail Blueprint - medium - One business. A thousand transactions. Only one layout survives the analytics layer.
The Last Mile - medium - Order placed. Now track it to the door.
POS Sales Data Warehouse - medium - Every beep at the register. Coupons, returns, all of it.
Property Booking Platform - hard - Five-star listing. Three-star reality.
Retailer Data Warehouse Design - medium - Queries are crawling. The analysts are not happy.
Ride-Sharing Platform Schema - medium - Riders, drivers, and fares. Everyone takes a cut.
The Sales Architecture - medium - Numbers are easy. Making them queryable at scale is the real job.
The Customer Who Changed - hard - She moved. She upgraded. She became someone new. The record has to keep up.
The Endless Thread - medium - Follows, likes, replies to replies. It never stops.
Two Wallets - medium - Two user types. Multiple payment methods. One messy billing table.
When the Music Stops - medium - Subscribers are leaving. The data knows why.
The Heat of the Map - hard
Telecom Network Connectivity Warehouse - hard - One device goes down. The ripple keeps going.
The Celebrity Problem - medium - One post. A million notifications. Something has to give.
The Churner Who Came Back - hard - They cancelled. They came back. The report has to tell both stories correctly.
The Gaps Between Clicks - hard
The JSON Files That Became a Data Mart - medium - Three semi-structured inputs. One queryable warehouse.
The League With Too Many Loyalties - hard - A player can belong to many teams. The schema must agree.
The Other Seat - hard
The Person They Were Then - easy
The Plan That Changed Twice This Month - medium - Subscribers come, go, downgrade, and share. The schema has to keep up.
The Retail Tables That Need a New Home - medium - A working system. Now redesign it so the analysts can actually use it.
The Schema That Could Not Answer Back - hard - Forty columns in. Zero useful answers out.
The Slow Yes - hard
The Table That Lies - medium - Every query comes out wrong. The data is all there.
The Talent Funnel - medium - Thousands applied. One accepted. Where did the rest go?
The Territory That Keeps Moving - hard - Reps get reassigned. The receipts have to survive.
The Transfer Request - medium - Apply, wait, get approved or denied. Track all of it.
Three-Sided Marketplace Delivery Schema - hard - One order. Two deliveries. Revenue counted twice. Where is the bug in your schema?
Toll Road Sensor Analytics - easy - Cars enter, cars exit. Except when they don't.
Trending Dishes Dashboard - medium - What's everyone eating? The answer changes hourly.
Who Comes Back - medium