Pre-Aggregation

A dashboard that takes 45 seconds to load is a dashboard nobody uses. Pre-aggregation is how you make analytical queries fast: compute the answer once, store the result, and serve it instantly. But not every metric can be pre-aggregated safely. Sum revenue by day works. Average order value by day does not (you cannot average averages). This lesson teaches you which metrics are safe to pre-aggregate, how to design summary tables at the right granularity, and how to keep them up to date.

Why Pre-Aggregate?

Daily Life

Interviews

Make dashboards load in seconds, not minutes

The Performance Problem

A fact table with 500 million rows. A dashboard with 10 tiles. Each tile runs a GROUP BY query that scans the entire table. Ten full-table scans every time someone opens the dashboard. At $5 per TB scanned (BigQuery pricing), that is real money. At 30 seconds per query, that is a terrible user experience.

Pre-aggregation solves this by computing the answer ahead of time. Instead of scanning 500 million rows to get daily revenue, create a daily_revenue summary table with 365 rows per year. The dashboard reads 365 rows instead of 500 million. Load time drops from 30 seconds to milliseconds.

✗No Pre-Aggregation

Every dashboard load scans 500M rows
30-second query time
$5 per TB scanned, 10 queries per load
Users stop checking the dashboard

✓With Pre-Aggregation

Dashboard reads 365-row summary table
Millisecond query time
Negligible scan cost
Users check the dashboard 10x per day

The Tradeoff

Pre-aggregation is a deliberate tradeoff between freshness and performance. The summary table is only as current as its last refresh. If the daily_revenue table is refreshed overnight, the dashboard shows yesterday's numbers, not today's. If the business needs real-time numbers, you need a different approach (streaming aggregation, materialized views with auto-refresh).

Approach	Freshness	Cost	Complexity
No pre-aggregation	Real-time (queries hit raw data)	High (full-table scans)	Low (no summary tables to maintain)
Nightly summary tables	T+1 (yesterday's data)	Low (365 rows per year)	Medium (ETL job to refresh)
Streaming aggregation	Near-real-time (seconds)	Medium (streaming infrastructure)	High (Flink/Spark Structured Streaming)
Materialized views (auto-refresh)	Minutes to hours	Medium (depends on platform)	Low (database handles refresh)

TIP

Start with nightly summary tables. They cover 90% of dashboard use cases. Only invest in streaming aggregation when the business has a proven need for sub-minute freshness.

Metric Types and Additivity

Daily Life

Interviews

Know which metrics you can safely roll up

Not Every Metric Can Be Pre-Aggregated

This is the most important concept in pre-aggregation. Some metrics can be safely SUMmed across any dimension. Others cannot. If you pre-aggregate a non-additive metric, the numbers will be mathematically wrong and nobody will notice until someone reconciles against the source.

Fully AdditiveSemi-AdditiveNon-Additive

Fully Additive

Can SUM across every dimension

Revenue, quantity, cost, discount amount. SUM by product, by region, by month, by any combination. Always correct.

Semi-Additive

Can SUM across some dimensions, not time

Account balance, inventory level, headcount. SUM across accounts on one day is correct. SUM across days is meaningless (you are adding up daily snapshots).

Non-Additive

Cannot SUM across any dimension

Unit price, conversion rate, average order value. These must be computed from additive components (SUM(revenue) / COUNT(orders)), never pre-stored.

The Average-of-Averages Trap

You pre-compute average order value (AOV) per day: Monday = $50, Tuesday = $100. What is the weekly AOV? NOT ($50 + $100) / 2 = $75. If Monday had 1,000 orders and Tuesday had 10 orders, the true AOV is ($50,000 + $1,000) / 1,010 = $50.50. Averaging the daily averages gives $75, which is 48% wrong.

The fix: never store ratios or averages in summary tables. Store the additive components: SUM(revenue) and COUNT(orders). Compute the ratio at query time: SUM(revenue) / SUM(orders). This is correct at every rollup level.

daily_revenue_summary

PKdate_skINT

total_revenueNUMERIC

order_countINT

fact_orders

PKorder_idBIGINT

FKdate_skINT

revenueNUMERIC

Pre-aggregation: daily_revenue_summary rolls 500M fact rows down to 365/year, storing ADDITIVE components (total_revenue, order_count) not ratios. AOV = SUM(total_revenue)/SUM(order_count) stays correct at every rollup.

Metric	Additive?	Pre-Aggregate?	Store Instead
Revenue	Fully	Yes, SUM directly	SUM(revenue)
Order count	Fully	Yes, COUNT directly	COUNT(*)
Average order value	Non-additive	No	SUM(revenue) and COUNT(orders) separately
Conversion rate	Non-additive	No	SUM(conversions) and SUM(impressions) separately
Account balance	Semi-additive	Yes, but only per snapshot	One row per account per day, SUM across accounts only
Distinct users	Non-additive	Approximate only	HyperLogLog sketch (mergeable)

TIP

The rule: store additive components, not derived ratios. If you store clicks and impressions, you can always compute CTR. If you store only CTR, you cannot re-aggregate it correctly at a different granularity.

OLAP Cubes & Rollups

Daily Life

Interviews

Build multi-dimensional summary tables

Pre-Computing Every Dimension Combination

An OLAP cube is a summary table that pre-computes aggregates for every combination of dimensions. Revenue by product. Revenue by region. Revenue by product AND region. Revenue by product AND region AND month. If you have 3 dimensions with 10 values each, the cube has 10 x 10 x 10 = 1,000 cells. Each cell stores a pre-computed SUM.

Modern cloud warehouses do not use literal cubes (multidimensional arrays). They use flat summary tables with GROUP BY ROLLUP or GROUP BY CUBE to generate all combinations. But the concept is the same: pre-compute aggregates at every granularity so any dashboard filter combination is answered instantly.

ROLLUP vs CUBE vs GROUPING SETS

Syntax	What It Computes	Row Count
GROUP BY ROLLUP(year, quarter, month)	Hierarchical subtotals: month, quarter, year, grand total	Moderate (hierarchical only)
GROUP BY CUBE(product, region, month)	All possible combinations including subtotals	High (2^n combinations)
GROUP BY GROUPING SETS((product, region), (product), ())	Only the specific combinations you list	Controlled (you choose)

CUBE generates 2^n grouping combinations where n is the number of dimensions. With 5 dimensions, that is 32 combinations. With 10 dimensions, that is 1,024. At some point, the cube becomes larger than the source fact table, defeating the purpose. GROUPING SETS let you specify exactly which combinations matter.

When Cubes Are Worth It

Pre-computed cubes shine when the same dimension combinations are queried frequently (dashboards with fixed filter sets), the fact table is large (billions of rows), and the number of dimensions is manageable (3-5). They are overkill when queries are ad-hoc (unpredictable dimension combinations), the fact table is small, or the number of dimensions is high (combinatorial explosion).

Granularity Design

Daily Life

Interviews

Choose the right summary level for each use

Choosing the Right Summary Level

The granularity of your summary table determines both its size and its usefulness. A daily summary has 365 rows per year. An hourly summary has 8,760. A per-minute summary has 525,600. The finer the grain, the more flexible the queries, but the larger the table and the more it costs to refresh.

The key question: what is the finest grain that any dashboard or report needs? If no dashboard filters below daily, a daily summary is sufficient. If the operations team needs hourly metrics, you need hourly granularity. Do not build minute-level summaries unless someone actually queries by minute.

Multiple Summary Levels

Production warehouses typically maintain multiple summary tables at different granularities. A daily summary for operational dashboards. A monthly summary for executive reports. A yearly summary for board presentations. Each is derived from the one below it or directly from the fact table.

Summary Level	Rows per Year	Refresh Frequency	Use Case
Hourly by product	8,760 x products	Hourly	Real-time operations dashboard
Daily by product by region	365 x products x regions	Nightly	Standard analytics dashboard
Monthly by region	12 x regions	Monthly	Executive summary
Yearly total	1	Annually	Board reports

TIP

Build the finest summary that any consumer needs. Coarser summaries can be derived from it. Do not build multiple independent summaries from the fact table if they can be cascaded: daily -> monthly -> yearly. Cascade reduces compute cost and guarantees consistency.

A common mistake: building a summary table at a granularity that no consumer uses. If nobody filters by hour, an hourly summary wastes storage and refresh compute. Build summaries based on actual query patterns, not theoretical flexibility.

Refresh Strategies & Materialized Views

Daily Life

Interviews

Keep aggregated data current

A summary table is only useful if it stays current. The refresh strategy determines how often the summary is updated, how long it takes, and what happens when it fails. This is the operationally consequential part of pre-aggregation.

Full Refresh vs Incremental

✓Full Refresh

Delete and rebuild the entire summary table
Simple: no state tracking
Correct: no accumulated bugs
Expensive: re-scans the entire fact table
Best for: small to medium fact tables

✓Incremental Refresh

Only process new/changed rows since last refresh
Complex: must track what has been processed
Risk: bugs accumulate if logic is wrong
Cheap: proportional to change volume, not table size
Best for: large fact tables with append-only writes

The production pattern: incremental refresh daily, full refresh weekly as a reconciliation check. If the full refresh produces different numbers than the incremental, you have a bug in the incremental logic. The weekly full refresh catches it.

Materialized Views

A materialized view is a database-managed summary table. You define the query, and the database handles storage and refresh. This is the simplest form of pre-aggregation because you do not need to build or schedule ETL jobs. The database does it for you.

Platform	Auto-Refresh?	Incremental?	Notes
BigQuery	Yes (up to every 30 min)	Yes	Best materialized view support. Handles joins and aggregations.
Snowflake	Yes (automatic)	Yes	Called 'Dynamic Tables.' Specify lag time.
Redshift	Manual (REFRESH)	Yes (since 2023)	Must schedule refresh via Airflow or cron.
PostgreSQL	Manual (REFRESH)	No (full refresh only)	REFRESH MATERIALIZED VIEW CONCURRENTLY avoids locks.

When Materialized Views Are Not Enough

Materialized views handle simple aggregations well but struggle with complex transformations: multi-step joins, window functions, conditional logic, or cross-database references. For these, you need a proper ETL pipeline (dbt model, Spark job) that computes the summary and writes it to a table.

The decision framework: use materialized views for simple, single-source aggregations that the database can optimize. Use ETL-managed summary tables for complex, multi-source, multi-step transformations.

TIP

Use materialized views as your first choice for pre-aggregation. If they cannot handle your transformation, upgrade to a dbt model or Spark job. Do not over-engineer: if the database can do it, let the database do it.

❯❯❯PUTTING IT ALL TOGETHER

> The product team asks for a dashboard showing daily revenue by product category and region. The fact table has 500M rows and growing.

You create a daily_revenue_summary table with grain: (date, product_category, region). Pre-computes SUM(revenue), SUM(quantity), COUNT(orders). Does NOT store AOV.

The dashboard computes AOV at query time: SUM(revenue) / COUNT(orders). This is correct at every filter combination. A pre-stored AOV would break when the user filters by region.

The summary refreshes incrementally each night (new day's data only). A full rebuild runs every Sunday as a reconciliation check. If numbers diverge, you investigate.

KEY TAKEAWAYS

Pre-aggregate for performance: 365-row daily summary vs 500M-row fact table

Store components, not ratios: SUM(revenue) and COUNT(orders) separately; compute AOV at query time

ROLLUP for hierarchical, CUBE for all combos: use GROUPING SETS when you need specific combinations only

Build the finest grain any consumer needs: cascade coarser summaries from it for consistency

Incremental daily + full weekly: catches incremental bugs while keeping compute cost low

Pre-computing answers before anyone asks

Category: Data Modeling
Duration: 23 minutes
Challenges: 12 hands-on challenges

Topics covered: Why Pre-Aggregate?, Metric Types and Additivity, OLAP Cubes & Rollups, Granularity Design, Refresh Strategies & Materialized Views

Lesson Sections

Why Pre-Aggregate? (concepts: dmPreAggregation)
The Performance Problem A fact table with 500 million rows. A dashboard with 10 tiles. Each tile runs a GROUP BY query that scans the entire table. Ten full-table scans every time someone opens the dashboard. At $5 per TB scanned (BigQuery pricing), that is real money. At 30 seconds per query, that is a terrible user experience. Pre-aggregation solves this by computing the answer ahead of time. Instead of scanning 500 million rows to get daily revenue, create a daily_revenue summary table with 3
Metric Types and Additivity (concepts: dmMetricAdditivity)
Not Every Metric Can Be Pre-Aggregated This is the most important concept in pre-aggregation. Some metrics can be safely SUMmed across any dimension. Others cannot. If you pre-aggregate a non-additive metric, the numbers will be mathematically wrong and nobody will notice until someone reconciles against the source. The Average-of-Averages Trap You pre-compute average order value (AOV) per day: Monday = $50, Tuesday = $100. What is the weekly AOV? NOT ($50 + $100) / 2 = $75. If Monday had 1,000
OLAP Cubes & Rollups (concepts: dmOlapCubes)
Pre-Computing Every Dimension Combination An OLAP cube is a summary table that pre-computes aggregates for every combination of dimensions. Revenue by product. Revenue by region. Revenue by product AND region. Revenue by product AND region AND month. If you have 3 dimensions with 10 values each, the cube has 10 x 10 x 10 = 1,000 cells. Each cell stores a pre-computed SUM. Modern cloud warehouses do not use literal cubes (multidimensional arrays). They use flat summary tables with GROUP BY ROLLUP
Granularity Design (concepts: dmGrainDefinition)
Choosing the Right Summary Level The granularity of your summary table determines both its size and its usefulness. A daily summary has 365 rows per year. An hourly summary has 8,760. A per-minute summary has 525,600. The finer the grain, the more flexible the queries, but the larger the table and the more it costs to refresh. The key question: what is the finest grain that any dashboard or report needs? If no dashboard filters below daily, a daily summary is sufficient. If the operations team n
Refresh Strategies & Materialized Views (concepts: dmPreAggregation)
A summary table is only useful if it stays current. The refresh strategy determines how often the summary is updated, how long it takes, and what happens when it fails. This is the operationally consequential part of pre-aggregation. Full Refresh vs Incremental The production pattern: incremental refresh daily, full refresh weekly as a reconciliation check. If the full refresh produces different numbers than the incremental, you have a bug in the incremental logic. The weekly full refresh catche