Pre-Aggregation

A dashboard that takes 45 seconds to load is a dashboard nobody uses. Pre-aggregation is how you make analytical queries fast: compute the answer once, store the result, and serve it instantly. But not every metric can be pre-aggregated safely. Sum revenue by day works. Average order value by day does not (you cannot average averages). This lesson teaches you which metrics are safe to pre-aggregate, how to design summary tables at the right granularity, and how to keep them up to date.

Why Pre-Aggregate?

Daily Life
Interviews

Make dashboards load in seconds, not minutes

The Performance Problem

A fact table with 500 million rows. A dashboard with 10 tiles. Each tile runs a GROUP BY query that scans the entire table. Ten full-table scans every time someone opens the dashboard. At $5 per TB scanned (BigQuery pricing), that is real money. At 30 seconds per query, that is a terrible user experience.
Pre-aggregation solves this by computing the answer ahead of time. Instead of scanning 500 million rows to get daily revenue, create a daily_revenue summary table with 365 rows per year. The dashboard reads 365 rows instead of 500 million. Load time drops from 30 seconds to milliseconds.
No Pre-Aggregation
  • Every dashboard load scans 500M rows
  • 30-second query time
  • $5 per TB scanned, 10 queries per load
  • Users stop checking the dashboard
With Pre-Aggregation
  • Dashboard reads 365-row summary table
  • Millisecond query time
  • Negligible scan cost
  • Users check the dashboard 10x per day

The Tradeoff

Pre-aggregation is a deliberate tradeoff between freshness and performance. The summary table is only as current as its last refresh. If the daily_revenue table is refreshed overnight, the dashboard shows yesterday's numbers, not today's. If the business needs real-time numbers, you need a different approach (streaming aggregation, materialized views with auto-refresh).
ApproachFreshnessCostComplexity
No pre-aggregationReal-time (queries hit raw data)High (full-table scans)Low (no summary tables to maintain)
Nightly summary tablesT+1 (yesterday's data)Low (365 rows per year)Medium (ETL job to refresh)
Streaming aggregationNear-real-time (seconds)Medium (streaming infrastructure)High (Flink/Spark Structured Streaming)
Materialized views (auto-refresh)Minutes to hoursMedium (depends on platform)Low (database handles refresh)
TIP
Start with nightly summary tables. They cover 90% of dashboard use cases. Only invest in streaming aggregation when the business has a proven need for sub-minute freshness.

Metric Types and Additivity

Daily Life
Interviews

Know which metrics you can safely roll up

Not Every Metric Can Be Pre-Aggregated

This is the most important concept in pre-aggregation. Some metrics can be safely SUMmed across any dimension. Others cannot. If you pre-aggregate a non-additive metric, the numbers will be mathematically wrong and nobody will notice until someone reconciles against the source.
Fully AdditiveSemi-AdditiveNon-Additive
Fully Additive
Can SUM across every dimension
Revenue, quantity, cost, discount amount. SUM by product, by region, by month, by any combination. Always correct.
Semi-Additive
Can SUM across some dimensions, not time
Account balance, inventory level, headcount. SUM across accounts on one day is correct. SUM across days is meaningless (you are adding up daily snapshots).
Non-Additive
Cannot SUM across any dimension
Unit price, conversion rate, average order value. These must be computed from additive components (SUM(revenue) / COUNT(orders)), never pre-stored.

The Average-of-Averages Trap

You pre-compute average order value (AOV) per day: Monday = $50, Tuesday = $100. What is the weekly AOV? NOT ($50 + $100) / 2 = $75. If Monday had 1,000 orders and Tuesday had 10 orders, the true AOV is ($50,000 + $1,000) / 1,010 = $50.50. Averaging the daily averages gives $75, which is 48% wrong.
The fix: never store ratios or averages in summary tables. Store the additive components: SUM(revenue) and COUNT(orders). Compute the ratio at query time: SUM(revenue) / SUM(orders). This is correct at every rollup level.
//

MetricAdditive?Pre-Aggregate?Store Instead
RevenueFullyYes, SUM directlySUM(revenue)
Order countFullyYes, COUNT directlyCOUNT(*)
Average order valueNon-additiveNoSUM(revenue) and COUNT(orders) separately
Conversion rateNon-additiveNoSUM(conversions) and SUM(impressions) separately
Account balanceSemi-additiveYes, but only per snapshotOne row per account per day, SUM across accounts only
Distinct usersNon-additiveApproximate onlyHyperLogLog sketch (mergeable)
TIP
The rule: store additive components, not derived ratios. If you store clicks and impressions, you can always compute CTR. If you store only CTR, you cannot re-aggregate it correctly at a different granularity.

OLAP Cubes & Rollups

Daily Life
Interviews

Build multi-dimensional summary tables

Pre-Computing Every Dimension Combination

An OLAP cube is a summary table that pre-computes aggregates for every combination of dimensions. Revenue by product. Revenue by region. Revenue by product AND region. Revenue by product AND region AND month. If you have 3 dimensions with 10 values each, the cube has 10 x 10 x 10 = 1,000 cells. Each cell stores a pre-computed SUM.
Modern cloud warehouses do not use literal cubes (multidimensional arrays). They use flat summary tables with GROUP BY ROLLUP or GROUP BY CUBE to generate all combinations. But the concept is the same: pre-compute aggregates at every granularity so any dashboard filter combination is answered instantly.

ROLLUP vs CUBE vs GROUPING SETS

SyntaxWhat It ComputesRow Count
GROUP BY ROLLUP(year, quarter, month)Hierarchical subtotals: month, quarter, year, grand totalModerate (hierarchical only)
GROUP BY CUBE(product, region, month)All possible combinations including subtotalsHigh (2^n combinations)
GROUP BY GROUPING SETS((product, region), (product), ())Only the specific combinations you listControlled (you choose)
CUBE generates 2^n grouping combinations where n is the number of dimensions. With 5 dimensions, that is 32 combinations. With 10 dimensions, that is 1,024. At some point, the cube becomes larger than the source fact table, defeating the purpose. GROUPING SETS let you specify exactly which combinations matter.

When Cubes Are Worth It

Pre-computed cubes shine when the same dimension combinations are queried frequently (dashboards with fixed filter sets), the fact table is large (billions of rows), and the number of dimensions is manageable (3-5). They are overkill when queries are ad-hoc (unpredictable dimension combinations), the fact table is small, or the number of dimensions is high (combinatorial explosion).

Granularity Design

Daily Life
Interviews

Choose the right summary level for each use

Choosing the Right Summary Level

The granularity of your summary table determines both its size and its usefulness. A daily summary has 365 rows per year. An hourly summary has 8,760. A per-minute summary has 525,600. The finer the grain, the more flexible the queries, but the larger the table and the more it costs to refresh.
The key question: what is the finest grain that any dashboard or report needs? If no dashboard filters below daily, a daily summary is sufficient. If the operations team needs hourly metrics, you need hourly granularity. Do not build minute-level summaries unless someone actually queries by minute.

Multiple Summary Levels

Production warehouses typically maintain multiple summary tables at different granularities. A daily summary for operational dashboards. A monthly summary for executive reports. A yearly summary for board presentations. Each is derived from the one below it or directly from the fact table.
Summary LevelRows per YearRefresh FrequencyUse Case
Hourly by product8,760 x productsHourlyReal-time operations dashboard
Daily by product by region365 x products x regionsNightlyStandard analytics dashboard
Monthly by region12 x regionsMonthlyExecutive summary
Yearly total1AnnuallyBoard reports
TIP
Build the finest summary that any consumer needs. Coarser summaries can be derived from it. Do not build multiple independent summaries from the fact table if they can be cascaded: daily -> monthly -> yearly. Cascade reduces compute cost and guarantees consistency.

A common mistake: building a summary table at a granularity that no consumer uses. If nobody filters by hour, an hourly summary wastes storage and refresh compute. Build summaries based on actual query patterns, not theoretical flexibility.

Refresh Strategies & Materialized Views

Daily Life
Interviews

Keep aggregated data current

A summary table is only useful if it stays current. The refresh strategy determines how often the summary is updated, how long it takes, and what happens when it fails. This is the operationally consequential part of pre-aggregation.

Full Refresh vs Incremental

Full Refresh
  • Delete and rebuild the entire summary table
  • Simple: no state tracking
  • Correct: no accumulated bugs
  • Expensive: re-scans the entire fact table
  • Best for: small to medium fact tables
Incremental Refresh
  • Only process new/changed rows since last refresh
  • Complex: must track what has been processed
  • Risk: bugs accumulate if logic is wrong
  • Cheap: proportional to change volume, not table size
  • Best for: large fact tables with append-only writes
The production pattern: incremental refresh daily, full refresh weekly as a reconciliation check. If the full refresh produces different numbers than the incremental, you have a bug in the incremental logic. The weekly full refresh catches it.

Materialized Views

A materialized view is a database-managed summary table. You define the query, and the database handles storage and refresh. This is the simplest form of pre-aggregation because you do not need to build or schedule ETL jobs. The database does it for you.
//

PlatformAuto-Refresh?Incremental?Notes
BigQueryYes (up to every 30 min)YesBest materialized view support. Handles joins and aggregations.
SnowflakeYes (automatic)YesCalled 'Dynamic Tables.' Specify lag time.
RedshiftManual (REFRESH)Yes (since 2023)Must schedule refresh via Airflow or cron.
PostgreSQLManual (REFRESH)No (full refresh only)REFRESH MATERIALIZED VIEW CONCURRENTLY avoids locks.

When Materialized Views Are Not Enough

Materialized views handle simple aggregations well but struggle with complex transformations: multi-step joins, window functions, conditional logic, or cross-database references. For these, you need a proper ETL pipeline (dbt model, Spark job) that computes the summary and writes it to a table.
The decision framework: use materialized views for simple, single-source aggregations that the database can optimize. Use ETL-managed summary tables for complex, multi-source, multi-step transformations.
TIP
Use materialized views as your first choice for pre-aggregation. If they cannot handle your transformation, upgrade to a dbt model or Spark job. Do not over-engineer: if the database can do it, let the database do it.
PUTTING IT ALL TOGETHER

> The product team asks for a dashboard showing daily revenue by product category and region. The fact table has 500M rows and growing.

You create a daily_revenue_summary table with grain: (date, product_category, region). Pre-computes SUM(revenue), SUM(quantity), COUNT(orders). Does NOT store AOV.
The dashboard computes AOV at query time: SUM(revenue) / COUNT(orders). This is correct at every filter combination. A pre-stored AOV would break when the user filters by region.
The summary refreshes incrementally each night (new day's data only). A full rebuild runs every Sunday as a reconciliation check. If numbers diverge, you investigate.
KEY TAKEAWAYS
Pre-aggregate for performance: 365-row daily summary vs 500M-row fact table
Store components, not ratios: SUM(revenue) and COUNT(orders) separately; compute AOV at query time
ROLLUP for hierarchical, CUBE for all combos: use GROUPING SETS when you need specific combinations only
Build the finest grain any consumer needs: cascade coarser summaries from it for consistency
Incremental daily + full weekly: catches incremental bugs while keeping compute cost low

Pre-Aggregation

Pre-computing answers before anyone asks

Category
Data Modeling
Duration
23 minutes
Challenges
12 hands-on challenges

Topics covered: Why Pre-Aggregate?, Metric Types and Additivity, OLAP Cubes & Rollups, Granularity Design, Refresh Strategies & Materialized Views

Lesson Sections

  1. Why Pre-Aggregate? (concepts: dmPreAggregation)

    The Performance Problem A fact table with 500 million rows. A dashboard with 10 tiles. Each tile runs a GROUP BY query that scans the entire table. Ten full-table scans every time someone opens the dashboard. At $5 per TB scanned (BigQuery pricing), that is real money. At 30 seconds per query, that is a terrible user experience. Pre-aggregation solves this by computing the answer ahead of time. Instead of scanning 500 million rows to get daily revenue, create a daily_revenue summary table with 3

  2. Metric Types and Additivity (concepts: dmMetricAdditivity)

    Not Every Metric Can Be Pre-Aggregated This is the most important concept in pre-aggregation. Some metrics can be safely SUMmed across any dimension. Others cannot. If you pre-aggregate a non-additive metric, the numbers will be mathematically wrong and nobody will notice until someone reconciles against the source. The Average-of-Averages Trap You pre-compute average order value (AOV) per day: Monday = $50, Tuesday = $100. What is the weekly AOV? NOT ($50 + $100) / 2 = $75. If Monday had 1,000

  3. OLAP Cubes & Rollups (concepts: dmOlapCubes)

    Pre-Computing Every Dimension Combination An OLAP cube is a summary table that pre-computes aggregates for every combination of dimensions. Revenue by product. Revenue by region. Revenue by product AND region. Revenue by product AND region AND month. If you have 3 dimensions with 10 values each, the cube has 10 x 10 x 10 = 1,000 cells. Each cell stores a pre-computed SUM. Modern cloud warehouses do not use literal cubes (multidimensional arrays). They use flat summary tables with GROUP BY ROLLUP

  4. Granularity Design

    Choosing the Right Summary Level The granularity of your summary table determines both its size and its usefulness. A daily summary has 365 rows per year. An hourly summary has 8,760. A per-minute summary has 525,600. The finer the grain, the more flexible the queries, but the larger the table and the more it costs to refresh. The key question: what is the finest grain that any dashboard or report needs? If no dashboard filters below daily, a daily summary is sufficient. If the operations team n

  5. Refresh Strategies & Materialized Views

    A summary table is only useful if it stays current. The refresh strategy determines how often the summary is updated, how long it takes, and what happens when it fails. This is the operationally consequential part of pre-aggregation. Full Refresh vs Incremental The production pattern: incremental refresh daily, full refresh weekly as a reconciliation check. If the full refresh produces different numbers than the incremental, you have a bug in the incremental logic. The weekly full refresh catche