Pre-Aggregation
Why Pre-Aggregate?
Make dashboards load in seconds, not minutes
The Performance Problem
- Every dashboard load scans 500M rows
- 30-second query time
- $5 per TB scanned, 10 queries per load
- Users stop checking the dashboard
- Dashboard reads 365-row summary table
- Millisecond query time
- Negligible scan cost
- Users check the dashboard 10x per day
The Tradeoff
| Approach | Freshness | Cost | Complexity |
|---|---|---|---|
| No pre-aggregation | Real-time (queries hit raw data) | High (full-table scans) | Low (no summary tables to maintain) |
| Nightly summary tables | T+1 (yesterday's data) | Low (365 rows per year) | Medium (ETL job to refresh) |
| Streaming aggregation | Near-real-time (seconds) | Medium (streaming infrastructure) | High (Flink/Spark Structured Streaming) |
| Materialized views (auto-refresh) | Minutes to hours | Medium (depends on platform) | Low (database handles refresh) |
Metric Types and Additivity
Know which metrics you can safely roll up
Not Every Metric Can Be Pre-Aggregated
The Average-of-Averages Trap
| Metric | Additive? | Pre-Aggregate? | Store Instead |
|---|---|---|---|
| Revenue | Fully | Yes, SUM directly | SUM(revenue) |
| Order count | Fully | Yes, COUNT directly | COUNT(*) |
| Average order value | Non-additive | No | SUM(revenue) and COUNT(orders) separately |
| Conversion rate | Non-additive | No | SUM(conversions) and SUM(impressions) separately |
| Account balance | Semi-additive | Yes, but only per snapshot | One row per account per day, SUM across accounts only |
| Distinct users | Non-additive | Approximate only | HyperLogLog sketch (mergeable) |
OLAP Cubes & Rollups
Build multi-dimensional summary tables
Pre-Computing Every Dimension Combination
ROLLUP vs CUBE vs GROUPING SETS
| Syntax | What It Computes | Row Count |
|---|---|---|
| GROUP BY ROLLUP(year, quarter, month) | Hierarchical subtotals: month, quarter, year, grand total | Moderate (hierarchical only) |
| GROUP BY CUBE(product, region, month) | All possible combinations including subtotals | High (2^n combinations) |
| GROUP BY GROUPING SETS((product, region), (product), ()) | Only the specific combinations you list | Controlled (you choose) |
When Cubes Are Worth It
Granularity Design
Choose the right summary level for each use
Choosing the Right Summary Level
Multiple Summary Levels
| Summary Level | Rows per Year | Refresh Frequency | Use Case |
|---|---|---|---|
| Hourly by product | 8,760 x products | Hourly | Real-time operations dashboard |
| Daily by product by region | 365 x products x regions | Nightly | Standard analytics dashboard |
| Monthly by region | 12 x regions | Monthly | Executive summary |
| Yearly total | 1 | Annually | Board reports |
A common mistake: building a summary table at a granularity that no consumer uses. If nobody filters by hour, an hourly summary wastes storage and refresh compute. Build summaries based on actual query patterns, not theoretical flexibility.
Refresh Strategies & Materialized Views
Keep aggregated data current
Full Refresh vs Incremental
- Delete and rebuild the entire summary table
- Simple: no state tracking
- Correct: no accumulated bugs
- Expensive: re-scans the entire fact table
- Best for: small to medium fact tables
- Only process new/changed rows since last refresh
- Complex: must track what has been processed
- Risk: bugs accumulate if logic is wrong
- Cheap: proportional to change volume, not table size
- Best for: large fact tables with append-only writes
Materialized Views
| Platform | Auto-Refresh? | Incremental? | Notes |
|---|---|---|---|
| BigQuery | Yes (up to every 30 min) | Yes | Best materialized view support. Handles joins and aggregations. |
| Snowflake | Yes (automatic) | Yes | Called 'Dynamic Tables.' Specify lag time. |
| Redshift | Manual (REFRESH) | Yes (since 2023) | Must schedule refresh via Airflow or cron. |
| PostgreSQL | Manual (REFRESH) | No (full refresh only) | REFRESH MATERIALIZED VIEW CONCURRENTLY avoids locks. |
When Materialized Views Are Not Enough
> The product team asks for a dashboard showing daily revenue by product category and region. The fact table has 500M rows and growing.
Pre-Aggregation
Pre-computing answers before anyone asks
- Category
- Data Modeling
- Duration
- 23 minutes
- Challenges
- 12 hands-on challenges
Topics covered: Why Pre-Aggregate?, Metric Types and Additivity, OLAP Cubes & Rollups, Granularity Design, Refresh Strategies & Materialized Views
Lesson Sections
- Why Pre-Aggregate? (concepts: dmPreAggregation)
The Performance Problem A fact table with 500 million rows. A dashboard with 10 tiles. Each tile runs a GROUP BY query that scans the entire table. Ten full-table scans every time someone opens the dashboard. At $5 per TB scanned (BigQuery pricing), that is real money. At 30 seconds per query, that is a terrible user experience. Pre-aggregation solves this by computing the answer ahead of time. Instead of scanning 500 million rows to get daily revenue, create a daily_revenue summary table with 3
- Metric Types and Additivity (concepts: dmMetricAdditivity)
Not Every Metric Can Be Pre-Aggregated This is the most important concept in pre-aggregation. Some metrics can be safely SUMmed across any dimension. Others cannot. If you pre-aggregate a non-additive metric, the numbers will be mathematically wrong and nobody will notice until someone reconciles against the source. The Average-of-Averages Trap You pre-compute average order value (AOV) per day: Monday = $50, Tuesday = $100. What is the weekly AOV? NOT ($50 + $100) / 2 = $75. If Monday had 1,000
- OLAP Cubes & Rollups (concepts: dmOlapCubes)
Pre-Computing Every Dimension Combination An OLAP cube is a summary table that pre-computes aggregates for every combination of dimensions. Revenue by product. Revenue by region. Revenue by product AND region. Revenue by product AND region AND month. If you have 3 dimensions with 10 values each, the cube has 10 x 10 x 10 = 1,000 cells. Each cell stores a pre-computed SUM. Modern cloud warehouses do not use literal cubes (multidimensional arrays). They use flat summary tables with GROUP BY ROLLUP
- Granularity Design
Choosing the Right Summary Level The granularity of your summary table determines both its size and its usefulness. A daily summary has 365 rows per year. An hourly summary has 8,760. A per-minute summary has 525,600. The finer the grain, the more flexible the queries, but the larger the table and the more it costs to refresh. The key question: what is the finest grain that any dashboard or report needs? If no dashboard filters below daily, a daily summary is sufficient. If the operations team n
- Refresh Strategies & Materialized Views
A summary table is only useful if it stays current. The refresh strategy determines how often the summary is updated, how long it takes, and what happens when it fails. This is the operationally consequential part of pre-aggregation. Full Refresh vs Incremental The production pattern: incremental refresh daily, full refresh weekly as a reconciliation check. If the full refresh produces different numbers than the incremental, you have a bug in the incremental logic. The weekly full refresh catche