Aggregating: Intermediate
Uber's operations team uses HAVING and complex GROUP BY queries to identify which cities have enough high-rated drivers to support a surge in demand, filtering out markets where the average driver rating falls below threshold or where completed trips per driver drop under a minimum volume. That kind of analysis, where you aggregate first and then filter on the result of that aggregation, is something WHERE cannot do on its own. HAVING is what makes it possible, and mastering it is what turns basic aggregation into the tool that drives real operational decisions.
Row-level vs aggregate behavior
Know when a query collapses rows
COUNT is the most common aggregate function, and its two forms look nearly identical. Run them on the same column and they can return different numbers, with no error to tell you why.
Victor Huang flagged it this morning: five engineers on the headcount report, three on the compensation report. He has a leadership sync at two o'clock and he needs a number he can defend. Nobody can explain the gap.
Ten employees. Three departments. Not every row looks the same.
| id | name | department | salary | bonus |
|---|---|---|---|---|
| 1 | Chen | Engineering | 85000 | 6000 |
| 2 | Patel | Engineering | 92000 | NULL |
| 3 | Kim | Engineering | 78000 | 4000 |
| 4 | Osei | Engineering | 88000 | 5000 |
| 5 | Ruiz | Engineering | 74000 | NULL |
| 6 | Tanaka | Marketing | 71000 | 3000 |
| 7 | Adams | Marketing | 82000 | 4000 |
| 8 | Silva | Marketing | 76000 | 2000 |
| 9 | Lee | Sales | 68000 | 2000 |
| 10 | Costa | Sales | 73000 | NULL |
Understanding the difference between row-level and aggregate functions is essential for writing correct SQL. Functions like SUM(salary * 1.1) work because the row-level expression (salary * 1.1) is wrapped inside an aggregate. Let's understand why this matters and why the reverse doesn't work.
Understanding Differences
Two Worlds of SQL
- Evaluate once per row, independently
- Examples: =, >, <, +, -, *, CASE WHEN
- Input: values from the current row
- Output: one value per row
- Evaluate once per group (or entire table)
- Examples: COUNT, SUM, AVG, MIN, MAX
- Input: all values from all rows in group
- Output: one value per group
How They Work Together
- 1. For EACH row: Calculate salary * 1.1 (row-level math)
- 2. Group the results by department
- 3. For EACH group:
SUMall the calculated values (aggregate) - 4. Return one row per group
The row-level expression (salary * 1.1) is evaluated independently for every single row. Then SUM collects all those individual results and adds them up per group.
CASE in Aggregates
This is why CASE WHEN works inside aggregates. CASE WHEN is a row-level expression, so it evaluates for each row first:
- 1. Row 1 (Alice, manager):
CASEevaluates to 120000 - 2. Row 2 (Bob, not manager):
CASEevaluates to 0 - 3. Row 3 (Carol, manager):
CASEevaluates to 95000 - 4. Row 4 (Dave, not manager):
CASEevaluates to 0 - 5.
SUMadds up the evaluated values per department
The CASE expression produces a concrete number for each row. SUM doesn't know or care that the number came from a CASE. It just sums whatever values it receives.
Patterns and Pitfalls
Reverse Doesn't Work
Now here's the crucial insight: you cannot put an aggregate inside a row-level expression in the SELECT clause. This is a syntax error:
- IF/
CASEis evaluated per row, before grouping SUMneeds all rows in a group to calculate- You can't use
SUM's result beforeSUMhas finished - The row-level expression runs first, but needs aggregate result
The Correct Pattern
The Mental Model
- 1.
FROM: Get the rows - 2.
WHERE: Filter rows (row-level, can't use aggregates) - 3.
GROUP BY: Organize into groups - 4.
HAVING: Filter groups (can use aggregates) - 5.
SELECT: Calculate results (row-level expressions first, then aggregates) - 6.
ORDER BY: Sort results (can use aggregates and aliases)
Row-level expressions in SELECT are evaluated as the aggregate scans each row. The aggregate collects these intermediate results. You can nest row-level inside aggregate (row happens during the scan), but not aggregate inside row-level (aggregate needs the full scan to complete first).
Practical Applications
Each aggregate receives row-level evaluated values:
COUNT(*) receives 1 for every row
SUM receives 1 for returns, 0 for non-returns
AVG receives amount for discounted orders, NULL for full-price (which AVG ignores)
This pattern lets you calculate multiple different metrics in a single pass through the data.
COUNT Deep Dive
COUNT is deceptively simple. Understanding its variations and NULL handling helps you write accurate data quality queries.
COUNT(*) Deep Dive
COUNT(*) seems simple, but understanding its nuances helps you write better queries. Let's explore what it actually does and when to use alternatives.
COUNT(*) Behavior
COUNT(*) counts rows, period. It doesn't look at column values. It doesn't care about NULLs. If a row exists in the result set, it gets counted.
Consider this table with some NULL values:
| id | name | |
|---|---|---|
| 1 | Alice | alice@example.com |
| 2 | Bob | NULL |
| 3 | Carol | carol@example.com |
| 4 | NULL | NULL |
| 5 | Eve | eve@example.com |
COUNT(*) returns 5 (all rows exist). COUNT(name) returns 4 (one NULL name). COUNT(email) returns 3 (two NULL emails). NULL represents missing values.
COUNT(*) Best Practices
You might see COUNT(1) in legacy code or tutorials. It's functionally identical to COUNT(*) since both count rows. Use COUNT(*) for clarity because it's the standard convention and clearly expresses "count all rows."
When to Use COUNT(column)
Use COUNT(column) when you specifically need to count non-NULL values. Common use cases:
Which COUNT variant answers "how many rows have a value in the email column"?
COUNT variants have different performance characteristics depending on how the database processes them.
COUNT(*)is highly optimized and often uses index metadataCOUNT(column)must check each value forNULLCOUNT(DISTINCT column)is expensive because it builds a hash set- Prefer
COUNT(*)when you just need row counts
> Complete this query to find each department's total metric count and how many have a non-null value.
SELECT department, COUNT() AS total, COUNT() AS with_value FROM employee_metrics GROUP BY department
CASE WHEN inside SUM or AVG is essentially conditional aggregation: you decide, row by row, whether a value should contribute to the aggregate. This single pattern replaces many common reporting subqueries.
CASE WHEN branches when writing complex conditional aggregations.Understanding cardinality
Predict how joins change your row count
Choose effective GROUP BY columns (low cardinality = fewer groups)
Predict query performance (high cardinality = more memory for DISTINCT)
Design efficient indexes (high cardinality columns make better indexes)
Identify data quality issues (unexpected cardinality suggests problems)
Low cardinality columns like status, region, or category are ideal for GROUP BY. High cardinality columns like user_id or order_id are better for filtering or joining.
Which column would produce fewer groups in a GROUP BY?
> Complete this query to find how many different customers placed orders in each region, with the busiest region first.
SELECT region, ( ) AS unique_customers FROM orders GROUP BY region unique_customers DESC
Cardinality awareness shapes how you write GROUP BY queries. Low-cardinality columns like status or region produce compact, fast results. High-cardinality columns like user_id or order_id produce row-per-entity results that are better suited to filtering or joining.
COUNT(DISTINCT) is the primary tool for measuring cardinality directly in SQL. It tells you exactly how many unique values exist, which is essential for data profiling and deduplication work.
COUNT(DISTINCT column) to understand its cardinality. A column with 5 million distinct values will produce a 5-million-row result set, which is rarely what you want from a GROUP BY.APPROX_DISTINCT (HyperLogLog)
Count unique values in massive datasets
COUNT(DISTINCT) is essential for measuring cardinality, answering questions like how many unique users, sessions, or products. But on massive datasets, it becomes a performance bottleneck. APPROX_DISTINCT solves this with probabilistic counting that's fast at any scale.
The Scaling Problem
COUNT(DISTINCT) Problem
- Memory pressure: Hash set size grows linearly with unique values
- CPU overhead: Every row requires a hash computation and lookup
- Parallelism complexity: Merging results from parallel workers is expensive
- Time cost: Can take minutes on billion-row tables
In a distributed system, COUNT(DISTINCT) requires shuffling data to ensure the same value is processed by the same worker, adding network overhead to an already expensive operation.
How APPROX_DISTINCT Works
APPROX_DISTINCT uses HyperLogLog (HLL), a probabilistic algorithm that estimates cardinality using a fixed amount of memory regardless of input size. The key insight: instead of storing every value, it stores a compact "sketch" that can estimate the count.
Complexity Comparison
The performance difference between COUNT(DISTINCT) and APPROX_DISTINCT grows dramatically with data size. Here's how they compare as cardinality increases:
- 10K unique values: COUNT(DISTINCT) ~12 KB vs APPROX_DISTINCT ~12 KB (1x)
- 100K unique values: ~120 KB vs ~12 KB (2-3x speedup)
- 1M unique values: ~1.2 MB vs ~12 KB (5-10x speedup)
- 10M unique values: ~12 MB vs ~12 KB (20-50x speedup)
- 100M unique values: ~120 MB vs ~12 KB (50-100x speedup)
- 1B unique values: ~1.2 GB vs ~12 KB (100x+ speedup)
Notice that COUNT(DISTINCT) memory grows linearly with cardinality, while APPROX_DISTINCT stays constant at approximately 12 KB (the HLL sketch size). The speedup column shows typical query time improvement.
Choosing the Right Approach
When to Use Each
- Financial reporting:
COUNT(DISTINCT). Exact numbers required - Dashboard metrics:
APPROX_DISTINCT. Speed over precision - Trend analysis:
APPROX_DISTINCT. Relative change matters - Real-time monitoring:
APPROX_DISTINCT. Latency critical - A/B test analysis:
COUNT(DISTINCT). Statistical rigor needed - Log analysis:
APPROX_DISTINCT. Volume too high for exact
Scale is the deciding factor: under 1 million rows, COUNT(DISTINCT) runs fast enough that there is no reason to sacrifice precision. For 1–100 million rows, switch to APPROX_DISTINCT when query latency is a constraint. Beyond 100 million rows, strongly prefer APPROX_DISTINCT unless exactness is a hard requirement. Real-time dashboards should almost always use APPROX_DISTINCT to keep response times below the perceptible threshold.
HyperLogLog Internals
- Constant memory: Fixed size regardless of input cardinality
- Mergeable: Sketches can be combined for distributed computing
- Streaming: Processes data in a single pass
- Order-independent: Same result regardless of input order
> Complete this query to efficiently estimate the number of unique visitors per page.
SELECT page_url, () AS unique_visitors FROM page_views GROUP BY page_url
APPROX_DISTINCT uses constant memory regardless of how many unique values exist, making it ideal for billion-row tables.
Use COUNT(DISTINCT) when precision matters, such as financial reporting or compliance metrics where every count must be exact.
ORDER BY with aggregates
Rank groups by their aggregate values
Aggregated results are often more useful when sorted. ORDER BY lets you arrange your groups by any column or aggregate value: highest revenue first, most recent activity on top, or alphabetically by name.
Basic Sorting Patterns
ORDER BY works with aggregate columns just like regular columns. Sort by metrics to surface top or bottom performers.
Sorting by Aggregates
The result shows your top customers ranked by total spending. DESC means descending (highest first). Use ASC for ascending (lowest first), which is the default if you don't specify.
Using Column Aliases
You can reference your aggregate columns by their alias in ORDER BY. This makes queries more readable:
Top-N Queries
Combining ORDER BY with LIMIT creates powerful reporting patterns for finding top or bottom performers.
LIMIT with ORDER BY
Combine ORDER BY with LIMIT to get "top N" or "bottom N" results. This is one of the most common reporting patterns.
Execution Order
Understanding when ORDER BY executes explains why aggregate aliases work in sorting.
- 1.
FROM: Get the table - 2.
WHERE: Filter rows - 3.
GROUP BY: Create groups - 4.
HAVING: Filter groups - 5.
SELECT: Calculate aggregates - 6.
ORDER BY: Sort results - 7.
LIMIT: Truncate to N rows
ORDER BY runs near the end, after aggregation is complete. This is why you can sort by aggregate values since they exist by this point.
> Complete this query to rank regions from highest to lowest total cloud spend.
SELECT region, SUM(amount) AS total_spend FROM cloud_costs GROUP BY region SUM(amount)
ORDER BY on aggregated results is how raw group data becomes an actionable ranked list. Top-N reports, leaderboards, and worst-performer alerts all rely on this pattern.
Combining ORDER BY with LIMIT is one of the most efficient patterns in analytical SQL. The database can often short-circuit the sort once it has identified the top N values, rather than sorting the entire result set.
ORDER BY rather than repeating the full aggregate expression. It makes queries shorter, easier to read, and less prone to inconsistency if you later change the aggregate.APPROX_PERCENTILE and ARBITRARY
Estimate percentiles at warehouse scale
Percentile Calculations
APPROX_PERCENTILE
APPROX_PERCENTILE provides fast percentile calculations. It takes the column and the percentile as a decimal (0.5 for median, 0.95 for p95):
Median (p50)
Percentile Use Cases
- p50 (median): Typical value, robust against outliers
- p75: Upper quartile, where top 25% begins
- p90: Most users' experience (90% faster than this)
- p95: Service level agreement (SLA) threshold
- p99: Tail latency, worst-case scenarios
The ARBITRARY Function
ARBITRARY is a specialized aggregate for when you need any value from a group, particularly when all values are identical.
The ARBITRARY Function
ARBITRARY is an aggregate function that returns any value from a group. Unlike MIN, MAX, or AVG, which compute a specific result, ARBITRARY simply picks one value and returns it, and that value could be different between query executions.
This might sound useless at first. Why would you want a random value? The answer lies in performance and intent: when you genuinely don't care which value you get, ARBITRARY is dramatically faster than alternatives.
Why ARBITRARY Exists
Syntax and Basic Usage
Without ARBITRARY, you'd get an error: "user_name must appear in GROUP BY or be used in an aggregate function." Using ARBITRARY tells the database "I know all values are the same, just give me one."
How ARBITRARY Works
Watch how ARBITRARY processes rows. When a new group appears, it takes the first value encountered. Subsequent rows for that group are ignored since the value is already set.
| region | manager |
|---|---|
| North | Alice |
| South | Bob |
| North | Carol |
| East | Diana |
| South | Eve |
| East | Frank |
| region | any_manager |
|---|
Performance: O(1) vs O(n)
ARBITRARY has O(1) time complexity per group. It stores one value and ignores subsequent values. Compare this to MIN or MAX, which must compare every value to find the extreme, an O(n) operation where n is the group size.
ARBITRARY runs in O(1): it stores the first value it encounters and ignores the rest. MIN and MAX are O(n): they must compare every value in the group to find the extreme. SUM, COUNT, and AVG also run in O(n), processing every row per group. On billion-row tables this constant-time vs linear distinction becomes significant.
- Use conditional counting for single-pass multi-metric analysis
- Prefer COUNT(*) for simple row counts because it's optimized
- Sort by the metric that matters most to your analysis
- Use median alongside mean to detect skewed distributions
- Add LIMIT when you only need top/bottom results
- Use APPROX_DISTINCT for cardinality on large datasets
- Don't use COUNT(1) over COUNT(*) because there's no benefit and it's less clear
- Don't forget ORDER BY because unsorted aggregates are harder to read
- Don't rely only on averages because percentiles reveal the full picture
- Avoid exact distinct counts on billion-row tables when approximate suffices
- Avoid exact percentiles on billion-row tables and use approximate instead
> Complete this query to find the median response time per service endpoint.
SELECT endpoint, (latency, ) AS median_ms FROM api_calls GROUP BY endpoint
The median (0.5 percentile) is more representative than AVG for skewed distributions like response times or salaries.
ARBITRARY returns any single value from a group in O(1) time, useful when you need a representative value but do not care which specific one.
> You are a data engineer at Airbnb building a nightly metrics pipeline that computes booking performance across listing categories at scale. The query must handle hundreds of millions of rows without blowing memory budgets on exact distinct counts.
date_trunc compute derived fields before aggregate functions collect them per category.APPROX_DISTINCT replaces COUNT(DISTINCT ...) across 500M rows, cutting memory usage from gigabytes to kilobytes.ORDER BY total_bookings DESC with LIMIT 10 on the aggregated result surfaces the top-performing listing categories.APPROX_PERCENTILE(0.5) on payout_amount reveals the median booking value, unaffected by outlier luxury listings.COUNT(*) is fastest: counts rows without checking column valuesCOUNT(column) excludes NULLs; COUNT(DISTINCT column) counts unique non-NULL valuesAPPROX_DISTINCT uses HyperLogLog for ~3% error with O(1) memory at any scaleORDER BY + LIMIT on aggregates gives top/bottom N patternsAPPROX_PERCENTILE(0.5) is median; P95/P99 reveal distribution tailsARBITRARY returns any value from a group in O(1) time (useful for denormalization)NULL values; empty groups return NULLAVG lied to you. Meet the percentiles.
- Category
- SQL
- Difficulty
- intermediate
- Duration
- 33 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Row-level vs aggregate behavior, Understanding cardinality, APPROX_DISTINCT (HyperLogLog), ORDER BY with aggregates, APPROX_PERCENTILE and ARBITRARY
Lesson Sections
- Row-level vs aggregate behavior
Understanding Differences Row-level and aggregate operations follow different execution models. Grasping this distinction is key to writing correct SQL. Two Worlds of SQL SQL has two fundamentally different kinds of operations: How They Work Together Here's the key insight: row-level operations happen first, for each row individually. Then aggregate functions collect those results and summarize them. Consider this query: Here's the execution order: CASE in Aggregates Patterns and Pitfalls Unders
- Understanding cardinality
Cardinality means the number of unique values in a column. It's a fundamental concept for understanding your data and optimizing queries. Understanding cardinality helps you: Try counting distinct customers per region and sorting to find the busiest regions first.
- APPROX_DISTINCT (HyperLogLog) (concepts: sqlApproxAgg)
The Scaling Problem Exact distinct counts become expensive as data grows. Understanding the cost helps you choose the right approach. COUNT(DISTINCT) Problem To count distinct values exactly, the database must track every unique value it has seen. This requires building a hash set in memory that grows with cardinality. For a table with 1 billion rows and 100 million unique values, that hash set can consume gigabytes of memory. How APPROX_DISTINCT Works The approximate counts are within 0.3% of e
- ORDER BY with aggregates
Basic Sorting Patterns Sorting by Aggregates The most common use case: sort groups by their aggregated metric to find the top or bottom performers. Using Column Aliases Top-N Queries LIMIT with ORDER BY This shows your top 5 best-selling products by units sold. Perfect for dashboards, reports, and quick insights. Execution Order
- APPROX_PERCENTILE and ARBITRARY
Averages can be misleading. If one customer spends $100,000 and ninety-nine customers spend $100 each, the average is $1,089.90, which is far higher than what a typical customer spends. Percentiles tell you what typical really looks like. A percentile tells you what value a certain percentage of data falls below. The 50th percentile (median) is where half the values are below and half are above. The 95th percentile tells you the value that 95% of data falls below, which is useful for understandi