Aggregating: Intermediate

Uber's operations team uses HAVING and complex GROUP BY queries to identify which cities have enough high-rated drivers to support a surge in demand, filtering out markets where the average driver rating falls below threshold or where completed trips per driver drop under a minimum volume. That kind of analysis, where you aggregate first and then filter on the result of that aggregation, is something WHERE cannot do on its own. HAVING is what makes it possible, and mastering it is what turns basic aggregation into the tool that drives real operational decisions.

Row-level vs aggregate behavior

Daily Life
Interviews

Know when a query collapses rows

COUNT is the most common aggregate function, and its two forms look nearly identical. Run them on the same column and they can return different numbers, with no error to tell you why.

THE CASE OF THE PHANTOM COUNT

Victor Huang flagged it this morning: five engineers on the headcount report, three on the compensation report. He has a leadership sync at two o'clock and he needs a number he can defend. Nobody can explain the gap.

Ten employees. Three departments. Not every row looks the same.

employees
idnamedepartmentsalarybonus
1ChenEngineering850006000
2PatelEngineering92000NULL
3KimEngineering780004000
4OseiEngineering880005000
5RuizEngineering74000NULL
6TanakaMarketing710003000
7AdamsMarketing820004000
8SilvaMarketing760002000
9LeeSales680002000
10CostaSales73000NULL

Understanding the difference between row-level and aggregate functions is essential for writing correct SQL. Functions like SUM(salary * 1.1) work because the row-level expression (salary * 1.1) is wrapped inside an aggregate. Let's understand why this matters and why the reverse doesn't work.

Understanding Differences

Row-level and aggregate operations follow different execution models. Grasping this distinction is key to writing correct SQL.
//

Two Worlds of SQL

SQL has two fundamentally different kinds of operations:
Row-Level Operations
  • Evaluate once per row, independently
  • Examples: =, >, <, +, -, *, CASE WHEN
  • Input: values from the current row
  • Output: one value per row
Aggregate Operations
  • Evaluate once per group (or entire table)
  • Examples: COUNT, SUM, AVG, MIN, MAX
  • Input: all values from all rows in group
  • Output: one value per group
//

How They Work Together

Here's the key insight: row-level operations happen first, for each row individually. Then aggregate functions collect those results and summarize them.
Consider this query:
1SELECT
2 department,
3 SUM(salary * 1.1) AS total_with_raise
4FROM employee_metrics
5GROUP BY department
Result
departmenttotal_with_raise
Engineering594000.00
Sales316800.00
Here's the execution order:
Step-by-Step Evaluation
  • 1. For EACH row: Calculate salary * 1.1 (row-level math)
  • 2. Group the results by department
  • 3. For EACH group: SUM all the calculated values (aggregate)
  • 4. Return one row per group

The row-level expression (salary * 1.1) is evaluated independently for every single row. Then SUM collects all those individual results and adds them up per group.

//

CASE in Aggregates

This is why CASE WHEN works inside aggregates. CASE WHEN is a row-level expression, so it evaluates for each row first:

1SELECT
2 department,
3 SUM(
4 CASE
5 WHEN is_manager THEN salary
6 ELSE 0
7 END
8 ) AS manager_total
9FROM employee_metrics
10GROUP BY department
Result
departmentmanager_total
Engineering180000
Sales95000
Execution Breakdown
  • 1. Row 1 (Alice, manager): CASE evaluates to 120000
  • 2. Row 2 (Bob, not manager): CASE evaluates to 0
  • 3. Row 3 (Carol, manager): CASE evaluates to 95000
  • 4. Row 4 (Dave, not manager): CASE evaluates to 0
  • 5. SUM adds up the evaluated values per department

The CASE expression produces a concrete number for each row. SUM doesn't know or care that the number came from a CASE. It just sums whatever values it receives.

Patterns and Pitfalls

Understanding what works and what doesn't helps you avoid common mistakes when mixing row-level and aggregate operations.
//

Reverse Doesn't Work

Now here's the crucial insight: you cannot put an aggregate inside a row-level expression in the SELECT clause. This is a syntax error:

1SELECT
2 department,
3 CASE
4 WHEN SUM(salary) > 100000 THEN 'Large'
5 ELSE 'Small'
6 END AS size
7FROM employee_metrics
8GROUP BY department
Result
departmentsize
EngineeringLarge
SalesSmall
MarketingLarge
Why doesn't this work? Think about when things happen:
The Logical Problem
  • IF/CASE is evaluated per row, before grouping
  • SUM needs all rows in a group to calculate
  • You can't use SUM's result before SUM has finished
  • The row-level expression runs first, but needs aggregate result
It's like trying to know the final score of a game before it's played. The aggregate hasn't happened yet when the row-level expression needs its value.
//

The Correct Pattern

To make decisions based on aggregate results, you have two options:
Use HAVING to filter groups after aggregation. Groups that don't meet the condition are excluded entirely from the result.
1SELECT
2 department,
3 SUM(salary) AS total
4FROM employee_metrics
5GROUP BY department
6HAVING SUM(salary) > 100000
//

The Mental Model

Think of SQL execution as a pipeline:
Query Execution Pipeline
  • 1. FROM: Get the rows
  • 2. WHERE: Filter rows (row-level, can't use aggregates)
  • 3. GROUP BY: Organize into groups
  • 4. HAVING: Filter groups (can use aggregates)
  • 5. SELECT: Calculate results (row-level expressions first, then aggregates)
  • 6. ORDER BY: Sort results (can use aggregates and aliases)

Row-level expressions in SELECT are evaluated as the aggregate scans each row. The aggregate collects these intermediate results. You can nest row-level inside aggregate (row happens during the scan), but not aggregate inside row-level (aggregate needs the full scan to complete first).

Which approach correctly calculates total salary with a 10% raise per department?
SUM(salary * 1.1) applies the raise per row first, then sums. Each employee gets their own 10% raise before totaling.
1SELECT
2 department,
3 SUM(salary * 1.1) AS total_raised
4FROM employee_metrics
5GROUP BY department
//

Practical Applications

Understanding this unlocks powerful patterns. Here's a common real-world example combining multiple row-level expressions inside aggregates:
1SELECT
2 product_category,
3 COUNT(*) AS total_orders,
4 SUM(
5 CASE
6 WHEN status = 'returned' THEN 1
7 ELSE 0
8 END
9 ) AS returns,
10 AVG(
11 CASE
12 WHEN discount > 0 THEN amount
13 END
14 ) AS avg_discounted
15FROM orders
16GROUP BY product_category
Result
product_categorytotal_ordersreturnsavg_discounted
Electronics152389245.50
Clothing234123467.80
Home87645123.40

Each aggregate receives row-level evaluated values: COUNT(*) receives 1 for every row SUM receives 1 for returns, 0 for non-returns AVG receives amount for discounted orders, NULL for full-price (which AVG ignores) This pattern lets you calculate multiple different metrics in a single pass through the data.

COUNT Deep Dive

COUNT is deceptively simple. Understanding its variations and NULL handling helps you write accurate data quality queries.

//

COUNT(*) Deep Dive

COUNT(*) seems simple, but understanding its nuances helps you write better queries. Let's explore what it actually does and when to use alternatives.

//

COUNT(*) Behavior

COUNT(*) counts rows, period. It doesn't look at column values. It doesn't care about NULLs. If a row exists in the result set, it gets counted.

Consider this table with some NULL values:

idnameemail
1Alicealice@example.com
2BobNULL
3Carolcarol@example.com
4NULLNULL
5Eveeve@example.com
1SELECT
2 COUNT(*) AS total_rows,
3 COUNT(name) AS names_present,
4 COUNT(email) AS emails_present
5FROM users
Result
total_rowsnames_presentemails_present
543

COUNT(*) returns 5 (all rows exist). COUNT(name) returns 4 (one NULL name). COUNT(email) returns 3 (two NULL emails). NULL represents missing values.

//

COUNT(*) Best Practices

You might see COUNT(1) in legacy code or tutorials. It's functionally identical to COUNT(*) since both count rows. Use COUNT(*) for clarity because it's the standard convention and clearly expresses "count all rows."

//

When to Use COUNT(column)

Use COUNT(column) when you specifically need to count non-NULL values. Common use cases:

• Data quality checks: How many customers have emails? • Optional field analysis: How many orders have discount codes? • Completeness metrics: What percentage of profiles are complete?
1SELECT
2 COUNT(*) AS total_customers,
3 COUNT(phone) AS have_phone,
4 COUNT(email) AS have_email,
5 COUNT(address) AS have_address,
6 ROUND(
7 100 * COUNT(phone) / COUNT(*),
8 1
9 ) AS phone_rate
10FROM customers
Result
total_customershave_phonehave_emailhave_addressphone_rate
1000082349567712382.3
This data quality report shows that 82.3% of customers have phone numbers on file, and email capture is better than address capture.

Which COUNT variant answers "how many rows have a value in the email column"?

COUNT(*) counts every row regardless of column values. It tells you the total number of customers, not how many have emails.
1SELECT
2 COUNT(*) AS total
3FROM customers

COUNT variants have different performance characteristics depending on how the database processes them.

Database Execution
  • COUNT(*) is highly optimized and often uses index metadata
  • COUNT(column) must check each value for NULL
  • COUNT(DISTINCT column) is expensive because it builds a hash set
  • Prefer COUNT(*) when you just need row counts

> Complete this query to find each department's total metric count and how many have a non-null value.

SELECT
  department,
  COUNT(___) AS total,
  COUNT(___) AS with_value
FROM employee_metrics
GROUP BY department
*
metric_name
1
metric_value
Row-level expressions inside aggregates are a powerful composition tool. The database evaluates each row independently, then the aggregate function collects those individual results into a single summary value.

CASE WHEN inside SUM or AVG is essentially conditional aggregation: you decide, row by row, whether a value should contribute to the aggregate. This single pattern replaces many common reporting subqueries.

TIP
When combining multiple aggregate metrics in one query, always verify the results against simpler single-metric queries first. It is easy to mix up CASE WHEN branches when writing complex conditional aggregations.

Understanding cardinality

Daily Life
Interviews

Predict how joins change your row count

Cardinality means the number of unique values in a column. It's a fundamental concept for understanding your data and optimizing queries.
Status column
Status column
Values like [pending, completed, cancelled] have cardinality 3.
User ID column
User ID column
1 million unique users means cardinality of 1,000,000.
Country column
Country column
About ~200 unique values (number of countries).
Boolean column
Boolean column
Cardinality 2 (true/false) or 3 if NULLs exist.
Understanding cardinality helps you:

Choose effective GROUP BY columns (low cardinality = fewer groups) Predict query performance (high cardinality = more memory for DISTINCT) Design efficient indexes (high cardinality columns make better indexes) Identify data quality issues (unexpected cardinality suggests problems)

1SELECT
2 'status' AS column_name,
3 COUNT(DISTINCT status) AS cardinality
4FROM orders
Result
column_namecardinality
status3

Low cardinality columns like status, region, or category are ideal for GROUP BY. High cardinality columns like user_id or order_id are better for filtering or joining.

Which column would produce fewer groups in a GROUP BY?

Status has low cardinality (perhaps 3-5 values like pending, shipped, delivered). This produces very few groups, making the query fast and the results easy to read.
1SELECT
2 status,
3 COUNT(*) AS cnt
4FROM orders
5GROUP BY status
Try counting distinct customers per region and sorting to find the busiest regions first.

> Complete this query to find how many different customers placed orders in each region, with the busiest region first.

SELECT
  region,
  ___(___ ___) AS unique_customers
FROM orders
GROUP BY region
___ unique_customers DESC
SUM
COUNT
ORDER BY
customer_id
DISTINCT
GROUP BY

Cardinality awareness shapes how you write GROUP BY queries. Low-cardinality columns like status or region produce compact, fast results. High-cardinality columns like user_id or order_id produce row-per-entity results that are better suited to filtering or joining.

COUNT(DISTINCT) is the primary tool for measuring cardinality directly in SQL. It tells you exactly how many unique values exist, which is essential for data profiling and deduplication work.

TIP
Before grouping on a column you are unfamiliar with, run a quick COUNT(DISTINCT column) to understand its cardinality. A column with 5 million distinct values will produce a 5-million-row result set, which is rarely what you want from a GROUP BY.

APPROX_DISTINCT (HyperLogLog)

Daily Life
Interviews

Count unique values in massive datasets

COUNT(DISTINCT) is essential for measuring cardinality, answering questions like how many unique users, sessions, or products. But on massive datasets, it becomes a performance bottleneck. APPROX_DISTINCT solves this with probabilistic counting that's fast at any scale.

The Scaling Problem

Exact distinct counts become expensive as data grows. Understanding the cost helps you choose the right approach.
//

COUNT(DISTINCT) Problem

To count distinct values exactly, the database must track every unique value it has seen. This requires building a hash set in memory that grows with cardinality. For a table with 1 billion rows and 100 million unique values, that hash set can consume gigabytes of memory.
Why COUNT(DISTINCT) Struggles at Scale
  • Memory pressure: Hash set size grows linearly with unique values
  • CPU overhead: Every row requires a hash computation and lookup
  • Parallelism complexity: Merging results from parallel workers is expensive
  • Time cost: Can take minutes on billion-row tables

In a distributed system, COUNT(DISTINCT) requires shuffling data to ensure the same value is processed by the same worker, adding network overhead to an already expensive operation.

//

How APPROX_DISTINCT Works

APPROX_DISTINCT uses HyperLogLog (HLL), a probabilistic algorithm that estimates cardinality using a fixed amount of memory regardless of input size. The key insight: instead of storing every value, it stores a compact "sketch" that can estimate the count.

1SELECT
2 DATE_TRUNC('day', event_time) AS event_day,
3 COUNT(DISTINCT user_id) AS exact_users,
4 APPROX_DISTINCT(user_id) AS approx_users
5FROM events
6GROUP BY DATE_TRUNC('day', event_time)
Result
event_dayexact_usersapprox_users
2025-11-2512345671231024
2025-11-2613456781349215
2025-11-2714567891452387
The approximate counts are within 0.3% of exact in this example. In practice, HyperLogLog provides about 2% standard error, meaning 95% of estimates are within 4% of the true value.
//

Complexity Comparison

The performance difference between COUNT(DISTINCT) and APPROX_DISTINCT grows dramatically with data size. Here's how they compare as cardinality increases:

Memory Comparison
  • 10K unique values: COUNT(DISTINCT) ~12 KB vs APPROX_DISTINCT ~12 KB (1x)
  • 100K unique values: ~120 KB vs ~12 KB (2-3x speedup)
  • 1M unique values: ~1.2 MB vs ~12 KB (5-10x speedup)
  • 10M unique values: ~12 MB vs ~12 KB (20-50x speedup)
  • 100M unique values: ~120 MB vs ~12 KB (50-100x speedup)
  • 1B unique values: ~1.2 GB vs ~12 KB (100x+ speedup)

Notice that COUNT(DISTINCT) memory grows linearly with cardinality, while APPROX_DISTINCT stays constant at approximately 12 KB (the HLL sketch size). The speedup column shows typical query time improvement.

Choosing the Right Approach

The choice between exact and approximate counts depends on your use case. Consider accuracy requirements and performance constraints.
//

When to Use Each

When to Use Each
  • Financial reporting: COUNT(DISTINCT). Exact numbers required
  • Dashboard metrics: APPROX_DISTINCT. Speed over precision
  • Trend analysis: APPROX_DISTINCT. Relative change matters
  • Real-time monitoring: APPROX_DISTINCT. Latency critical
  • A/B test analysis: COUNT(DISTINCT). Statistical rigor needed
  • Log analysis: APPROX_DISTINCT. Volume too high for exact

Scale is the deciding factor: under 1 million rows, COUNT(DISTINCT) runs fast enough that there is no reason to sacrifice precision. For 1–100 million rows, switch to APPROX_DISTINCT when query latency is a constraint. Beyond 100 million rows, strongly prefer APPROX_DISTINCT unless exactness is a hard requirement. Real-time dashboards should almost always use APPROX_DISTINCT to keep response times below the perceptible threshold.

//

HyperLogLog Internals

HyperLogLog works by hashing each value and observing patterns in the hash bits. The algorithm counts leading zeros in the hash, using the maximum number of leading zeros seen as an estimator for cardinality. More unique values means higher probability of seeing rare patterns (many leading zeros).
Despite its compact implementation, HyperLogLog provides several powerful properties that make it ideal for distributed systems.
HyperLogLog Properties
  • Constant memory: Fixed size regardless of input cardinality
  • Mergeable: Sketches can be combined for distributed computing
  • Streaming: Processes data in a single pass
  • Order-independent: Same result regardless of input order
COUNT(DISTINCT) gives an exact answer but must track every unique value in memory. Best for small-to-medium datasets or when precision is required.
1SELECT
2 region,
3 COUNT(DISTINCT customer_id) AS unique_customers
4FROM orders
5GROUP BY region
Choosing between exact and approximate counts depends on your dataset size and precision requirements.

> Complete this query to efficiently estimate the number of unique visitors per page.

SELECT
  page_url,
  ___(___) AS unique_visitors
FROM page_views
GROUP BY page_url
APPROX_DISTINCT
COUNT
visitor_id
page_url

APPROX_DISTINCT uses constant memory regardless of how many unique values exist, making it ideal for billion-row tables.

The ~2-3% error margin is acceptable for dashboards and trend analysis where exact counts provide no additional business value.

Use COUNT(DISTINCT) when precision matters, such as financial reporting or compliance metrics where every count must be exact.

ORDER BY with aggregates

Daily Life
Interviews

Rank groups by their aggregate values

Aggregated results are often more useful when sorted. ORDER BY lets you arrange your groups by any column or aggregate value: highest revenue first, most recent activity on top, or alphabetically by name.

Basic Sorting Patterns

ORDER BY works with aggregate columns just like regular columns. Sort by metrics to surface top or bottom performers.

//

Sorting by Aggregates

The most common use case: sort groups by their aggregated metric to find the top or bottom performers.
1SELECT
2 customer_id,
3 COUNT(*) AS order_count,
4 SUM(amount) AS total_spent
5FROM orders
6GROUP BY customer_id
7ORDER BY total_spent DESC
Result
customer_idorder_counttotal_spent
C0424715234.50
C0893112876.00
C0155211923.75
C103289845.00
C067198234.25

The result shows your top customers ranked by total spending. DESC means descending (highest first). Use ASC for ascending (lowest first), which is the default if you don't specify.

//

Using Column Aliases

You can reference your aggregate columns by their alias in ORDER BY. This makes queries more readable:

Using the alias is cleaner and avoids repeating the expression.
1SELECT
2 category,
3 SUM(sales) AS total_sales
4FROM products
5GROUP BY category
6ORDER BY total_sales DESC

Top-N Queries

Combining ORDER BY with LIMIT creates powerful reporting patterns for finding top or bottom performers.

//

LIMIT with ORDER BY

Combine ORDER BY with LIMIT to get "top N" or "bottom N" results. This is one of the most common reporting patterns.

1SELECT
2 product_id,
3 product_name,
4 SUM(quantity_sold) AS units_sold
5FROM transactions
6GROUP BY product_id, product_name
7ORDER BY units_sold DESC
8LIMIT 10
Result
product_idproduct_nameunits_sold
P101Wireless Earbuds15234
P089USB-C Cable12876
P156Phone Case11923
P042Screen Protector9845
P203Power Bank8234
This shows your top 5 best-selling products by units sold. Perfect for dashboards, reports, and quick insights.
//

Execution Order

Understanding when ORDER BY executes explains why aggregate aliases work in sorting.

Query Execution with ORDER BY
  • 1. FROM: Get the table
  • 2. WHERE: Filter rows
  • 3. GROUP BY: Create groups
  • 4. HAVING: Filter groups
  • 5. SELECT: Calculate aggregates
  • 6. ORDER BY: Sort results
  • 7. LIMIT: Truncate to N rows

ORDER BY runs near the end, after aggregation is complete. This is why you can sort by aggregate values since they exist by this point.

> Complete this query to rank regions from highest to lowest total cloud spend.

SELECT
  region,
  SUM(amount) AS total_spend
FROM cloud_costs
GROUP BY region
___ SUM(amount) ___
ASC
ORDER BY
GROUP BY
DESC

ORDER BY on aggregated results is how raw group data becomes an actionable ranked list. Top-N reports, leaderboards, and worst-performer alerts all rely on this pattern.

Combining ORDER BY with LIMIT is one of the most efficient patterns in analytical SQL. The database can often short-circuit the sort once it has identified the top N values, rather than sorting the entire result set.

TIP
Use column aliases in ORDER BY rather than repeating the full aggregate expression. It makes queries shorter, easier to read, and less prone to inconsistency if you later change the aggregate.

APPROX_PERCENTILE and ARBITRARY

Daily Life
Interviews

Estimate percentiles at warehouse scale

Averages can be misleading. If one customer spends $100,000 and ninety-nine customers spend $100 each, the average is $1,089.90, which is far higher than what a typical customer spends. Percentiles tell you what typical really looks like.
A percentile tells you what value a certain percentage of data falls below. The 50th percentile (median) is where half the values are below and half are above. The 95th percentile tells you the value that 95% of data falls below, which is useful for understanding outliers.

Percentile Calculations

Percentiles reveal data distributions that averages hide. They tell you what "typical" really means for your data.
//

APPROX_PERCENTILE

APPROX_PERCENTILE provides fast percentile calculations. It takes the column and the percentile as a decimal (0.5 for median, 0.95 for p95):

1SELECT
2 APPROX_PERCENTILE(
3 response_time,
4 0.5
5 ) AS median_response,
6 APPROX_PERCENTILE(
7 response_time,
8 0.95
9 ) AS p95_response,
10 APPROX_PERCENTILE(
11 response_time,
12 0.99
13 ) AS p99_response
14FROM api_calls
Result
median_responsep95_responsep99_response
45235568
//

Median (p50)

The median is the 50th percentile, which is the middle value. It's often more representative than the average because it's not skewed by outliers.
1SELECT
2 AVG(order_amount) AS mean_order,
3 APPROX_PERCENTILE(
4 order_amount,
5 0.5
6 ) AS median_order
7FROM orders
Result
mean_ordermedian_order
127.4385.00
The mean order is $127.43 but the median is only $85.00. This gap tells you that a few large orders are pulling the average up. Most orders are actually around $85.
AVG is sensitive to outliers. A single executive salary of $500K can dramatically shift the department average upward.
1SELECT
2 department,
3 AVG(salary) AS avg_salary
4FROM employee_metrics
5GROUP BY department
//

Percentile Use Cases

Different percentiles answer different questions about your data distribution.
Common Percentile Use Cases
  • p50 (median): Typical value, robust against outliers
  • p75: Upper quartile, where top 25% begins
  • p90: Most users' experience (90% faster than this)
  • p95: Service level agreement (SLA) threshold
  • p99: Tail latency, worst-case scenarios

The ARBITRARY Function

ARBITRARY is a specialized aggregate for when you need any value from a group, particularly when all values are identical.

//

The ARBITRARY Function

ARBITRARY is an aggregate function that returns any value from a group. Unlike MIN, MAX, or AVG, which compute a specific result, ARBITRARY simply picks one value and returns it, and that value could be different between query executions.

This might sound useless at first. Why would you want a random value? The answer lies in performance and intent: when you genuinely don't care which value you get, ARBITRARY is dramatically faster than alternatives.

//

Why ARBITRARY Exists

Denormalized data
Denormalized data
A user's name appears on every order row.
One-to-one relationships
One-to-one relationships
Each order has exactly one shipping address.
Sampling
Sampling
You want any example from each category.
Metadata columns
Metadata columns
Timestamps or IDs that are constant per group.
//

Syntax and Basic Usage

1SELECT
2 user_id,
3 ARBITRARY(user_name) AS user_name,
4 ARBITRARY(country) AS country,
5 COUNT(*) AS order_count,
6 SUM(amount) AS total_spent
7FROM orders
8GROUP BY user_id
Result
user_iduser_namecountryorder_counttotal_spent
U101Alice ChenUSA121456.00
U102Bob SmithCanada8923.50
U103Carol JonesUK152134.75

Without ARBITRARY, you'd get an error: "user_name must appear in GROUP BY or be used in an aggregate function." Using ARBITRARY tells the database "I know all values are the same, just give me one."

//

How ARBITRARY Works

Watch how ARBITRARY processes rows. When a new group appears, it takes the first value encountered. Subsequent rows for that group are ignored since the value is already set.

1SELECT
2 region,
3 ARBITRARY(manager) AS any_manager
4FROM orders
5GROUP BY region
Arbitrary
orders
regionmanager
NorthAlice
SouthBob
NorthCarol
EastDiana
SouthEve
EastFrank
Result
regionany_manager
Step 1/6
Notice that "North" got "Alice" (the first North row), and "Carol" was skipped. The result is non-deterministic because a different execution might process rows in a different order and return "Carol" instead.
//

Performance: O(1) vs O(n)

ARBITRARY has O(1) time complexity per group. It stores one value and ignores subsequent values. Compare this to MIN or MAX, which must compare every value to find the extreme, an O(n) operation where n is the group size.

ARBITRARY runs in O(1): it stores the first value it encounters and ignores the rest. MIN and MAX are O(n): they must compare every value in the group to find the extreme. SUM, COUNT, and AVG also run in O(n), processing every row per group. On billion-row tables this constant-time vs linear distinction becomes significant.

Do
  • Use conditional counting for single-pass multi-metric analysis
  • Prefer COUNT(*) for simple row counts because it's optimized
  • Sort by the metric that matters most to your analysis
  • Use median alongside mean to detect skewed distributions
  • Add LIMIT when you only need top/bottom results
  • Use APPROX_DISTINCT for cardinality on large datasets
Don't
  • Don't use COUNT(1) over COUNT(*) because there's no benefit and it's less clear
  • Don't forget ORDER BY because unsorted aggregates are harder to read
  • Don't rely only on averages because percentiles reveal the full picture
  • Avoid exact distinct counts on billion-row tables when approximate suffices
  • Avoid exact percentiles on billion-row tables and use approximate instead

> Complete this query to find the median response time per service endpoint.

SELECT
  endpoint,
  ___(latency, ___) AS median_ms
FROM api_calls
GROUP BY endpoint
PERCENTILE
0.95
0.5
APPROX_PERCENTILE

The median (0.5 percentile) is more representative than AVG for skewed distributions like response times or salaries.

P95 and P99 percentiles reveal the tail of the distribution, showing worst-case performance that averages hide.

ARBITRARY returns any single value from a group in O(1) time, useful when you need a representative value but do not care which specific one.

PUTTING IT ALL TOGETHER

> You are a data engineer at Airbnb building a nightly metrics pipeline that computes booking performance across listing categories at scale. The query must handle hundreds of millions of rows without blowing memory budgets on exact distinct counts.

Row-level expressions like date_trunc compute derived fields before aggregate functions collect them per category.
APPROX_DISTINCT replaces COUNT(DISTINCT ...) across 500M rows, cutting memory usage from gigabytes to kilobytes.
ORDER BY total_bookings DESC with LIMIT 10 on the aggregated result surfaces the top-performing listing categories.
APPROX_PERCENTILE(0.5) on payout_amount reveals the median booking value, unaffected by outlier luxury listings.
KEY TAKEAWAYS
Row-level expressions evaluate first, then aggregate functions collect the results
COUNT(*) is fastest: counts rows without checking column values
COUNT(column) excludes NULLs; COUNT(DISTINCT column) counts unique non-NULL values
APPROX_DISTINCT uses HyperLogLog for ~3% error with O(1) memory at any scale
ORDER BY + LIMIT on aggregates gives top/bottom N patterns
APPROX_PERCENTILE(0.5) is median; P95/P99 reveal distribution tails
ARBITRARY returns any value from a group in O(1) time (useful for denormalization)
Percentiles ignore NULL values; empty groups return NULL

AVG lied to you. Meet the percentiles.

Category
SQL
Difficulty
intermediate
Duration
33 minutes
Challenges
0 hands-on challenges

Topics covered: Row-level vs aggregate behavior, Understanding cardinality, APPROX_DISTINCT (HyperLogLog), ORDER BY with aggregates, APPROX_PERCENTILE and ARBITRARY

Lesson Sections

  1. Row-level vs aggregate behavior

    Understanding Differences Row-level and aggregate operations follow different execution models. Grasping this distinction is key to writing correct SQL. Two Worlds of SQL SQL has two fundamentally different kinds of operations: How They Work Together Here's the key insight: row-level operations happen first, for each row individually. Then aggregate functions collect those results and summarize them. Consider this query: Here's the execution order: CASE in Aggregates Patterns and Pitfalls Unders

  2. Understanding cardinality

    Cardinality means the number of unique values in a column. It's a fundamental concept for understanding your data and optimizing queries. Understanding cardinality helps you: Try counting distinct customers per region and sorting to find the busiest regions first.

  3. APPROX_DISTINCT (HyperLogLog) (concepts: sqlApproxAgg)

    The Scaling Problem Exact distinct counts become expensive as data grows. Understanding the cost helps you choose the right approach. COUNT(DISTINCT) Problem To count distinct values exactly, the database must track every unique value it has seen. This requires building a hash set in memory that grows with cardinality. For a table with 1 billion rows and 100 million unique values, that hash set can consume gigabytes of memory. How APPROX_DISTINCT Works The approximate counts are within 0.3% of e

  4. ORDER BY with aggregates

    Basic Sorting Patterns Sorting by Aggregates The most common use case: sort groups by their aggregated metric to find the top or bottom performers. Using Column Aliases Top-N Queries LIMIT with ORDER BY This shows your top 5 best-selling products by units sold. Perfect for dashboards, reports, and quick insights. Execution Order

  5. APPROX_PERCENTILE and ARBITRARY

    Averages can be misleading. If one customer spends $100,000 and ninety-nine customers spend $100 each, the average is $1,089.90, which is far higher than what a typical customer spends. Percentiles tell you what typical really looks like. A percentile tells you what value a certain percentage of data falls below. The 50th percentile (median) is where half the values are below and half are above. The 95th percentile tells you the value that 95% of data falls below, which is useful for understandi