Aggregating: Intermediate

Uber's operations team uses HAVING and complex GROUP BY queries to identify which cities have enough high-rated drivers to support a surge in demand, filtering out markets where the average driver rating falls below threshold or where completed trips per driver drop under a minimum volume. That kind of analysis, where you aggregate first and then filter on the result of that aggregation, is something WHERE cannot do on its own. HAVING is what makes it possible, and mastering it is what turns basic aggregation into the tool that drives real operational decisions.

Row-level vs aggregate behavior

Daily Life

Interviews

Know when a query collapses rows

COUNT is the most common aggregate function, and its two forms look nearly identical. Run them on the same column and they can return different numbers, with no error to tell you why.

THE CASE OF THE PHANTOM COUNT

Victor Huang flagged it this morning: five engineers on the headcount report, three on the compensation report. He has a leadership sync at two o'clock and he needs a number he can defend. Nobody can explain the gap.

employees

id	name	department	salary	bonus
1	Chen	Engineering	85000	6000
2	Patel	Engineering	92000	null
3	Kim	Engineering	78000	4000
4	Osei	Engineering	88000	5000
5	Ruiz	Engineering	74000	null
6	Tanaka	Marketing	71000	3000
7	Adams	Marketing	82000	4000
8	Silva	Marketing	76000	2000
9	Lee	Sales	68000	2000
10	Costa	Sales	73000	null

Understanding the difference between row-level and aggregate functions is essential for writing correct SQL. Functions like SUM(salary * 1.1) work because the row-level expression (salary * 1.1) is wrapped inside an aggregate. Let's understand why this matters and why the reverse doesn't work.

Understanding Differences

Row-level and aggregate operations follow different execution models. Grasping this distinction is key to writing correct SQL.

Two Worlds of SQL

SQL has two fundamentally different kinds of operations:

•Row-Level Operations

Evaluate once per row, independently
Examples: =, >, <, +, -, *, CASE WHEN
Input: values from the current row
Output: one value per row

•Aggregate Operations

Evaluate once per group (or entire table)
Examples: COUNT, SUM, AVG, MIN, MAX
Input: all values from all rows in group
Output: one value per group

How They Work Together

Here's the key insight: row-level operations happen first, for each row individually. Then aggregate functions collect those results and summarize them.

Consider this query:

	SELECT
	department,
	SUM(salary * 1.1) AS total_with_raise
	FROM employee_metrics
	GROUP BY department

Result

department	total_with_raise
Engineering	594000.00
Sales	316800.00

Here's the execution order:

Step-by-Step Evaluation

1. For EACH row: Calculate salary * 1.1 (row-level math)
2. Group the results by department
3. For EACH group: SUM all the calculated values (aggregate)
4. Return one row per group

The row-level expression (salary * 1.1) is evaluated independently for every single row. Then SUM collects all those individual results and adds them up per group.

CASE in Aggregates

This is why CASE WHEN works inside aggregates. CASE WHEN is a row-level expression, so it evaluates for each row first:

	SELECT
	department,
	SUM(
	CASE
	WHEN is_manager THEN salary
	ELSE 0
	END
	) AS manager_total
	FROM employee_metrics
	GROUP BY department

Result

department	manager_total
Engineering	180000
Sales	95000

Execution Breakdown

1. Row 1 (Alice, manager): CASE evaluates to 120000
2. Row 2 (Bob, not manager): CASE evaluates to 0
3. Row 3 (Carol, manager): CASE evaluates to 95000
4. Row 4 (Dave, not manager): CASE evaluates to 0
5. SUM adds up the evaluated values per department

The CASE expression produces a concrete number for each row. SUM doesn't know or care that the number came from a CASE. It just sums whatever values it receives.

Patterns and Pitfalls

Understanding what works and what doesn't helps you avoid common mistakes when mixing row-level and aggregate operations.

Reverse Doesn't Work

Now here's the crucial insight: you cannot put an aggregate inside a row-level expression in the SELECT clause. This is a syntax error:

	SELECT
	department,
	CASE
	WHEN SUM(salary) > 100000 THEN 'Large'
	ELSE 'Small'
	END AS size
	FROM employee_metrics
	GROUP BY department

Result

department	size
Engineering	Large
Sales	Small
Marketing	Large

Why doesn't this work? Think about when things happen:

The Logical Problem

IF/CASE is evaluated per row, before grouping
SUM needs all rows in a group to calculate
You can't use SUM's result before SUM has finished
The row-level expression runs first, but needs aggregate result

It's like trying to know the final score of a game before it's played. The aggregate hasn't happened yet when the row-level expression needs its value.

The Correct Pattern

To make decisions based on aggregate results, you have two options:

Use HAVING to filter groups after aggregation. Groups that don't meet the condition are excluded entirely from the result.

	SELECT
	department,
	SUM(salary) AS total
	FROM employee_metrics
	GROUP BY department
	HAVING SUM(salary) > 100000

The Mental Model

Think of SQL execution as a pipeline:

Query Execution Pipeline

1. FROM: Get the rows
2. WHERE: Filter rows (row-level, can't use aggregates)
3. GROUP BY: Organize into groups
4. HAVING: Filter groups (can use aggregates)
5. SELECT: Calculate results (row-level expressions first, then aggregates)
6. ORDER BY: Sort results (can use aggregates and aliases)

Row-level expressions in SELECT are evaluated as the aggregate scans each row. The aggregate collects these intermediate results. You can nest row-level inside aggregate (row happens during the scan), but not aggregate inside row-level (aggregate needs the full scan to complete first).

Which approach correctly calculates total salary with a 10% raise per department?

SUM(salary * 1.1) applies the raise per row first, then sums. Each employee gets their own 10% raise before totaling.

	SELECT
	department,
	SUM(salary * 1.1) AS total_raised
	FROM employee_metrics
	GROUP BY department

Practical Applications

Understanding this unlocks powerful patterns. Here's a common real-world example combining multiple row-level expressions inside aggregates:

	SELECT
	product_category,
	COUNT(*) AS total_orders,
	SUM(
	CASE
	WHEN status = 'returned' THEN 1
	ELSE 0
	END
	) AS returns,
	AVG(
	CASE
	WHEN discount > 0 THEN amount
	END
	) AS avg_discounted
	FROM orders
	GROUP BY product_category

Result

product_category	total_orders	returns	avg_discounted
Electronics	1523	89	245.50
Clothing	2341	234	67.80
Home	876	45	123.40

Each aggregate receives row-level evaluated values: COUNT(*) receives 1 for every row SUM receives 1 for returns, 0 for non-returns AVG receives amount for discounted orders, NULL for full-price (which AVG ignores) This pattern lets you calculate multiple different metrics in a single pass through the data.

COUNT Deep Dive

COUNT is deceptively simple. Understanding its variations and NULL handling helps you write accurate data quality queries.

COUNT(*) Deep Dive

COUNT(*) seems simple, but understanding its nuances helps you write better queries. Let's explore what it actually does and when to use alternatives.

COUNT(*) Behavior

COUNT(*) counts rows, period. It doesn't look at column values. It doesn't care about NULLs. If a row exists in the result set, it gets counted.

Consider this table with some NULL values:

id	name	email
1	Alice	alice@example.com
2	Bob	NULL
3	Carol	carol@example.com
4	NULL	NULL
5	Eve	eve@example.com

	SELECT
	COUNT(*) AS total_rows,
	COUNT(name) AS names_present,
	COUNT(email) AS emails_present
	FROM users

Result

total_rows	names_present	emails_present
5	4	3

COUNT(*) returns 5 (all rows exist). COUNT(name) returns 4 (one NULL name). COUNT(email) returns 3 (two NULL emails). NULL represents missing values.

COUNT(*) Best Practices

You might see COUNT(1) in legacy code or tutorials. It's functionally identical to COUNT(*) since both count rows. Use COUNT(*) for clarity because it's the standard convention and clearly expresses "count all rows."

When to Use COUNT(column)

Use COUNT(column) when you specifically need to count non-NULL values. Common use cases:

• Data quality checks: How many customers have emails? • Optional field analysis: How many orders have discount codes? • Completeness metrics: What percentage of profiles are complete?

	SELECT
	COUNT(*) AS total_customers,
	COUNT(phone) AS have_phone,
	COUNT(email) AS have_email,
	COUNT(address) AS have_address,
	ROUND(
	100 * COUNT(phone) / COUNT(*),
	1
	) AS phone_rate
	FROM customers

Result

total_customers	have_phone	have_email	have_address	phone_rate
10000	8234	9567	7123	82.3

This data quality report shows that 82.3% of customers have phone numbers on file, and email capture is better than address capture.

Which COUNT variant answers "how many rows have a value in the email column"?

COUNT(*) counts every row regardless of column values. It tells you the total number of customers, not how many have emails.

	SELECT
	COUNT(*) AS total
	FROM customers

COUNT variants have different performance characteristics depending on how the database processes them.

Database Execution

COUNT(*) is highly optimized and often uses index metadata
COUNT(column) must check each value for NULL
COUNT(DISTINCT column) is expensive because it builds a hash set
Prefer COUNT(*) when you just need row counts

> Complete this query to find each department's total metric count and how many have a non-null value.

SELECT
  department,
  COUNT(___) AS total,
  COUNT(___) AS with_value
FROM employee_metrics
GROUP BY department

metric_value

metric_name

Row-level expressions inside aggregates are a powerful composition tool. The database evaluates each row independently, then the aggregate function collects those individual results into a single summary value.

CASE WHEN inside SUM or AVG is essentially conditional aggregation: you decide, row by row, whether a value should contribute to the aggregate. This single pattern replaces many common reporting subqueries.

TIP

When combining multiple aggregate metrics in one query, always verify the results against simpler single-metric queries first. It is easy to mix up CASE WHEN branches when writing complex conditional aggregations.

Understanding cardinality

Daily Life

Interviews

Predict how joins change your row count

Cardinality means the number of unique values in a column. It's a fundamental concept for understanding your data and optimizing queries.

Status column

Values like [pending, completed, cancelled] have cardinality 3.

User ID column

1 million unique users means cardinality of 1,000,000.

Country column

About ~200 unique values (number of countries).

Boolean column

Cardinality 2 (true/false) or 3 if NULLs exist.

Understanding cardinality helps you:

Choose effective GROUP BY columns (low cardinality = fewer groups) Predict query performance (high cardinality = more memory for DISTINCT) Design efficient indexes (high cardinality columns make better indexes) Identify data quality issues (unexpected cardinality suggests problems)

	SELECT
	'status' AS column_name,
	COUNT(DISTINCT status) AS cardinality
	FROM orders

Result

column_name	cardinality
status	3

Low cardinality columns like status, region, or category are ideal for GROUP BY. High cardinality columns like user_id or order_id are better for filtering or joining.

Which column would produce fewer groups in a GROUP BY?

Status has low cardinality (perhaps 3-5 values like pending, shipped, delivered). This produces very few groups, making the query fast and the results easy to read.

	SELECT
	status,
	COUNT(*) AS cnt
	FROM orders
	GROUP BY status

Try counting distinct customers per region and sorting to find the busiest regions first.

> Complete this query to find how many different customers placed orders in each region, with the busiest region first.

SELECT
  region,
  ___(___ ___) AS unique_customers
FROM orders
GROUP BY region
___ unique_customers DESC

COUNT

DISTINCT

customer_id

ORDER BY

SUM

GROUP BY

Cardinality awareness shapes how you write GROUP BY queries. Low-cardinality columns like status or region produce compact, fast results. High-cardinality columns like user_id or order_id produce row-per-entity results that are better suited to filtering or joining.

COUNT(DISTINCT) is the primary tool for measuring cardinality directly in SQL. It tells you exactly how many unique values exist, which is essential for data profiling and deduplication work.

TIP

Before grouping on a column you are unfamiliar with, run a quick COUNT(DISTINCT column) to understand its cardinality. A column with 5 million distinct values will produce a 5-million-row result set, which is rarely what you want from a GROUP BY.

APPROX_DISTINCT (HyperLogLog)

Daily Life

Interviews

Count unique values in massive datasets

COUNT(DISTINCT) is essential for measuring cardinality, answering questions like how many unique users, sessions, or products. But on massive datasets, it becomes a performance bottleneck. APPROX_DISTINCT solves this with probabilistic counting that's fast at any scale.

The Scaling Problem

Exact distinct counts become expensive as data grows. Understanding the cost helps you choose the right approach.

COUNT(DISTINCT) Problem

To count distinct values exactly, the database must track every unique value it has seen. This requires building a hash set in memory that grows with cardinality. For a table with 1 billion rows and 100 million unique values, that hash set can consume gigabytes of memory.

Why COUNT(DISTINCT) Struggles at Scale

Memory pressure: Hash set size grows linearly with unique values
CPU overhead: Every row requires a hash computation and lookup
Parallelism complexity: Merging results from parallel workers is expensive
Time cost: Can take minutes on billion-row tables

In a distributed system, COUNT(DISTINCT) requires shuffling data to ensure the same value is processed by the same worker, adding network overhead to an already expensive operation.

How APPROX_DISTINCT Works

APPROX_DISTINCT uses HyperLogLog (HLL), a probabilistic algorithm that estimates cardinality using a fixed amount of memory regardless of input size. The key insight: instead of storing every value, it stores a compact "sketch" that can estimate the count.

	SELECT
	DATE_TRUNC('day', event_time) AS event_day,
	COUNT(DISTINCT user_id) AS exact_users,
	APPROX_DISTINCT(user_id) AS approx_users
	FROM events
	GROUP BY DATE_TRUNC('day', event_time)

Result

event_day	exact_users	approx_users
2025-11-25	1234567	1231024
2025-11-26	1345678	1349215
2025-11-27	1456789	1452387

The approximate counts are within 0.3% of exact in this example. In practice, HyperLogLog provides about 2% standard error, meaning 95% of estimates are within 4% of the true value.

Complexity Comparison

The performance difference between COUNT(DISTINCT) and APPROX_DISTINCT grows dramatically with data size. Here's how they compare as cardinality increases:

Memory Comparison

10K unique values: COUNT(DISTINCT) ~12 KB vs APPROX_DISTINCT ~12 KB (1x)
100K unique values: ~120 KB vs ~12 KB (2-3x speedup)
1M unique values: ~1.2 MB vs ~12 KB (5-10x speedup)
10M unique values: ~12 MB vs ~12 KB (20-50x speedup)
100M unique values: ~120 MB vs ~12 KB (50-100x speedup)
1B unique values: ~1.2 GB vs ~12 KB (100x+ speedup)

Notice that COUNT(DISTINCT) memory grows linearly with cardinality, while APPROX_DISTINCT stays constant at approximately 12 KB (the HLL sketch size). The speedup column shows typical query time improvement.

Choosing the Right Approach

The choice between exact and approximate counts depends on your use case. Consider accuracy requirements and performance constraints.

When to Use Each

Financial reporting: COUNT(DISTINCT). Exact numbers required
Dashboard metrics: APPROX_DISTINCT. Speed over precision
Trend analysis: APPROX_DISTINCT. Relative change matters
Real-time monitoring: APPROX_DISTINCT. Latency critical
A/B test analysis: COUNT(DISTINCT). Statistical rigor needed
Log analysis: APPROX_DISTINCT. Volume too high for exact

Scale is the deciding factor: under 1 million rows, COUNT(DISTINCT) runs fast enough that there is no reason to sacrifice precision. For 1–100 million rows, switch to APPROX_DISTINCT when query latency is a constraint. Beyond 100 million rows, strongly prefer APPROX_DISTINCT unless exactness is a hard requirement. Real-time dashboards should almost always use APPROX_DISTINCT to keep response times below the perceptible threshold.

HyperLogLog Internals

HyperLogLog works by hashing each value and observing patterns in the hash bits. The algorithm counts leading zeros in the hash, using the maximum number of leading zeros seen as an estimator for cardinality. More unique values means higher probability of seeing rare patterns (many leading zeros).

Despite its compact implementation, HyperLogLog provides several powerful properties that make it ideal for distributed systems.

HyperLogLog Properties

Constant memory: Fixed size regardless of input cardinality
Mergeable: Sketches can be combined for distributed computing
Streaming: Processes data in a single pass
Order-independent: Same result regardless of input order

COUNT(DISTINCT) gives an exact answer but must track every unique value in memory. Best for small-to-medium datasets or when precision is required.

	SELECT
	region,
	COUNT(DISTINCT customer_id) AS unique_customers
	FROM orders
	GROUP BY region

Choosing between exact and approximate counts depends on your dataset size and precision requirements.

> Complete this query to efficiently estimate the number of unique visitors per page.

SELECT
  page_url,
  ___(___) AS unique_visitors
FROM page_views
GROUP BY page_url

APPROX_DISTINCT

visitor_id

COUNT

page_url

APPROX_DISTINCT uses constant memory regardless of how many unique values exist, making it ideal for billion-row tables.

The ~2-3% error margin is acceptable for dashboards and trend analysis where exact counts provide no additional business value.

Use COUNT(DISTINCT) when precision matters, such as financial reporting or compliance metrics where every count must be exact.

ORDER BY with aggregates

Daily Life

Interviews

Rank groups by their aggregate values

Aggregated results are often more useful when sorted. ORDER BY lets you arrange your groups by any column or aggregate value: highest revenue first, most recent activity on top, or alphabetically by name.

Basic Sorting Patterns

ORDER BY works with aggregate columns just like regular columns. Sort by metrics to surface top or bottom performers.

Sorting by Aggregates

The most common use case: sort groups by their aggregated metric to find the top or bottom performers.

	SELECT
	customer_id,
	COUNT(*) AS order_count,
	SUM(amount) AS total_spent
	FROM orders
	GROUP BY customer_id
	ORDER BY total_spent DESC

Result

customer_id	order_count	total_spent
C042	47	15234.50
C089	31	12876.00
C015	52	11923.75
C103	28	9845.00
C067	19	8234.25

The result shows your top customers ranked by total spending. DESC means descending (highest first). Use ASC for ascending (lowest first), which is the default if you don't specify.

Using Column Aliases

You can reference your aggregate columns by their alias in ORDER BY. This makes queries more readable:

Using the alias is cleaner and avoids repeating the expression.

	SELECT
	category,
	SUM(sales) AS total_sales
	FROM products
	GROUP BY category
	ORDER BY total_sales DESC

Top-N Queries

Combining ORDER BY with LIMIT creates powerful reporting patterns for finding top or bottom performers.

LIMIT with ORDER BY

Combine ORDER BY with LIMIT to get "top N" or "bottom N" results. This is one of the most common reporting patterns.

	SELECT
	product_id,
	product_name,
	SUM(quantity_sold) AS units_sold
	FROM transactions
	GROUP BY product_id, product_name
	ORDER BY units_sold DESC
	LIMIT 10

Result

product_id	product_name	units_sold
P101	Wireless Earbuds	15234
P089	USB-C Cable	12876
P156	Phone Case	11923
P042	Screen Protector	9845
P203	Power Bank	8234

This shows your top 5 best-selling products by units sold. Perfect for dashboards, reports, and quick insights.

Execution Order

Understanding when ORDER BY executes explains why aggregate aliases work in sorting.

Query Execution with ORDER BY

1. FROM: Get the table
2. WHERE: Filter rows
3. GROUP BY: Create groups
4. HAVING: Filter groups
5. SELECT: Calculate aggregates
6. ORDER BY: Sort results
7. LIMIT: Truncate to N rows

ORDER BY runs near the end, after aggregation is complete. This is why you can sort by aggregate values since they exist by this point.

> Complete this query to rank regions from highest to lowest total cloud spend.

SELECT
  region,
  SUM(amount) AS total_spend
FROM cloud_costs
GROUP BY region
___ SUM(amount) ___

ORDER BY

DESC

GROUP BY

ASC

ORDER BY on aggregated results is how raw group data becomes an actionable ranked list. Top-N reports, leaderboards, and worst-performer alerts all rely on this pattern.

Combining ORDER BY with LIMIT is one of the most efficient patterns in analytical SQL. The database can often short-circuit the sort once it has identified the top N values, rather than sorting the entire result set.

TIP

Use column aliases in ORDER BY rather than repeating the full aggregate expression. It makes queries shorter, easier to read, and less prone to inconsistency if you later change the aggregate.

APPROX_PERCENTILE and ARBITRARY

Daily Life

Interviews

Estimate percentiles at warehouse scale

Averages can be misleading. If one customer spends $100,000 and ninety-nine customers spend $100 each, the average is $1,089.90, which is far higher than what a typical customer spends. Percentiles tell you what typical really looks like.

A percentile tells you what value a certain percentage of data falls below. The 50th percentile (median) is where half the values are below and half are above. The 95th percentile tells you the value that 95% of data falls below, which is useful for understanding outliers.

Percentile Calculations

Percentiles reveal data distributions that averages hide. They tell you what "typical" really means for your data.

APPROX_PERCENTILE

APPROX_PERCENTILE provides fast percentile calculations. It takes the column and the percentile as a decimal (0.5 for median, 0.95 for p95):

	SELECT
	APPROX_PERCENTILE(
	response_time,
	0.5
	) AS median_response,
	APPROX_PERCENTILE(
	response_time,
	0.95
	) AS p95_response,
	APPROX_PERCENTILE(
	response_time,
	0.99
	) AS p99_response
	FROM api_calls

Result

median_response	p95_response	p99_response
45	235	568

Median (p50)

The median is the 50th percentile, which is the middle value. It's often more representative than the average because it's not skewed by outliers.

	SELECT
	AVG(order_amount) AS mean_order,
	APPROX_PERCENTILE(
	order_amount,
	0.5
	) AS median_order
	FROM orders

Result

mean_order	median_order
127.43	85.00

The mean order is $127.43 but the median is only $85.00. This gap tells you that a few large orders are pulling the average up. Most orders are actually around $85.

AVG is sensitive to outliers. A single executive salary of $500K can dramatically shift the department average upward.

	SELECT
	department,
	AVG(salary) AS avg_salary
	FROM employee_metrics
	GROUP BY department

Percentile Use Cases

Different percentiles answer different questions about your data distribution.

Common Percentile Use Cases

p50 (median): Typical value, robust against outliers
p75: Upper quartile, where top 25% begins
p90: Most users' experience (90% faster than this)
p95: Service level agreement (SLA) threshold
p99: Tail latency, worst-case scenarios

The ARBITRARY Function

ARBITRARY is a specialized aggregate for when you need any value from a group, particularly when all values are identical.

The ARBITRARY Function

ARBITRARY is an aggregate function that returns any value from a group. Unlike MIN, MAX, or AVG, which compute a specific result, ARBITRARY simply picks one value and returns it, and that value could be different between query executions.

This might sound useless at first. Why would you want a random value? The answer lies in performance and intent: when you genuinely don't care which value you get, ARBITRARY is dramatically faster than alternatives.

Why ARBITRARY Exists

Denormalized data

A user's name appears on every order row.

One-to-one relationships

Each order has exactly one shipping address.

Sampling

You want any example from each category.

Metadata columns

Timestamps or IDs that are constant per group.

Syntax and Basic Usage

	SELECT
	user_id,
	ARBITRARY(user_name) AS user_name,
	ARBITRARY(country) AS country,
	COUNT(*) AS order_count,
	SUM(amount) AS total_spent
	FROM orders
	GROUP BY user_id

Result

user_id	user_name	country	order_count	total_spent
U101	Alice Chen	USA	12	1456.00
U102	Bob Smith	Canada	8	923.50
U103	Carol Jones	UK	15	2134.75

Without ARBITRARY, you'd get an error: "user_name must appear in GROUP BY or be used in an aggregate function." Using ARBITRARY tells the database "I know all values are the same, just give me one."

How ARBITRARY Works

Watch how ARBITRARY processes rows. When a new group appears, it takes the first value encountered. Subsequent rows for that group are ignored since the value is already set.

	SELECT
	region,
	ARBITRARY(manager) AS any_manager
	FROM orders
	GROUP BY region

SELECT
  region,
  ANY_VALUE(manager) AS any_manager
FROM orders
GROUP BY region

orders

region

manager

North

Alice

South

Bob

North

Carol

East

Diana

South

Eve

East

Frank

→

Result

region

any_manager

empty

GROUP BY region: every row lands in exactly one bucket, and ANY_VALUE(manager) updates as its group receives rows.

1/8

Notice that "North" got "Alice" (the first North row), and "Carol" was skipped. The result is non-deterministic because a different execution might process rows in a different order and return "Carol" instead.

Performance: O(1) vs O(n)

ARBITRARY has O(1) time complexity per group. It stores one value and ignores subsequent values. Compare this to MIN or MAX, which must compare every value to find the extreme, an O(n) operation where n is the group size.

ARBITRARY runs in O(1): it stores the first value it encounters and ignores the rest. MIN and MAX are O(n): they must compare every value in the group to find the extreme. SUM, COUNT, and AVG also run in O(n), processing every row per group. On billion-row tables this constant-time vs linear distinction becomes significant.

✓Do

Use conditional counting for single-pass multi-metric analysis
Prefer COUNT(*) for simple row counts because it's optimized
Sort by the metric that matters most to your analysis
Use median alongside mean to detect skewed distributions
Add LIMIT when you only need top/bottom results
Use APPROX_DISTINCT for cardinality on large datasets

✗Don't

Don't use COUNT(1) over COUNT(*) because there's no benefit and it's less clear
Don't forget ORDER BY because unsorted aggregates are harder to read
Don't rely only on averages because percentiles reveal the full picture
Avoid exact distinct counts on billion-row tables when approximate suffices
Avoid exact percentiles on billion-row tables and use approximate instead

> Complete this query to find the median response time per service endpoint.

SELECT
  endpoint,
  ___(latency, ___) AS median_ms
FROM api_calls
GROUP BY endpoint

APPROX_PERCENTILE

0.5

PERCENTILE

0.95

The median (0.5 percentile) is more representative than AVG for skewed distributions like response times or salaries.

P95 and P99 percentiles reveal the tail of the distribution, showing worst-case performance that averages hide.

ARBITRARY returns any single value from a group in O(1) time, useful when you need a representative value but do not care which specific one.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data engineer at Airbnb building a nightly metrics pipeline that computes booking performance across listing categories at scale. The query must handle hundreds of millions of rows without blowing memory budgets on exact distinct counts.

Row-level expressions like date_trunc compute derived fields before aggregate functions collect them per category.

APPROX_DISTINCT replaces COUNT(DISTINCT ...) across 500M rows, cutting memory usage from gigabytes to kilobytes.

ORDER BY total_bookings DESC with LIMIT 10 on the aggregated result surfaces the top-performing listing categories.

APPROX_PERCENTILE(0.5) on payout_amount reveals the median booking value, unaffected by outlier luxury listings.

KEY TAKEAWAYS

Row-level expressions evaluate first, then aggregate functions collect the results

COUNT(*) is fastest: counts rows without checking column values

COUNT(column) excludes NULLs; COUNT(DISTINCT column) counts unique non-NULL values

APPROX_DISTINCT uses HyperLogLog for ~3% error with O(1) memory at any scale

ORDER BY + LIMIT on aggregates gives top/bottom N patterns

APPROX_PERCENTILE(0.5) is median; P95/P99 reveal distribution tails

ARBITRARY returns any value from a group in O(1) time (useful for denormalization)

Percentiles ignore NULL values; empty groups return NULL

AVG lied to you. Meet the percentiles.

Category: SQL
Difficulty: intermediate
Duration: 33 minutes
Challenges: 0 hands-on challenges

Topics covered: Row-level vs aggregate behavior, Understanding cardinality, APPROX_DISTINCT (HyperLogLog), ORDER BY with aggregates, APPROX_PERCENTILE and ARBITRARY

Lesson Sections

Row-level vs aggregate behavior (concepts: sqlCount)
Understanding Differences Row-level and aggregate operations follow different execution models. Grasping this distinction is key to writing correct SQL. Two Worlds of SQL SQL has two fundamentally different kinds of operations: How They Work Together Here's the key insight: row-level operations happen first, for each row individually. Then aggregate functions collect those results and summarize them. Consider this query: Here's the execution order: CASE in Aggregates Patterns and Pitfalls Unders
Understanding cardinality (concepts: sqlJoinCardinality)
Cardinality means the number of unique values in a column. It's a fundamental concept for understanding your data and optimizing queries. Understanding cardinality helps you: Try counting distinct customers per region and sorting to find the busiest regions first.
APPROX_DISTINCT (HyperLogLog) (concepts: sqlApproxAgg)
The Scaling Problem Exact distinct counts become expensive as data grows. Understanding the cost helps you choose the right approach. COUNT(DISTINCT) Problem To count distinct values exactly, the database must track every unique value it has seen. This requires building a hash set in memory that grows with cardinality. For a table with 1 billion rows and 100 million unique values, that hash set can consume gigabytes of memory. How APPROX_DISTINCT Works The approximate counts are within 0.3% of e
ORDER BY with aggregates (concepts: sqlOrderBy)
Basic Sorting Patterns Sorting by Aggregates The most common use case: sort groups by their aggregated metric to find the top or bottom performers. Using Column Aliases Top-N Queries LIMIT with ORDER BY This shows your top 5 best-selling products by units sold. Perfect for dashboards, reports, and quick insights. Execution Order
APPROX_PERCENTILE and ARBITRARY (concepts: sqlApproxAgg)
Averages can be misleading. If one customer spends $100,000 and ninety-nine customers spend $100 each, the average is $1,089.90, which is far higher than what a typical customer spends. Percentiles tell you what typical really looks like. A percentile tells you what value a certain percentage of data falls below. The 50th percentile (median) is where half the values are below and half are above. The 95th percentile tells you the value that 95% of data falls below, which is useful for understandi