Aggregating: Beginner
Amazon lists over 350 million products, and every seller dashboard metric you see, from units sold to average star rating to revenue by category, is produced by aggregate SQL queries running across billions of order records. When Amazon's retail team identifies that a product category is underperforming, they are looking at COUNT, SUM, and AVG applied across millions of rows grouped by category, marketplace, and time window. The same functions that power those eight-figure business decisions are the ones you are about to learn: COUNT, SUM, AVG, MIN, MAX, and GROUP BY.
GROUP BY for categorizing data
Group rows into meaningful categories
GROUP BY collapses individual rows into summary groups. Without it, you can count every order in your database but not how many belong to each customer. That gap is what GROUP BY closes.
Lisa Park flagged a summary report. Twelve orders in the system. The report shows four rows. Nobody deleted anything. But eight orders have vanished from the page. She has a board meeting in two hours.
One table. Twelve orders. Four categories. The report gives back four rows.
| order_id | category | customer | amount |
|---|---|---|---|
| ORD-101 | Electronics | Chen | $1,200 |
| ORD-102 | Clothing | Patel | $650 |
| ORD-103 | Books | Kim | $150 |
| ORD-104 | Electronics | Ruiz | $2,500 |
| ORD-105 | Home | Osei | $1,800 |
| ORD-106 | Clothing | Tanaka | $800 |
| ORD-107 | Books | Chen | $200 |
| ORD-108 | Electronics | Kim | $3,200 |
| ORD-109 | Home | Patel | $1,300 |
| ORD-110 | Clothing | Osei | $500 |
| ORD-111 | Books | Ruiz | $100 |
| ORD-112 | Electronics | Tanaka | $1,300 |
GROUP BY is the organizing principle behind aggregation. It tells SQL: "Put all rows with the same value into the same bucket, then calculate something for each bucket."
Without GROUP BY, an aggregate function operates on the entire table and returns a single row. With GROUP BY, you get one row per unique value in the grouping column.
Understanding GROUP BY
GROUP BY transforms your data by organizing rows into buckets based on shared values, then summarizing each bucket independently.
How GROUP BY Works
| sale_id | region | amount |
|---|---|---|
| 1 | North | 150 |
| 2 | South | 200 |
| 3 | North | 175 |
| 4 | South | 225 |
| 5 | North | 190 |
| 6 | East | 160 |
| sale_id | region | amount |
|---|---|---|
| 1 | North | 150 |
| 2 | South | 200 |
| 3 | North | 175 |
| 4 | South | 225 |
| 5 | North | 190 |
| 6 | East | 160 |
| region | sale_count |
|---|
When you GROUP BY region, SQL creates three buckets: one for "North" (3 rows), one for "South" (2 rows), and one for "East" (1 row). Each bucket is then summarized independently.
Rules and Patterns
GROUP BY has strict rules about what you can select. Understanding these rules prevents common errors and helps you write correct queries.
GROUP BY Golden Rule
When you use GROUP BY, every column in your SELECT must either:
1. Be in the GROUP BY clause, OR
2. Be inside an aggregate function (COUNT, SUM, AVG, etc.)
GROUP BY, check that every non-aggregated column appears in your GROUP BY clause.Multiple Column Grouping
Database Execution
Understanding the execution sequence helps you write efficient GROUP BY queries.
GROUP BYexecutes afterWHEREbut beforeSELECT- Rows are sorted into buckets by grouping column values
- Aggregate functions calculate one result per bucket
- The result has one row per unique group
> Complete this query to find how many products exist in each category.
SELECT , (*) AS num_products FROM products category
GROUP BY is the single most important clause in analytical SQL. Every dashboard, report, and summary table ultimately relies on it to collapse raw rows into meaningful categories.
The golden rule is simple: every column in SELECT must either be in GROUP BY or wrapped in an aggregate function. Violating this rule produces an error, and rightly so, because SQL needs to know how to handle multi-valued columns.
GROUP BY query returns more rows than expected, you are probably grouping on a higher-cardinality column than intended. Check whether you need to combine columns or use a coarser grouping.COUNT variations (*,col,DISTINCT)
Count rows, values, or unique entries
COUNT() is the simplest and most frequently used aggregate function. It counts things. Depending on how you use it, you can count rows, count non-empty values, or count unique values.
COUNT Variations
COUNT has three forms, each answering a different question about your data. Choosing the right one depends on what you need to measure.
COUNT(*) - Count All Rows
COUNT(*) counts every row in the group, regardless of what values are in those rows. It counts rows where columns are empty and rows where columns have values.
With GROUP BY, COUNT(*) tells you how many rows are in each group:
COUNT(column)
COUNT(column_name) counts only rows where that specific column is not empty (not NULL). This is useful when some rows have missing data.
| customer_id | name | |
|---|---|---|
| 1 | Alice | alice@email.com |
| 2 | Bob | NULL |
| 3 | Carol | carol@email.com |
| 4 | Dave | NULL |
| 5 | Eve | eve@email.com |
COUNT(*) returns 5 (all rows). COUNT(email) returns 3 (only rows with an email). The NULL entries represent missing values which COUNT(column) skips.
COUNT(DISTINCT)
Adding DISTINCT inside COUNT tells SQL to count only unique values, ignoring duplicates.
Practical Applications
Knowing when to use each COUNT variant helps you answer different business questions accurately.
Comparing COUNT Variants
The three variants of COUNT answer different questions. Choose the right one for what you need to measure:
Database Execution
Each COUNT variant has different performance characteristics based on the work required.
> Complete this query to count orders and find unique customers per region.
SELECT region, COUNT(*) AS total_orders, ( customer_id) AS unique_customers FROM orders GROUP BY region
COUNT(*), COUNT(column), and COUNT(DISTINCT column) each answer a different question. Choosing the right variant prevents subtle errors in data quality reports and business metrics.
A common real-world use case is audit reporting: COUNT(*) gives total records, COUNT(email) gives records with a value, and COUNT(DISTINCT customer_id) gives unique entities, all in one query.
COUNT result surprises you, ask whether NULLs or duplicates might explain it. COUNT(*) vs COUNT(column) vs COUNT(DISTINCT) each handle those cases differently.SUM and AVG calculations
Calculate totals and averages per group
SUM() adds up all the values in a column. It is used for any question involving totals: total revenue, total quantity, total hours, total anything numeric.
SUM Functions
SUM adds up numeric values to produce totals. It works with or without GROUP BY, calculating grand totals or group-level totals.
Basic Example
The simplest SUM query adds up every value in a column across the entire table.
Without GROUP BY, SUM adds up every row in the table, giving you a grand total.
SUM with GROUP BY
The real power of SUM appears when combined with GROUP BY. Now you can see totals broken down by category.
| sale_id | region | amount |
|---|---|---|
| 1 | North | 150 |
| 2 | South | 200 |
| 3 | North | 175 |
| 4 | South | 225 |
| 5 | North | 190 |
| 6 | East | 160 |
| region | total_amount |
|---|
SUM with Expressions
Grand vs Group Totals
SUM behaves differently depending on whether you include GROUP BY. Compare the two approaches:
Handling Empty Data
SUM ignores NULL values. If a column has some NULL entries, they are simply skipped in the calculation. If ALL values are NULL, SUM returns NULL (not zero).
Database Execution
SUM has specific behaviors around NULL values and data types that affect your results.
- Iterates through each row in the group
- NULL values are skipped (not treated as zero)
- Only works on numeric columns
- Returns NULL if all values are NULL
- Integer overflow on very large sums
- Text columns cause errors
AVG Functions
AVG calculates arithmetic means to find typical values. Understanding how it handles NULLs is crucial for accurate analysis.
AVG()
AVG() calculates the arithmetic mean: sum of all values divided by the count of values. It is essential for understanding typical values: average order size, average response time, average salary.
Basic Example
AVG without GROUP BY calculates the mean across all rows in the table.
AVG with GROUP BY
When combined with GROUP BY, AVG calculates a separate mean for each group.
| emp_id | department | salary |
|---|---|---|
| 1 | Engineering | 120000 |
| 2 | Sales | 80000 |
| 3 | Engineering | 130000 |
| 4 | Sales | 90000 |
| 5 | Engineering | 125000 |
| 6 | Marketing | 95000 |
| department | avg_salary |
|---|
AVG vs Manual Calculation
AVG() is equivalent to SUM() / COUNT(), but there is a subtle difference with NULL values:
AVG() divides by the count of non-NULL values only. If you have 10 rows but 2 are NULL, it divides by 8.The manual calculation using SUM/COUNT handles NULLs the same way as AVG.
Database Execution
AVG has important behaviors around NULL handling that can affect your calculations.
- AVG = SUM / COUNT (of non-
NULLvalues) NULLvalues are excluded from both the sum and count- Returns
NULLif all values areNULL - Result is typically a decimal, even if inputs are integers
Since AVG is sensitive to outliers, be cautious when interpreting results from skewed datasets.
> Complete this query to calculate the total and average cloud spend per region.
SELECT region, (amount) AS total_rev, (amount) AS avg_rev FROM cloud_costs GROUP BY region
SUM and AVG both ignore NULL values. If a column has NULLs, they are silently excluded from the calculation.
AVG can produce misleading results on skewed data. Consider using percentiles alongside averages for a fuller picture.
Combining SUM and AVG in the same GROUP BY query gives you both the total and typical value per group in one pass.
MIN and MAX for extremes
Find the highest and lowest values
MIN() and MAX() find the smallest and largest values in a column. They work with numbers, dates, and even text (alphabetical order).
Basic MIN/MAX Usage
MIN and MAX find the smallest and largest values in your data. They scan through groups to identify extremes.
Finding Extremes
MIN() scans through all values in each group, keeping track of the smallest one found so far:
| product_id | category | price |
|---|---|---|
| P001 | Electronics | 299 |
| P002 | Clothing | 45 |
| P003 | Electronics | 149 |
| P004 | Clothing | 89 |
| P005 | Electronics | 599 |
| P006 | Clothing | 29 |
| category | min_price |
|---|
MAX() works the same way, but keeps track of the largest value in each group:
| product_id | category | price |
|---|---|---|
| P001 | Electronics | 299 |
| P002 | Clothing | 45 |
| P003 | Electronics | 149 |
| P004 | Clothing | 89 |
| P005 | Electronics | 599 |
| P006 | Clothing | 29 |
| category | max_price |
|---|
Now the largest price per category is selected instead. Electronics surfaces 599 and Clothing surfaces 89, the opposite extremes from MIN.
Together with GROUP BY, they let you find price ranges within each category:
Advanced Applications
MIN and MAX work beyond numbers. They handle dates for time-based analysis and even text for alphabetical ordering.
MIN/MAX with Dates
MIN and MAX are extremely useful for date analysis: finding first and last events, date ranges, and activity spans.
MIN/MAX with Text
When applied to text columns, MIN returns the alphabetically first value and MAX returns the alphabetically last value.
Compare how MIN and MAX behave on the same data:
Combining Aggregates
MIN and MAX are often used alongside other aggregate functions to give a complete picture:
Database Execution
MIN and MAX have straightforward execution but work differently depending on data types.
MIN/MAXscan all values in the group to find extremesNULLvalues are silently ignored- Work with numbers, dates, and text columns
- Text comparison is alphabetical and case-sensitive
> Complete this query to find the cheapest and most expensive product in each category.
SELECT category, (price) AS lowest, (price) AS highest FROM products GROUP BY category
MIN and MAX scan all values in the group to find extremes. NULL values are silently excluded from the comparison.
For text columns, MIN returns the alphabetically first value and MAX returns the last. This comparison is case-sensitive.
Combining MIN, MAX, and AVG in one query shows the full range and central tendency, giving a quick data distribution summary.
HAVING for filtered groups
Filter groups after aggregation
HAVING filters groups after aggregation, just as WHERE filters rows before aggregation. You need HAVING when your filter condition involves an aggregate function.
The key distinction: WHERE cannot reference aggregate functions because it runs before the aggregation happens. HAVING runs after aggregation, so it CAN reference aggregated values.
Understanding HAVING
HAVING exists because WHERE cannot filter on aggregate results. Understanding when to use each is essential for correct queries.
Why We Need HAVING
You cannot use aggregate functions in WHERE because it runs before aggregation. HAVING runs after aggregation, so it can reference aggregated values:
WHERE filters individual rows (before grouping). HAVING filters entire groups (after grouping).
| order_id | region | amount |
|---|---|---|
| 1 | North | 150 |
| 2 | South | 200 |
| 3 | North | 175 |
| 4 | South | 225 |
| 5 | North | 190 |
| 6 | East | 160 |
| 7 | North | 180 |
| 8 | South | 210 |
| region | order_count |
|---|
HAVING Examples
HAVING conditions can use any aggregate function, letting you filter on totals, averages, counts, or any calculated metric.
This shows only customers who have spent more than $1,000 total. Customers who spent less are filtered out by HAVING.
Practical Patterns
WHERE and HAVING can work together in the same query. Understanding the execution order helps you write efficient and correct queries.
WHERE vs HAVING
You can use both in the same query. WHERE filters rows first, then GROUP BY creates groups, then HAVING filters groups:
| order_id | region | status |
|---|---|---|
| 1 | North | completed |
| 2 | South | pending |
| 3 | North | completed |
| 4 | South | completed |
| 5 | North | completed |
| 6 | East | completed |
| 7 | South | completed |
| 8 | North | pending |
| 9 | East | completed |
| 10 | South | completed |
| 11 | North | completed |
| 12 | West | completed |
| region | completed_orders |
|---|
First, WHERE keeps only completed orders (filtering rows). Then, GROUP BY counts completed orders per region. Finally, HAVING keeps only regions with more than 50 completed orders (filtering groups).
These guidelines help you choose between WHERE and HAVING for optimal query performance.
- Use WHERE for filtering individual rows
- Use HAVING for filtering based on aggregate values
- Filter with WHERE first to reduce data before aggregation (better performance)
- Remember: HAVING comes after GROUP BY
- Don't try to use aggregate functions in WHERE
- Don't use HAVING when WHERE would work because it's slower
- Don't forget that HAVING filters entire groups, not rows
Execution Order
SQL clauses execute in a specific order. Understanding this sequence clarifies why HAVING can reference aggregates but WHERE cannot.
> Fill in the missing parts to find only departments where the average metric value exceeds 60,000.
SELECT department, AVG(metric_value) AS avg_metric FROM employee_metrics GROUP BY department (metric_value) >
HAVING is the clause that unlocks group-level filtering. Without it, you would need subqueries to accomplish what HAVING does in a single, readable step.
The WHERE and HAVING combination is extremely powerful: WHERE eliminates unwanted rows before any grouping occurs, reducing the data the database must aggregate. HAVING then eliminates groups whose summary metrics do not meet your criteria.
If your HAVING condition does not reference an aggregate function, move it to WHERE instead. Filtering rows before grouping is always faster than filtering groups after aggregation.
> You are a product analyst at Stripe building a feature usage dashboard that summarizes transaction activity across subscription tiers. The product team needs counts, totals, and averages per plan so they can identify which tiers drive the most volume.
GROUP BY groups every transaction row by subscription tier, producing one summary row per tier.COUNT(*) tallies total transactions per tier while COUNT(DISTINCT ...) measures unique active users.SUM computes total revenue per tier and AVG reveals the typical transaction size.HAVING filters the grouped results to only tiers exceeding a minimum transaction threshold for the report.GROUP BY buckets rows by column values; each group produces one output rowCOUNT(*) counts all rows; COUNT(column) counts non-NULL values onlySUM and AVG work only on numeric columns; NULL values are ignoredMIN and MAX work on any comparable type: numbers, strings, datesSELECT column must appear in GROUP BYHAVING filters groups after aggregation; WHERE filters rows beforeWHERE to exclude rows from calculation; use HAVING to exclude groups from resultsGROUP BY treats the entire table as one groupAggregating: Beginner
A million rows walk into a SUM...
- Category
- SQL
- Difficulty
- beginner
- Duration
- 38 minutes
- Challenges
- 0 hands-on challenges
Topics covered: GROUP BY for categorizing data, COUNT variations (*,col,DISTINCT), SUM and AVG calculations, MIN and MAX for extremes, HAVING for filtered groups
Lesson Sections
- GROUP BY for categorizing data (concepts: sqlGroupBy)
Understanding GROUP BY How GROUP BY Works Consider this sales data: Watch how rows with the same region value are gathered into groups, then collapsed into a single summary row per group. The result has one row per region. Each row summarizes all the sales in that region. The original 6 rows have been collapsed into 3 summary rows. Rules and Patterns GROUP BY Golden Rule This rule exists because SQL needs to know how to handle columns that have multiple values per group. If you group by region,
- COUNT variations (*,col,DISTINCT) (concepts: sqlCount, sqlCountDistinct)
COUNT Variations COUNT(*) - Count All Rows COUNT(column) Consider a table where some customers have not provided an email: COUNT(DISTINCT) 1,847 orders came from 892 unique customers. This tells you that some customers placed multiple orders. Practical Applications Comparing COUNT Variants Database Execution
- SUM and AVG calculations (concepts: sqlSumAvg)
SUM Functions Basic Example SUM with GROUP BY Each group's amounts are added together into a single total. North's three sales (150 + 175 + 190) become 515, while East's lone sale stays at 160. Now you can instantly see which region generates the most revenue and sells the most units. SUM with Expressions You can sum calculated values, not just raw columns. This is common for computing totals from unit prices and quantities. The expression price * quantity is calculated for each row, then those
- MIN and MAX for extremes (concepts: sqlMinMax)
Basic MIN/MAX Usage Finding Extremes Within each category, the smallest price is selected. Electronics keeps 149 (ignoring 299 and 599), while Clothing keeps 29. This tells you the price range within each category. Electronics spans $450 while Clothing only spans $60. Advanced Applications MIN/MAX with Dates Now you can see each customer's entire order history at a glance: when they first ordered, their most recent order, and how many orders they have placed. MIN/MAX with Text Combining Aggregat
- HAVING for filtered groups (concepts: sqlHaving)
Understanding HAVING Why We Need HAVING Consider this question: "Which regions have more than 100 orders?" Groups are formed first, then the count threshold is applied. North (4 orders) and South (3 orders) pass the filter, while East (1 order) is eliminated entirely. HAVING Examples This finds highly-rated products with enough reviews to be statistically meaningful. Both conditions must be true: at least 10 reviews AND average rating of 4.0 or higher. Practical Patterns WHERE vs HAVING Executio