SQL Aggregate Functions: COUNT, SUM, AVG
SUM and AVG appear in 15% of verified DE SQL interview questions. COUNT shows up in 13%. GROUP BY, their constant companion, runs at 32% in our corpus. Put those numbers together and aggregates sit inside something like half of every SQL round you'll ever do, with failure modes that compound at scale.
The Five Aggregate Functions
Five functions, each with its own NULL story. COUNT(*) ignores nothing. COUNT(column) ignores NULLs. AVG divides by the non-null count, which is the single most common source of 'off by a tiny bit' reporting bugs in our corpus of graded answers.
Aggregates as Window Functions
Every aggregate function can also be used as a window function by adding an OVER clause. This computes the aggregate without collapsing rows. The result is attached to every row as an additional column.
Adding ORDER BY inside the OVER clause changes the behavior from a static aggregate to a running aggregate. SUM with ORDER BY produces a running total. AVG with ORDER BY produces a running average. This default frame behavior (UNBOUNDED PRECEDING to CURRENT ROW) surprises many candidates.
Know Aggregation the way the interviewer who asks it knows it.
Pulled from debriefs where SQL was the gate.
Function Overview
COUNT
COUNT has three forms, and interviewers test all of them. COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values. The difference between COUNT(*) and COUNT(column) is one of the most common SQL interview traps. If a column has 100 rows but 10 are NULL, COUNT(*) returns 100 and COUNT(column) returns 90. Syntax: COUNT(*) | COUNT(column) | COUNT(DISTINCT column) NULL behavior: COUNT(*) includes NULLs. COUNT(column) and COUNT(DISTINCT column) skip NULLs. Interview note: If the question asks 'how many rows,' use COUNT(*). If it asks 'how many users have an email,' use COUNT(email). If it asks 'how many different cities,' use COUNT(DISTINCT city). Getting the wrong COUNT form is an instant red flag.
SUM
SUM adds up numeric values. It ignores NULLs entirely: SUM of (10, 20, NULL) is 30, not NULL. If all values are NULL, SUM returns NULL, not 0. This catches people off guard in interviews because they expect 0. SUM(DISTINCT column) adds only unique values, which is rarely used in practice but occasionally tested. Syntax: SUM(column) | SUM(DISTINCT column) NULL behavior: Skips NULLs. Returns NULL if all values are NULL (not 0). Interview note: Wrap SUM in COALESCE(SUM(x), 0) when downstream code cannot handle NULL. This is especially important in subqueries and CTEs where a NULL propagates through arithmetic and turns entire expressions to NULL.
AVG
AVG computes the arithmetic mean. Like SUM, it ignores NULLs. This means AVG of (10, 20, NULL) is 15 (sum of 30 divided by count of 2), not 10 (sum of 30 divided by count of 3). The denominator is the count of non-NULL values. This behavior has a real impact: if missing data is coded as NULL vs 0, the average changes dramatically. Interviewers often set up scenarios where this distinction matters. Syntax: AVG(column) | AVG(DISTINCT column) NULL behavior: Skips NULLs in both numerator and denominator. Returns NULL if all values are NULL. Interview note: The difference between AVG(bonus) and SUM(bonus)/COUNT(*) is a classic trap. AVG skips NULLs in the denominator. SUM/COUNT(*) includes NULL rows in the denominator (they contribute 0 to SUM but 1 to COUNT). If 5 out of 10 employees have bonuses, AVG(bonus) divides by 5 while SUM(bonus)/COUNT(*) divides by 10.
MIN
MIN returns the smallest value in a set. It works on numbers, strings (alphabetical order), dates, and timestamps. NULLs are ignored. If all values are NULL, MIN returns NULL. For strings, MIN returns the value that sorts first alphabetically (case-sensitive in PostgreSQL, case-insensitive in MySQL with default collation). MIN is frequently combined with GROUP BY to find earliest dates, lowest prices, or first events per entity. Syntax: MIN(column) NULL behavior: Skips NULLs. Returns NULL if all values are NULL. Interview note: MIN(order_date) is the standard way to find the first event per entity. Combined with PARTITION BY in a window function, it gives you the first date per group on every row without collapsing the result set.
MAX
MAX returns the largest value. Same type support as MIN: numbers, strings, dates, timestamps. NULLs are ignored. MAX is the most common way to find the latest timestamp, highest value, or most recent event. In data engineering pipelines, MAX(updated_at) is a standard watermark query for incremental loads: process only rows where updated_at > last known maximum. Syntax: MAX(column) NULL behavior: Skips NULLs. Returns NULL if all values are NULL. Interview note: MAX(updated_at) is the bread-and-butter pattern for incremental data loads. If the interviewer asks about pipeline optimization or avoiding full table scans, mention watermark queries with MAX.
COUNT
-- Three forms of COUNT
SELECT
COUNT(*) AS total_rows, -- 100 (includes NULLs)
COUNT(email) AS has_email, -- 90 (10 NULLs skipped)
COUNT(DISTINCT city) AS cities -- 25 (unique non-NULL cities)
FROM users;SUM
-- Total and distinct sum
SELECT
department,
SUM(salary) AS total_payroll,
SUM(DISTINCT salary) AS unique_salary_total,
COUNT(*) AS headcount
FROM employees
GROUP BY department;
-- SUM with COALESCE to guarantee 0 instead of NULL
SELECT COALESCE(SUM(bonus), 0) AS total_bonus
FROM employees
WHERE department = 'Research';AVG
-- Average salary per department
SELECT
department,
AVG(salary) AS avg_salary,
AVG(salary)::NUMERIC(10,2) AS avg_rounded -- PostgreSQL cast
FROM employees
GROUP BY department;
-- The NULL trap: these return different results
SELECT
AVG(bonus) AS avg_with_nulls_skipped, -- ignores NULLs
SUM(bonus) / COUNT(*) AS avg_with_nulls_as_zero -- treats NULLs as 0
FROM employees;MIN
-- Earliest order per customer
SELECT
customer_id,
MIN(order_date) AS first_order,
MIN(amount) AS smallest_order
FROM orders
GROUP BY customer_id;
-- MIN on strings (alphabetical)
SELECT MIN(last_name) FROM employees; -- 'Adams' before 'Zhang'MAX
-- Latest event per user (watermark pattern)
SELECT
user_id,
MAX(event_timestamp) AS last_seen,
MAX(amount) AS largest_purchase
FROM user_events
GROUP BY user_id;
-- Pipeline watermark
SELECT MAX(updated_at) AS high_watermark
FROM source_table;NULL Handling Summary
NULL handling is the single biggest source of aggregate function bugs. Here is the complete picture in one table.
| Function | Skips NULLs? | All NULLs Result |
|---|---|---|
| COUNT(*) | No | Row count |
| COUNT(col) | Yes | 0 |
| SUM(col) | Yes | NULL |
| AVG(col) | Yes | NULL |
| MIN(col) | Yes | NULL |
| MAX(col) | Yes | NULL |
GROUP BY Patterns with Aggregates
Aggregate functions almost always appear with GROUP BY. These four patterns cover the majority of interview and production scenarios.
Basic GROUP BY with Aggregates
Every column in SELECT that is not inside an aggregate function must appear in GROUP BY. This rule is enforced in PostgreSQL and standard SQL. MySQL historically allowed non-grouped, non-aggregated columns (picking an arbitrary value), but this is a bug magnet and should be avoided.
GROUP BY with HAVING
HAVING filters groups after aggregation. WHERE filters rows before aggregation. This ordering is critical. WHERE cannot reference aggregate results because aggregation has not happened yet when WHERE executes. HAVING can. A common interview question is to find departments where the average salary exceeds a threshold, which requires HAVING.
GROUP BY with Multiple Columns
GROUP BY can accept multiple columns, creating one group for each unique combination. GROUP BY department, fiscal_year produces one row per department-year pair. The grain of your output is determined by the GROUP BY columns. Adding more columns makes the grain finer (more rows, smaller groups). Removing columns makes it coarser (fewer rows, larger groups).
Aggregate Subqueries
A frequent pattern is computing an aggregate in a subquery and using it in the outer query for comparison. For example, finding all employees whose salary is above the company average. The subquery computes the scalar aggregate; the outer query compares each row against it. This can also be done with a window function, but the subquery approach is more portable.
Basic GROUP BY with Aggregates
SELECT
department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
MAX(salary) AS top_salary,
MIN(hire_date) AS earliest_hire
FROM employees
GROUP BY department
ORDER BY headcount DESC;GROUP BY with HAVING
-- Departments with average salary above 100k
SELECT
department,
AVG(salary) AS avg_salary,
COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING AVG(salary) > 100000;
-- WHERE vs HAVING: different stages
SELECT department, AVG(salary)
FROM employees
WHERE hire_date >= '2020-01-01' -- filters rows BEFORE aggregation
GROUP BY department
HAVING AVG(salary) > 100000; -- filters groups AFTER aggregationGROUP BY with Multiple Columns
SELECT
department,
EXTRACT(YEAR FROM hire_date) AS hire_year,
COUNT(*) AS hires,
AVG(salary) AS avg_starting_salary
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hire_date)
ORDER BY department, hire_year;Aggregate Subqueries
-- Employees earning above company average
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Equivalent with window function
SELECT employee_name, salary
FROM (
SELECT
employee_name, salary,
AVG(salary) OVER () AS company_avg
FROM employees
) sub
WHERE salary > company_avg;Aggregate Window Function Examples
-- Each employee with their department average and total
SELECT
department,
employee_name,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
COUNT(*) OVER (PARTITION BY department) AS dept_size,
salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;
-- Running total vs static total
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS lifetime_total,
SUM(amount) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS running_total
FROM orders;5 Aggregate Function Interview Questions
These questions test practical aggregation skills. Each includes the interviewer's intent and a step-by-step approach.
Q1: What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?
What they test: This is the single most asked aggregate function question. COUNT(*) counts all rows. COUNT(column) skips NULLs. COUNT(DISTINCT column) counts unique non-NULL values. A complete answer includes a concrete example with NULLs showing different numbers for each form. Approach: Example: a users table with 100 rows, 10 NULL emails, 25 unique cities. COUNT(*) = 100, COUNT(email) = 90, COUNT(DISTINCT city) = 25. Mention that COUNT(*) is the only aggregate that includes NULLs.
Q2: Write a query to find departments where more than 5 employees earn above $100,000.
What they test: This combines WHERE (to filter individual rows by salary), GROUP BY (to aggregate by department), and HAVING (to filter groups by count). The interviewer checks that you know WHERE runs before GROUP BY, and HAVING runs after. Getting the order wrong produces an error or wrong results. Approach: WHERE salary > 100000 filters rows first. GROUP BY department groups the survivors. HAVING COUNT(*) > 5 keeps only departments with more than 5 high earners. SELECT department, COUNT(*) gives the final output.
Q3: Explain why AVG(bonus) and SUM(bonus)/COUNT(*) can return different results.
What they test: NULL handling. AVG(bonus) divides by COUNT(bonus), which excludes NULLs. SUM(bonus)/COUNT(*) divides by the total row count including NULLs. If half the rows have NULL bonuses, the AVG is twice as large as SUM/COUNT(*). This tests whether the candidate truly understands how NULLs propagate through aggregation. Approach: Walk through a concrete example: 4 employees with bonuses (1000, 2000, NULL, NULL). AVG(bonus) = 3000/2 = 1500. SUM(bonus)/COUNT(*) = 3000/4 = 750. The denominator is the difference. AVG uses COUNT(bonus) = 2; COUNT(*) = 4.
Q4: Given an orders table, find each customer's total spend, average order value, and number of orders. Only include customers with at least 3 orders and total spend above $500.
What they test: Practical aggregation with compound HAVING conditions. The interviewer wants GROUP BY customer_id, multiple aggregates in SELECT, and HAVING with two conditions. This tests whether the candidate can compose a complete analytical query from a business requirement. Approach: SELECT customer_id, SUM(amount) AS total_spend, AVG(amount) AS avg_order, COUNT(*) AS order_count FROM orders GROUP BY customer_id HAVING COUNT(*) >= 3 AND SUM(amount) > 500. Note: HAVING uses the aggregate expressions, not the aliases (some engines allow aliases in HAVING, but standard SQL does not).
Q5: What happens when you run SUM on a column where every value is NULL? What about COUNT?
What they test: Edge case knowledge. SUM returns NULL when all values are NULL. COUNT(column) returns 0. COUNT(*) returns the row count regardless of NULLs. This is a gotcha that affects pipeline logic: a CASE expression checking SUM(x) = 0 will not match when all values are NULL because NULL != 0. Approach: SUM(all NULLs) = NULL. AVG(all NULLs) = NULL. MIN(all NULLs) = NULL. MAX(all NULLs) = NULL. COUNT(all NULLs) = 0. COUNT(*) = row count. Use COALESCE(SUM(x), 0) when you need 0 instead of NULL. Mention that this is especially important in LEFT JOINs where the right side has no matches.
Aggregate Functions FAQ
What are SQL aggregate functions?+
How do aggregate functions handle NULL values?+
What is the difference between WHERE and HAVING?+
Can I use aggregate functions without GROUP BY?+
What is the difference between COUNT(*) and COUNT(1)?+
28% of Rounds Hinge on COUNT, SUM, or AVG
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition