SQL Practice

SQL Aggregate Functions

SUM and AVG appear in 15% of verified DE SQL interview questions. COUNT shows up in 13%. GROUP BY, their constant companion, runs at 32% in our corpus. Put those numbers together and aggregates sit inside something like half of every SQL round you'll ever do, with failure modes that compound at scale.

Five functions, one guide. NULL handling, GROUP BY mechanics, WHERE versus HAVING, COUNT(DISTINCT) versus COUNT(*). Every one of these has a specific frequency in our dataset and a specific mistake pattern that interviewers watch for.

15%

SUM / AVG question rate

13%

COUNT questions

32%

GROUP BY companion rate

5

Functions covered here

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

The Five Aggregate Functions

Five functions, each with its own NULL story. COUNT(*) ignores nothing. COUNT(column) ignores NULLs. AVG divides by the non-null count, which is the single most common source of "off by a tiny bit" reporting bugs in our corpus of graded answers.

COUNT

COUNT has three forms, and interviewers test all of them. COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values. The difference between COUNT(*) and COUNT(column) is one of the most common SQL interview traps. If a column has 100 rows but 10 are NULL, COUNT(*) returns 100 and COUNT(column) returns 90.

Syntax: COUNT(*) | COUNT(column) | COUNT(DISTINCT column)

NULL behavior: COUNT(*) includes NULLs. COUNT(column) and COUNT(DISTINCT column) skip NULLs.

-- Three forms of COUNT
SELECT
  COUNT(*) AS total_rows,           -- 100 (includes NULLs)
  COUNT(email) AS has_email,        -- 90 (10 NULLs skipped)
  COUNT(DISTINCT city) AS cities    -- 25 (unique non-NULL cities)
FROM users;

Interview note: If the question asks 'how many rows,' use COUNT(*). If it asks 'how many users have an email,' use COUNT(email). If it asks 'how many different cities,' use COUNT(DISTINCT city). Getting the wrong COUNT form is an instant red flag.

SUM

SUM adds up numeric values. It ignores NULLs entirely: SUM of (10, 20, NULL) is 30, not NULL. If all values are NULL, SUM returns NULL, not 0. This catches people off guard in interviews because they expect 0. SUM(DISTINCT column) adds only unique values, which is rarely used in practice but occasionally tested.

Syntax: SUM(column) | SUM(DISTINCT column)

NULL behavior: Skips NULLs. Returns NULL if all values are NULL (not 0).

-- Total and distinct sum
SELECT
  department,
  SUM(salary) AS total_payroll,
  SUM(DISTINCT salary) AS unique_salary_total,
  COUNT(*) AS headcount
FROM employees
GROUP BY department;

-- SUM with COALESCE to guarantee 0 instead of NULL
SELECT COALESCE(SUM(bonus), 0) AS total_bonus
FROM employees
WHERE department = 'Research';

Interview note: Wrap SUM in COALESCE(SUM(x), 0) when downstream code cannot handle NULL. This is especially important in subqueries and CTEs where a NULL propagates through arithmetic and turns entire expressions to NULL.

AVG

AVG computes the arithmetic mean. Like SUM, it ignores NULLs. This means AVG of (10, 20, NULL) is 15 (sum of 30 divided by count of 2), not 10 (sum of 30 divided by count of 3). The denominator is the count of non-NULL values. This behavior has a real impact: if missing data is coded as NULL vs 0, the average changes dramatically. Interviewers often set up scenarios where this distinction matters.

Syntax: AVG(column) | AVG(DISTINCT column)

NULL behavior: Skips NULLs in both numerator and denominator. Returns NULL if all values are NULL.

-- Average salary per department
SELECT
  department,
  AVG(salary) AS avg_salary,
  AVG(salary)::NUMERIC(10,2) AS avg_rounded  -- PostgreSQL cast
FROM employees
GROUP BY department;

-- The NULL trap: these return different results
SELECT
  AVG(bonus) AS avg_with_nulls_skipped,  -- ignores NULLs
  SUM(bonus) / COUNT(*) AS avg_with_nulls_as_zero  -- treats NULLs as 0
FROM employees;

Interview note: The difference between AVG(bonus) and SUM(bonus)/COUNT(*) is a classic trap. AVG skips NULLs in the denominator. SUM/COUNT(*) includes NULL rows in the denominator (they contribute 0 to SUM but 1 to COUNT). If 5 out of 10 employees have bonuses, AVG(bonus) divides by 5 while SUM(bonus)/COUNT(*) divides by 10.

MIN

MIN returns the smallest value in a set. It works on numbers, strings (alphabetical order), dates, and timestamps. NULLs are ignored. If all values are NULL, MIN returns NULL. For strings, MIN returns the value that sorts first alphabetically (case-sensitive in PostgreSQL, case-insensitive in MySQL with default collation). MIN is frequently combined with GROUP BY to find earliest dates, lowest prices, or first events per entity.

Syntax: MIN(column)

NULL behavior: Skips NULLs. Returns NULL if all values are NULL.

-- Earliest order per customer
SELECT
  customer_id,
  MIN(order_date) AS first_order,
  MIN(amount) AS smallest_order
FROM orders
GROUP BY customer_id;

-- MIN on strings (alphabetical)
SELECT MIN(last_name) FROM employees;  -- 'Adams' before 'Zhang'

Interview note: MIN(order_date) is the standard way to find the first event per entity. Combined with PARTITION BY in a window function, it gives you the first date per group on every row without collapsing the result set.

MAX

MAX returns the largest value. Same type support as MIN: numbers, strings, dates, timestamps. NULLs are ignored. MAX is the most common way to find the latest timestamp, highest value, or most recent event. In data engineering pipelines, MAX(updated_at) is a standard watermark query for incremental loads: process only rows where updated_at > last known maximum.

Syntax: MAX(column)

NULL behavior: Skips NULLs. Returns NULL if all values are NULL.

-- Latest event per user (watermark pattern)
SELECT
  user_id,
  MAX(event_timestamp) AS last_seen,
  MAX(amount) AS largest_purchase
FROM user_events
GROUP BY user_id;

-- Pipeline watermark
SELECT MAX(updated_at) AS high_watermark
FROM source_table;

Interview note: MAX(updated_at) is the bread-and-butter pattern for incremental data loads. If the interviewer asks about pipeline optimization or avoiding full table scans, mention watermark queries with MAX.

NULL Handling Summary

NULL handling is the single biggest source of aggregate function bugs. Here is the complete picture in one table.

FunctionSkips NULLs?All NULLs Result
COUNT(*)NoRow count
COUNT(col)Yes0
SUM(col)YesNULL
AVG(col)YesNULL
MIN(col)YesNULL
MAX(col)YesNULL

GROUP BY Patterns with Aggregates

Aggregate functions almost always appear with GROUP BY. These four patterns cover the majority of interview and production scenarios.

Basic GROUP BY with Aggregates

Every column in SELECT that is not inside an aggregate function must appear in GROUP BY. This rule is enforced in PostgreSQL and standard SQL. MySQL historically allowed non-grouped, non-aggregated columns (picking an arbitrary value), but this is a bug magnet and should be avoided.

SELECT
  department,
  COUNT(*) AS headcount,
  AVG(salary) AS avg_salary,
  MAX(salary) AS top_salary,
  MIN(hire_date) AS earliest_hire
FROM employees
GROUP BY department
ORDER BY headcount DESC;

GROUP BY with HAVING

HAVING filters groups after aggregation. WHERE filters rows before aggregation. This ordering is critical. WHERE cannot reference aggregate results because aggregation has not happened yet. HAVING can. A common interview question is to find departments where the average salary exceeds a threshold, which requires HAVING.

-- Departments with average salary above 100k
SELECT
  department,
  AVG(salary) AS avg_salary,
  COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING AVG(salary) > 100000;

-- WHERE vs HAVING: different stages
SELECT department, AVG(salary)
FROM employees
WHERE hire_date >= '2020-01-01'   -- filters rows BEFORE aggregation
GROUP BY department
HAVING AVG(salary) > 100000;     -- filters groups AFTER aggregation

GROUP BY with Multiple Columns

GROUP BY can accept multiple columns, creating one group for each unique combination. GROUP BY department, fiscal_year produces one row per department-year pair. The grain of your output is determined by the GROUP BY columns. Adding more columns makes the grain finer (more rows, smaller groups). Removing columns makes it coarser (fewer rows, larger groups).

SELECT
  department,
  EXTRACT(YEAR FROM hire_date) AS hire_year,
  COUNT(*) AS hires,
  AVG(salary) AS avg_starting_salary
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hire_date)
ORDER BY department, hire_year;

Aggregate Subqueries

A frequent pattern is computing an aggregate in a subquery and using it in the outer query for comparison. For example, finding all employees whose salary is above the company average. The subquery computes the scalar aggregate; the outer query compares each row against it. This can also be done with a window function, but the subquery approach is more portable.

-- Employees earning above company average
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);

-- Equivalent with window function
SELECT employee_name, salary
FROM (
  SELECT
    employee_name, salary,
    AVG(salary) OVER () AS company_avg
  FROM employees
) sub
WHERE salary > company_avg;

Aggregates as Window Functions

Every aggregate function can also be used as a window function by adding an OVER clause. This computes the aggregate without collapsing rows. The result is attached to every row as an additional column.

-- Each employee with their department average and total
SELECT
  department,
  employee_name,
  salary,
  AVG(salary) OVER (PARTITION BY department) AS dept_avg,
  SUM(salary) OVER (PARTITION BY department) AS dept_total,
  COUNT(*) OVER (PARTITION BY department) AS dept_size,
  salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;

Adding ORDER BY inside the OVER clause changes the behavior from a static aggregate to a running aggregate. SUM with ORDER BY produces a running total. AVG with ORDER BY produces a running average. This default frame behavior (UNBOUNDED PRECEDING to CURRENT ROW) surprises many candidates.

-- Running total vs static total
SELECT
  customer_id,
  order_date,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id) AS lifetime_total,
  SUM(amount) OVER (
    PARTITION BY customer_id ORDER BY order_date
  ) AS running_total
FROM orders;

5 Aggregate Function Interview Questions

These questions test practical aggregation skills. Each includes the interviewer's intent and a step-by-step approach.

Q1: What is the difference between COUNT(*), COUNT(column), and COUNT(DISTINCT column)?

What they test:

This is the single most asked aggregate function question. COUNT(*) counts all rows. COUNT(column) skips NULLs. COUNT(DISTINCT column) counts unique non-NULL values. A complete answer includes a concrete example with NULLs showing different numbers for each form.

Approach:

Example: a users table with 100 rows, 10 NULL emails, 25 unique cities. COUNT(*) = 100, COUNT(email) = 90, COUNT(DISTINCT city) = 25. Mention that COUNT(*) is the only aggregate that includes NULLs.

Q2: Write a query to find departments where more than 5 employees earn above $100,000.

What they test:

This combines WHERE (to filter individual rows by salary), GROUP BY (to aggregate by department), and HAVING (to filter groups by count). The interviewer checks that you know WHERE runs before GROUP BY, and HAVING runs after. Getting the order wrong produces an error or wrong results.

Approach:

WHERE salary > 100000 filters rows first. GROUP BY department groups the survivors. HAVING COUNT(*) > 5 keeps only departments with more than 5 high earners. SELECT department, COUNT(*) gives the final output.

Q3: Explain why AVG(bonus) and SUM(bonus)/COUNT(*) can return different results.

What they test:

NULL handling. AVG(bonus) divides by COUNT(bonus), which excludes NULLs. SUM(bonus)/COUNT(*) divides by the total row count including NULLs. If half the rows have NULL bonuses, the AVG is twice as large as SUM/COUNT(*). This tests whether the candidate truly understands how NULLs propagate through aggregation.

Approach:

Walk through a concrete example: 4 employees with bonuses (1000, 2000, NULL, NULL). AVG(bonus) = 3000/2 = 1500. SUM(bonus)/COUNT(*) = 3000/4 = 750. The denominator is the difference. AVG uses COUNT(bonus) = 2; COUNT(*) = 4.

Q4: Given an orders table, find each customer's total spend, average order value, and number of orders. Only include customers with at least 3 orders and total spend above $500.

What they test:

Practical aggregation with compound HAVING conditions. The interviewer wants GROUP BY customer_id, multiple aggregates in SELECT, and HAVING with two conditions. This tests whether the candidate can compose a complete analytical query from a business requirement.

Approach:

SELECT customer_id, SUM(amount) AS total_spend, AVG(amount) AS avg_order, COUNT(*) AS order_count FROM orders GROUP BY customer_id HAVING COUNT(*) >= 3 AND SUM(amount) > 500. Note: HAVING uses the aggregate expressions, not the aliases (some engines allow aliases in HAVING, but standard SQL does not).

Q5: What happens when you run SUM on a column where every value is NULL? What about COUNT?

What they test:

Edge case knowledge. SUM returns NULL when all values are NULL. COUNT(column) returns 0. COUNT(*) returns the row count regardless of NULLs. This is a gotcha that affects pipeline logic: a CASE expression checking SUM(x) = 0 will not match when all values are NULL because NULL != 0.

Approach:

SUM(all NULLs) = NULL. AVG(all NULLs) = NULL. MIN(all NULLs) = NULL. MAX(all NULLs) = NULL. COUNT(all NULLs) = 0. COUNT(*) = row count. Use COALESCE(SUM(x), 0) when you need 0 instead of NULL. Mention that this is especially important in LEFT JOINs where the right side has no matches.

Aggregate Functions FAQ

What are SQL aggregate functions?+
Aggregate functions compute a single value from a set of rows. The five standard aggregate functions are COUNT, SUM, AVG, MIN, and MAX. They are used with GROUP BY to compute values per group, or without GROUP BY to compute a single value for the entire result set. All standard aggregates (except COUNT(*)) ignore NULL values.
How do aggregate functions handle NULL values?+
All aggregate functions except COUNT(*) skip NULL values entirely. COUNT(*) counts all rows including those with NULLs. COUNT(column) counts only non-NULL values. SUM, AVG, MIN, and MAX ignore NULLs. If all values are NULL, SUM, AVG, MIN, and MAX return NULL, while COUNT returns 0. This NULL-skipping behavior affects AVG the most: the denominator excludes NULL rows, so the average may be higher than expected.
What is the difference between WHERE and HAVING?+
WHERE filters individual rows before aggregation. HAVING filters groups after aggregation. WHERE cannot reference aggregate functions because aggregation has not happened yet when WHERE executes. HAVING can reference aggregate results like COUNT(*) > 5 or AVG(salary) > 100000. Execution order: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY.
Can I use aggregate functions without GROUP BY?+
Yes. Without GROUP BY, the aggregate function operates on the entire result set and returns a single row. SELECT COUNT(*) FROM users returns one row with the total count. SELECT AVG(salary) FROM employees returns one row with the company-wide average. If you add non-aggregated columns to SELECT without GROUP BY, most engines will throw an error (PostgreSQL, SQL Server) or return unpredictable results (older MySQL versions).
What is the difference between COUNT(*) and COUNT(1)?+
There is no practical difference. Both count all rows including those with NULLs. COUNT(1) does not count the literal value 1; it counts rows where the expression 1 is not NULL, which is always true. Modern query optimizers treat COUNT(*) and COUNT(1) identically. Use COUNT(*) because it is more readable and its intent is clearer.

28% of Rounds Hinge on COUNT, SUM, or AVG

That's combined frequency from our verified corpus. Practice the NULL cases first, the DISTINCT variants second, and watch your error rate crash.