SUM and AVG appear in 15% of verified DE SQL interview questions. COUNT shows up in 13%. GROUP BY, their constant companion, runs at 32% in our corpus. Put those numbers together and aggregates sit inside something like half of every SQL round you'll ever do, with failure modes that compound at scale.
Five functions, one guide. NULL handling, GROUP BY mechanics, WHERE versus HAVING, COUNT(DISTINCT) versus COUNT(*). Every one of these has a specific frequency in our dataset and a specific mistake pattern that interviewers watch for.
SUM / AVG question rate
COUNT questions
GROUP BY companion rate
Functions covered here
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Five functions, each with its own NULL story. COUNT(*) ignores nothing. COUNT(column) ignores NULLs. AVG divides by the non-null count, which is the single most common source of "off by a tiny bit" reporting bugs in our corpus of graded answers.
COUNT has three forms, and interviewers test all of them. COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values. The difference between COUNT(*) and COUNT(column) is one of the most common SQL interview traps. If a column has 100 rows but 10 are NULL, COUNT(*) returns 100 and COUNT(column) returns 90.
NULL behavior: COUNT(*) includes NULLs. COUNT(column) and COUNT(DISTINCT column) skip NULLs.
-- Three forms of COUNT
SELECT
COUNT(*) AS total_rows, -- 100 (includes NULLs)
COUNT(email) AS has_email, -- 90 (10 NULLs skipped)
COUNT(DISTINCT city) AS cities -- 25 (unique non-NULL cities)
FROM users;Interview note: If the question asks 'how many rows,' use COUNT(*). If it asks 'how many users have an email,' use COUNT(email). If it asks 'how many different cities,' use COUNT(DISTINCT city). Getting the wrong COUNT form is an instant red flag.
SUM adds up numeric values. It ignores NULLs entirely: SUM of (10, 20, NULL) is 30, not NULL. If all values are NULL, SUM returns NULL, not 0. This catches people off guard in interviews because they expect 0. SUM(DISTINCT column) adds only unique values, which is rarely used in practice but occasionally tested.
NULL behavior: Skips NULLs. Returns NULL if all values are NULL (not 0).
-- Total and distinct sum
SELECT
department,
SUM(salary) AS total_payroll,
SUM(DISTINCT salary) AS unique_salary_total,
COUNT(*) AS headcount
FROM employees
GROUP BY department;
-- SUM with COALESCE to guarantee 0 instead of NULL
SELECT COALESCE(SUM(bonus), 0) AS total_bonus
FROM employees
WHERE department = 'Research';Interview note: Wrap SUM in COALESCE(SUM(x), 0) when downstream code cannot handle NULL. This is especially important in subqueries and CTEs where a NULL propagates through arithmetic and turns entire expressions to NULL.
AVG computes the arithmetic mean. Like SUM, it ignores NULLs. This means AVG of (10, 20, NULL) is 15 (sum of 30 divided by count of 2), not 10 (sum of 30 divided by count of 3). The denominator is the count of non-NULL values. This behavior has a real impact: if missing data is coded as NULL vs 0, the average changes dramatically. Interviewers often set up scenarios where this distinction matters.
NULL behavior: Skips NULLs in both numerator and denominator. Returns NULL if all values are NULL.
-- Average salary per department
SELECT
department,
AVG(salary) AS avg_salary,
AVG(salary)::NUMERIC(10,2) AS avg_rounded -- PostgreSQL cast
FROM employees
GROUP BY department;
-- The NULL trap: these return different results
SELECT
AVG(bonus) AS avg_with_nulls_skipped, -- ignores NULLs
SUM(bonus) / COUNT(*) AS avg_with_nulls_as_zero -- treats NULLs as 0
FROM employees;Interview note: The difference between AVG(bonus) and SUM(bonus)/COUNT(*) is a classic trap. AVG skips NULLs in the denominator. SUM/COUNT(*) includes NULL rows in the denominator (they contribute 0 to SUM but 1 to COUNT). If 5 out of 10 employees have bonuses, AVG(bonus) divides by 5 while SUM(bonus)/COUNT(*) divides by 10.
MIN returns the smallest value in a set. It works on numbers, strings (alphabetical order), dates, and timestamps. NULLs are ignored. If all values are NULL, MIN returns NULL. For strings, MIN returns the value that sorts first alphabetically (case-sensitive in PostgreSQL, case-insensitive in MySQL with default collation). MIN is frequently combined with GROUP BY to find earliest dates, lowest prices, or first events per entity.
NULL behavior: Skips NULLs. Returns NULL if all values are NULL.
-- Earliest order per customer
SELECT
customer_id,
MIN(order_date) AS first_order,
MIN(amount) AS smallest_order
FROM orders
GROUP BY customer_id;
-- MIN on strings (alphabetical)
SELECT MIN(last_name) FROM employees; -- 'Adams' before 'Zhang'Interview note: MIN(order_date) is the standard way to find the first event per entity. Combined with PARTITION BY in a window function, it gives you the first date per group on every row without collapsing the result set.
MAX returns the largest value. Same type support as MIN: numbers, strings, dates, timestamps. NULLs are ignored. MAX is the most common way to find the latest timestamp, highest value, or most recent event. In data engineering pipelines, MAX(updated_at) is a standard watermark query for incremental loads: process only rows where updated_at > last known maximum.
NULL behavior: Skips NULLs. Returns NULL if all values are NULL.
-- Latest event per user (watermark pattern)
SELECT
user_id,
MAX(event_timestamp) AS last_seen,
MAX(amount) AS largest_purchase
FROM user_events
GROUP BY user_id;
-- Pipeline watermark
SELECT MAX(updated_at) AS high_watermark
FROM source_table;Interview note: MAX(updated_at) is the bread-and-butter pattern for incremental data loads. If the interviewer asks about pipeline optimization or avoiding full table scans, mention watermark queries with MAX.
NULL handling is the single biggest source of aggregate function bugs. Here is the complete picture in one table.
| Function | Skips NULLs? | All NULLs Result |
|---|---|---|
| COUNT(*) | No | Row count |
| COUNT(col) | Yes | 0 |
| SUM(col) | Yes | NULL |
| AVG(col) | Yes | NULL |
| MIN(col) | Yes | NULL |
| MAX(col) | Yes | NULL |
Aggregate functions almost always appear with GROUP BY. These four patterns cover the majority of interview and production scenarios.
Every column in SELECT that is not inside an aggregate function must appear in GROUP BY. This rule is enforced in PostgreSQL and standard SQL. MySQL historically allowed non-grouped, non-aggregated columns (picking an arbitrary value), but this is a bug magnet and should be avoided.
SELECT
department,
COUNT(*) AS headcount,
AVG(salary) AS avg_salary,
MAX(salary) AS top_salary,
MIN(hire_date) AS earliest_hire
FROM employees
GROUP BY department
ORDER BY headcount DESC;HAVING filters groups after aggregation. WHERE filters rows before aggregation. This ordering is critical. WHERE cannot reference aggregate results because aggregation has not happened yet. HAVING can. A common interview question is to find departments where the average salary exceeds a threshold, which requires HAVING.
-- Departments with average salary above 100k
SELECT
department,
AVG(salary) AS avg_salary,
COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING AVG(salary) > 100000;
-- WHERE vs HAVING: different stages
SELECT department, AVG(salary)
FROM employees
WHERE hire_date >= '2020-01-01' -- filters rows BEFORE aggregation
GROUP BY department
HAVING AVG(salary) > 100000; -- filters groups AFTER aggregationGROUP BY can accept multiple columns, creating one group for each unique combination. GROUP BY department, fiscal_year produces one row per department-year pair. The grain of your output is determined by the GROUP BY columns. Adding more columns makes the grain finer (more rows, smaller groups). Removing columns makes it coarser (fewer rows, larger groups).
SELECT
department,
EXTRACT(YEAR FROM hire_date) AS hire_year,
COUNT(*) AS hires,
AVG(salary) AS avg_starting_salary
FROM employees
GROUP BY department, EXTRACT(YEAR FROM hire_date)
ORDER BY department, hire_year;A frequent pattern is computing an aggregate in a subquery and using it in the outer query for comparison. For example, finding all employees whose salary is above the company average. The subquery computes the scalar aggregate; the outer query compares each row against it. This can also be done with a window function, but the subquery approach is more portable.
-- Employees earning above company average
SELECT employee_name, salary
FROM employees
WHERE salary > (SELECT AVG(salary) FROM employees);
-- Equivalent with window function
SELECT employee_name, salary
FROM (
SELECT
employee_name, salary,
AVG(salary) OVER () AS company_avg
FROM employees
) sub
WHERE salary > company_avg;Every aggregate function can also be used as a window function by adding an OVER clause. This computes the aggregate without collapsing rows. The result is attached to every row as an additional column.
-- Each employee with their department average and total
SELECT
department,
employee_name,
salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
COUNT(*) OVER (PARTITION BY department) AS dept_size,
salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;Adding ORDER BY inside the OVER clause changes the behavior from a static aggregate to a running aggregate. SUM with ORDER BY produces a running total. AVG with ORDER BY produces a running average. This default frame behavior (UNBOUNDED PRECEDING to CURRENT ROW) surprises many candidates.
-- Running total vs static total
SELECT
customer_id,
order_date,
amount,
SUM(amount) OVER (PARTITION BY customer_id) AS lifetime_total,
SUM(amount) OVER (
PARTITION BY customer_id ORDER BY order_date
) AS running_total
FROM orders;These questions test practical aggregation skills. Each includes the interviewer's intent and a step-by-step approach.
What they test:
This is the single most asked aggregate function question. COUNT(*) counts all rows. COUNT(column) skips NULLs. COUNT(DISTINCT column) counts unique non-NULL values. A complete answer includes a concrete example with NULLs showing different numbers for each form.
Approach:
Example: a users table with 100 rows, 10 NULL emails, 25 unique cities. COUNT(*) = 100, COUNT(email) = 90, COUNT(DISTINCT city) = 25. Mention that COUNT(*) is the only aggregate that includes NULLs.
What they test:
This combines WHERE (to filter individual rows by salary), GROUP BY (to aggregate by department), and HAVING (to filter groups by count). The interviewer checks that you know WHERE runs before GROUP BY, and HAVING runs after. Getting the order wrong produces an error or wrong results.
Approach:
WHERE salary > 100000 filters rows first. GROUP BY department groups the survivors. HAVING COUNT(*) > 5 keeps only departments with more than 5 high earners. SELECT department, COUNT(*) gives the final output.
What they test:
NULL handling. AVG(bonus) divides by COUNT(bonus), which excludes NULLs. SUM(bonus)/COUNT(*) divides by the total row count including NULLs. If half the rows have NULL bonuses, the AVG is twice as large as SUM/COUNT(*). This tests whether the candidate truly understands how NULLs propagate through aggregation.
Approach:
Walk through a concrete example: 4 employees with bonuses (1000, 2000, NULL, NULL). AVG(bonus) = 3000/2 = 1500. SUM(bonus)/COUNT(*) = 3000/4 = 750. The denominator is the difference. AVG uses COUNT(bonus) = 2; COUNT(*) = 4.
What they test:
Practical aggregation with compound HAVING conditions. The interviewer wants GROUP BY customer_id, multiple aggregates in SELECT, and HAVING with two conditions. This tests whether the candidate can compose a complete analytical query from a business requirement.
Approach:
SELECT customer_id, SUM(amount) AS total_spend, AVG(amount) AS avg_order, COUNT(*) AS order_count FROM orders GROUP BY customer_id HAVING COUNT(*) >= 3 AND SUM(amount) > 500. Note: HAVING uses the aggregate expressions, not the aliases (some engines allow aliases in HAVING, but standard SQL does not).
What they test:
Edge case knowledge. SUM returns NULL when all values are NULL. COUNT(column) returns 0. COUNT(*) returns the row count regardless of NULLs. This is a gotcha that affects pipeline logic: a CASE expression checking SUM(x) = 0 will not match when all values are NULL because NULL != 0.
Approach:
SUM(all NULLs) = NULL. AVG(all NULLs) = NULL. MIN(all NULLs) = NULL. MAX(all NULLs) = NULL. COUNT(all NULLs) = 0. COUNT(*) = row count. Use COALESCE(SUM(x), 0) when you need 0 instead of NULL. Mention that this is especially important in LEFT JOINs where the right side has no matches.
That's combined frequency from our verified corpus. Practice the NULL cases first, the DISTINCT variants second, and watch your error rate crash.