SQL Aggregate Functions Practice

Aggregation is the highest-frequency SQL category in interviews and the highest-frequency category of silent bugs. COUNT(*) versus COUNT(col), AVG ignoring NULLs in both numerator and denominator, JOIN multiplication inflating SUM. 7 graded problems below cover the patterns; the tables and worked snippets show the gotchas.

Start with problem 1 Aggregate reference

Aggregation in interview SQL

Aggregation share

25%

of SQL interview questions have aggregation as the primary subject. Highest single category.

GROUP BY presence

32%

of SQL questions use GROUP BY somewhere in the answer (often as a supporting clause, not the main test).

COUNT DISTINCT

14%

of aggregation questions explicitly require COUNT(DISTINCT), the most-tested aggregate variant.

NULL-related bugs

~40%

of failed aggregation submissions trace back to NULL semantics — COUNT(*) vs COUNT(col), SUM with all-NULL inputs, AVG dividing by non-NULL count.

Aggregate NULL behavior, all in 1 table

The single source of most aggregate bugs. Memorize before drilling problems.

Function

Counts NULLs?

All-NULL input returns

Interview trap

COUNT(*)

Includes every row

Confused with COUNT(col); they diverge when col has NULL

COUNT(col)

Skips NULLs in col

Used when COUNT(*) was wanted; undercount

COUNT(DISTINCT col)

Skips NULLs

Slow on large columns; sometimes a HyperLogLog approximation is expected

SUM(col)

Skips NULLs in col

NULL (not 0)

Downstream code expects 0; NULL propagates

AVG(col)

Skips in numerator AND denominator

NULL

Differs from SUM(col)/COUNT(*); interviewers test which is wanted

MIN(col), MAX(col)

Skips NULLs

NULL

Generally safe; MIN over empty group returns NULL not 0

STRING_AGG / ARRAY_AGG

Skips NULLs by default

NULL

Order is undefined without ORDER BY inside the agg

FILTER vs CASE WHEN, side by side

-- Same conditional count, 2 ways. The output rows are identical.
-- The query plan often isn't.

-- FILTER (Postgres, DuckDB, ANSI SQL since 2003):
SELECT
  department,
  COUNT(*) FILTER (WHERE status = 'active')  AS active_n,
  COUNT(*) FILTER (WHERE status = 'churned') AS churned_n,
  COUNT(*) FILTER (WHERE status = 'active' AND tier = 'pro') AS pro_active_n
FROM users
GROUP BY department;

-- CASE WHEN (universal; works on Snowflake, BigQuery, MySQL, every Postgres):
SELECT
  department,
  SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END)  AS active_n,
  SUM(CASE WHEN status = 'churned' THEN 1 ELSE 0 END) AS churned_n,
  SUM(CASE WHEN status = 'active' AND tier = 'pro' THEN 1 ELSE 0 END) AS pro_active_n
FROM users
GROUP BY department;

-- Postgres planner notes:
-- FILTER variant: HashAggregate (1 pass, internal filter predicates)
-- CASE  variant: HashAggregate (1 pass, internal CASE evaluation)
-- Output: identical. Cost: FILTER ~3-5% lower on average.

Both queries produce identical output. FILTER is cleaner on Postgres; CASE is universal across engines.

The JOIN-then-aggregate double-counting bug

-- The double-counting trap: aggregating after a JOIN inflates the metric.

-- WRONG. Joining orders (1 row per order) to order_items (3-7 rows per order)
-- before averaging makes AVG(o.total) average each order N times where
-- N is its item count.
SELECT o.customer_id, AVG(o.total) AS avg_order
FROM orders o
JOIN order_items i ON i.order_id = o.order_id
GROUP BY o.customer_id;

-- RIGHT, option A. Aggregate orders first, then join.
WITH per_customer AS (
  SELECT customer_id, AVG(total) AS avg_order
  FROM orders
  GROUP BY customer_id
)
SELECT pc.customer_id, pc.avg_order, COUNT(DISTINCT i.item_id) AS items
FROM per_customer pc
LEFT JOIN orders o ON o.customer_id = pc.customer_id
LEFT JOIN order_items i ON i.order_id = o.order_id
GROUP BY pc.customer_id, pc.avg_order;

-- RIGHT, option B. Use COUNT(DISTINCT) inside the AVG to deduplicate.
-- Useful when the JOIN is unavoidable.
SELECT o.customer_id,
       SUM(o.total) / COUNT(DISTINCT o.order_id) AS avg_order
FROM orders o
JOIN order_items i ON i.order_id = o.order_id
GROUP BY o.customer_id;

The WRONG version inflates AVG by the average item count per order. Both RIGHT versions return the correct per-customer average.

7 aggregate practice problems

Each targets a specific aggregation gotcha. Submissions run against 10 randomized seeds, including seeds engineered for NULL bias and JOIN duplication.

01Distinguish the 3 COUNT forms

Easy · ~5 min

users(user_id, email, signup_country)

Return total users, users with an email (some are NULL), and distinct signup countries. 3 columns, 1 row.

Seed trap

Seed 5 has ~20% NULL emails. COUNT(*) and COUNT(email) diverge by exactly that amount; the grader checks both numbers explicitly.

COUNT(*)COUNT(col)COUNT(DISTINCT col)

02AVG with NULL bonuses

Easy · ~8 min

employees(emp_id, department, salary, bonus)

Per department, return average salary and average bonus. Treat employees without a bonus as having 0 bonus, not as missing data.

Seed trap

AVG(bonus) skips NULL employees; the prompt says treat NULL as 0. The grader accepts AVG(COALESCE(bonus, 0)) and SUM(bonus)/COUNT(*) but rejects bare AVG(bonus).

NULL semanticsAVG vs SUM/COUNTCOALESCE

03HAVING with multiple aggregate conditions

Easy-Medium · ~10 min

employees(emp_id, department, salary)

Return departments where headcount is at least 5 AND average salary exceeds $80,000.

Seed trap

Putting AVG(salary) > 80000 in WHERE is a syntax error. The grader rejects the query rather than returning empty rows, so the failure is obvious.

HAVINGGROUP BYaggregate predicates

04Conditional aggregation for active vs churned

Medium · ~12 min

users(user_id, signup_month, status)

For each signup_month, return total users, active count, churned count, and active percentage.

Seed trap

Computing active_pct as AVG(CASE WHEN status='active' THEN 1.0 ELSE 0.0 END) silently works but is dialect-fragile. The grader expects FILTER on Postgres and either FILTER or CASE on other engines.

FILTER vs CASE WHENpercent-of-total

05Cohort retention pivot

Medium · ~18 min

user_events(user_id, signup_month, event_month)

Pivot retention so each row is a signup_month with columns month_0..month_3 showing active user counts.

Seed trap

Seed 2 has users with no events past month_1. The grader checks that month_2 and month_3 columns contain 0 (not NULL) for those cohorts.

Pivoting via conditional aggregationCOALESCE on counts

06Multi-level grouping with missing combos

Medium-Hard · ~20 min

sales(sale_id, sale_date, region, revenue)

Return total revenue per (year, quarter, region). Include rows where revenue is 0 for combinations with no sales.

Seed trap

A plain GROUP BY only emits rows that exist. Seed 8 has a region with no sales in Q2; the grader expects a row with revenue = 0, which requires a CROSS JOIN against generate_series.

Multi-column GROUP BYgenerate_seriesmissing-row fill

07AVG order value with JOIN duplication

Hard · ~20 min

orders(order_id, customer_id, total), order_items(item_id, order_id, qty, unit_price)

Return the average order total per customer.

Seed trap

Seed 1 has orders with 1 to 12 items each. Joining before averaging weights orders with more items more. The grader uses a SUM/COUNT(DISTINCT) check; off-by-multiplication is caught explicitly.

JOIN duplicationCOUNT(DISTINCT) inside aggregateCTE pre-aggregation

Aggregate function FAQ

Why do COUNT(*) and COUNT(column) return different numbers?+

COUNT(*) counts every row. COUNT(column) counts only rows where the column is not NULL. If a 100-row table has 10 NULLs in the column, COUNT(*) returns 100 and COUNT(column) returns 90. The distinction is the most-tested NULL gotcha in SQL interviews.

Does AVG include NULL values in the denominator?+

No. AVG(col) skips NULL in both numerator and denominator. AVG over 10 employees where 5 have NULL bonuses returns the sum of the 5 non-NULL bonuses divided by 5, not by 10. If the prompt says 'average across all employees including no-bonus as zero,' the answer is SUM(COALESCE(bonus, 0)) / COUNT(*), not AVG(bonus).

When do I use HAVING instead of WHERE?+

HAVING for conditions on aggregate results (COUNT(*) > 10, AVG(salary) > 80000). WHERE for conditions on raw rows (status = 'active'). Both can appear together: WHERE runs first to filter rows, then GROUP BY, then HAVING to filter groups. An aggregate in WHERE is a syntax error in standards-compliant engines.

What's the difference between FILTER and CASE WHEN?+

Functionally equivalent in result. FILTER (WHERE ...) is the ANSI SQL syntax supported on Postgres 9.4-plus and DuckDB. CASE WHEN inside SUM or COUNT is the universal idiom and is required on Snowflake, BigQuery, and MySQL. FILTER reads cleaner and often optimizes slightly better; use it where supported.

Why is AVG sometimes wrong after a JOIN?+

The JOIN multiplies rows. AVG(o.total) on a query that joined orders to order_items returns the average weighted by item count, not the average per order. 2 fixes: aggregate orders before joining (CTE), or use SUM(o.total) / COUNT(DISTINCT o.order_id) inside the aggregate.

02 / Why practice

Start with problem 1 (COUNT semantics)

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Open problem 1

Adjacent topics

Aggregate Reference→

COUNT, SUM, AVG, MIN, MAX with syntax, NULL behavior, edge cases.

GROUP BY Practice→

5 problems combining aggregation with HAVING and conditional logic.

NULL Handling→

3-valued logic, COALESCE chains, the NOT IN trap.