SQL Aggregate Functions Practice
Aggregation is the highest-frequency SQL category in interviews and the highest-frequency category of silent bugs. COUNT(*) versus COUNT(col), AVG ignoring NULLs in both numerator and denominator, JOIN multiplication inflating SUM. 7 graded problems below cover the patterns; the tables and worked snippets show the gotchas.
Aggregation is the highest-frequency SQL category in interviews and the highest-frequency category of silent bugs. COUNT(*) versus COUNT(col), AVG ignoring NULLs in both numerator and denominator, JOIN multiplication inflating SUM. 7 graded problems below cover the patterns; the tables and worked snippets show the gotchas.
Know aggregate functions the way the interviewer who asks it knows it.
Aggregation in interview SQL
of SQL interview questions have aggregation as the primary subject. Highest single category.
of SQL questions use GROUP BY somewhere in the answer (often as a supporting clause, not the main test).
of aggregation questions explicitly require COUNT(DISTINCT), the most-tested aggregate variant.
of failed aggregation submissions trace back to NULL semantics (the table above).
Aggregate NULL behavior, all in 1 table
The single source of most aggregate bugs. Memorize before drilling problems.
FILTER vs CASE WHEN, side by side
-- Same conditional count, 2 ways. The output rows are identical.
-- The query plan often isn't.
-- FILTER (Postgres, DuckDB, ANSI SQL since 2003):
SELECT
department,
COUNT(*) FILTER (WHERE status = 'active') AS active_n,
COUNT(*) FILTER (WHERE status = 'churned') AS churned_n,
COUNT(*) FILTER (WHERE status = 'active' AND tier = 'pro') AS pro_active_n
FROM users
GROUP BY department;
-- CASE WHEN (universal; works on Snowflake, BigQuery, MySQL, every Postgres):
SELECT
department,
SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END) AS active_n,
SUM(CASE WHEN status = 'churned' THEN 1 ELSE 0 END) AS churned_n,
SUM(CASE WHEN status = 'active' AND tier = 'pro' THEN 1 ELSE 0 END) AS pro_active_n
FROM users
GROUP BY department;
-- Postgres planner notes:
-- FILTER variant: HashAggregate (1 pass, internal filter predicates)
-- CASE variant: HashAggregate (1 pass, internal CASE evaluation)
-- Output: identical. Cost: FILTER ~3-5% lower on average.Both queries produce identical output. FILTER is cleaner on Postgres; CASE is universal across engines.
The JOIN-then-aggregate double-counting bug
-- The double-counting trap: aggregating after a JOIN inflates the metric.
-- WRONG. Joining orders (1 row per order) to order_items (3-7 rows per order)
-- before averaging makes AVG(o.total) average each order N times where
-- N is its item count.
SELECT o.customer_id, AVG(o.total) AS avg_order
FROM orders o
JOIN order_items i ON i.order_id = o.order_id
GROUP BY o.customer_id;
-- RIGHT, option A. Aggregate orders first, then join.
WITH per_customer AS (
SELECT customer_id, AVG(total) AS avg_order
FROM orders
GROUP BY customer_id
)
SELECT pc.customer_id, pc.avg_order, COUNT(DISTINCT i.item_id) AS items
FROM per_customer pc
LEFT JOIN orders o ON o.customer_id = pc.customer_id
LEFT JOIN order_items i ON i.order_id = o.order_id
GROUP BY pc.customer_id, pc.avg_order;
-- RIGHT, option B. Use COUNT(DISTINCT) inside the AVG to deduplicate.
-- Useful when the JOIN is unavoidable.
SELECT o.customer_id,
SUM(o.total) / COUNT(DISTINCT o.order_id) AS avg_order
FROM orders o
JOIN order_items i ON i.order_id = o.order_id
GROUP BY o.customer_id;The WRONG version inflates AVG by the average item count per order. Both RIGHT versions return the correct per-customer average.
7 aggregate practice problems
Each targets a specific aggregation gotcha. Submissions run against 10 randomized seeds, including seeds engineered for NULL bias and JOIN duplication.
Return total users, users with an email (some are NULL), and distinct signup countries. 3 columns, 1 row.
Seed 5 has ~20% NULL emails. COUNT(*) and COUNT(email) diverge by exactly that amount; the grader checks both numbers explicitly.
Per department, return average salary and average bonus. Treat employees without a bonus as having 0 bonus, not as missing data.
AVG(bonus) skips NULL employees; the prompt says treat NULL as 0. The grader accepts AVG(COALESCE(bonus, 0)) and SUM(bonus)/COUNT(*) but rejects bare AVG(bonus).
Return departments where headcount is at least 5 AND average salary exceeds $80,000.
Putting AVG(salary) > 80000 in WHERE is a syntax error. The grader rejects the query rather than returning empty rows, so the failure is obvious.
For each signup_month, return total users, active count, churned count, and active percentage.
Computing active_pct as AVG(CASE WHEN status='active' THEN 1.0 ELSE 0.0 END) silently works but is dialect-fragile. The grader expects FILTER on Postgres and either FILTER or CASE on other engines.
Pivot retention so each row is a signup_month with columns month_0..month_3 showing active user counts.
Seed 2 has users with no events past month_1. The grader checks that month_2 and month_3 columns contain 0 (not NULL) for those cohorts.
Return total revenue per (year, quarter, region). Include rows where revenue is 0 for combinations with no sales.
A plain GROUP BY only emits rows that exist. Seed 8 has a region with no sales in Q2; the grader expects a row with revenue = 0, which requires a CROSS JOIN against generate_series.
Return the average order total per customer.
Seed 1 has orders with 1 to 12 items each. Joining before averaging weights orders with more items more. The grader uses a SUM/COUNT(DISTINCT) check; off-by-multiplication is caught explicitly.
Aggregate function FAQ
Why do COUNT(*) and COUNT(column) return different numbers?+
Does AVG include NULL values in the denominator?+
When do I use HAVING instead of WHERE?+
What's the difference between FILTER and CASE WHEN?+
Why is AVG sometimes wrong after a JOIN?+
Start with problem 1 (COUNT semantics)
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition