SQL Aggregate Functions Practice

Aggregation is the highest-frequency SQL category in interviews and the highest-frequency category of silent bugs. COUNT(*) versus COUNT(col), AVG ignoring NULLs in both numerator and denominator, JOIN multiplication inflating SUM. 7 graded problems below cover the patterns; the tables and worked snippets show the gotchas.

Aggregation is the highest-frequency SQL category in interviews and the highest-frequency category of silent bugs. COUNT(*) versus COUNT(col), AVG ignoring NULLs in both numerator and denominator, JOIN multiplication inflating SUM. 7 graded problems below cover the patterns; the tables and worked snippets show the gotchas.

Prepare for the interview
01 / Open invite
02min.

Know aggregate functions the way the interviewer who asks it knows it.

a aggregate functions query, the same shape a screen would give you.
The diff against expected. Where ties broke. What you missed.
sandbox
1SELECT user_id,
2 COUNT(*) AS sessions
3FROM events
4WHERE ts >= NOW() - INTERVAL '7 day'
5
Execute your solution0.4s avg.
PayPalInterview question
Solve a aggregate functions problem

Aggregation in interview SQL

Aggregation share
25%

of SQL interview questions have aggregation as the primary subject. Highest single category.

GROUP BY presence
32%

of SQL questions use GROUP BY somewhere in the answer (often as a supporting clause, not the main test).

COUNT DISTINCT
14%

of aggregation questions explicitly require COUNT(DISTINCT), the most-tested aggregate variant.

NULL-related bugs
~40%

of failed aggregation submissions trace back to NULL semantics (the table above).

Aggregate NULL behavior, all in 1 table

The single source of most aggregate bugs. Memorize before drilling problems.

Function
Counts NULLs?
All-NULL input returns
Interview trap
COUNT(*)
Includes every row
0
Confused with COUNT(col); they diverge when col has NULL
COUNT(col)
Skips NULLs in col
0
Used when COUNT(*) was wanted; undercount
COUNT(DISTINCT col)
Skips NULLs
0
Slow on large columns; sometimes a HyperLogLog approximation is expected
SUM(col)
Skips NULLs in col
NULL (not 0)
Downstream code expects 0; NULL propagates
AVG(col)
Skips in numerator AND denominator
NULL
Differs from SUM(col)/COUNT(*); interviewers test which is wanted
MIN(col), MAX(col)
Skips NULLs
NULL
Generally safe; MIN over empty group returns NULL not 0
STRING_AGG / ARRAY_AGG
Skips NULLs by default
NULL
Order is undefined without ORDER BY inside the agg

FILTER vs CASE WHEN, side by side

-- Same conditional count, 2 ways. The output rows are identical.
-- The query plan often isn't.

-- FILTER (Postgres, DuckDB, ANSI SQL since 2003):
SELECT
  department,
  COUNT(*) FILTER (WHERE status = 'active')  AS active_n,
  COUNT(*) FILTER (WHERE status = 'churned') AS churned_n,
  COUNT(*) FILTER (WHERE status = 'active' AND tier = 'pro') AS pro_active_n
FROM users
GROUP BY department;

-- CASE WHEN (universal; works on Snowflake, BigQuery, MySQL, every Postgres):
SELECT
  department,
  SUM(CASE WHEN status = 'active' THEN 1 ELSE 0 END)  AS active_n,
  SUM(CASE WHEN status = 'churned' THEN 1 ELSE 0 END) AS churned_n,
  SUM(CASE WHEN status = 'active' AND tier = 'pro' THEN 1 ELSE 0 END) AS pro_active_n
FROM users
GROUP BY department;

-- Postgres planner notes:
-- FILTER variant: HashAggregate (1 pass, internal filter predicates)
-- CASE  variant: HashAggregate (1 pass, internal CASE evaluation)
-- Output: identical. Cost: FILTER ~3-5% lower on average.

Both queries produce identical output. FILTER is cleaner on Postgres; CASE is universal across engines.

The JOIN-then-aggregate double-counting bug

-- The double-counting trap: aggregating after a JOIN inflates the metric.

-- WRONG. Joining orders (1 row per order) to order_items (3-7 rows per order)
-- before averaging makes AVG(o.total) average each order N times where
-- N is its item count.
SELECT o.customer_id, AVG(o.total) AS avg_order
FROM orders o
JOIN order_items i ON i.order_id = o.order_id
GROUP BY o.customer_id;

-- RIGHT, option A. Aggregate orders first, then join.
WITH per_customer AS (
  SELECT customer_id, AVG(total) AS avg_order
  FROM orders
  GROUP BY customer_id
)
SELECT pc.customer_id, pc.avg_order, COUNT(DISTINCT i.item_id) AS items
FROM per_customer pc
LEFT JOIN orders o ON o.customer_id = pc.customer_id
LEFT JOIN order_items i ON i.order_id = o.order_id
GROUP BY pc.customer_id, pc.avg_order;

-- RIGHT, option B. Use COUNT(DISTINCT) inside the AVG to deduplicate.
-- Useful when the JOIN is unavoidable.
SELECT o.customer_id,
       SUM(o.total) / COUNT(DISTINCT o.order_id) AS avg_order
FROM orders o
JOIN order_items i ON i.order_id = o.order_id
GROUP BY o.customer_id;

The WRONG version inflates AVG by the average item count per order. Both RIGHT versions return the correct per-customer average.

7 aggregate practice problems

Each targets a specific aggregation gotcha. Submissions run against 10 randomized seeds, including seeds engineered for NULL bias and JOIN duplication.

01Distinguish the 3 COUNT forms
Easy · ~5 min
users(user_id, email, signup_country)

Return total users, users with an email (some are NULL), and distinct signup countries. 3 columns, 1 row.

Seed trap

Seed 5 has ~20% NULL emails. COUNT(*) and COUNT(email) diverge by exactly that amount; the grader checks both numbers explicitly.

COUNT(*)COUNT(col)COUNT(DISTINCT col)
02AVG with NULL bonuses
Easy · ~8 min
employees(emp_id, department, salary, bonus)

Per department, return average salary and average bonus. Treat employees without a bonus as having 0 bonus, not as missing data.

Seed trap

AVG(bonus) skips NULL employees; the prompt says treat NULL as 0. The grader accepts AVG(COALESCE(bonus, 0)) and SUM(bonus)/COUNT(*) but rejects bare AVG(bonus).

NULL semanticsAVG vs SUM/COUNTCOALESCE
03HAVING with multiple aggregate conditions
Easy-Medium · ~10 min
employees(emp_id, department, salary)

Return departments where headcount is at least 5 AND average salary exceeds $80,000.

Seed trap

Putting AVG(salary) > 80000 in WHERE is a syntax error. The grader rejects the query rather than returning empty rows, so the failure is obvious.

HAVINGGROUP BYaggregate predicates
04Conditional aggregation for active vs churned
Medium · ~12 min
users(user_id, signup_month, status)

For each signup_month, return total users, active count, churned count, and active percentage.

Seed trap

Computing active_pct as AVG(CASE WHEN status='active' THEN 1.0 ELSE 0.0 END) silently works but is dialect-fragile. The grader expects FILTER on Postgres and either FILTER or CASE on other engines.

FILTER vs CASE WHENpercent-of-total
05Cohort retention pivot
Medium · ~18 min
user_events(user_id, signup_month, event_month)

Pivot retention so each row is a signup_month with columns month_0..month_3 showing active user counts.

Seed trap

Seed 2 has users with no events past month_1. The grader checks that month_2 and month_3 columns contain 0 (not NULL) for those cohorts.

Pivoting via conditional aggregationCOALESCE on counts
06Multi-level grouping with missing combos
Medium-Hard · ~20 min
sales(sale_id, sale_date, region, revenue)

Return total revenue per (year, quarter, region). Include rows where revenue is 0 for combinations with no sales.

Seed trap

A plain GROUP BY only emits rows that exist. Seed 8 has a region with no sales in Q2; the grader expects a row with revenue = 0, which requires a CROSS JOIN against generate_series.

Multi-column GROUP BYgenerate_seriesmissing-row fill
07AVG order value with JOIN duplication
Hard · ~20 min
orders(order_id, customer_id, total), order_items(item_id, order_id, qty, unit_price)

Return the average order total per customer.

Seed trap

Seed 1 has orders with 1 to 12 items each. Joining before averaging weights orders with more items more. The grader uses a SUM/COUNT(DISTINCT) check; off-by-multiplication is caught explicitly.

JOIN duplicationCOUNT(DISTINCT) inside aggregateCTE pre-aggregation

Aggregate function FAQ

Why do COUNT(*) and COUNT(column) return different numbers?+
COUNT(*) counts every row. COUNT(column) counts only rows where the column is not NULL. If a 100-row table has 10 NULLs in the column, COUNT(*) returns 100 and COUNT(column) returns 90. The distinction is the most-tested NULL gotcha in SQL interviews.
Does AVG include NULL values in the denominator?+
No. AVG(col) skips NULL in both numerator and denominator. AVG over 10 employees where 5 have NULL bonuses returns the sum of the 5 non-NULL bonuses divided by 5, not by 10. If the prompt says 'average across all employees including no-bonus as zero,' the answer is SUM(COALESCE(bonus, 0)) / COUNT(*), not AVG(bonus).
When do I use HAVING instead of WHERE?+
HAVING for conditions on aggregate results (COUNT(*) > 10, AVG(salary) > 80000). WHERE for conditions on raw rows (status = 'active'). Both can appear together: WHERE runs first to filter rows, then GROUP BY, then HAVING to filter groups. An aggregate in WHERE is a syntax error in standards-compliant engines.
What's the difference between FILTER and CASE WHEN?+
Functionally equivalent in result. FILTER (WHERE ...) is the ANSI SQL syntax supported on Postgres 9.4-plus and DuckDB. CASE WHEN inside SUM or COUNT is the universal idiom and is required on Snowflake, BigQuery, and MySQL. FILTER reads cleaner and often optimizes slightly better; use it where supported.
Why is AVG sometimes wrong after a JOIN?+
The JOIN multiplies rows. AVG(o.total) on a query that joined orders to order_items returns the average weighted by item count, not the average per order. 2 fixes: aggregate orders before joining (CTE), or use SUM(o.total) / COUNT(DISTINCT o.order_id) inside the aggregate.
02 / Why practice

Start with problem 1 (COUNT semantics)

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Adjacent topics