142 aggregation problems isolated from the data engineer SQL catalog. GROUP BY, HAVING, COUNT(*) versus COUNT(col) versus COUNT(DISTINCT col), conditional aggregation with SUM CASE WHEN or FILTER (WHERE), PERCENTILE_CONT for medians, and the NULL semantics that show up in 40 percent of problems.

Aggregation is the second-largest topic in the data engineer SQL interview catalog at 142 problems (17 percent). The questions look simple. COUNT, SUM, AVG, MIN, MAX over a GROUP BY. The syntax fits in a single line. What makes them interview-grade is the edge-case density: NULL semantics differ between COUNT(*) and COUNT(col), conditional aggregation has two correct idioms with different portability, and AVG of a CASE expression silently mishandles NULL denominators in a way that gives the wrong percentage without raising an error.

Six sub-patterns covered. Basic GROUP BY with HAVING: HAVING filters after aggregation, WHERE filters before. Data engineer interviewers ask which filter belongs where, especially for "departments with average salary greater than 100k" type questions. COUNT(*) versus COUNT(col) versus COUNT(DISTINCT col): COUNT(*) counts rows including NULLs, COUNT(col) skips NULL rows, COUNT(DISTINCT col) counts unique non-NULL values. The trap is using COUNT(*) when COUNT(DISTINCT user_id) is what the business wants. Conditional aggregation pivots: SUM(CASE WHEN status equals 'active' THEN 1 ELSE 0 END) is the portable form, COUNT(*) FILTER (WHERE status equals 'active') is the Postgres-and-SQLite native, both compose for percentages as 100.0 multiplied-by SUM(CASE WHEN ... THEN 1 END) divided-by COUNT(*). Percentile functions: PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) for interpolated median, PERCENTILE_DISC for nearest discrete value, Spark and Hive use percentile_approx for performance at scale. NULL handling in aggregations: SUM ignores NULL, AVG of a CASE WHEN expression treats CASE-with-no-match as NULL and drops it from the denominator silently producing the wrong percentage. HAVING with composite conditions: HAVING SUM(amount) greater-than 1000 AND COUNT(DISTINCT category) greater-than-or-equal-to 3 for "high-spend customers across multiple categories".

The most-engineered failure mode in this catalog's seeds is the AVG-of-CASE-WHEN-for-percentage bug. A query like AVG(CASE WHEN status equals 'active' THEN 1 ELSE 0 END) returns the active fraction. But a data engineer who writes AVG(CASE WHEN status equals 'active' THEN 1 END) without the ELSE 0 produces NULL when the row is not active. AVG drops NULLs from the denominator. The result is always 1.0 regardless of the data. The multi-seed grader catches this immediately because the apparent answer is constant across all 10 seeds, clearly wrong against the expected variable percentage.

Dialect notes. Postgres, Snowflake, and BigQuery support FILTER (WHERE) for conditional aggregation. Hive and Redshift require SUM(CASE WHEN). MySQL 8 added FILTER but not all installations enable it; SUM CASE is the safe choice. PERCENTILE_CONT is standard SQL but support varies: Spark uses percentile_approx with explicit accuracy parameter, BigQuery uses APPROX_QUANTILES, Snowflake supports both PERCENTILE_CONT and APPROX_PERCENTILE. The choice matters at scale; a data engineer designing a daily aggregation pipeline picks approximate functions for the cost-versus-precision trade-off.

SQL Aggregation Interview Questions

Aggregation-focused SQL interview problems with NULL-engineered seeds for data engineer prep.

Common questions

What is the difference between COUNT(*), COUNT(col), and COUNT(DISTINCT col)?
COUNT(*) counts rows including rows where every column is NULL. COUNT(col) counts rows where col is not NULL, useful for counting non-missing values in a specific column. COUNT(DISTINCT col) counts unique non-NULL values. For 'how many active users this month', use COUNT(DISTINCT user_id), not COUNT(*) (which double-counts users with multiple events) and not COUNT(user_id) (same result as COUNT(*) when user_id is not nullable).
When does WHERE belong versus HAVING in SQL?
WHERE filters rows before aggregation. HAVING filters groups after aggregation. 'Customers in California' is WHERE state equals 'CA' (filter rows before GROUP BY). 'Customers with at least 3 orders' is HAVING COUNT(*) greater-than-or-equal-to 3 (filter groups after GROUP BY). The litmus test: if the predicate references an aggregate function, it goes in HAVING. Otherwise WHERE. Putting an aggregate in WHERE is a syntax error.
How do I compute a percentage with conditional aggregation in SQL?
100.0 multiplied-by SUM(CASE WHEN status equals 'active' THEN 1 ELSE 0 END) divided-by COUNT(*). The 100.0 promotes to floating point in dialects that integer-divide. The ELSE 0 is critical: without it, CASE returns NULL when not active, and silent behavior differs by function. With SUM, NULLs are ignored so numerator is unaffected. With AVG of the CASE alone, NULLs drop from the denominator and the result is always 1.0. Postgres-native form: 100.0 multiplied-by COUNT(*) FILTER (WHERE status equals 'active') divided-by COUNT(*).
What is the correct way to compute a median in SQL?
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY val) for an interpolated median (between two middle values for even N). PERCENTILE_DISC(0.5) for the nearest discrete value. Both are standard SQL but support varies. Spark and Hive use percentile_approx for scale-performance reasons. MySQL pre-8.0 has no median function; the workaround is a window-function rank and an average of the middle two.
Why does AVG(CASE WHEN x THEN 1 END) produce the wrong percentage?
Without an ELSE, CASE returns NULL when the condition is false. AVG ignores NULLs in both numerator and denominator. AVG averages only the matching rows, all of which are 1, producing 1.0 always. The fix is AVG(CASE WHEN x THEN 1.0 ELSE 0 END) or, more idiomatic, 100.0 multiplied-by SUM(CASE WHEN x THEN 1 ELSE 0 END) divided-by COUNT(*). Multi-seed graders catch this because the apparent answer is constant regardless of data.
How do I aggregate over multiple metrics in one SQL pass?
Conditional aggregation pivots all the metrics into a single SELECT with one row per group: SELECT month, SUM(CASE WHEN type equals 'A' THEN amount END) AS a_total, SUM(CASE WHEN type equals 'B' THEN amount END) AS b_total, COUNT(*) FILTER (WHERE type equals 'A') AS a_count FROM events GROUP BY month. Alternative is the PIVOT operator in Snowflake and SQL Server, but conditional aggregation is portable everywhere.
What does GROUP BY ROLLUP do in SQL?
ROLLUP adds subtotal rows. GROUP BY ROLLUP (region, city) produces per-city totals, per-region subtotals (where city is NULL), and a grand total (where both are NULL). Useful for reports that need totals at multiple levels. CUBE produces all combinations. GROUPING SETS lets you specify exactly which subtotals. The grand total row's NULLs are real NULLs in the result, distinguishable from data NULLs using the GROUPING() function.