GROUP BY shows up in 32% of verified DE SQL interview questions, that's 139 of 429 rounds in our corpus. It's the second most frequent SQL construct in the entire dataset, trailing SELECT/FROM at 36%. HAVING, its quieter sibling, appears in 7%. If you're allocating prep time by frequency, GROUP BY sits near the top of the list.
The syntax is five words. The interview surface is huge. Execution order, WHERE versus HAVING, fan-out from joins, ROLLUP and CUBE. We'll walk each one with PostgreSQL examples you can run live.
Questions using GROUP BY
Of 429 SQL questions
SELECT/FROM (only clause more common)
HAVING frequency
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
GROUP BY runs fifth, not first. That single fact explains roughly 80% of the "why doesn't this alias work" confusion in interviews. The execution pipeline is fixed across every ANSI engine, and memorizing the order pays back on maybe 1 in 3 SQL rounds you'll ever sit in.
Tables are loaded and joined. This produces the raw row set that every subsequent step operates on.
Rows are filtered before any grouping happens. You cannot reference aggregates here because groups do not exist yet.
Remaining rows are collapsed into groups based on the specified columns. Each unique combination of GROUP BY values becomes one output row.
Groups are filtered. HAVING runs after GROUP BY, so it can reference aggregate functions like COUNT(*) or SUM(amount).
Expressions and aggregates are evaluated. Column aliases are created here, which is why most engines do not let you reference aliases in GROUP BY or HAVING.
Results are sorted. This is the only clause that can reference column aliases (in most engines) because it runs after SELECT.
The rule is simple: every column in SELECT must either be listed in GROUP BY or wrapped in an aggregate function. No exceptions in standard SQL.
SELECT column_a, column_b, AGG_FUNCTION(column_c) AS result FROM table_name WHERE row_filter_condition GROUP BY column_a, column_b HAVING AGG_FUNCTION(column_c) > threshold ORDER BY result DESC;
Interview note: If asked to write GROUP BY from scratch, start with SELECT and GROUP BY in sync. Add the same columns to both. Then add aggregates to SELECT. This prevents the most common GROUP BY error: forgetting a column.
Six patterns that cover the vast majority of GROUP BY usage in production pipelines and interview questions. Each includes PostgreSQL code you can run directly.
The simplest GROUP BY: pick a column, apply an aggregate function, and get one row per unique value. This pattern is the foundation of every reporting query. If your table has 1M orders across 50 regions, GROUP BY region collapses those million rows into 50.
SELECT
region,
COUNT(*) AS order_count,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;Interview note: Every non-aggregated column in SELECT must appear in GROUP BY. PostgreSQL enforces this strictly. MySQL in permissive mode does not, which can return non-deterministic results.
HAVING filters groups after aggregation. WHERE filters individual rows before grouping. This distinction trips up roughly half of interview candidates. Think of WHERE as a row-level gate and HAVING as a group-level gate.
-- Find users who placed more than 5 orders
-- with total spend above $500
SELECT
user_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY user_id
HAVING COUNT(*) > 5
AND SUM(amount) > 500;Interview note: WHERE status = 'completed' removes rows before grouping. HAVING COUNT(*) > 5 removes groups after. Using WHERE for row filters is more efficient because it reduces the data GROUP BY has to process.
Grouping by two or more columns creates one row per unique combination. This is how you build cross-tabulation reports: revenue by region and product category, counts by year and department, rates by country and device type.
SELECT
region,
product_category,
COUNT(*) AS order_count,
SUM(amount) AS revenue,
ROUND(AVG(amount), 2) AS avg_order
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY region, product_category
ORDER BY region, revenue DESC;Interview note: The number of output rows equals the number of unique (region, product_category) combinations. If you have 10 regions and 8 categories, the maximum is 80 rows (fewer if some combinations have no data).
CASE WHEN inside GROUP BY lets you create custom buckets on the fly. This is how you segment users by behavior tiers, bucket orders by size ranges, or classify transactions without creating a lookup table.
SELECT
CASE
WHEN amount < 50 THEN 'small'
WHEN amount < 200 THEN 'medium'
WHEN amount < 1000 THEN 'large'
ELSE 'enterprise'
END AS order_tier,
COUNT(*) AS order_count,
SUM(amount) AS tier_revenue,
ROUND(AVG(amount), 2) AS avg_amount
FROM orders
GROUP BY
CASE
WHEN amount < 50 THEN 'small'
WHEN amount < 200 THEN 'medium'
WHEN amount < 1000 THEN 'large'
ELSE 'enterprise'
END;Interview note: The full CASE expression must be repeated in GROUP BY (in PostgreSQL and SQL Server). Some engines like MySQL and BigQuery allow referencing the alias. Know which engine your interviewer targets.
Time-series aggregation is the bread and butter of analytics engineering. Truncate timestamps to day, week, month, or quarter, then GROUP BY the truncated value. This pattern powers every time-series dashboard.
-- Daily revenue for the last 90 days
SELECT
DATE_TRUNC('day', created_at) AS order_date,
COUNT(*) AS orders,
SUM(amount) AS revenue,
COUNT(DISTINCT user_id) AS unique_buyers
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY DATE_TRUNC('day', created_at)
ORDER BY order_date;
-- Monthly aggregation
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS monthly_revenue
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;Interview note: DATE_TRUNC is PostgreSQL/Snowflake syntax. In MySQL, use DATE_FORMAT(created_at, '%Y-%m-01'). In BigQuery, use DATE_TRUNC(created_at, MONTH). Interviewers want to see that you know the function for their stack.
ROLLUP adds subtotals and a grand total row. CUBE generates subtotals for every combination of grouped columns. GROUPING SETS lets you specify exactly which groupings you want. These are tested in senior-level interviews and are standard in warehouse reporting.
-- ROLLUP: subtotals by region, then grand total
SELECT
COALESCE(region, 'ALL REGIONS') AS region,
COALESCE(product_category, 'ALL PRODUCTS') AS category,
SUM(amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY ROLLUP(region, product_category);
-- CUBE: every combination of subtotals
SELECT
COALESCE(region, 'ALL') AS region,
COALESCE(product_category, 'ALL') AS category,
SUM(amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY CUBE(region, product_category);
-- GROUPING SETS: pick exactly which groupings
SELECT
region,
product_category,
SUM(amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY GROUPING SETS (
(region, product_category),
(region),
()
);Interview note: ROLLUP(a, b) produces groupings (a, b), (a), and (). CUBE(a, b) produces (a, b), (a), (b), and (). Use the GROUPING() function to distinguish real NULLs from subtotal NULLs in the output.
The execution order explains three things that confuse every SQL beginner (and trip up experienced engineers in interviews):
GROUP BY executes at step 3. SELECT executes at step 5. When GROUP BY runs, the alias you defined in SELECT does not exist yet. PostgreSQL and SQL Server enforce this strictly. MySQL and BigQuery allow it as a convenience extension, but your query will fail if you port it to another engine.
-- Fails in PostgreSQL: alias not visible to GROUP BY
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY month;
-- Works in PostgreSQL: repeat the expression
SELECT DATE_TRUNC('month', created_at) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', created_at);Correction: PostgreSQL actually does allow aliases in GROUP BY as a special extension. But the standard does not guarantee it, and SQL Server does not allow it. When in doubt, repeat the expression.
WHERE runs at step 2, before GROUP BY at step 3. This means WHERE reduces the row count that GROUP BY has to process. A WHERE clause that eliminates 80% of rows before grouping is significantly faster than grouping all rows and then filtering with HAVING. Use WHERE for every filter that operates on individual row values.
-- Efficient: WHERE filters 80% of rows
-- before GROUP BY processes them
SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= '2024-01-01'
AND status = 'completed'
GROUP BY region;
-- Less efficient: grouping ALL rows
-- then filtering groups
SELECT region, SUM(amount) AS revenue
FROM orders
GROUP BY region
HAVING SUM(CASE WHEN status != 'completed'
THEN 0 ELSE amount END) > 0;HAVING runs at step 4, after GROUP BY has created groups. HAVING is the only place where you can write conditions on aggregate results. "Show me departments with more than 10 employees" requires HAVING COUNT(*) > 10 because COUNT(*) does not exist until the groups are formed.
-- Find products with declining sales:
-- at least 100 orders total, but avg order value
-- dropped below $25
SELECT
product_id,
COUNT(*) AS total_orders,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY product_id
HAVING COUNT(*) >= 100
AND AVG(amount) < 25
ORDER BY avg_order_value ASC;GROUP BY questions test more than syntax. They reveal whether you understand query execution, can debug aggregation issues, and know performance implications. These seven cover the range from phone screens to on-site rounds.
What they test:
Foundational understanding. Most candidates can say 'it groups rows,' but fewer can place it correctly in the execution order: after FROM/WHERE, before HAVING/SELECT/ORDER BY.
Approach:
State the execution order explicitly: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY. Explain that GROUP BY collapses rows into groups, and each group becomes one output row. Only aggregated columns and GROUP BY columns can appear in SELECT.
What they test:
Execution order knowledge. SELECT (where aliases are defined) runs after GROUP BY but before ORDER BY. So GROUP BY cannot see aliases, while ORDER BY can.
Approach:
Walk through the order: GROUP BY fires before SELECT, so the alias does not exist yet. ORDER BY fires after SELECT, so the alias is available. Note that MySQL is an exception and allows aliases in GROUP BY as a non-standard extension.
What they test:
This is the single most common GROUP BY interview question. Interviewers want the timing distinction: WHERE filters rows before grouping, HAVING filters groups after aggregation.
Approach:
WHERE operates on individual rows and cannot use aggregate functions. HAVING operates on groups and can reference aggregates like COUNT(*) > 10. Using WHERE to filter early is more efficient because it reduces the rows GROUP BY processes.
What they test:
Combining WHERE, GROUP BY, ORDER BY, and LIMIT in the right order. The WHERE clause filters first, GROUP BY aggregates, ORDER BY sorts, and LIMIT caps the output.
Approach:
WHERE amount > 100 filters individual orders. GROUP BY product_category aggregates. ORDER BY SUM(amount) DESC sorts by revenue. LIMIT 3 takes the top three. Do not use HAVING here because the $100 threshold is a row filter, not a group filter.
What they test:
Real debugging skill. Users who belong to multiple regions get counted once in the total but once per region in the grouped sum. This is a fan-out problem caused by a JOIN before aggregation.
Approach:
Explain the fan-out: if a user has rows in 2 regions, they appear in 2 groups. COUNT(*) counts rows, not distinct users. Fix: use COUNT(DISTINCT user_id) or aggregate before joining.
What they test:
SQL standard compliance knowledge. In strict SQL mode (PostgreSQL, SQL Server), this is an error. In MySQL permissive mode, it returns an arbitrary value from the group, which is non-deterministic.
Approach:
State that the SQL standard requires every SELECT column to be either in GROUP BY or inside an aggregate function. PostgreSQL raises an error. MySQL historically allowed it but would pick a random row's value. Always follow the standard.
What they test:
Whether you can combine GROUP BY with window functions or subqueries. A common pattern: GROUP BY first to get category totals, then divide by the grand total.
Approach:
Use a CTE or subquery. First, GROUP BY category to get each category's revenue. Then divide by SUM(revenue) OVER () to get the percentage. Alternatively, use a scalar subquery for the grand total: category_revenue / (SELECT SUM(amount) FROM orders).
These four mistakes account for the majority of GROUP BY bugs in interviews and production code. Each one has a clear fix.
If a column is in SELECT but not in GROUP BY and not wrapped in an aggregate function, the query is technically invalid. PostgreSQL rejects it outright. MySQL in permissive mode silently picks an arbitrary value from the group, which means your results could change between runs.
Wrong
-- PostgreSQL ERROR: column "name" must appear
-- in GROUP BY or be used in an aggregate function
SELECT
department,
name,
COUNT(*) AS headcount
FROM employees
GROUP BY department;Correct
-- Fix: add name to GROUP BY or aggregate it
SELECT
department,
COUNT(*) AS headcount,
ARRAY_AGG(name) AS employee_names
FROM employees
GROUP BY department;WHERE runs before GROUP BY, so it cannot reference aggregate results. Writing WHERE COUNT(*) > 5 is a syntax error in every SQL engine. The fix is always HAVING.
Wrong
-- ERROR: aggregate functions not allowed in WHERE
SELECT department, COUNT(*) AS cnt
FROM employees
WHERE COUNT(*) > 5
GROUP BY department;Correct
-- Correct: use HAVING for aggregate conditions
SELECT department, COUNT(*) AS cnt
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;Some engines allow GROUP BY 1, 2 to reference SELECT columns by position. This works but is fragile: if someone reorders the SELECT columns, the GROUP BY silently changes meaning. Production queries should use explicit column names.
Wrong
-- Fragile: if SELECT columns are reordered,
-- the grouping changes silently
SELECT region, product_type, SUM(amount)
FROM orders
GROUP BY 1, 2;Correct
-- Explicit: the grouping is clear regardless
-- of SELECT column order
SELECT region, product_type, SUM(amount)
FROM orders
GROUP BY region, product_type;When building queries incrementally (adding joins, subqueries, new columns), it is easy to add a column to SELECT but forget to add it to GROUP BY. This is especially common when a query has 6+ columns and multiple joins. Review GROUP BY against SELECT every time you modify a grouped query.
Wrong
-- Added u.signup_source but forgot GROUP BY
SELECT
u.region,
u.signup_source,
COUNT(o.id) AS orders,
SUM(o.amount) AS revenue
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.region; -- missing u.signup_sourceCorrect
SELECT
u.region,
u.signup_source,
COUNT(o.id) AS orders,
SUM(o.amount) AS revenue
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.region, u.signup_source;139 GROUP BY questions in our verified corpus. DataDriven serves them back with live grading so the patterns stick before loop day.
Hands-on GROUP BY problems with aggregation, HAVING, and multi-column grouping
Complete guide to every SQL topic tested in data engineering interviews
Handle NULLs in GROUP BY results with COALESCE patterns for clean aggregations