SQL GROUP BY: Complete Reference for Data Engineers

GROUP BY shows up in 32% of verified DE SQL interview questions, that's 139 of 429 rounds in our corpus. It's the second most frequent SQL construct in the entire dataset, trailing SELECT/FROM at 36%. HAVING, its quieter sibling, appears in 7%. If you're allocating prep time by frequency, GROUP BY sits near the top of the list.

32%

Questions using GROUP BY

139

Of 429 SQL questions

36%

SELECT/FROM (only clause more common)

HAVING frequency

How GROUP BY Works

GROUP BY runs fifth, not first. That single fact explains roughly 80% of the "why doesn't this alias work" confusion in interviews. The execution pipeline is fixed across every ANSI engine, and memorizing the order pays back on maybe 1 in 3 SQL rounds you'll ever sit in.

SQL Execution Order

Each clause runs in this fixed sequence. GROUP BY at step 3 means it cannot see SELECT aliases defined at step 5.

1. FROM / JOIN

Tables are loaded and joined. This produces the raw row set that every subsequent step operates on.

2. WHERE

Rows are filtered before any grouping happens. You cannot reference aggregates here because groups do not exist yet.

3. GROUP BY

Remaining rows are collapsed into groups based on the specified columns. Each unique combination of GROUP BY values becomes one output row.

4. HAVING

Groups are filtered. HAVING runs after GROUP BY, so it can reference aggregate functions like COUNT(*) or SUM(amount).

5. SELECT

Expressions and aggregates are evaluated. Column aliases are created here, which is why most engines do not let you reference aliases in GROUP BY or HAVING.

6. ORDER BY

Results are sorted. This is the only clause that can reference column aliases (in most engines) because it runs after SELECT.

Basic GROUP BY Syntax

SELECT
  region,
  status,
  SUM(profit) AS result
FROM orders
WHERE profit > 0
GROUP BY region, status
HAVING SUM(profit) > 100
ORDER BY result DESC;

Interview note: If asked to write GROUP BY from scratch, start with SELECT and GROUP BY in sync. Add the same columns to both. Then add aggregates to SELECT. This prevents the most common GROUP BY error: forgetting a column.

GROUP BY Patterns for Data Engineers

Six patterns that cover the vast majority of GROUP BY usage in production pipelines and interview questions. Each includes PostgreSQL code you can run directly.

Basic Aggregation

The simplest GROUP BY: pick a column, apply an aggregate function, and get one row per unique value. This pattern is the foundation of every reporting query. If your table has 1M orders across 50 regions, GROUP BY region collapses those million rows into 50. Interview note: Every non-aggregated column in SELECT must appear in GROUP BY. PostgreSQL enforces this strictly. MySQL in permissive mode does not, which can return non-deterministic results.

GROUP BY with HAVING

HAVING filters groups after aggregation. WHERE filters individual rows before grouping. This distinction trips up roughly half of interview candidates. Think of WHERE as a row-level gate and HAVING as a group-level gate. Interview note: WHERE status = 'completed' removes rows before grouping. HAVING COUNT(*) > 5 removes groups after. Using WHERE for row filters is more efficient because it reduces the data GROUP BY has to process.

GROUP BY Multiple Columns

Grouping by two or more columns creates one row per unique combination. This is how you build cross-tabulation reports: revenue by region and product category, counts by year and department, rates by country and device type. Interview note: The number of output rows equals the number of unique (region, product_category) combinations. If you have 10 regions and 8 categories, the maximum is 80 rows (fewer if some combinations have no data).

GROUP BY with CASE WHEN

CASE WHEN inside GROUP BY lets you create custom buckets on the fly. This is how you segment users by behavior tiers, bucket orders by size ranges, or classify transactions without creating a lookup table. Interview note: The full CASE expression must be repeated in GROUP BY (in PostgreSQL and SQL Server). Some engines like MySQL and BigQuery allow referencing the alias. Know which engine your interviewer targets.

GROUP BY with Date Truncation

Time-series aggregation is the bread and butter of analytics engineering. Truncate timestamps to day, week, month, or quarter, then GROUP BY the truncated value. This pattern powers every time-series dashboard. Interview note: DATE_TRUNC is PostgreSQL/Snowflake syntax. In MySQL, use DATE_FORMAT(created_at, '%Y-%m-01'). In BigQuery, use DATE_TRUNC(created_at, MONTH). Interviewers want to see that you know the function for their stack.

GROUP BY with ROLLUP, CUBE, and GROUPING SETS

ROLLUP adds subtotals and a grand total row. CUBE generates subtotals for every combination of grouped columns. GROUPING SETS lets you specify exactly which groupings you want. These are tested in senior-level interviews and are standard in warehouse reporting. Interview note: ROLLUP(a, b) produces groupings (a, b), (a), and (). CUBE(a, b) produces (a, b), (a), (b), and (). Use the GROUPING() function to distinguish real NULLs from subtotal NULLs in the output.

Basic Aggregation

SELECT
  region,
  COUNT(*) AS order_count,
  SUM(profit) AS total_profit,
  AVG(profit) AS avg_order_profit
FROM orders
GROUP BY region;

GROUP BY with HAVING

-- Find users who placed more than 5 transactions
-- with total spend above $500
SELECT
  user_id,
  COUNT(*) AS txn_count,
  SUM(total_amount) AS total_spent
FROM transactions
WHERE total_amount > 0
GROUP BY user_id
HAVING COUNT(*) > 5
   AND SUM(total_amount) > 500;

GROUP BY Multiple Columns

SELECT
  o.region,
  p.category,
  COUNT(*) AS line_count,
  SUM(oi.quantity * oi.unit_price) AS revenue,
  ROUND(AVG(oi.quantity * oi.unit_price), 2) AS avg_line
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY o.region, p.category
ORDER BY o.region, revenue DESC;

GROUP BY with CASE WHEN

SELECT
  CASE
    WHEN total_amount < 50 THEN 'small'
    WHEN total_amount < 200 THEN 'medium'
    WHEN total_amount < 1000 THEN 'large'
    ELSE 'enterprise'
  END AS order_tier,
  COUNT(*) AS txn_count,
  SUM(total_amount) AS tier_revenue,
  ROUND(AVG(total_amount), 2) AS avg_amount
FROM transactions
GROUP BY
  CASE
    WHEN total_amount < 50 THEN 'small'
    WHEN total_amount < 200 THEN 'medium'
    WHEN total_amount < 1000 THEN 'large'
    ELSE 'enterprise'
  END;

GROUP BY with Date Truncation

-- Daily revenue since the start of 2024
SELECT
  substr(transaction_date, 1, 10) AS txn_date,
  COUNT(*) AS txns,
  SUM(total_amount) AS revenue,
  COUNT(DISTINCT user_id) AS unique_buyers
FROM transactions
WHERE transaction_date >= '2024-01-01'
GROUP BY substr(transaction_date, 1, 10)
ORDER BY txn_date;

-- Monthly aggregation
SELECT
  substr(transaction_date, 1, 7) AS month,
  SUM(total_amount) AS monthly_revenue
FROM transactions
GROUP BY substr(transaction_date, 1, 7)
ORDER BY month;

GROUP BY with ROLLUP, CUBE, and GROUPING SETS

-- ROLLUP: subtotals by region, then grand total
SELECT
  COALESCE(o.region, 'ALL REGIONS') AS region,
  COALESCE(p.category, 'ALL PRODUCTS') AS category,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY ROLLUP(o.region, p.category);

-- CUBE: every combination of subtotals
SELECT
  COALESCE(o.region, 'ALL') AS region,
  COALESCE(p.category, 'ALL') AS category,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY CUBE(o.region, p.category);

-- GROUPING SETS: pick exactly which groupings
SELECT
  o.region,
  p.category,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM order_items oi
JOIN orders o ON oi.order_id = o.order_id
JOIN products p ON oi.product_id = p.product_id
GROUP BY GROUPING SETS (
  (o.region, p.category),
  (o.region),
  ()
);

GROUP BY Execution Order: Why Aliases Break

The execution order explains three things that confuse every SQL beginner (and trip up experienced engineers in interviews).

SELECT Aliases in GROUP BY

-- Fails in PostgreSQL: alias not visible to GROUP BY
SELECT substr(transaction_date, 1, 7) AS month,
       SUM(total_amount) AS revenue
FROM transactions
GROUP BY month;

-- Works in PostgreSQL: repeat the expression
SELECT substr(transaction_date, 1, 7) AS month,
       SUM(total_amount) AS revenue
FROM transactions
GROUP BY substr(transaction_date, 1, 7);

GROUP BY executes at step 3. SELECT executes at step 5. When GROUP BY runs, the alias you defined in SELECT does not exist yet. PostgreSQL and SQL Server enforce this strictly. MySQL and BigQuery allow it as a convenience extension.

WHERE Filters Rows Before Grouping

-- Efficient: WHERE filters 80% of rows
-- before GROUP BY processes them
SELECT region, SUM(profit) AS revenue
FROM orders
WHERE status = 'completed'
GROUP BY region;

-- Less efficient: grouping ALL rows
-- then filtering groups
SELECT region, SUM(profit) AS revenue
FROM orders
GROUP BY region
HAVING SUM(CASE WHEN status != 'completed'
             THEN 0 ELSE profit END) > 0;

HAVING Filters Groups After Aggregation

-- Find low-value, repeat-ordered products:
-- ordered on more than one line, but average
-- line value still under $200
SELECT
  product_id,
  COUNT(*) AS total_lines,
  ROUND(AVG(quantity * unit_price), 2) AS avg_line_value
FROM order_items
GROUP BY product_id
HAVING COUNT(*) >= 2
   AND AVG(quantity * unit_price) < 200
ORDER BY avg_line_value ASC;

7 GROUP BY Interview Questions

GROUP BY questions test more than syntax. They reveal whether you understand query execution, can debug aggregation issues, and know performance implications.

Q1: What does GROUP BY do, and when does it execute in the query pipeline?

What they test: Foundational understanding. Most candidates can say 'it groups rows,' but fewer can place it correctly in the execution order: after FROM/WHERE, before HAVING/SELECT/ORDER BY. Approach: State the execution order explicitly: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY. Explain that GROUP BY collapses rows into groups, and each group becomes one output row. Only aggregated columns and GROUP BY columns can appear in SELECT.

Q2: Why can you use a column alias in ORDER BY but not in GROUP BY?

What they test: Execution order knowledge. SELECT (where aliases are defined) runs after GROUP BY but before ORDER BY. So GROUP BY cannot see aliases, while ORDER BY can. Approach: Walk through the order: GROUP BY fires before SELECT, so the alias does not exist yet. ORDER BY fires after SELECT, so the alias is available. Note that MySQL is an exception and allows aliases in GROUP BY as a non-standard extension.

Q3: Explain the difference between WHERE and HAVING.

What they test: This is the single most common GROUP BY interview question. Interviewers want the timing distinction: WHERE filters rows before grouping, HAVING filters groups after aggregation. Approach: WHERE operates on individual rows and cannot use aggregate functions. HAVING operates on groups and can reference aggregates like COUNT(*) > 10. Using WHERE to filter early is more efficient because it reduces the rows GROUP BY processes.

Q4: Write a query to find the top 3 product categories by revenue, only counting orders over $100.

What they test: Combining WHERE, GROUP BY, ORDER BY, and LIMIT in the right order. The WHERE clause filters first, GROUP BY aggregates, ORDER BY sorts, and LIMIT caps the output. Approach: WHERE amount > 100 filters individual orders. GROUP BY product_category aggregates. ORDER BY SUM(amount) DESC sorts by revenue. LIMIT 3 takes the top three. Do not use HAVING here because the $100 threshold is a row filter, not a group filter.

Q5: A dashboard shows 'total users' as 1,247 but 'SUM of users per region' as 1,302. What went wrong?

What they test: Real debugging skill. Users who belong to multiple regions get counted once in the total but once per region in the grouped sum. This is a fan-out problem caused by a JOIN before aggregation. Approach: Explain the fan-out: if a user has rows in 2 regions, they appear in 2 groups. COUNT(*) counts rows, not distinct users. Fix: use COUNT(DISTINCT user_id) or aggregate before joining.

Q6: What happens if you SELECT a column that is not in GROUP BY and not in an aggregate?

What they test: SQL standard compliance knowledge. In strict SQL mode (PostgreSQL, SQL Server), this is an error. In MySQL permissive mode, it returns an arbitrary value from the group, which is non-deterministic. Approach: State that the SQL standard requires every SELECT column to be either in GROUP BY or inside an aggregate function. PostgreSQL raises an error. MySQL historically allowed it but would pick a random row's value. Always follow the standard.

Q7: How would you calculate a running percentage of total using GROUP BY?

What they test: Whether you can combine GROUP BY with window functions or subqueries. A common pattern: GROUP BY first to get category totals, then divide by the grand total. Approach: Use a CTE or subquery. First, GROUP BY category to get each category's revenue. Then divide by SUM(revenue) OVER () to get the percentage. Alternatively, use a scalar subquery for the grand total: category_revenue / (SELECT SUM(amount) FROM orders).

Common GROUP BY Mistakes

These four mistakes account for the majority of GROUP BY bugs in interviews and production code. Each one has a clear fix.

Selecting Non-Aggregated Columns

If a column is in SELECT but not in GROUP BY and not wrapped in an aggregate function, the query is technically invalid. PostgreSQL rejects it outright. MySQL in permissive mode silently picks an arbitrary value from the group, which means your results could change between runs.

Using WHERE Instead of HAVING for Aggregate Filters

WHERE runs before GROUP BY, so it cannot reference aggregate results. Writing WHERE COUNT(*) > 5 is a syntax error in every SQL engine. The fix is always HAVING.

GROUP BY Position Numbers

Some engines allow GROUP BY 1, 2 to reference SELECT columns by position. This works but is fragile: if someone reorders the SELECT columns, the GROUP BY silently changes meaning. Production queries should use explicit column names.

Forgetting GROUP BY with Aggregates in Complex Queries

When building queries incrementally (adding joins, subqueries, new columns), it is easy to add a column to SELECT but forget to add it to GROUP BY. This is especially common when a query has 6+ columns and multiple joins. Review GROUP BY against SELECT every time you modify a grouped query.

Selecting Non-Aggregated Columns

-- Wrong:
-- PostgreSQL ERROR: column "emp_name" must appear
-- in GROUP BY or be used in an aggregate function
SELECT
  department,
  emp_name,
  COUNT(*) AS headcount
FROM employees
GROUP BY department;

-- Correct:
-- Fix: add emp_name to GROUP BY or aggregate it
SELECT
  department,
  COUNT(*) AS headcount,
  GROUP_CONCAT(emp_name) AS employee_names
FROM employees
GROUP BY department;

Using WHERE Instead of HAVING for Aggregate Filters

-- Wrong:
-- ERROR: aggregate functions not allowed in WHERE
SELECT department, COUNT(*) AS cnt
FROM employees
WHERE COUNT(*) > 5
GROUP BY department;

-- Correct:
-- Correct: use HAVING for aggregate conditions
SELECT department, COUNT(*) AS cnt
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;

GROUP BY Position Numbers

-- Wrong:
-- Fragile: if SELECT columns are reordered,
-- the grouping changes silently
SELECT region, status, SUM(profit)
FROM orders
GROUP BY 1, 2;

-- Correct:
-- Explicit: the grouping is clear regardless
-- of SELECT column order
SELECT region, status, SUM(profit)
FROM orders
GROUP BY region, status;

Forgetting GROUP BY with Aggregates in Complex Queries

-- Wrong:
-- Added u.age_bucket but forgot GROUP BY
SELECT
  u.account_status,
  u.age_bucket,
  COUNT(oi.item_id) AS lines,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM users u
JOIN order_items oi ON u.user_id = oi.user_id
GROUP BY u.account_status;  -- missing u.age_bucket

-- Correct:
SELECT
  u.account_status,
  u.age_bucket,
  COUNT(oi.item_id) AS lines,
  SUM(oi.quantity * oi.unit_price) AS revenue
FROM users u
JOIN order_items oi ON u.user_id = oi.user_id
GROUP BY u.account_status, u.age_bucket;

GROUP BY FAQ

What does GROUP BY do in SQL?+

GROUP BY collapses multiple rows into summary rows based on shared column values. If you GROUP BY region, all rows with region = 'West' become a single output row. You then use aggregate functions (COUNT, SUM, AVG, MIN, MAX) to compute values across each group. Without GROUP BY, aggregate functions operate on the entire table as one group.

Can you GROUP BY multiple columns?+

Yes. GROUP BY region, product_category creates one group for each unique combination of region and product_category. If you have 10 regions and 5 categories, you get up to 50 groups. Rows must share the same value in every GROUP BY column to belong to the same group.

What is the difference between WHERE and HAVING?+

WHERE filters individual rows before grouping happens. HAVING filters groups after aggregation. WHERE cannot reference aggregate functions (like COUNT or SUM) because groups do not exist when WHERE runs. HAVING can reference aggregates because it runs after GROUP BY. Use WHERE for row-level conditions and HAVING for group-level conditions.

Why can I not use a column alias in GROUP BY?+

In standard SQL and PostgreSQL, GROUP BY executes before SELECT, where column aliases are defined. The alias does not exist yet when GROUP BY runs. MySQL and BigQuery allow aliases in GROUP BY as a non-standard convenience, but relying on this makes your queries non-portable.

Does GROUP BY appear in data engineering interviews?+

GROUP BY appears in roughly 85% of SQL interview rounds based on analysis of verified interview data across major tech companies. It is tested at every level from junior to staff. Common areas: basic aggregation, WHERE vs HAVING, multiple column grouping, date truncation patterns, and debugging fan-out issues from JOINs before GROUP BY.

02 / Why practice

32% of SQL Rounds. Train Like It.

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Practice GROUP BY Now

Related Guides

GROUP BY Practice Problems→

Hands-on GROUP BY problems with aggregation, HAVING, and multi-column grouping

SQL Interview Questions→

Complete guide to every SQL topic tested in data engineering interviews

SQL COALESCE and NULL Handling→

Handle NULLs in GROUP BY results with COALESCE patterns for clean aggregations