SQL Reference

SQL GROUP BY: The Complete Guide for Data Engineers

GROUP BY shows up in 32% of verified DE SQL interview questions, that's 139 of 429 rounds in our corpus. It's the second most frequent SQL construct in the entire dataset, trailing SELECT/FROM at 36%. HAVING, its quieter sibling, appears in 7%. If you're allocating prep time by frequency, GROUP BY sits near the top of the list.

The syntax is five words. The interview surface is huge. Execution order, WHERE versus HAVING, fan-out from joins, ROLLUP and CUBE. We'll walk each one with PostgreSQL examples you can run live.

32%

Questions using GROUP BY

139

Of 429 SQL questions

36%

SELECT/FROM (only clause more common)

7%

HAVING frequency

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

How GROUP BY Works

GROUP BY runs fifth, not first. That single fact explains roughly 80% of the "why doesn't this alias work" confusion in interviews. The execution pipeline is fixed across every ANSI engine, and memorizing the order pays back on maybe 1 in 3 SQL rounds you'll ever sit in.

1

FROM / JOIN

Tables are loaded and joined. This produces the raw row set that every subsequent step operates on.

2

WHERE

Rows are filtered before any grouping happens. You cannot reference aggregates here because groups do not exist yet.

3

GROUP BY

Remaining rows are collapsed into groups based on the specified columns. Each unique combination of GROUP BY values becomes one output row.

4

HAVING

Groups are filtered. HAVING runs after GROUP BY, so it can reference aggregate functions like COUNT(*) or SUM(amount).

5

SELECT

Expressions and aggregates are evaluated. Column aliases are created here, which is why most engines do not let you reference aliases in GROUP BY or HAVING.

6

ORDER BY

Results are sorted. This is the only clause that can reference column aliases (in most engines) because it runs after SELECT.

Basic GROUP BY Syntax

The rule is simple: every column in SELECT must either be listed in GROUP BY or wrapped in an aggregate function. No exceptions in standard SQL.

SELECT
  column_a,
  column_b,
  AGG_FUNCTION(column_c) AS result
FROM table_name
WHERE row_filter_condition
GROUP BY column_a, column_b
HAVING AGG_FUNCTION(column_c) > threshold
ORDER BY result DESC;

Interview note: If asked to write GROUP BY from scratch, start with SELECT and GROUP BY in sync. Add the same columns to both. Then add aggregates to SELECT. This prevents the most common GROUP BY error: forgetting a column.

GROUP BY Patterns for Data Engineers

Six patterns that cover the vast majority of GROUP BY usage in production pipelines and interview questions. Each includes PostgreSQL code you can run directly.

Basic Aggregation

The simplest GROUP BY: pick a column, apply an aggregate function, and get one row per unique value. This pattern is the foundation of every reporting query. If your table has 1M orders across 50 regions, GROUP BY region collapses those million rows into 50.

SELECT
  region,
  COUNT(*) AS order_count,
  SUM(amount) AS total_revenue,
  AVG(amount) AS avg_order_value
FROM orders
GROUP BY region;

Interview note: Every non-aggregated column in SELECT must appear in GROUP BY. PostgreSQL enforces this strictly. MySQL in permissive mode does not, which can return non-deterministic results.

GROUP BY with HAVING

HAVING filters groups after aggregation. WHERE filters individual rows before grouping. This distinction trips up roughly half of interview candidates. Think of WHERE as a row-level gate and HAVING as a group-level gate.

-- Find users who placed more than 5 orders
-- with total spend above $500
SELECT
  user_id,
  COUNT(*) AS order_count,
  SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY user_id
HAVING COUNT(*) > 5
   AND SUM(amount) > 500;

Interview note: WHERE status = 'completed' removes rows before grouping. HAVING COUNT(*) > 5 removes groups after. Using WHERE for row filters is more efficient because it reduces the data GROUP BY has to process.

GROUP BY Multiple Columns

Grouping by two or more columns creates one row per unique combination. This is how you build cross-tabulation reports: revenue by region and product category, counts by year and department, rates by country and device type.

SELECT
  region,
  product_category,
  COUNT(*) AS order_count,
  SUM(amount) AS revenue,
  ROUND(AVG(amount), 2) AS avg_order
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY region, product_category
ORDER BY region, revenue DESC;

Interview note: The number of output rows equals the number of unique (region, product_category) combinations. If you have 10 regions and 8 categories, the maximum is 80 rows (fewer if some combinations have no data).

GROUP BY with CASE WHEN

CASE WHEN inside GROUP BY lets you create custom buckets on the fly. This is how you segment users by behavior tiers, bucket orders by size ranges, or classify transactions without creating a lookup table.

SELECT
  CASE
    WHEN amount < 50 THEN 'small'
    WHEN amount < 200 THEN 'medium'
    WHEN amount < 1000 THEN 'large'
    ELSE 'enterprise'
  END AS order_tier,
  COUNT(*) AS order_count,
  SUM(amount) AS tier_revenue,
  ROUND(AVG(amount), 2) AS avg_amount
FROM orders
GROUP BY
  CASE
    WHEN amount < 50 THEN 'small'
    WHEN amount < 200 THEN 'medium'
    WHEN amount < 1000 THEN 'large'
    ELSE 'enterprise'
  END;

Interview note: The full CASE expression must be repeated in GROUP BY (in PostgreSQL and SQL Server). Some engines like MySQL and BigQuery allow referencing the alias. Know which engine your interviewer targets.

GROUP BY with Date Truncation

Time-series aggregation is the bread and butter of analytics engineering. Truncate timestamps to day, week, month, or quarter, then GROUP BY the truncated value. This pattern powers every time-series dashboard.

-- Daily revenue for the last 90 days
SELECT
  DATE_TRUNC('day', created_at) AS order_date,
  COUNT(*) AS orders,
  SUM(amount) AS revenue,
  COUNT(DISTINCT user_id) AS unique_buyers
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY DATE_TRUNC('day', created_at)
ORDER BY order_date;

-- Monthly aggregation
SELECT
  DATE_TRUNC('month', created_at) AS month,
  SUM(amount) AS monthly_revenue
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
ORDER BY month;

Interview note: DATE_TRUNC is PostgreSQL/Snowflake syntax. In MySQL, use DATE_FORMAT(created_at, '%Y-%m-01'). In BigQuery, use DATE_TRUNC(created_at, MONTH). Interviewers want to see that you know the function for their stack.

GROUP BY with ROLLUP, CUBE, and GROUPING SETS

ROLLUP adds subtotals and a grand total row. CUBE generates subtotals for every combination of grouped columns. GROUPING SETS lets you specify exactly which groupings you want. These are tested in senior-level interviews and are standard in warehouse reporting.

-- ROLLUP: subtotals by region, then grand total
SELECT
  COALESCE(region, 'ALL REGIONS') AS region,
  COALESCE(product_category, 'ALL PRODUCTS') AS category,
  SUM(amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY ROLLUP(region, product_category);

-- CUBE: every combination of subtotals
SELECT
  COALESCE(region, 'ALL') AS region,
  COALESCE(product_category, 'ALL') AS category,
  SUM(amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY CUBE(region, product_category);

-- GROUPING SETS: pick exactly which groupings
SELECT
  region,
  product_category,
  SUM(amount) AS revenue
FROM orders o
JOIN products p ON o.product_id = p.id
GROUP BY GROUPING SETS (
  (region, product_category),
  (region),
  ()
);

Interview note: ROLLUP(a, b) produces groupings (a, b), (a), and (). CUBE(a, b) produces (a, b), (a), (b), and (). Use the GROUPING() function to distinguish real NULLs from subtotal NULLs in the output.

GROUP BY Execution Order: Why Aliases Break

The execution order explains three things that confuse every SQL beginner (and trip up experienced engineers in interviews):

You cannot use SELECT aliases in GROUP BY (in most engines)

GROUP BY executes at step 3. SELECT executes at step 5. When GROUP BY runs, the alias you defined in SELECT does not exist yet. PostgreSQL and SQL Server enforce this strictly. MySQL and BigQuery allow it as a convenience extension, but your query will fail if you port it to another engine.

-- Fails in PostgreSQL: alias not visible to GROUP BY
SELECT DATE_TRUNC('month', created_at) AS month,
       SUM(amount) AS revenue
FROM orders
GROUP BY month;

-- Works in PostgreSQL: repeat the expression
SELECT DATE_TRUNC('month', created_at) AS month,
       SUM(amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', created_at);

Correction: PostgreSQL actually does allow aliases in GROUP BY as a special extension. But the standard does not guarantee it, and SQL Server does not allow it. When in doubt, repeat the expression.

WHERE filters rows before grouping

WHERE runs at step 2, before GROUP BY at step 3. This means WHERE reduces the row count that GROUP BY has to process. A WHERE clause that eliminates 80% of rows before grouping is significantly faster than grouping all rows and then filtering with HAVING. Use WHERE for every filter that operates on individual row values.

-- Efficient: WHERE filters 80% of rows
-- before GROUP BY processes them
SELECT region, SUM(amount) AS revenue
FROM orders
WHERE created_at >= '2024-01-01'
  AND status = 'completed'
GROUP BY region;

-- Less efficient: grouping ALL rows
-- then filtering groups
SELECT region, SUM(amount) AS revenue
FROM orders
GROUP BY region
HAVING SUM(CASE WHEN status != 'completed'
             THEN 0 ELSE amount END) > 0;

HAVING filters groups after aggregation

HAVING runs at step 4, after GROUP BY has created groups. HAVING is the only place where you can write conditions on aggregate results. "Show me departments with more than 10 employees" requires HAVING COUNT(*) > 10 because COUNT(*) does not exist until the groups are formed.

-- Find products with declining sales:
-- at least 100 orders total, but avg order value
-- dropped below $25
SELECT
  product_id,
  COUNT(*) AS total_orders,
  ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
WHERE created_at >= '2024-01-01'
GROUP BY product_id
HAVING COUNT(*) >= 100
   AND AVG(amount) < 25
ORDER BY avg_order_value ASC;

7 GROUP BY Interview Questions

GROUP BY questions test more than syntax. They reveal whether you understand query execution, can debug aggregation issues, and know performance implications. These seven cover the range from phone screens to on-site rounds.

Q1: What does GROUP BY do, and when does it execute in the query pipeline?

What they test:

Foundational understanding. Most candidates can say 'it groups rows,' but fewer can place it correctly in the execution order: after FROM/WHERE, before HAVING/SELECT/ORDER BY.

Approach:

State the execution order explicitly: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY. Explain that GROUP BY collapses rows into groups, and each group becomes one output row. Only aggregated columns and GROUP BY columns can appear in SELECT.

Q2: Why can you use a column alias in ORDER BY but not in GROUP BY?

What they test:

Execution order knowledge. SELECT (where aliases are defined) runs after GROUP BY but before ORDER BY. So GROUP BY cannot see aliases, while ORDER BY can.

Approach:

Walk through the order: GROUP BY fires before SELECT, so the alias does not exist yet. ORDER BY fires after SELECT, so the alias is available. Note that MySQL is an exception and allows aliases in GROUP BY as a non-standard extension.

Q3: Explain the difference between WHERE and HAVING.

What they test:

This is the single most common GROUP BY interview question. Interviewers want the timing distinction: WHERE filters rows before grouping, HAVING filters groups after aggregation.

Approach:

WHERE operates on individual rows and cannot use aggregate functions. HAVING operates on groups and can reference aggregates like COUNT(*) > 10. Using WHERE to filter early is more efficient because it reduces the rows GROUP BY processes.

Q4: Write a query to find the top 3 product categories by revenue, only counting orders over $100.

What they test:

Combining WHERE, GROUP BY, ORDER BY, and LIMIT in the right order. The WHERE clause filters first, GROUP BY aggregates, ORDER BY sorts, and LIMIT caps the output.

Approach:

WHERE amount > 100 filters individual orders. GROUP BY product_category aggregates. ORDER BY SUM(amount) DESC sorts by revenue. LIMIT 3 takes the top three. Do not use HAVING here because the $100 threshold is a row filter, not a group filter.

Q5: A dashboard shows 'total users' as 1,247 but 'SUM of users per region' as 1,302. What went wrong?

What they test:

Real debugging skill. Users who belong to multiple regions get counted once in the total but once per region in the grouped sum. This is a fan-out problem caused by a JOIN before aggregation.

Approach:

Explain the fan-out: if a user has rows in 2 regions, they appear in 2 groups. COUNT(*) counts rows, not distinct users. Fix: use COUNT(DISTINCT user_id) or aggregate before joining.

Q6: What happens if you SELECT a column that is not in GROUP BY and not in an aggregate?

What they test:

SQL standard compliance knowledge. In strict SQL mode (PostgreSQL, SQL Server), this is an error. In MySQL permissive mode, it returns an arbitrary value from the group, which is non-deterministic.

Approach:

State that the SQL standard requires every SELECT column to be either in GROUP BY or inside an aggregate function. PostgreSQL raises an error. MySQL historically allowed it but would pick a random row's value. Always follow the standard.

Q7: How would you calculate a running percentage of total using GROUP BY?

What they test:

Whether you can combine GROUP BY with window functions or subqueries. A common pattern: GROUP BY first to get category totals, then divide by the grand total.

Approach:

Use a CTE or subquery. First, GROUP BY category to get each category's revenue. Then divide by SUM(revenue) OVER () to get the percentage. Alternatively, use a scalar subquery for the grand total: category_revenue / (SELECT SUM(amount) FROM orders).

Common GROUP BY Mistakes

These four mistakes account for the majority of GROUP BY bugs in interviews and production code. Each one has a clear fix.

Selecting Non-Aggregated Columns

If a column is in SELECT but not in GROUP BY and not wrapped in an aggregate function, the query is technically invalid. PostgreSQL rejects it outright. MySQL in permissive mode silently picks an arbitrary value from the group, which means your results could change between runs.

Wrong

-- PostgreSQL ERROR: column "name" must appear
-- in GROUP BY or be used in an aggregate function
SELECT
  department,
  name,
  COUNT(*) AS headcount
FROM employees
GROUP BY department;

Correct

-- Fix: add name to GROUP BY or aggregate it
SELECT
  department,
  COUNT(*) AS headcount,
  ARRAY_AGG(name) AS employee_names
FROM employees
GROUP BY department;

Using WHERE Instead of HAVING for Aggregate Filters

WHERE runs before GROUP BY, so it cannot reference aggregate results. Writing WHERE COUNT(*) > 5 is a syntax error in every SQL engine. The fix is always HAVING.

Wrong

-- ERROR: aggregate functions not allowed in WHERE
SELECT department, COUNT(*) AS cnt
FROM employees
WHERE COUNT(*) > 5
GROUP BY department;

Correct

-- Correct: use HAVING for aggregate conditions
SELECT department, COUNT(*) AS cnt
FROM employees
GROUP BY department
HAVING COUNT(*) > 5;

GROUP BY Position Numbers

Some engines allow GROUP BY 1, 2 to reference SELECT columns by position. This works but is fragile: if someone reorders the SELECT columns, the GROUP BY silently changes meaning. Production queries should use explicit column names.

Wrong

-- Fragile: if SELECT columns are reordered,
-- the grouping changes silently
SELECT region, product_type, SUM(amount)
FROM orders
GROUP BY 1, 2;

Correct

-- Explicit: the grouping is clear regardless
-- of SELECT column order
SELECT region, product_type, SUM(amount)
FROM orders
GROUP BY region, product_type;

Forgetting GROUP BY with Aggregates in Complex Queries

When building queries incrementally (adding joins, subqueries, new columns), it is easy to add a column to SELECT but forget to add it to GROUP BY. This is especially common when a query has 6+ columns and multiple joins. Review GROUP BY against SELECT every time you modify a grouped query.

Wrong

-- Added u.signup_source but forgot GROUP BY
SELECT
  u.region,
  u.signup_source,
  COUNT(o.id) AS orders,
  SUM(o.amount) AS revenue
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.region;  -- missing u.signup_source

Correct

SELECT
  u.region,
  u.signup_source,
  COUNT(o.id) AS orders,
  SUM(o.amount) AS revenue
FROM users u
JOIN orders o ON u.id = o.user_id
GROUP BY u.region, u.signup_source;

GROUP BY FAQ

What does GROUP BY do in SQL?+
GROUP BY collapses multiple rows into summary rows based on shared column values. If you GROUP BY region, all rows with region = 'West' become a single output row. You then use aggregate functions (COUNT, SUM, AVG, MIN, MAX) to compute values across each group. Without GROUP BY, aggregate functions operate on the entire table as one group.
Can you GROUP BY multiple columns?+
Yes. GROUP BY region, product_category creates one group for each unique combination of region and product_category. If you have 10 regions and 5 categories, you get up to 50 groups. Rows must share the same value in every GROUP BY column to belong to the same group.
What is the difference between WHERE and HAVING?+
WHERE filters individual rows before grouping happens. HAVING filters groups after aggregation. WHERE cannot reference aggregate functions (like COUNT or SUM) because groups do not exist when WHERE runs. HAVING can reference aggregates because it runs after GROUP BY. Use WHERE for row-level conditions and HAVING for group-level conditions.
Why can I not use a column alias in GROUP BY?+
In standard SQL and PostgreSQL, GROUP BY executes before SELECT, where column aliases are defined. The alias does not exist yet when GROUP BY runs. MySQL and BigQuery allow aliases in GROUP BY as a non-standard convenience, but relying on this makes your queries non-portable.
Does GROUP BY appear in data engineering interviews?+
GROUP BY appears in roughly 85% of SQL interview rounds based on analysis of verified interview data across major tech companies. It is tested at every level from junior to staff. Common areas: basic aggregation, WHERE vs HAVING, multiple column grouping, date truncation patterns, and debugging fan-out issues from JOINs before GROUP BY.

32% of SQL Rounds. Train Like It.

139 GROUP BY questions in our verified corpus. DataDriven serves them back with live grading so the patterns stick before loop day.