GROUP BY appears in 32% of verified DE interview SQL questions. It's the second most common construct in the corpus, behind SELECT/FROM. Roughly 14% of those GROUP BY questions use a composite key of two or more columns, which puts multi-column grouping at about 1 in every 22 SQL rounds.
The higher you go in seniority, the more often it shows up. At L5 (61% of the corpus at 632 rounds), composite GROUP BY is usually paired with a HAVING filter. At L6, it's almost always the setup for a window function or a ranking follow-up. Four patterns cover nearly every shape you'll see: hierarchical rollups, cross-tabs, HAVING on composites, and GROUP BY chained into a second aggregation.
Questions use GROUP BY
Use composite grouping
Most common clause
SQL rounds analyzed
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
The 14% of SQL questions that use composite GROUP BY share one mechanic. SQL walks every row, builds a group key from the column tuple, and buckets matching keys together. A row with ('Sales', 'Q1') and another with ('Sales', 'Q2') land in separate groups even though the first column matches. Two columns with 5 distinct values each can produce up to 25 groups, but in practice the corpus shows most real problems return 6 to 12 groups after filtering.
-- Source data (5 rows)
-- department | quarter | revenue
-- Sales | Q1 | 100
-- Sales | Q1 | 200
-- Sales | Q2 | 150
-- Marketing | Q1 | 80
-- Marketing | Q2 | 120
SELECT department, quarter, SUM(revenue) AS total
FROM sales
GROUP BY department, quarter;
-- Result (4 groups):
-- Sales | Q1 | 300 (2 rows combined)
-- Sales | Q2 | 150 (1 row)
-- Marketing | Q1 | 80 (1 row)
-- Marketing | Q2 | 120 (1 row)Key insight: The number of groups is bounded by the number of distinct value combinations that exist in the data, not the theoretical maximum. If Marketing never has Q3 sales, there is no Marketing/Q3 group.
From basic composite grouping to hierarchical analysis, cross-tabulation, and HAVING filters on composite groups.
When you GROUP BY two or more columns, SQL creates one group for each unique combination of values across those columns. A table with 3 departments and 4 job titles could produce up to 12 groups (3 x 4), though only combinations that exist in the data create actual groups. Each group gets one row in the result.
-- Revenue by department and quarter
SELECT
department,
quarter,
SUM(revenue) AS total_revenue,
COUNT(*) AS deal_count
FROM sales
GROUP BY department, quarter
ORDER BY department, quarter;
-- If department has values (Sales, Marketing)
-- and quarter has values (Q1, Q2, Q3, Q4)
-- up to 8 groups are possible:
-- Sales/Q1, Sales/Q2, Sales/Q3, Sales/Q4
-- Marketing/Q1, Marketing/Q2, Marketing/Q3, Marketing/Q4Hierarchical grouping produces summaries at the intersection of organizational levels. Grouping by country and city shows totals for each city within each country. Grouping by year, month, and day creates daily summaries nested within months and years. This pattern powers most business reporting dashboards.
-- Headcount by country and city
SELECT
country,
city,
COUNT(*) AS employee_count,
AVG(salary) AS avg_salary
FROM employees
GROUP BY country, city
ORDER BY country, city;
-- Result:
-- USA | New York | 45 | 95000
-- USA | San Francisco | 32 | 110000
-- UK | London | 28 | 85000
-- UK | Manchester | 12 | 72000
-- Each row is a unique (country, city) pair
-- "USA" alone is not a group; "USA, New York" isCombining multi-column GROUP BY with conditional aggregation produces a cross-tabulation: one dimension on the rows, another on the columns. This is the SQL equivalent of a pivot table. The GROUP BY defines the row dimension. CASE WHEN inside aggregate functions creates the column dimension.
-- Sales by product (rows) and quarter (columns)
SELECT
product_name,
SUM(CASE WHEN quarter = 'Q1' THEN revenue ELSE 0 END) AS q1,
SUM(CASE WHEN quarter = 'Q2' THEN revenue ELSE 0 END) AS q2,
SUM(CASE WHEN quarter = 'Q3' THEN revenue ELSE 0 END) AS q3,
SUM(CASE WHEN quarter = 'Q4' THEN revenue ELSE 0 END) AS q4,
SUM(revenue) AS full_year
FROM sales
GROUP BY product_name
ORDER BY full_year DESC;
-- Two-dimension grouping with conditional aggregation
SELECT
department,
job_level,
COUNT(*) AS headcount,
COUNT(*) FILTER (WHERE gender = 'F') AS female_count,
COUNT(*) FILTER (WHERE gender = 'M') AS male_count
FROM employees
GROUP BY department, job_level;HAVING filters groups after aggregation. With multi-column GROUP BY, HAVING filters based on the aggregate values of composite groups. This lets you ask questions like 'which department-quarter combinations had more than $1M in revenue' or 'which city-product pairs sold fewer than 10 units.'
-- Department-quarter pairs with high revenue
SELECT
department,
quarter,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY department, quarter
HAVING SUM(revenue) > 1000000
ORDER BY total_revenue DESC;
-- City-category pairs with low sales volume
SELECT
city,
product_category,
COUNT(*) AS order_count,
SUM(amount) AS total_sales
FROM orders
JOIN products ON orders.product_id = products.product_id
GROUP BY city, product_category
HAVING COUNT(*) < 10;Multi-column GROUP BY introduces specific pitfalls that single-column grouping does not.
Every column in SELECT must either appear in GROUP BY or be inside an aggregate function. Adding a column to SELECT without adding it to GROUP BY causes an error in PostgreSQL and Snowflake. MySQL may silently pick an arbitrary value.
Fix: Add the column to GROUP BY if you want finer groups, or wrap it in an aggregate like MAX() or MIN() if you need a representative value.
Adding unnecessary columns to GROUP BY creates more groups with fewer rows each. If you GROUP BY department, quarter, employee_name, you get per-employee groups, not per-department groups. The aggregates become meaningless because each group might contain a single row.
Fix: Only include columns in GROUP BY that define the level of granularity you need. Ask yourself: what is one row in my result? That answer defines your GROUP BY columns.
WHERE filters rows before grouping. HAVING filters groups after aggregation. Putting an aggregate condition in WHERE causes a syntax error. Putting a row-level condition in HAVING is valid but less efficient because it filters after the expensive grouping operation.
Fix: Row-level filters go in WHERE. Aggregate-level filters go in HAVING. For example: WHERE status = 'active' (row filter) and HAVING COUNT(*) > 10 (group filter).
These questions test composite group understanding, practical multi-dimension analysis, and SQL standard compliance.
What they test:
Whether you understand composite group formation. Single-column GROUP BY creates one group per unique value in that column. Multi-column GROUP BY creates one group per unique combination of values across all listed columns. The number of possible groups is the Cartesian product of distinct values (though only existing combinations appear).
Approach:
Explain with a concrete example: employees table with department (3 values) and job_level (4 values). GROUP BY department alone gives 3 groups. GROUP BY department, job_level gives up to 12 groups, one for each (department, job_level) pair that has at least one row. Mention that the aggregate functions (SUM, COUNT, AVG) compute within each composite group independently.
What they test:
Ability to combine multi-column GROUP BY with ranking. This requires grouping by region, category, and product to get totals, then ranking within (region, category) to pick the top product. Window functions or a subquery with GROUP BY are both valid approaches.
Approach:
Step 1: GROUP BY region, category, product_name to get SUM(sales) per product per category per region. Step 2: Use ROW_NUMBER() OVER (PARTITION BY region, category ORDER BY total_sales DESC) to rank products within each (region, category). Step 3: Filter to rank = 1. Walk through the query step by step so the interviewer sees your reasoning.
What they test:
Understanding of the GROUP BY contract: every non-aggregated column in SELECT must appear in the GROUP BY clause (or be functionally dependent on it). MySQL with ONLY_FULL_GROUP_BY disabled allows this and picks an arbitrary value. PostgreSQL enforces the SQL standard strictly.
Approach:
Explain the rule: SELECT can only contain columns that are in the GROUP BY list or inside aggregate functions. employee_name is not in GROUP BY and not aggregated, so PostgreSQL rejects it. MySQL (without ONLY_FULL_GROUP_BY) picks an arbitrary employee_name from the group, which is unpredictable and usually wrong. The fix: add employee_name to GROUP BY if you want per-employee groups, or use an aggregate like MAX(employee_name) if you need a representative value.
Roughly 60 out of the 429 verified SQL rounds we analyzed needed a composite GROUP BY to answer. Practice the four patterns against a real engine and you'll recognize them cold.