Aggregating: Advanced

Walmart operates over 10,500 store locations across 19 countries, and its analytics team needs to simultaneously compare sales performance across every combination of store, product category, and time period without running hundreds of separate queries. Multi-dimensional aggregation with GROUPING SETS, ROLLUP, and CUBE lets a single SQL query produce subtotals for every region, grand totals across all categories, and individual store breakdowns all in one pass, a computation that would take hours of manual report assembly otherwise. When a category manager wants to see how electronics performed in the Southeast versus the Midwest versus globally, all in the same report, these are the tools that make it possible in milliseconds.

ARRAY_AGG for collecting values

Daily Life

Interviews

Collect grouped values into a single array

ARRAY_AGG collects all values in a group into an array. Unlike SUM or COUNT which produce a single number, ARRAY_AGG preserves every individual value in a structured format that's easy for downstream systems to consume.

ARRAY_AGG Fundamentals

ARRAY_AGG collects multiple values into a single array column. This preserves detail while still grouping data.

Basic Syntax

ARRAY_AGG takes all values from a column within each group and returns them as a single array.

	SELECT
	customer_id,
	ARRAY_AGG(order_id) AS order_ids
	FROM orders
	GROUP BY customer_id

Result

customer_id	order_ids
C001	[1, 2, 3]
C002	[4, 5]
C003	[6]

For each customer, this collects all their order IDs into a single array. The result is an array type that can be processed by other functions.

Practical Example

Consider an orders table:

order_id	customer_id	product
1	C001	Laptop
2	C001	Mouse
3	C001	Keyboard
4	C002	Monitor
5	C002	Webcam
6	C003	Headphones

	SELECT
	customer_id,
	COUNT(*) AS order_count,
	ARRAY_AGG(product) AS products
	FROM orders
	GROUP BY customer_id

Result

customer_id	order_count	products
C001	3	[Laptop, Mouse, Keyboard]
C002	2	[Monitor, Webcam]
C003	1	[Headphones]

Each customer appears once with all their products collected in an array. The array preserves each individual value.

Arrays vs Strings

Arrays provide significant advantages over comma-separated strings for storing collected values.

Why Arrays Beat Strings

Arrays preserve structure in a way that comma-separated strings cannot.

Array Advantages

No parsing required: Elements are already separated and accessible
Rich typing: Numbers stay numbers, dates stay dates, strings stay strings
No delimiter collisions: What if a value contains a comma?
Native operations: Use CARDINALITY, ELEMENT_AT, UNNEST without string splitting
Clean JSON serialization: Arrays map directly to JSON arrays

Arrays as Strings

Analysts sometimes prefer readable strings over arrays. ARRAY_JOIN converts an array to a delimited string for display:

	SELECT
	customer_id,
	ARRAY_JOIN(ARRAY_AGG(product), ', ') AS products
	FROM orders
	GROUP BY customer_id

Result

customer_id	products
C001	Laptop, Mouse, Keyboard
C002	Monitor, Webcam

Which approach preserves type information when collecting values?

ARRAY_JOIN produces a flat string. Downstream code must split on the delimiter, and values containing commas will break the parsing.

	SELECT
	customer_id,
	ARRAY_JOIN(ARRAY_AGG(product), ', ') AS products
	FROM orders
	GROUP BY customer_id

Practice using ARRAY_AGG to collect values into a list grouped by a category.

> Complete this query to gather all metric names grouped by their department.

SELECT
  department,
  ___(___) AS all_metrics
FROM employee_metrics
GROUP BY department

ARRAY_AGG

metric_name

ARRAY_JOIN

department

ARRAY_AGG is a natural fit for building team rosters, product tag lists, and activity trails. Anywhere you currently use comma-joined strings, ARRAY_AGG is the structured upgrade.

Arrays produced by ARRAY_AGG can be passed directly to downstream systems like Python, Spark, or JSON APIs without any string parsing, which eliminates a whole class of data pipeline bugs.

TIP

If your downstream tool does not support native array types, convert with ARRAY_JOIN at the final output step. Keep the array form as long as possible inside your pipeline.

MAP_AGG for key-value pairs

Daily Life

Interviews

Build key-value lookups from grouped data

MAP_AGG builds a key-value map (dictionary) from two columns. While ARRAY_AGG collects values into a list, MAP_AGG creates structured lookups where each key maps to a value.

MAP_AGG Fundamentals

MAP_AGG creates key-value dictionaries from two columns, enabling direct lookups without parsing.

Basic Syntax

MAP_AGG takes two columns and builds key-value pairs for each group.

	SELECT
	customer_id,
	MAP_AGG(
	product_name,
	quantity
	) AS product_quantities
	FROM orders
	GROUP BY customer_id

Result

customer_id	product_quantities
C001	{Laptop=1, Mouse=2, Keyboard=1}
C002	{Monitor=1, Desk=1}
C003	{Chair=2}

For each customer, this creates a map where product names are keys and quantities are values. The result is a MAP type that provides O(1) key lookup.

Practical Example

Consider a product configuration table:

product_id	setting_name	setting_value
P001	color	red
P001	size	large
P001	material	cotton
P002	color	blue
P002	size	medium

	SELECT
	product_id,
	MAP_AGG(setting_name, setting_value) AS settings
	FROM products
	GROUP BY product_id

Result

product_id	settings
P001	{color=red, size=large, material=cotton}
P002	{color=blue, size=medium}

Each product now has a single row with all its settings in a map. Downstream code can access settings.color or settings.size directly without parsing.

ARRAY_AGG vs MAP_AGG

Choose the right aggregation based on your access pattern:

•ARRAY_AGG

Collects all values into a list
Iterate, check membership, count
Tags: ["sql", "tutorial", "beginner"]
Good for ordered sequences

•MAP_AGG

Builds key-value pairs
Direct lookup by key name
Prefs: {theme="dark", lang="en"}
Good for named attributes

In large-scale pipelines, MAP_AGG often eliminates the need for expensive joins back to configuration tables.

Accessing Map Values

Once you have a map, access values using bracket notation or the ELEMENT_AT function:

	SELECT
	product_id,
	settings 'color' AS color,
	settings 'size' AS size,
	ELEMENT_AT(settings, 'material') AS material
	FROM(SELECT product_id, MAP_AGG(setting_name, setting_value) AS settings FROM products GROUP BY product_id)

Result

product_id	color	size	material
P001	red	large	cotton
P002	blue	medium	NULL

Missing keys return NULL. This is useful for sparse data where not all entities have all attributes.

> Complete this query to build a map of feature names to their data types for each source.

SELECT
  source,
  ___(___, ___) AS feature_types
FROM ml_features
GROUP BY source

MAP_AGG

feat_name

dtype

ARRAY_AGG

source

MAP_AGG pivots multiple rows of key-value pairs into a single map per group, eliminating the need for self-joins on configuration tables.

If duplicate keys exist within a group, the last value wins. Ensure your data has unique keys per group or use deduplication first.

After building a map with MAP_AGG, use bracket notation like settings['color'] to access individual values efficiently.

Aggregate modifiers (ORDER BY)

Daily Life

Interviews

Control the order inside aggregated results

Both ARRAY_AGG and other aggregate functions support modifiers that control their behavior: ORDER BY for sequencing, and DISTINCT for removing duplicates.

Controlling Order

By default, aggregated values have no guaranteed order. Modifiers give you precise control over sequencing.

ORDER BY in Aggregates

By default, the order of aggregated values is not guaranteed. Use ORDER BY inside the aggregate to control the sequence:

	SELECT
	department,
	ARRAY_AGG(employee_name ORDER BY hire_date) AS employees_by_tenure
	FROM employee_metrics
	GROUP BY department

Result

department	employees_by_tenure
Engineering	[Alice, Bob, Carol, Dave]
Sales	[Eve, Frank, Grace]
Marketing	[Henry, Ivy]

Now employees appear in order of hire date (earliest first). You can use any column and ASC or DESC for ordering.

Time-Ordered Arrays

ORDER BY is especially useful for building ordered sequences like activity trails or event timelines:

	SELECT
	user_id,
	ARRAY_AGG(activity_type ORDER BY activity_time DESC) AS activity_trail
	FROM user_activities
	GROUP BY user_id

Result

user_id	activity_trail
U001	[purchase, cart, browse, login]
U002	[logout, browse, login]

The array preserves the sequence with most recent activity first. Downstream code can iterate through these events in order.

Removing Duplicates

DISTINCT inside aggregates removes duplicate values before collection, giving you unique entries only.

DISTINCT in Aggregates

Add DISTINCT to remove duplicate values before aggregating:

	SELECT
	customer_id,
	ARRAY_AGG(DISTINCT
	product_category
	) AS categories_purchased
	FROM orders
	GROUP BY customer_id

Result

customer_id	categories_purchased
C001	[Electronics, Accessories]
C002	[Electronics, Home Office]
C003	[Audio]

Even if a customer ordered multiple Electronics products, the category appears only once. Without DISTINCT, it would repeat.

DISTINCT inside ARRAY_AGG requires hashing or sorting, so it adds computational overhead. Only use it when uniqueness matters for your downstream consumers.

ORDER BY + DISTINCT

You can use both modifiers together:

	SELECT
	customer_id,
	ARRAY_AGG(DISTINCT
	product_category ORDER BY product_category
	) AS categories
	FROM orders
	GROUP BY customer_id

Result

customer_id	categories
C001	[Accessories, Electronics, Home]
C002	[Clothing, Electronics]
C003	[Home]

This gives you unique categories in alphabetical order for each customer.

COUNT_IF conditional counting

Daily Life

Interviews

Count only rows that match a condition

Sometimes you need to count only rows that meet a certain condition. For example: "How many orders are completed vs pending?" You could write two separate queries, but COUNT_IF lets you do it in one.

Basic COUNT_IF

COUNT_IF provides a concise way to count rows matching specific conditions without complex CASE expressions.

COUNT_IF Syntax

COUNT_IF counts only rows where the condition is TRUE:

	SELECT
	COUNT_IF(status = 'completed') AS completed,
	COUNT_IF(status = 'pending') AS pending,
	COUNT_IF(status = 'cancelled') AS cancelled
	FROM orders

Result

completed	pending	cancelled
941	232	104

Calculating Percentages

Conditional counts become even more powerful when converted to percentages. This reveals patterns that raw counts might hide.

	SELECT
	region,
	COUNT(*) AS total_orders,
	ROUND(
	100 * COUNT_IF(
	status = 'completed'
	) / COUNT(*),
	1
	) AS completion_rate
	FROM orders
	GROUP BY region

Result

region	total_orders	completion_rate
North	423	73.8
South	389	76.6
East	267	70.8
West	198	71.7

South has the highest completion rate (76.6%) while East has the lowest (70.8%). Raw counts alone wouldn't reveal this because South has fewer total orders.

Multiple Conditions

You can combine conditions using AND and OR within your conditional count:

	SELECT
	product_category,
	COUNT(*) AS total_reviews,
	COUNT_IF(
	rating >= 4
	AND verified_purchase = TRUE
	) AS verified_positive
	FROM reviews
	GROUP BY product_category

Result

product_category	total_reviews	verified_positive
Electronics	1523	892
Clothing	2156	1203
Home	987	645

This counts only positive reviews (4+ stars) from verified purchases, giving you a more trustworthy quality metric.

COUNT_IF Patterns

COUNT_IF enables counting multiple conditions in a single query, unlike WHERE which filters before aggregation.

COUNT_IF vs WHERE

Before reading on, predict which approach lets you count multiple statuses in one query:

WHERE filters rows before aggregation, so you can only count one status per query. Getting both completed and pending requires two separate queries.

	SELECT
	region,
	COUNT(*) AS completed
	FROM orders
	WHERE status = 'completed'
	GROUP BY region

See It In Action

Watch how COUNT_IF iterates through each row, evaluates the condition, and only counts matching rows:

Only rows where status = 'completed' are counted toward the result. The pending and cancelled rows are skipped entirely.

> Complete this query to break down each category's orders into completed and pending totals.

SELECT
  category,
  COUNT(*) AS total_orders,
  ___(status = ___) AS completed,
  ___(status = ___) AS pending
FROM orders
___ category

COUNT_IF

'completed'

COUNT_IF

'pending'

GROUP BY

SUM_IF

HAVING

COUNT_IF replaces a pattern that would otherwise require multiple subqueries or a CASE WHEN inside SUM. It expresses conditional counting in the most direct, readable way possible.

Once you start using COUNT_IF, you will find yourself replacing WHERE-filtered subqueries throughout your codebase. A single pass through the data is always faster than multiple filtered scans.

TIP

Not all SQL engines support COUNT_IF natively. In those cases, use SUM(CASE WHEN condition THEN 1 ELSE 0 END) as the equivalent pattern.

CASE WHEN in aggregates

Daily Life

Interviews

Split one aggregation into multiple buckets

COUNT_IF works great for counting, but what about SUM, AVG, and other aggregates? Use CASE WHEN to conditionally include values in any aggregate function.

Conditional Aggregation

CASE WHEN inside aggregates lets you apply different logic to different rows. The pattern differs for SUM versus AVG.

Conditional SUM and AVG

CASE WHEN inside SUM or AVG lets you selectively include values based on row-level conditions.

	SELECT
	department,
	SUM(salary) AS total_payroll,
	SUM(
	CASE
	WHEN is_manager THEN salary
	ELSE 0
	END
	) AS manager_payroll,
	AVG(
	CASE
	WHEN tenure_years >= 5 THEN salary
	END
	) AS avg_senior_salary
	FROM employee_metrics
	GROUP BY department

Result

department	total_payroll	manager_payroll	avg_senior_salary
Engineering	5400000	840000	135000.00
Sales	2880000	420000	98000.00
Marketing	1620000	310000	105000.00

Notice the critical difference: SUM uses ELSE 0 to avoid counting non-managers, while AVG omits ELSE entirely. Including ELSE 0 in AVG would drag down the average by counting zeros.

NULL Handling

Advanced aggregates require careful NULL handling to avoid unexpected results.

When collecting values into arrays, NULL values are typically excluded by default. Use COALESCE or filtering to control this behavior:

	SELECT
	customer_id,
	ARRAY_AGG(notes) AS notes,
	CARDINALITY(ARRAY_AGG(notes)) AS note_count
	FROM orders
	GROUP BY customer_id

Result

customer_id	notes	note_count
C001	[Rush order, Gift wrap]	2
C002	[]	0

Which query safely handles groups where all values are NULL?

If all notes are NULL, the array is empty and ELEMENT_AT returns NULL or may error depending on the engine. The result is ambiguous.

	SELECT
	customer_id,
	ELEMENT_AT(ARRAY_AGG(notes), 1) AS first_note
	FROM orders
	GROUP BY customer_id

Performance Tips

Advanced aggregates can be computationally expensive. Understanding their behavior helps you write efficient queries at scale.

Memory bound

Arrays built in-memory per group

DISTINCT cost

Requires hashing or sorting

ORDER BY cost

Adds sort overhead per group

Subquery LIMIT

Pre-filter to cap array sizes

These guidelines help you write efficient and predictable advanced aggregation queries.

✓Do

Use ARRAY_AGG with ORDER BY to control element ordering
Use COUNT_IF instead of SUM(CASE WHEN ... THEN 1 ELSE 0 END)
Check CARDINALITY for empty arrays before accessing elements
Use DISTINCT inside aggregates only when uniqueness matters
Pre-filter with subqueries to cap array sizes on large groups

✗Don't

Don't use string concatenation when ARRAY_AGG preserves type information
Don't use ELSE 0 inside AVG because it silently drags averages down
Don't assume array order is guaranteed without ORDER BY
Don't nest aggregation functions directly; use subqueries instead

Try using the SUM and CASE WHEN pattern to conditionally count rows within each group.

> Fill in the missing parts to count only active products within each category.

SELECT
  category,
  ___(
    ___
      WHEN status = 'active' THEN 1
      ELSE 0
    END
    ) AS active_count
FROM products
GROUP BY category

SUM

CASE

COUNT

These advanced aggregation patterns unlock powerful data transformations that go beyond simple summaries. Put them to the test with hands-on challenges.

The CASE WHEN pattern inside aggregates is universally supported across all major SQL engines: Trino, Spark SQL, BigQuery, Snowflake, PostgreSQL, and more. It is a transferable skill regardless of which platform you use.

TIP

Remember the ELSE rule: use ELSE 0 inside SUM so non-matching rows contribute zero. Omit ELSE inside AVG so non-matching rows are excluded from the denominator entirely.

❯❯❯PUTTING IT ALL TOGETHER

> You are a data engineer at Shopify building a merchant analytics report that rolls up sales events into structured summaries per seller. The pipeline must produce both array-level detail and conditional breakdowns in a single aggregation pass.

ARRAY_AGG() collects all product SKUs sold per merchant into a single array for downstream inventory analysis.

MAP_AGG() builds a key-value map of payment_method to total amount, enabling pivot-style breakdowns without multiple queries.

ORDER BY sale_time inside ARRAY_AGG() ensures the collected SKU sequence reflects chronological purchase order per merchant.

COUNT_IF() cleanly counts refunded transactions per merchant without nested CASE WHEN expressions.

KEY TAKEAWAYS

ARRAY_AGG collects all values into an array; use ARRAY_JOIN to convert to string

MAP_AGG builds key-value maps from two columns; great for pivot-style lookups

ORDER BY inside aggregates controls the sequence of collected values

DISTINCT inside aggregates removes duplicates before collecting

COUNT_IF(condition) is cleaner than SUM(CASE WHEN ... THEN 1 ELSE 0 END)

Use ELSE 0 for conditional SUM, but omit ELSE for conditional AVG

Arrays and maps preserve types; string concatenation loses type information

Check CARDINALITY(array) = 0 for empty groups before accessing elements

Aggregate into arrays, maps, and beyond

Category: SQL
Difficulty: advanced
Duration: 27 minutes
Challenges: 0 hands-on challenges

Topics covered: ARRAY_AGG for collecting values, MAP_AGG for key-value pairs, Aggregate modifiers (ORDER BY), COUNT_IF conditional counting, CASE WHEN in aggregates

Lesson Sections

ARRAY_AGG for collecting values (concepts: sqlArrayAgg)
ARRAY_AGG Fundamentals Basic Syntax For each customer, this collects all their order IDs into a single array. The result is an array type that can be processed by other functions. Practical Example Consider an orders table: Each customer appears once with all their products collected in an array. The array preserves each individual value. Arrays vs Strings Arrays provide significant advantages over comma-separated strings for storing collected values. Why Arrays Beat Strings Arrays preserve stru
MAP_AGG for key-value pairs (concepts: sqlMapAgg)
MAP_AGG Fundamentals Basic Syntax Practical Example Consider a product configuration table: Each product now has a single row with all its settings in a map. Downstream code can access settings.color or settings.size directly without parsing. ARRAY_AGG vs MAP_AGG Choose the right aggregation based on your access pattern: Accessing Map Values If duplicate keys exist within a group, the last value wins. Ensure your data has unique keys per group or use deduplication first.
Aggregate modifiers (ORDER BY) (concepts: sqlArrayAgg)
Controlling Order By default, aggregated values have no guaranteed order. Modifiers give you precise control over sequencing. ORDER BY in Aggregates Time-Ordered Arrays The array preserves the sequence with most recent activity first. Downstream code can iterate through these events in order. Removing Duplicates DISTINCT in Aggregates ORDER BY + DISTINCT You can use both modifiers together: This gives you unique categories in alphabetical order for each customer.
COUNT_IF conditional counting (concepts: sqlCountIf)
Basic COUNT_IF COUNT_IF Syntax Calculating Percentages Conditional counts become even more powerful when converted to percentages. This reveals patterns that raw counts might hide. South has the highest completion rate (76.6%) while East has the lowest (70.8%). Raw counts alone wouldn't reveal this because South has fewer total orders. Multiple Conditions This counts only positive reviews (4+ stars) from verified purchases, giving you a more trustworthy quality metric. COUNT_IF Patterns COUNT_IF
CASE WHEN in aggregates (concepts: sqlConditionalAgg)
Conditional Aggregation Conditional SUM and AVG NULL Handling Performance Tips Advanced aggregates can be computationally expensive. Understanding their behavior helps you write efficient queries at scale. These guidelines help you write efficient and predictable advanced aggregation queries. These advanced aggregation patterns unlock powerful data transformations that go beyond simple summaries. Put them to the test with hands-on challenges.