Aggregating: Advanced

Walmart operates over 10,500 store locations across 19 countries, and its analytics team needs to simultaneously compare sales performance across every combination of store, product category, and time period without running hundreds of separate queries. Multi-dimensional aggregation with GROUPING SETS, ROLLUP, and CUBE lets a single SQL query produce subtotals for every region, grand totals across all categories, and individual store breakdowns all in one pass, a computation that would take hours of manual report assembly otherwise. When a category manager wants to see how electronics performed in the Southeast versus the Midwest versus globally, all in the same report, these are the tools that make it possible in milliseconds.

ARRAY_AGG for collecting values

Daily Life
Interviews

Collect grouped values into a single array

ARRAY_AGG collects all values in a group into an array. Unlike SUM or COUNT which produce a single number, ARRAY_AGG preserves every individual value in a structured format that's easy for downstream systems to consume.

ARRAY_AGG Fundamentals

ARRAY_AGG collects multiple values into a single array column. This preserves detail while still grouping data.

//

Basic Syntax

ARRAY_AGG takes all values from a column within each group and returns them as a single array.

1SELECT
2 customer_id,
3 ARRAY_AGG(order_id) AS order_ids
4FROM orders
5GROUP BY customer_id
Result
customer_idorder_ids
C001[1, 2, 3]
C002[4, 5]
C003[6]
For each customer, this collects all their order IDs into a single array. The result is an array type that can be processed by other functions.
//

Practical Example

Consider an orders table:
order_idcustomer_idproduct
1C001Laptop
2C001Mouse
3C001Keyboard
4C002Monitor
5C002Webcam
6C003Headphones
1SELECT
2 customer_id,
3 COUNT(*) AS order_count,
4 ARRAY_AGG(product) AS products
5FROM orders
6GROUP BY customer_id
Result
customer_idorder_countproducts
C0013[Laptop, Mouse, Keyboard]
C0022[Monitor, Webcam]
C0031[Headphones]
Each customer appears once with all their products collected in an array. The array preserves each individual value.

Arrays vs Strings

Arrays provide significant advantages over comma-separated strings for storing collected values.
//

Why Arrays Beat Strings

Arrays preserve structure in a way that comma-separated strings cannot.
Array Advantages
  • No parsing required: Elements are already separated and accessible
  • Rich typing: Numbers stay numbers, dates stay dates, strings stay strings
  • No delimiter collisions: What if a value contains a comma?
  • Native operations: Use CARDINALITY, ELEMENT_AT, UNNEST without string splitting
  • Clean JSON serialization: Arrays map directly to JSON arrays
//

Arrays as Strings

Analysts sometimes prefer readable strings over arrays. ARRAY_JOIN converts an array to a delimited string for display:

1SELECT
2 customer_id,
3 ARRAY_JOIN(ARRAY_AGG(product), ', ') AS products
4FROM orders
5GROUP BY customer_id
Result
customer_idproducts
C001Laptop, Mouse, Keyboard
C002Monitor, Webcam
Which approach preserves type information when collecting values?
ARRAY_JOIN produces a flat string. Downstream code must split on the delimiter, and values containing commas will break the parsing.
1SELECT
2 customer_id,
3 ARRAY_JOIN(ARRAY_AGG(product), ', ') AS products
4FROM orders
5GROUP BY customer_id

Practice using ARRAY_AGG to collect values into a list grouped by a category.

> Complete this query to gather all metric names grouped by their department.

SELECT
  department,
  ___(___) AS all_metrics
FROM employee_metrics
GROUP BY department
metric_name
ARRAY_JOIN
ARRAY_AGG
department

ARRAY_AGG is a natural fit for building team rosters, product tag lists, and activity trails. Anywhere you currently use comma-joined strings, ARRAY_AGG is the structured upgrade.

Arrays produced by ARRAY_AGG can be passed directly to downstream systems like Python, Spark, or JSON APIs without any string parsing, which eliminates a whole class of data pipeline bugs.

TIP
If your downstream tool does not support native array types, convert with ARRAY_JOIN at the final output step. Keep the array form as long as possible inside your pipeline.

MAP_AGG for key-value pairs

Daily Life
Interviews

Build key-value lookups from grouped data

MAP_AGG builds a key-value map (dictionary) from two columns. While ARRAY_AGG collects values into a list, MAP_AGG creates structured lookups where each key maps to a value.

MAP_AGG Fundamentals

MAP_AGG creates key-value dictionaries from two columns, enabling direct lookups without parsing.

//

Basic Syntax

MAP_AGG takes two columns and builds key-value pairs for each group.

1SELECT
2 customer_id,
3 MAP_AGG(
4 product_name,
5 quantity
6 ) AS product_quantities
7FROM orders
8GROUP BY customer_id
Result
customer_idproduct_quantities
C001{Laptop=1, Mouse=2, Keyboard=1}
C002{Monitor=1, Desk=1}
C003{Chair=2}

For each customer, this creates a map where product names are keys and quantities are values. The result is a MAP type that provides O(1) key lookup.

//

Practical Example

Consider a product configuration table:
product_idsetting_namesetting_value
P001colorred
P001sizelarge
P001materialcotton
P002colorblue
P002sizemedium
1SELECT
2 product_id,
3 MAP_AGG(setting_name, setting_value) AS settings
4FROM products
5GROUP BY product_id
Result
product_idsettings
P001{color=red, size=large, material=cotton}
P002{color=blue, size=medium}
Each product now has a single row with all its settings in a map. Downstream code can access settings.color or settings.size directly without parsing.
//

ARRAY_AGG vs MAP_AGG

Choose the right aggregation based on your access pattern:
ARRAY_AGG
  • Collects all values into a list
  • Iterate, check membership, count
  • Tags: ["sql", "tutorial", "beginner"]
  • Good for ordered sequences
MAP_AGG
  • Builds key-value pairs
  • Direct lookup by key name
  • Prefs: {theme="dark", lang="en"}
  • Good for named attributes

In large-scale pipelines, MAP_AGG often eliminates the need for expensive joins back to configuration tables.

//

Accessing Map Values

Once you have a map, access values using bracket notation or the ELEMENT_AT function:

1SELECT
2 product_id,
3 settings 'color' AS color,
4 settings 'size' AS size,
5 ELEMENT_AT(settings, 'material') AS material
6FROM(SELECT product_id, MAP_AGG(setting_name, setting_value) AS settings FROM products GROUP BY product_id)
Result
product_idcolorsizematerial
P001redlargecotton
P002bluemediumNULL

Missing keys return NULL. This is useful for sparse data where not all entities have all attributes.

> Complete this query to build a map of feature names to their data types for each source.

SELECT
  source,
  ___(___, ___) AS feature_types
FROM ml_features
GROUP BY source
source
MAP_AGG
feat_name
ARRAY_AGG
dtype

MAP_AGG pivots multiple rows of key-value pairs into a single map per group, eliminating the need for self-joins on configuration tables.

If duplicate keys exist within a group, the last value wins. Ensure your data has unique keys per group or use deduplication first.

After building a map with MAP_AGG, use bracket notation like settings['color'] to access individual values efficiently.

Aggregate modifiers (ORDER BY)

Daily Life
Interviews

Control the order inside aggregated results

Both ARRAY_AGG and other aggregate functions support modifiers that control their behavior: ORDER BY for sequencing, and DISTINCT for removing duplicates.

Controlling Order

By default, aggregated values have no guaranteed order. Modifiers give you precise control over sequencing.
//

ORDER BY in Aggregates

By default, the order of aggregated values is not guaranteed. Use ORDER BY inside the aggregate to control the sequence:

1SELECT
2 department,
3 ARRAY_AGG(employee_name ORDER BY hire_date) AS employees_by_tenure
4FROM employee_metrics
5GROUP BY department
Result
departmentemployees_by_tenure
Engineering[Alice, Bob, Carol, Dave]
Sales[Eve, Frank, Grace]
Marketing[Henry, Ivy]

Now employees appear in order of hire date (earliest first). You can use any column and ASC or DESC for ordering.

//

Time-Ordered Arrays

ORDER BY is especially useful for building ordered sequences like activity trails or event timelines:

1SELECT
2 user_id,
3 ARRAY_AGG(activity_type ORDER BY activity_time DESC) AS activity_trail
4FROM user_activities
5GROUP BY user_id
Result
user_idactivity_trail
U001[purchase, cart, browse, login]
U002[logout, browse, login]
The array preserves the sequence with most recent activity first. Downstream code can iterate through these events in order.

Removing Duplicates

DISTINCT inside aggregates removes duplicate values before collection, giving you unique entries only.

//

DISTINCT in Aggregates

Add DISTINCT to remove duplicate values before aggregating:

1SELECT
2 customer_id,
3 ARRAY_AGG(DISTINCT
4 product_category
5 ) AS categories_purchased
6FROM orders
7GROUP BY customer_id
Result
customer_idcategories_purchased
C001[Electronics, Accessories]
C002[Electronics, Home Office]
C003[Audio]

Even if a customer ordered multiple Electronics products, the category appears only once. Without DISTINCT, it would repeat.

DISTINCT inside ARRAY_AGG requires hashing or sorting, so it adds computational overhead. Only use it when uniqueness matters for your downstream consumers.

//

ORDER BY + DISTINCT

You can use both modifiers together:
1SELECT
2 customer_id,
3 ARRAY_AGG(DISTINCT
4 product_category ORDER BY product_category
5 ) AS categories
6FROM orders
7GROUP BY customer_id
Result
customer_idcategories
C001[Accessories, Electronics, Home]
C002[Clothing, Electronics]
C003[Home]
This gives you unique categories in alphabetical order for each customer.

COUNT_IF conditional counting

Daily Life
Interviews

Count only rows that match a condition

Sometimes you need to count only rows that meet a certain condition. For example: "How many orders are completed vs pending?" You could write two separate queries, but COUNT_IF lets you do it in one.

Basic COUNT_IF

COUNT_IF provides a concise way to count rows matching specific conditions without complex CASE expressions.

//

COUNT_IF Syntax

COUNT_IF counts only rows where the condition is TRUE:

1SELECT
2 COUNT_IF(status = 'completed') AS completed,
3 COUNT_IF(status = 'pending') AS pending,
4 COUNT_IF(status = 'cancelled') AS cancelled
5FROM orders
Result
completedpendingcancelled
941232104
//

Calculating Percentages

Conditional counts become even more powerful when converted to percentages. This reveals patterns that raw counts might hide.
1SELECT
2 region,
3 COUNT(*) AS total_orders,
4 ROUND(
5 100 * COUNT_IF(
6 status = 'completed'
7 ) / COUNT(*),
8 1
9 ) AS completion_rate
10FROM orders
11GROUP BY region
Result
regiontotal_orderscompletion_rate
North42373.8
South38976.6
East26770.8
West19871.7
South has the highest completion rate (76.6%) while East has the lowest (70.8%). Raw counts alone wouldn't reveal this because South has fewer total orders.
//

Multiple Conditions

You can combine conditions using AND and OR within your conditional count:

1SELECT
2 product_category,
3 COUNT(*) AS total_reviews,
4 COUNT_IF(
5 rating >= 4
6 AND verified_purchase = TRUE
7 ) AS verified_positive
8FROM reviews
9GROUP BY product_category
Result
product_categorytotal_reviewsverified_positive
Electronics1523892
Clothing21561203
Home987645
This counts only positive reviews (4+ stars) from verified purchases, giving you a more trustworthy quality metric.

COUNT_IF Patterns

COUNT_IF enables counting multiple conditions in a single query, unlike WHERE which filters before aggregation.

//

COUNT_IF vs WHERE

Before reading on, predict which approach lets you count multiple statuses in one query:
WHERE filters rows before aggregation, so you can only count one status per query. Getting both completed and pending requires two separate queries.
1SELECT
2 region,
3 COUNT(*) AS completed
4FROM orders
5WHERE status = 'completed'
6GROUP BY region
//

See It In Action

Watch how COUNT_IF iterates through each row, evaluates the condition, and only counts matching rows:

Only rows where status = 'completed' are counted toward the result. The pending and cancelled rows are skipped entirely.

> Complete this query to break down each category's orders into completed and pending totals.

SELECT
  category,
  COUNT(*) AS total_orders,
  ___(status = ___) AS completed,
  ___(status = ___) AS pending
FROM orders
___ category
GROUP BY
'pending'
COUNT_IF
HAVING
COUNT_IF
SUM_IF
'completed'

COUNT_IF replaces a pattern that would otherwise require multiple subqueries or a CASE WHEN inside SUM. It expresses conditional counting in the most direct, readable way possible.

Once you start using COUNT_IF, you will find yourself replacing WHERE-filtered subqueries throughout your codebase. A single pass through the data is always faster than multiple filtered scans.

TIP
Not all SQL engines support COUNT_IF natively. In those cases, use SUM(CASE WHEN condition THEN 1 ELSE 0 END) as the equivalent pattern.

CASE WHEN in aggregates

Daily Life
Interviews

Split one aggregation into multiple buckets

COUNT_IF works great for counting, but what about SUM, AVG, and other aggregates? Use CASE WHEN to conditionally include values in any aggregate function.

Conditional Aggregation

CASE WHEN inside aggregates lets you apply different logic to different rows. The pattern differs for SUM versus AVG.

//

Conditional SUM and AVG

CASE WHEN inside SUM or AVG lets you selectively include values based on row-level conditions.

1SELECT
2 department,
3 SUM(salary) AS total_payroll,
4 SUM(
5 CASE
6 WHEN is_manager THEN salary
7 ELSE 0
8 END
9 ) AS manager_payroll,
10 AVG(
11 CASE
12 WHEN tenure_years >= 5 THEN salary
13 END
14 ) AS avg_senior_salary
15FROM employee_metrics
16GROUP BY department
Result
departmenttotal_payrollmanager_payrollavg_senior_salary
Engineering5400000840000135000.00
Sales288000042000098000.00
Marketing1620000310000105000.00

Notice the critical difference: SUM uses ELSE 0 to avoid counting non-managers, while AVG omits ELSE entirely. Including ELSE 0 in AVG would drag down the average by counting zeros.

//

NULL Handling

Advanced aggregates require careful NULL handling to avoid unexpected results.

When collecting values into arrays, NULL values are typically excluded by default. Use COALESCE or filtering to control this behavior:

1SELECT
2 customer_id,
3 ARRAY_AGG(notes) AS notes,
4 CARDINALITY(ARRAY_AGG(notes)) AS note_count
5FROM orders
6GROUP BY customer_id
Result
customer_idnotesnote_count
C001[Rush order, Gift wrap]2
C002[]0

Which query safely handles groups where all values are NULL?

If all notes are NULL, the array is empty and ELEMENT_AT returns NULL or may error depending on the engine. The result is ambiguous.
1SELECT
2 customer_id,
3 ELEMENT_AT(ARRAY_AGG(notes), 1) AS first_note
4FROM orders
5GROUP BY customer_id
//

Performance Tips

Advanced aggregates can be computationally expensive. Understanding their behavior helps you write efficient queries at scale.
01
Memory bound
Arrays built in-memory per group
02
DISTINCT cost
Requires hashing or sorting
03
ORDER BY cost
Adds sort overhead per group
04
Subquery LIMIT
Pre-filter to cap array sizes
These guidelines help you write efficient and predictable advanced aggregation queries.
Do
  • Use ARRAY_AGG with ORDER BY to control element ordering
  • Use COUNT_IF instead of SUM(CASE WHEN ... THEN 1 ELSE 0 END)
  • Check CARDINALITY for empty arrays before accessing elements
  • Use DISTINCT inside aggregates only when uniqueness matters
  • Pre-filter with subqueries to cap array sizes on large groups
Don't
  • Don't use string concatenation when ARRAY_AGG preserves type information
  • Don't use ELSE 0 inside AVG because it silently drags averages down
  • Don't assume array order is guaranteed without ORDER BY
  • Don't nest aggregation functions directly; use subqueries instead

Try using the SUM and CASE WHEN pattern to conditionally count rows within each group.

> Fill in the missing parts to count only active products within each category.

SELECT
  category,
  ___(
    ___
      WHEN status = 'active' THEN 1
      ELSE 0
    END
    ) AS active_count
FROM products
GROUP BY category
CASE
SUM
IF
COUNT
These advanced aggregation patterns unlock powerful data transformations that go beyond simple summaries. Put them to the test with hands-on challenges.

The CASE WHEN pattern inside aggregates is universally supported across all major SQL engines: Trino, Spark SQL, BigQuery, Snowflake, PostgreSQL, and more. It is a transferable skill regardless of which platform you use.

TIP
Remember the ELSE rule: use ELSE 0 inside SUM so non-matching rows contribute zero. Omit ELSE inside AVG so non-matching rows are excluded from the denominator entirely.
PUTTING IT ALL TOGETHER

> You are a data engineer at Shopify building a merchant analytics report that rolls up sales events into structured summaries per seller. The pipeline must produce both array-level detail and conditional breakdowns in a single aggregation pass.

ARRAY_AGG() collects all product SKUs sold per merchant into a single array for downstream inventory analysis.
MAP_AGG() builds a key-value map of payment_method to total amount, enabling pivot-style breakdowns without multiple queries.
ORDER BY sale_time inside ARRAY_AGG() ensures the collected SKU sequence reflects chronological purchase order per merchant.
COUNT_IF() cleanly counts refunded transactions per merchant without nested CASE WHEN expressions.
KEY TAKEAWAYS
ARRAY_AGG collects all values into an array; use ARRAY_JOIN to convert to string
MAP_AGG builds key-value maps from two columns; great for pivot-style lookups
ORDER BY inside aggregates controls the sequence of collected values
DISTINCT inside aggregates removes duplicates before collecting
COUNT_IF(condition) is cleaner than SUM(CASE WHEN ... THEN 1 ELSE 0 END)
Use ELSE 0 for conditional SUM, but omit ELSE for conditional AVG
Arrays and maps preserve types; string concatenation loses type information
Check CARDINALITY(array) = 0 for empty groups before accessing elements

Aggregate into arrays, maps, and beyond

Category
SQL
Difficulty
advanced
Duration
27 minutes
Challenges
0 hands-on challenges

Topics covered: ARRAY_AGG for collecting values, MAP_AGG for key-value pairs, Aggregate modifiers (ORDER BY), COUNT_IF conditional counting, CASE WHEN in aggregates

Lesson Sections

  1. ARRAY_AGG for collecting values (concepts: sqlArrayAgg)

    ARRAY_AGG Fundamentals Basic Syntax For each customer, this collects all their order IDs into a single array. The result is an array type that can be processed by other functions. Practical Example Consider an orders table: Each customer appears once with all their products collected in an array. The array preserves each individual value. Arrays vs Strings Arrays provide significant advantages over comma-separated strings for storing collected values. Why Arrays Beat Strings Arrays preserve stru

  2. MAP_AGG for key-value pairs (concepts: sqlMapAgg)

    MAP_AGG Fundamentals Basic Syntax Practical Example Consider a product configuration table: Each product now has a single row with all its settings in a map. Downstream code can access settings.color or settings.size directly without parsing. ARRAY_AGG vs MAP_AGG Choose the right aggregation based on your access pattern: Accessing Map Values If duplicate keys exist within a group, the last value wins. Ensure your data has unique keys per group or use deduplication first.

  3. Aggregate modifiers (ORDER BY)

    Controlling Order By default, aggregated values have no guaranteed order. Modifiers give you precise control over sequencing. ORDER BY in Aggregates Time-Ordered Arrays The array preserves the sequence with most recent activity first. Downstream code can iterate through these events in order. Removing Duplicates DISTINCT in Aggregates ORDER BY + DISTINCT You can use both modifiers together: This gives you unique categories in alphabetical order for each customer.

  4. COUNT_IF conditional counting (concepts: sqlCountIf)

    Basic COUNT_IF COUNT_IF Syntax Calculating Percentages Conditional counts become even more powerful when converted to percentages. This reveals patterns that raw counts might hide. South has the highest completion rate (76.6%) while East has the lowest (70.8%). Raw counts alone wouldn't reveal this because South has fewer total orders. Multiple Conditions This counts only positive reviews (4+ stars) from verified purchases, giving you a more trustworthy quality metric. COUNT_IF Patterns COUNT_IF

  5. CASE WHEN in aggregates (concepts: sqlConditionalAgg)

    Conditional Aggregation Conditional SUM and AVG NULL Handling Performance Tips Advanced aggregates can be computationally expensive. Understanding their behavior helps you write efficient queries at scale. These guidelines help you write efficient and predictable advanced aggregation queries. These advanced aggregation patterns unlock powerful data transformations that go beyond simple summaries. Put them to the test with hands-on challenges.