Aggregating: Advanced
Walmart operates over 10,500 store locations across 19 countries, and its analytics team needs to simultaneously compare sales performance across every combination of store, product category, and time period without running hundreds of separate queries. Multi-dimensional aggregation with GROUPING SETS, ROLLUP, and CUBE lets a single SQL query produce subtotals for every region, grand totals across all categories, and individual store breakdowns all in one pass, a computation that would take hours of manual report assembly otherwise. When a category manager wants to see how electronics performed in the Southeast versus the Midwest versus globally, all in the same report, these are the tools that make it possible in milliseconds.
ARRAY_AGG for collecting values
Collect grouped values into a single array
ARRAY_AGG collects all values in a group into an array. Unlike SUM or COUNT which produce a single number, ARRAY_AGG preserves every individual value in a structured format that's easy for downstream systems to consume.
ARRAY_AGG Fundamentals
ARRAY_AGG collects multiple values into a single array column. This preserves detail while still grouping data.
Basic Syntax
ARRAY_AGG takes all values from a column within each group and returns them as a single array.
Practical Example
| order_id | customer_id | product |
|---|---|---|
| 1 | C001 | Laptop |
| 2 | C001 | Mouse |
| 3 | C001 | Keyboard |
| 4 | C002 | Monitor |
| 5 | C002 | Webcam |
| 6 | C003 | Headphones |
Arrays vs Strings
Why Arrays Beat Strings
- No parsing required: Elements are already separated and accessible
- Rich typing: Numbers stay numbers, dates stay dates, strings stay strings
- No delimiter collisions: What if a value contains a comma?
- Native operations: Use
CARDINALITY,ELEMENT_AT,UNNESTwithout string splitting - Clean JSON serialization: Arrays map directly to JSON arrays
Arrays as Strings
Analysts sometimes prefer readable strings over arrays. ARRAY_JOIN converts an array to a delimited string for display:
Practice using ARRAY_AGG to collect values into a list grouped by a category.
> Complete this query to gather all metric names grouped by their department.
SELECT department, () AS all_metrics FROM employee_metrics GROUP BY department
ARRAY_AGG is a natural fit for building team rosters, product tag lists, and activity trails. Anywhere you currently use comma-joined strings, ARRAY_AGG is the structured upgrade.
Arrays produced by ARRAY_AGG can be passed directly to downstream systems like Python, Spark, or JSON APIs without any string parsing, which eliminates a whole class of data pipeline bugs.
ARRAY_JOIN at the final output step. Keep the array form as long as possible inside your pipeline.MAP_AGG for key-value pairs
Build key-value lookups from grouped data
MAP_AGG builds a key-value map (dictionary) from two columns. While ARRAY_AGG collects values into a list, MAP_AGG creates structured lookups where each key maps to a value.
MAP_AGG Fundamentals
MAP_AGG creates key-value dictionaries from two columns, enabling direct lookups without parsing.
Basic Syntax
MAP_AGG takes two columns and builds key-value pairs for each group.
For each customer, this creates a map where product names are keys and quantities are values. The result is a MAP type that provides O(1) key lookup.
Practical Example
| product_id | setting_name | setting_value |
|---|---|---|
| P001 | color | red |
| P001 | size | large |
| P001 | material | cotton |
| P002 | color | blue |
| P002 | size | medium |
ARRAY_AGG vs MAP_AGG
- Collects all values into a list
- Iterate, check membership, count
- Tags: ["sql", "tutorial", "beginner"]
- Good for ordered sequences
- Builds key-value pairs
- Direct lookup by key name
- Prefs: {theme="dark", lang="en"}
- Good for named attributes
In large-scale pipelines, MAP_AGG often eliminates the need for expensive joins back to configuration tables.
Accessing Map Values
Once you have a map, access values using bracket notation or the ELEMENT_AT function:
Missing keys return NULL. This is useful for sparse data where not all entities have all attributes.
> Complete this query to build a map of feature names to their data types for each source.
SELECT source, (, ) AS feature_types FROM ml_features GROUP BY source
MAP_AGG pivots multiple rows of key-value pairs into a single map per group, eliminating the need for self-joins on configuration tables.
After building a map with MAP_AGG, use bracket notation like settings['color'] to access individual values efficiently.
Aggregate modifiers (ORDER BY)
Control the order inside aggregated results
Both ARRAY_AGG and other aggregate functions support modifiers that control their behavior: ORDER BY for sequencing, and DISTINCT for removing duplicates.
Controlling Order
ORDER BY in Aggregates
By default, the order of aggregated values is not guaranteed. Use ORDER BY inside the aggregate to control the sequence:
Now employees appear in order of hire date (earliest first). You can use any column and ASC or DESC for ordering.
Time-Ordered Arrays
ORDER BY is especially useful for building ordered sequences like activity trails or event timelines:
Removing Duplicates
DISTINCT inside aggregates removes duplicate values before collection, giving you unique entries only.
DISTINCT in Aggregates
Add DISTINCT to remove duplicate values before aggregating:
Even if a customer ordered multiple Electronics products, the category appears only once. Without DISTINCT, it would repeat.
DISTINCT inside ARRAY_AGG requires hashing or sorting, so it adds computational overhead. Only use it when uniqueness matters for your downstream consumers.
ORDER BY + DISTINCT
COUNT_IF conditional counting
Count only rows that match a condition
Sometimes you need to count only rows that meet a certain condition. For example: "How many orders are completed vs pending?" You could write two separate queries, but COUNT_IF lets you do it in one.
Basic COUNT_IF
COUNT_IF provides a concise way to count rows matching specific conditions without complex CASE expressions.
COUNT_IF Syntax
COUNT_IF counts only rows where the condition is TRUE:
Calculating Percentages
Multiple Conditions
You can combine conditions using AND and OR within your conditional count:
COUNT_IF Patterns
COUNT_IF enables counting multiple conditions in a single query, unlike WHERE which filters before aggregation.
COUNT_IF vs WHERE
See It In Action
Watch how COUNT_IF iterates through each row, evaluates the condition, and only counts matching rows:
> Complete this query to break down each category's orders into completed and pending totals.
SELECT category, COUNT(*) AS total_orders, (status = ) AS completed, (status = ) AS pending FROM orders category
COUNT_IF replaces a pattern that would otherwise require multiple subqueries or a CASE WHEN inside SUM. It expresses conditional counting in the most direct, readable way possible.
Once you start using COUNT_IF, you will find yourself replacing WHERE-filtered subqueries throughout your codebase. A single pass through the data is always faster than multiple filtered scans.
COUNT_IF natively. In those cases, use SUM(CASE WHEN condition THEN 1 ELSE 0 END) as the equivalent pattern.CASE WHEN in aggregates
Split one aggregation into multiple buckets
COUNT_IF works great for counting, but what about SUM, AVG, and other aggregates? Use CASE WHEN to conditionally include values in any aggregate function.
Conditional Aggregation
CASE WHEN inside aggregates lets you apply different logic to different rows. The pattern differs for SUM versus AVG.
Conditional SUM and AVG
CASE WHEN inside SUM or AVG lets you selectively include values based on row-level conditions.
Notice the critical difference: SUM uses ELSE 0 to avoid counting non-managers, while AVG omits ELSE entirely. Including ELSE 0 in AVG would drag down the average by counting zeros.
NULL Handling
Advanced aggregates require careful NULL handling to avoid unexpected results.
When collecting values into arrays, NULL values are typically excluded by default. Use COALESCE or filtering to control this behavior:
Which query safely handles groups where all values are NULL?
Performance Tips
- Use ARRAY_AGG with ORDER BY to control element ordering
- Use COUNT_IF instead of SUM(CASE WHEN ... THEN 1 ELSE 0 END)
- Check CARDINALITY for empty arrays before accessing elements
- Use DISTINCT inside aggregates only when uniqueness matters
- Pre-filter with subqueries to cap array sizes on large groups
- Don't use string concatenation when ARRAY_AGG preserves type information
- Don't use ELSE 0 inside AVG because it silently drags averages down
- Don't assume array order is guaranteed without ORDER BY
- Don't nest aggregation functions directly; use subqueries instead
Try using the SUM and CASE WHEN pattern to conditionally count rows within each group.
> Fill in the missing parts to count only active products within each category.
SELECT category, ( WHEN status = 'active' THEN 1 ELSE 0 END ) AS active_count FROM products GROUP BY category
The CASE WHEN pattern inside aggregates is universally supported across all major SQL engines: Trino, Spark SQL, BigQuery, Snowflake, PostgreSQL, and more. It is a transferable skill regardless of which platform you use.
ELSE rule: use ELSE 0 inside SUM so non-matching rows contribute zero. Omit ELSE inside AVG so non-matching rows are excluded from the denominator entirely.> You are a data engineer at Shopify building a merchant analytics report that rolls up sales events into structured summaries per seller. The pipeline must produce both array-level detail and conditional breakdowns in a single aggregation pass.
ARRAY_AGG() collects all product SKUs sold per merchant into a single array for downstream inventory analysis.MAP_AGG() builds a key-value map of payment_method to total amount, enabling pivot-style breakdowns without multiple queries.ORDER BY sale_time inside ARRAY_AGG() ensures the collected SKU sequence reflects chronological purchase order per merchant.COUNT_IF() cleanly counts refunded transactions per merchant without nested CASE WHEN expressions.ARRAY_AGG collects all values into an array; use ARRAY_JOIN to convert to stringMAP_AGG builds key-value maps from two columns; great for pivot-style lookupsORDER BY inside aggregates controls the sequence of collected valuesDISTINCT inside aggregates removes duplicates before collectingCOUNT_IF(condition) is cleaner than SUM(CASE WHEN ... THEN 1 ELSE 0 END)ELSE 0 for conditional SUM, but omit ELSE for conditional AVGCARDINALITY(array) = 0 for empty groups before accessing elementsAggregate into arrays, maps, and beyond
- Category
- SQL
- Difficulty
- advanced
- Duration
- 27 minutes
- Challenges
- 0 hands-on challenges
Topics covered: ARRAY_AGG for collecting values, MAP_AGG for key-value pairs, Aggregate modifiers (ORDER BY), COUNT_IF conditional counting, CASE WHEN in aggregates
Lesson Sections
- ARRAY_AGG for collecting values (concepts: sqlArrayAgg)
ARRAY_AGG Fundamentals Basic Syntax For each customer, this collects all their order IDs into a single array. The result is an array type that can be processed by other functions. Practical Example Consider an orders table: Each customer appears once with all their products collected in an array. The array preserves each individual value. Arrays vs Strings Arrays provide significant advantages over comma-separated strings for storing collected values. Why Arrays Beat Strings Arrays preserve stru
- MAP_AGG for key-value pairs (concepts: sqlMapAgg)
MAP_AGG Fundamentals Basic Syntax Practical Example Consider a product configuration table: Each product now has a single row with all its settings in a map. Downstream code can access settings.color or settings.size directly without parsing. ARRAY_AGG vs MAP_AGG Choose the right aggregation based on your access pattern: Accessing Map Values If duplicate keys exist within a group, the last value wins. Ensure your data has unique keys per group or use deduplication first.
- Aggregate modifiers (ORDER BY)
Controlling Order By default, aggregated values have no guaranteed order. Modifiers give you precise control over sequencing. ORDER BY in Aggregates Time-Ordered Arrays The array preserves the sequence with most recent activity first. Downstream code can iterate through these events in order. Removing Duplicates DISTINCT in Aggregates ORDER BY + DISTINCT You can use both modifiers together: This gives you unique categories in alphabetical order for each customer.
- COUNT_IF conditional counting (concepts: sqlCountIf)
Basic COUNT_IF COUNT_IF Syntax Calculating Percentages Conditional counts become even more powerful when converted to percentages. This reveals patterns that raw counts might hide. South has the highest completion rate (76.6%) while East has the lowest (70.8%). Raw counts alone wouldn't reveal this because South has fewer total orders. Multiple Conditions This counts only positive reviews (4+ stars) from verified purchases, giving you a more trustworthy quality metric. COUNT_IF Patterns COUNT_IF
- CASE WHEN in aggregates (concepts: sqlConditionalAgg)
Conditional Aggregation Conditional SUM and AVG NULL Handling Performance Tips Advanced aggregates can be computationally expensive. Understanding their behavior helps you write efficient queries at scale. These guidelines help you write efficient and predictable advanced aggregation queries. These advanced aggregation patterns unlock powerful data transformations that go beyond simple summaries. Put them to the test with hands-on challenges.