Window Functions: Beginner
Amazon ranks products across 12 million listings in real time, and it does not accomplish this by running a separate query for each of its hundreds of product categories. Instead, Amazon's data engineers use window functions to compute rankings within every category in a single pass over the data, using ROW_NUMBER and RANK with PARTITION BY to reset the ranking independently for Electronics, Clothing, Books, and every other vertical simultaneously. The alternative would be thousands of individual GROUP BY queries stitched together, which would be impossibly slow and expensive at scale. When you understand window functions, you understand how the world's largest retailer keeps its search results fresh.
What is a Window Function?
Calculate across rows without collapsing them
Window functions compute aggregates while keeping every row in your result intact. GROUP BY can tell you the highest sale per region, but it cannot show that figure alongside each individual transaction. That is exactly the gap window functions fill.
Ava Mitchell flagged a salary report with the same department total repeated on every row. Individual salaries are all different, but the totals look copy-pasted. Something calculated those numbers on purpose.
Six employees. Three departments. Count the output rows.
| id | name | department | salary |
|---|---|---|---|
| 1 | Chen | Engineering | 85000 |
| 2 | Patel | Engineering | 92000 |
| 3 | Kim | Engineering | 78000 |
| 4 | Osei | Marketing | 68000 |
| 5 | Tanaka | Marketing | 71000 |
| 6 | Ruiz | Sales | 88000 |
A window function performs a calculation across a set of rows that are related to the current row. Unlike GROUP BY, which collapses many rows into one summary row, a window function keeps every row and adds the computed value as a new column. The term "window" refers to the set of rows the function looks at when computing its result for the current row.
Think of it this way: GROUP BY answers questions like "what is the average salary per department?" by producing one row per department. Window functions answer questions like "what is each employee's salary compared to their department average?" by keeping every employee row and adding the department average as a new column next to it.
The key to window functions is the OVER clause. Every window function includes OVER() after the function name. The OVER clause defines which rows the function should consider, called the "window" of rows. Without the OVER clause, these functions behave like ordinary aggregate functions that collapse rows.
What Window Functions Solve
Comparing Approaches
Let us walk through a concrete example to see the difference between GROUP BY and window functions.
| employee_name | department | salary |
|---|---|---|
| Alice | Engineering | 120000 |
| Bob | Engineering | 110000 |
| Carol | Engineering | 130000 |
| Dave | Sales | 80000 |
| Eve | Sales | 90000 |
| Frank | Marketing | 95000 |
With GROUP BY, we get one row per department. The individual employees disappear entirely. With a window function, we keep all six rows and add the department average as a new column. Switch between the two approaches below to see the difference:
GROUP BY Result (Collapsed)
| department | avg_salary |
|---|---|
| Engineering | 120000.00 |
| Marketing | 95000.00 |
| Sales | 85000.00 |
Preserved Rows in Results
Notice how every row is preserved. Alice, Bob, and Carol each see the Engineering average of 120,000 next to their own salary. Dave and Eve each see the Sales average of 85,000. The OVER clause with PARTITION BY department tells SQL: "Calculate the average within each department independently, and attach the result to every row in that department."
- Collapses rows into groups
- Returns one row per group
- Loses individual row detail
- Cannot mix detail and summary
- Must list all non-aggregated columns
- Keeps every row intact
- Adds computed column to each row
- Preserves individual detail
- Detail and summary side by side
- No restrictions on selected columns
GROUP BY when you only need summaries (total sales per region, count of users per country). Use window functions when you need each row to have context about its group (each sale's percentage of the regional total, each user's rank within their country).The OVER Clause
The OVER clause is what makes a function a window function. It can contain up to three parts, each controlling a different aspect of which rows are included and how they are processed:
> Complete this query to add a running total of sales alongside each row.
SELECT user_id, total_amount, (total_amount) ( ORDER BY transaction_date ) AS running_total FROM transactions
An empty OVER() clause means "use all rows in the entire result set as the window." This gives every row the same computed value. Adding PARTITION BY narrows each window to rows sharing the same partition value. Adding ORDER BY inside the OVER clause defines row ordering, which is required for ranking functions and controls the direction of running totals.
You can combine all three parts. For example, OVER (PARTITION BY department ORDER BY salary DESC) means: "Within each department, process rows from highest salary to lowest." The function then runs independently for each department, in salary order.
What happens when you change the OVER clause from empty to partitioned?
Window functions execute after WHERE, GROUP BY, and HAVING. They operate on the result set that those clauses produce. You cannot reference a window function in a WHERE clause. If you need to filter on a window function result, wrap it in a subquery and filter in the outer query.
PARTITION BY Deep Dive
Run calculations within grouped subsets
PARTITION BY is the window function equivalent of GROUP BY. It divides the result set into independent subsets (partitions), and the window function runs separately within each partition. Crucially, unlike GROUP BY, it does NOT collapse rows. Every row remains in the output with its partition's computed value attached.
Think of PARTITION BY as drawing invisible boundaries around groups of rows. Within each boundary, the window function operates as if the other groups do not exist. Rankings restart, sums reset, and counts begin from scratch for each partition.
Single Column Partition
The most common use case is partitioning by a category column to rank or aggregate within each category. You have already seen this pattern: partitioning employees by department to rank salaries within each department. Here is another example using MAX to find the highest price in each product category:
Advanced Partitioning
Multiple Partition Columns
You can partition by multiple columns to create finer-grained groups. Each unique combination of partition values becomes its own window. This is useful when your data has multiple dimensions of grouping:
Partitioning by both department and fiscal_year creates separate windows like "Engineering 2024", "Engineering 2025", "Sales 2024", and so on. The average resets for each unique combination. Carol and Alice share the Engineering 2024 average of 125,000, while Carol and Bob in Engineering 2025 share an average of 132,500.
NULL in PARTITION BY
When a partition column contains NULL values, all NULL rows are placed into the SAME partition. This is different from most SQL operations where NULL does not equal NULL. In the context of window function partitioning, NULL values are grouped together as if they were equal:
Dave and Eve have no department (NULL). They land in the same partition and share a dept_count of 2. Be aware of this behavior when your partition columns might have missing values. If you need to exclude NULLs from partitioning, filter them out with a WHERE clause before the window function runs.
PARTITION BY vs None
When you omit PARTITION BY entirely, the entire result set is treated as a single partition. This means the window function considers ALL rows together. You can use both in the same query to get company-wide and department-level stats side by side:
The company_total column uses no partition, so every row sees the same grand total of 625,000. The dept_total column partitions by department, so Engineering rows see 360,000, Sales rows see 170,000, and Marketing sees 95,000. Having both in the same query lets you compute percentages like "Engineering represents 360,000 / 625,000 = 57.6% of total payroll."
With these partition behaviors in mind, here are the key guidelines for working with window functions.
- Use PARTITION BY to scope calculations to logical groups
- Combine window functions with regular columns freely
- Use multiple partition columns for fine-grained grouping
- Filter NULL partition columns with WHERE when you need to exclude them
- Don't use GROUP BY when you need to keep individual rows
- Don't reference window functions in WHERE clauses
- Don't assume NULL rows are excluded from partitions. NULLs form their own group
- Don't confuse PARTITION BY (no row collapse) with GROUP BY (collapses rows)
> Complete this query to show each metric alongside the total for its department.
SELECT metric_name, department, metric_value, (metric_value) OVER ( ) AS dept_total FROM employee_metrics
- No
PARTITION BY: entire result set is one window - Single column: one window per unique value in that column
- Multiple columns: one window per unique combination of values
NULLvalues: all NULLs are grouped into the same partition
Understanding PARTITION BY is the key to unlocking the full power of window functions. Every window function in the following sections builds on the partitioning behavior covered here.
The pattern of PARTITION BY combined with any window function is the foundation: it computes a value across a group of related rows while keeping all rows in the result. The upcoming sections apply this to ranking (ROW_NUMBER, RANK) and running calculations (SUM/AVG/COUNT OVER).
ROW_NUMBER
Assign unique position numbers to rows
ROW_NUMBER() assigns a unique sequential integer to each row within its partition, starting at 1. It is the most commonly used window function and appears in almost every data engineering interview. You will use it for numbering rows, pagination, deduplication, and picking the top N items per group.
Basic ROW_NUMBER
At minimum, ROW_NUMBER() needs an ORDER BY inside the OVER clause to know how to order the numbering. Without ORDER BY, the row numbering is non-deterministic and could change between query executions:
ROW_NUMBER + PARTITION BY
Adding PARTITION BY restarts the numbering for each group. The numbering begins at 1 within each partition independently. To get only the top earner per department, wrap this in a subquery and filter WHERE dept_salary_position = 1.
Multiple ORDER BY Columns
You can order by multiple columns to control tie-breaking. When two rows have the same value in the first column, the second column determines their order. This makes the numbering deterministic even when values repeat:
If two employees had the same salary, the one hired earlier (hire_date ASC) would get the lower position number. Adding a second sort column eliminates ambiguity and ensures consistent results across repeated runs.
Handling NULLs in ORDER BY
When your ordering column contains NULL values, you should explicitly control where they appear using NULLS LAST or NULLS FIRST. Without this, the default NULL positioning depends on your database engine and can surprise you:
NULLS LAST pushes rows with missing bonus values to the end. The default NULL sort order varies, so it is best practice to always specify NULLS LAST or NULLS FIRST explicitly so your query behaves the same across all database engines.
ROW_NUMBER Applications
ROW_NUMBER has powerful applications including deduplication, pagination, and selecting top-N per group.
Dedup with ROW_NUMBER
One of the most common interview patterns is using ROW_NUMBER to deduplicate data. If a table has duplicate rows, you can number them within each group and keep only the first one. This pattern uses ROW_NUMBER in a subquery, then filters for row number 1 in the outer query:
This keeps only the most recent signup per user. The PARTITION BY user_id groups rows by user, ORDER BY created_at DESC puts the newest first, and the outer WHERE rn = 1 keeps only the first row per user. This is cleaner than using GROUP BY with MAX and a self-join.
ROW_NUMBER() always produces unique numbers, even when rows have identical values. If two employees have the same salary, one will get position 3 and the other position 4. The tie-breaking is arbitrary unless you add more columns to ORDER BY. If ties matter, use RANK or DENSE_RANK instead.> Complete this query to number metrics within each department by value.
SELECT department, metric_name, metric_value, () OVER ( PARTITION BY ORDER BY metric_value ) AS rank FROM employee_metrics
ROW_NUMBER always produces unique sequential numbers, even when rows have identical values in the ORDER BY column.
The deduplication pattern filters WHERE rn = 1 after ROW_NUMBER to keep only the top row per partition.
If ties matter and rows with the same value should share a rank, use RANK or DENSE_RANK instead of ROW_NUMBER.
RANK and DENSE_RANK
Rank rows with or without gaps
When rows share the same value, ROW_NUMBER() arbitrarily assigns different numbers. But sometimes ties matter. If two students scored 92 on an exam, they should share the same rank. RANK() and DENSE_RANK() handle ties explicitly, giving identical values the same rank. They differ in what happens to the numbers that come after a tie.
How Ties Work
| student | score | ROW_NUMBER | RANK | DENSE_RANK |
|---|---|---|---|---|
| Alice | 98 | 1 | 1 | 1 |
| Bob | 92 | 2 | 2 | 2 |
| Carol | 92 | 3 | 2 | 2 |
| Dave | 85 | 4 | 4 | 3 |
| Eve | 78 | 5 | 5 | 4 |
ROW_NUMBER: Bob gets 2, Carol gets 3. No ties allowed. The order between Bob and Carol is arbitrary since they have the same score. One of them just happens to be processed first.
RANK: Bob and Carol both get 2. The next value skips to 4. Gaps appear after ties. Think of it like Olympic medals: if two athletes tie for silver, the next athlete gets bronze (3rd place), but in RANK's logic, they skip to 4th. There is no rank 3 because two rows occupied position 2.
DENSE_RANK: Bob and Carol both get 2. The next value is 3. No gaps. Every consecutive rank value is used. If you need to know "how many distinct score levels exist above Dave," DENSE_RANK tells you directly: 2 (which are rank 1 and rank 2).
When Ties Matter: Rankings
Imagine you are building a scholarship system. The school gives scholarships to students ranked in the top 3. If two students tie for rank 2, should the student just below them be ranked 3rd or 4th? The answer depends on your business rules, and that determines whether you use RANK or DENSE_RANK.
RANK Example
RANK assigns tied rows the same number and then skips ahead. If two rows share rank 2, the next row gets rank 4. The rank value tells you: "how many rows scored higher than or equal to this row?"
Bob and Carol tied at 92, so both get rank 2. RANK then skips to 4 for Dave. There is no rank 3 because two rows occupied position 2. Dave's rank of 4 tells you that exactly 3 students scored higher than or tied with him.
DENSE_RANK Example
DENSE_RANK assigns tied rows the same number but never skips values. The next row after a tie gets the very next integer. This is useful when you care about "how many distinct rank levels exist" rather than "what position would this be in a sorted list."
With DENSE_RANK, Dave gets 3 instead of 4. No gaps in the sequence. There are exactly 4 distinct rank levels in this dataset. If your scholarship covers "the top 3 rank levels," DENSE_RANK gives you Alice, Bob, Carol, and Dave (ranks 1, 2, and 3), while RANK would only give you Alice, Bob, and Carol (ranks 1 and 2, with rank 3 never assigned).
Choosing Between Functions
> Rank metrics by value within each department, giving tied values the same rank and skipping ahead after ties.
SELECT metric_name, department, metric_value, () OVER ( department ORDER BY metric_value ) AS value_rank FROM employee_metrics
RANK and DENSE_RANK both give tied rows the same number. The only difference is what happens after a tie: RANK skips the next value (1, 2, 2, 4), while DENSE_RANK never skips (1, 2, 2, 3).
Choose RANK when rank values should reflect position in a sorted list ("3rd out of 10"), and choose DENSE_RANK when you need to count distinct rank levels ("how many score tiers exist above this student?").
When ties do not matter and you need a unique number for every row, use ROW_NUMBER instead. It guarantees no two rows share the same value, making it the right choice for deduplication and pagination.
SUM/COUNT/AVG OVER
Compute running totals alongside detail rows
You already know SUM, COUNT, and AVG as aggregate functions with GROUP BY. When you pair them with OVER(), they become window functions. Instead of collapsing rows, they compute values across a window while preserving every row. This is one of the most practical applications of window functions in day-to-day data work.
Regular SUM vs SUM OVER
Before diving into running totals, let us clearly see the difference between a regular aggregate SUM and a window function SUM. The regular SUM collapses rows; the window SUM preserves them:
Running Totals: SUM OVER
A running total adds each new value to the accumulated sum so far. When you use SUM() with OVER(ORDER BY ...), SQL computes the sum from the first row up to and including the current row. Let us walk through each row step by step:
Grand Totals: SUM OVER ()
An empty OVER() clause (no PARTITION BY, no ORDER BY) means "use all rows." This is great for computing each row's percentage of a total without needing a subquery:
Every row sees the same total_payroll of 625,000 because the window includes all rows. Each employee's percentage of total payroll is calculated inline without needing a subquery or a separate GROUP BY query. Carol at 130,000 represents 20.8% of the total payroll.
Running Calculations
Adding ORDER BY to window aggregates creates running calculations that accumulate row by row.
COUNT OVER: Running Counts
COUNT() as a window function counts rows within the window. With PARTITION BY and no ORDER BY, it gives you the total count for the partition. With ORDER BY, it gives a running count that increments row by row:
The region_total_orders column shows the total count for the entire partition (3 for East, 2 for West) because there is no ORDER BY. The running_count column increments one by one as it processes rows within each partition because the ORDER BY limits the window to rows up to the current row.
AVG OVER: Cumulative Avg
AVG() as a window function computes the average across the window. With ORDER BY, it calculates an expanding average (also called a cumulative average) from the first row to the current row. The average adjusts as more data accumulates:
Default Window Frame
When you use an aggregate window function with ORDER BY inside OVER, the default frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means "from the very first row up to and including the current row." That is why you get running totals and cumulative averages automatically when you include ORDER BY.
Without ORDER BY in the OVER clause, the default frame includes ALL rows in the partition. That is why SUM() OVER (PARTITION BY region) gives you the same total on every row within the partition. The presence or absence of ORDER BY fundamentally changes the behavior from "grand total per partition" to "running total within the partition."
ORDER BY in the OVER clause, the default frame includes ALL rows in the partition (giving partition-wide totals). With ORDER BY, it only includes rows up to the current row (giving running totals). This distinction is the single most important thing to understand about aggregate window functions.> Complete this query to show each product alongside the overall average price across all products.
SELECT product_name, price, (price) () AS avg_price FROM products
Using AVG with an empty OVER() clause is the most efficient way to compare each row to a grand average because the database computes the average once and attaches it to every row, avoiding the need for a subquery or a self-join.
When you pair AVG with ORDER BY inside OVER, the result shifts from a fixed grand average to a cumulative average that changes with each row, reflecting the average of all rows processed so far in the ordering.
Aggregate window functions like SUM, COUNT, AVG, MIN, and MAX all follow the same pattern: add OVER() to turn them into window functions that preserve every row instead of collapsing the result set.
> You are a data analyst at Shopify ranking products within each category by monthly revenue so the merchandising team can identify top performers and allocate promotional budget per vertical.
OVER() clause turns aggregate-style functions into per-row calculations, keeping all product rows visible alongside their ranking.PARTITION BY divides the product catalog into independent category windows so ranking resets cleanly for Electronics, Apparel, and Home separately.ROW_NUMBER() assigns a unique sequential rank to every product even when two items share identical monthly revenue, guaranteeing one top pick per category.RANK() and DENSE_RANK() expose tie behavior differently, so choose DENSE_RANK() when the merchandising team needs gapless tier counts across tied revenue bands.GROUP BYROW_NUMBER gives unique numbers (great for deduplication); RANK and DENSE_RANK handle ties with or without gapsSUM/AVG/COUNT become window functions with OVER; adding ORDER BY makes them cumulativePARTITION BY scopes calculations to independent groups; omitting it treats all rows as one windowWHERE/GROUP BY/HAVING, so filter on them via a subqueryEvery row and the big picture at once
- Category
- SQL
- Difficulty
- beginner
- Duration
- 39 minutes
- Challenges
- 5 hands-on challenges
Topics covered: What is a Window Function?, PARTITION BY Deep Dive, ROW_NUMBER, RANK and DENSE_RANK, SUM/COUNT/AVG OVER
Lesson Sections
- What is a Window Function? (concepts: sqlPartitionBy)
What Window Functions Solve Consider a common analytical question: "Show me each employee's salary alongside the average salary for their department." This requires two things at once: individual row detail (each employee's name and salary) and aggregate context (the department average). Before window functions existed, you had to write a subquery or a self-join to combine these. Window functions let you do it in a single, clean pass. Comparing Approaches Here is our employee data: GROUP BY Resu
- PARTITION BY Deep Dive
Single Column Partition Every Electronics row sees 1,200 as the category max price (the Laptop). Every Clothing row sees 150 (the Jacket). The partitions are completely independent. You can immediately see how each product compares to the most expensive item in its category. Advanced Partitioning Once you understand basic partitioning, you can apply more sophisticated techniques to handle complex real-world scenarios like multi-dimensional groupings and edge cases. Multiple Partition Columns NUL
- ROW_NUMBER (concepts: sqlRowNumber)
Basic ROW_NUMBER Every employee gets a unique number from 1 to 6, ordered by salary from highest to lowest. Carol earns the most so she gets position 1. Dave earns the least so he gets position 6. No two rows ever share the same number. ROW_NUMBER + PARTITION BY Multiple ORDER BY Columns Handling NULLs in ORDER BY ROW_NUMBER Applications Dedup with ROW_NUMBER Which approach would you use to find the most recent signup per user?
- RANK and DENSE_RANK (concepts: sqlRankDenseRank)
How Ties Work Consider five students with test scores where two students scored 92. Here is how the three ranking functions differ on the exact same data: When Ties Matter: Rankings RANK Example DENSE_RANK Example Choosing Between Functions The choice between these three functions comes down to how you want ties handled and whether gaps in the numbering matter. Compare them side by side on the same query:
- SUM/COUNT/AVG OVER (concepts: sqlAggregateOver)
Regular SUM vs SUM OVER Running Totals: SUM OVER Here is how each row's running total is calculated: Row 1 (Jan 1): just 500, nothing before it. Row 2 (Jan 2): 500 + 300 = 800. Row 3 (Jan 3): 500 + 300 + 700 = 1,500. Row 4 (Jan 4): 500 + 300 + 700 + 200 = 1,700. Row 5 (Jan 5): 500 + 300 + 700 + 200 + 400 = 2,100. Each row accumulates all previous amounts plus its own. Running totals are perfect for cumulative revenue, inventory levels, progress toward a goal, or account balances. Any "how much s