Window Functions: Advanced

Quantitative hedge funds like Jane Street and Citadel execute trades worth billions of dollars every day, and their analysts use advanced SQL window expressions to compute the signals that drive those decisions. NTILE and PERCENT_RANK rank security returns into percentile buckets, ROWS BETWEEN frame specifications compute rolling Sharpe ratios over trailing windows, and CUME_DIST identifies regime-change anomalies where a return falls in the extreme tail of its distribution. These are not academic exercises: a single well-designed window query can surface a mispriced asset before the market corrects, and the firms that build those queries fastest win. The advanced patterns in this lesson are what separate a working knowledge of SQL from the kind of analytical power that moves markets.

Windows vs Self-Joins

Daily Life
Interviews

Eliminate costly self-joins with windows

One of the most impactful things you can do with window functions is eliminate self-joins. A self-join is when a table is joined to itself, typically to compare each row with a related row in the same table. Self-joins are expensive: the engine must scan the table twice, build a join hash map, and match rows. Window functions accomplish the same comparison in a single pass over the data.
Self-joins are extremely common in legacy codebases. Before window functions were widely supported, they were the only way to compare a row with its neighbors. You will encounter them frequently when working with older data warehouses or reading code written by analysts who learned SQL before window functions became standard. Recognizing these patterns and knowing how to rewrite them is a high-value skill.

Consider a common requirement: for each order, show the previous order amount for the same customer. The self-join approach requires joining the orders table to itself, matching on customer and finding the most recent prior order. The window function approach uses a single LAG call.

Rewriting Self-Joins

Let us examine the most common self-join patterns and their window function equivalents side by side.
//

Self-Join vs LAG

1SELECT
2 o1.order_id,
3 o1.customer_id,
4 o1.amount,
5 o2.amount AS prev_amount
6FROM orders AS o1
7LEFT JOIN orders AS o2
8 ON o1.customer_id = o2.customer_id
9 AND o2.order_date = (
10 SELECT
11 MAX(o3.order_date)
12 FROM orders AS o3
13 WHERE o3.customer_id = o1.customer_id
14 AND o3.order_date < o1.order_date
15)
Self-Join
  • Multiple table scans
  • Correlated subqueries
  • Complex join conditions
  • Hard to maintain
Window Function
  • Single table scan
  • Declarative intent
  • Readable OVER clause
  • Easier to extend
1SELECT
2 order_id,
3 customer_id,
4 amount,
5 LAG(amount, 1) OVER (
6 PARTITION BY customer_id
7 ORDER BY order_date
8 ) AS prev_amount,
9 amount - LAG(amount, 1) OVER (
10 PARTITION BY customer_id
11 ORDER BY order_date
12 ) AS amount_change
13FROM orders
Result
order_idcustomer_idamountprev_amountamount_change
1001C10050.00
1002C10075.0050.0025.00
1003C10060.0075.00-15.00
1004C200120.00
1005C20095.00120.00-25.00

Notice how the first order for each customer has NULL for prev_amount because there is no prior row. The amount_change column shows the difference, reusing the same LAG expression. With a self-join, computing both the previous value and the difference would require even more complex join logic.

//

Self-Join vs AVG OVER

Another common self-join pattern compares each row to an aggregate of its group. For example, showing each employee's salary alongside their department average. The self-join version requires a subquery to compute the department average, then joins it back to the original table. The window function version computes the average inline.
1SELECT
2 e.employee_id,
3 e.department,
4 e.salary,
5 d.avg_salary,
6 e.salary - d.avg_salary AS diff
7FROM employee_metrics AS e
8INNER JOIN (
9 SELECT
10 department,
11 AVG(salary) AS avg_salary
12 FROM employee_metrics
13 GROUP BY department
14) AS d
15 ON e.department = d.department

Performance and Trade-offs

Understanding when window functions outperform self-joins, and when they fall short, helps you make the right choice for each situation.
//

Single Scan vs Double Scan

Self-joins require scanning the table twice and building a hash table, while window functions scan once and sort partitions locally. In practice, window functions often reduce query time by 40-60% on large datasets.

Performance Comparison
  • Self-join: Scans N rows twice (2N I/O), builds hash table of N entries, probes N times. Work scales as O(N) with large constant factors
  • Window function: Scans N rows once, sorts each partition (typically << N), walks sequentially. No hash table or join overhead
//

COUNT Comparison

Window functions also replace self-joins for counting related rows. Instead of joining a table to an aggregated version of itself, use COUNT with OVER:

1SELECT
2 order_id,
3 customer_id,
4 amount,
5 COUNT(*) OVER (
6 PARTITION BY customer_id
7 ) AS total_orders_by_customer,
8 SUM(amount) OVER (
9 PARTITION BY customer_id
10 ) AS total_spent_by_customer
11FROM orders
Result
order_idcustomer_idamounttotal_orders_by_customertotal_spent_by_customer
1001C10050.003185.00
1002C10075.003185.00
1003C10060.003185.00
1004C200120.002215.00
1005C20095.002215.00

Each row now includes the total number of orders and total amount spent by that customer, without any self-join. The old approach would require a GROUP BY subquery joined back to the original table. The window function approach is both simpler to write and faster to execute.

//

COUNT_IF vs FILTER

Standard SQL defines a FILTER clause for aggregate window functions, but it is not universally supported. A common alternative is COUNT_IF as a direct replacement. Where some references show COUNT(*) FILTER (WHERE condition), the equivalent is COUNT_IF(condition). This function evaluates the boolean condition for each row and counts only the rows where it is true. It works both as a regular aggregate and as a window function with an OVER clause.

1SELECT
2 department,
3 employee_id,
4 salary,
5 COUNT_IF(salary > 80000) OVER (
6 PARTITION BY department
7 ) AS high_earners_in_dept
8FROM employee_metrics
Result
departmentemployee_idsalaryhigh_earners_in_dept
EngineeringE001950002
EngineeringE002720002
EngineeringE003880002
SalesS001650001
SalesS002820001

COUNT_IF counts only rows where the condition is true. Combined with OVER, it counts within each partition. This is cleaner than wrapping a CASE expression inside SUM or AVG.

TIP
Some SQL references show COUNT(*) FILTER (WHERE ...) syntax. The equivalent here is COUNT_IF(...). The semantics are identical, but the syntax differs.
//

When Self-Joins Are Needed

Window functions cannot do everything. Use a self-join instead when:

Use Self-Joins When
  • Pairwise comparisons: Finding all pairs of rows matching complex criteria (e.g., employees hired within 7 days of each other)
  • Non-sequential matching: Rows need to be compared across unordered subsets, not just within ordered partitions
  • Complex aggregation logic: Joining a table to a specific aggregated view of itself with conditions beyond simple partition-level aggregation

Rule of thumb: If the comparison follows linear order within a partition, use a window function. If it requires comparing every row to every other row, use a self-join.

> Complete this query to split metrics into four equal tiers from highest to lowest value.

SELECT
  metric_name,
  metric_value,
  ___(___) OVER (
    ORDER BY metric_value DESC
  ) AS quartile
FROM employee_metrics
NTILE
RANK
3
4

NTILE is the clearest way to assign equal-sized quartiles because it adapts automatically to the actual number of rows, unlike a CASE statement with hardcoded salary thresholds that becomes outdated as the workforce changes.

Replacing a self-join with a window function does not just reduce the number of lines of SQL; it also removes the need for the database to build and probe a hash table, which can cut memory usage significantly on large datasets.
When profiling a slow query that contains a self-join used only to compare each row to an aggregate of its group, rewriting it as an aggregate window function is often the single most impactful optimization you can make.

Windows and Partitioning

Daily Life
Interviews

Optimize window queries for large datasets

There are two completely different meanings of "partitioning" in data engineering. PARTITION BY in a window function groups rows logically for the computation. Table partitioning determines how data is physically stored on disk. Understanding both, and how they interact, is critical for writing performant queries at scale.

Table partitioning divides a large table into smaller physical segments based on a column value, typically a date. A table partitioned by day stores each day of data in a separate directory or file. When a query filters on the partition column, the engine reads only the relevant partitions and skips the rest entirely. This is called partition pruning.

In a real-world data warehouse running Hive or Trino, a table like event_data might be partitioned by a date column called dt. The physical storage layout on HDFS or S3 looks like a directory tree: event_data/dt=2024-01-01/, event_data/dt=2024-01-02/, and so on. Each directory contains the Parquet or ORC files for that day. A query that filters WHERE dt = '2024-01-15' reads only the files in that single directory, potentially skipping terabytes of irrelevant data.

PARTITION BYTable partition
PARTITION BY
Logical scope
Groups rows for window
Table partition
Physical scope
Organizes files on disk

Partition Pruning

When your query filters and window partitions align with the physical table structure, performance improves dramatically.
//

Partition Pruning in Action

When your WHERE clause filters on the physical partition column, the engine skips entire directories of data. This is the single most important optimization in big data querying. A table with 365 daily partitions and a filter on one day reads roughly 1/365th of the data:

1SELECT
2 dt,
3 user_id,
4 event_type,
5 COUNT(*) OVER (
6 PARTITION BY dt, user_id
7 ) AS events_per_user_per_day
8FROM event_data
9WHERE dt = '2024-01-15'
Result
dtuser_idevent_typeevents_per_user_per_day
2024-01-15U001click3
2024-01-15U001view3
2024-01-15U001purchase3
2024-01-15U002click1

This query benefits from partition pruning because the WHERE clause filters on dt, which is the physical partition column. The engine reads only the 2024-01-15 directory. The window function then computes counts within each user for that day. No data from other days is ever loaded into memory.

//

When Partitions Align

When your window function PARTITION BY column matches the table partition column, the engine can process each physical partition independently. Each worker node handles its own partition without needing data from other nodes. This is the ideal scenario.

1SELECT
2 event_date,
3 user_id,
4 event_type,
5 ROW_NUMBER() OVER (
6 PARTITION BY event_date, user_id
7 ORDER BY event_time
8 ) AS event_sequence
9FROM event_data
10WHERE event_date BETWEEN DATE '2024-01-01'
11AND DATE '2024-01-31'
Result
event_dateuser_idevent_typeevent_sequence
2024-01-15U001click1
2024-01-15U001view2
2024-01-15U001purchase3
2024-01-15U002click1
2024-01-16U001click1

If the event_data table is physically partitioned by event_date, the WHERE clause triggers partition pruning (only January data is read). The window function then groups by event_date and user_id, which aligns with the physical layout. Each node processes its local data without shuffling rows across the network.

Understanding Data Skew

Uneven data distribution is one of the most common causes of slow window function queries. Recognizing and addressing skew is essential for production workloads.
//

Data Skew

Data skew occurs when one partition key value contains vastly more data than others. If you partition by customer_id and one customer generates 90% of all events, the worker assigned to that customer becomes a bottleneck while other workers sit idle. The query is only as fast as the slowest worker.
Skew is especially dangerous with window functions because the engine must sort all rows within a partition. A partition with 100 million rows requires sorting 100 million rows on a single worker, even if every other partition has only a few thousand rows. This causes out-of-memory errors and extremely long query times. Consider a real scenario: if customer C1 has 10 million events and all other customers have roughly 10 thousand events each, the worker handling C1 does 1000 times more work than any other worker. Your query effectively runs single-threaded on that one partition.
//

Diagnosing Skew

Before running a window function on a new dataset, always check the distribution of your partition key. A simple GROUP BY ... COUNT(*) query reveals whether one value dominates. Look at the ratio between the largest partition and the median partition. A ratio above 100:1 is a strong signal of problematic skew.

1SELECT
2 customer_id,
3 COUNT(*) AS event_count
4FROM orders
5GROUP BY customer_id
6ORDER BY event_count DESC
7LIMIT 10
Result
customer_idevent_count
C_MEGA10245800
C00215200
C00312400
C00411800
C0059600
C0068900
C0078100
C0087500
C0096800
C0106200
In this example, customer C_MEGA has over 10 million events while the next largest customer has only 15 thousand. That is a 674:1 ratio. Any window function partitioned by customer_id will bottleneck on C_MEGA. This is the kind of skew that requires mitigation through salting, which is covered in the real-world patterns section.
TIP
Consider using compound partition keys to reduce skew. Instead of PARTITION BY customer_id alone, try PARTITION BY customer_id, DATE_TRUNC('month', event_date). This splits the mega-customer into monthly chunks, distributing the work more evenly across workers.
Logical vs Physical Partitioning
Logical vs Physical Partitioning
PARTITION BY groups rows for computation; table partitions organize data on disk for I/O efficiency.
Partition Pruning
Partition Pruning
Filtering on a table partition column lets the engine skip irrelevant files entirely before any processing.
Skew Detection
Skew Detection
Always check value distribution before partitioning. One hot key can bottleneck the entire query.
Alignment Benefit
Alignment Benefit
When window PARTITION BY matches table partitioning, workers process local data without network shuffles.

> Complete this query to detect partition skew by counting events per customer.

SELECT
  ___,
  ___(*) AS event_count
FROM event_data
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10
SUM
COUNT
event_type
user_id

Before using PARTITION BY in production, always check for skew by counting rows per partition key to identify potential bottlenecks.

When a single partition key dominates, consider compound partition keys to split mega-partitions into smaller, more balanced chunks.

Aligning window PARTITION BY with table partitioning enables local processing without expensive network shuffles in distributed systems.

Cumulative Distribution

Daily Life
Interviews

Calculate percentiles across a dataset

Beyond ranking, window functions provide statistical distribution functions. CUME_DIST and PERCENT_RANK tell you where a value falls within its partition, expressed as a fraction between 0 and 1. These are essential for percentile analysis, grading curves, and performance benchmarking.

Distribution Functions

Two functions express a row's position as a decimal between 0 and 1. Understanding their subtle differences is key to choosing the right one.
//

CUME_DIST

CUME_DIST returns the fraction of rows with values less than or equal to the current row's value. The formula is: (number of rows with value <= current value) / (total rows in partition). The result ranges from just above 0 to exactly 1.0. Think of it as answering the question: "What percentage of the group is at or below this value?"

1SELECT
2 employee_id,
3 department,
4 salary,
5 CUME_DIST() OVER (
6 PARTITION BY department
7 ORDER BY salary
8 ) AS salary_percentile
9FROM employee_metrics
Result
employee_iddepartmentsalarysalary_percentile
E005Engineering700000.25
E003Engineering800000.50
E001Engineering950000.75
E004Engineering1100001.00
S002Sales550000.33
S001Sales650000.67
S003Sales820001.00

In the Engineering department with 4 employees, the lowest-paid employee has a CUME_DIST of 0.25 (1/4 of salaries are at or below theirs). The highest earner is always 1.0 because 100% of values are at or below the maximum.

//

PERCENT_RANK

PERCENT_RANK uses a different formula: (rank - 1) / (total rows - 1). This places the lowest value at exactly 0.0 and the highest at exactly 1.0, distributing other values evenly between them. The key difference from CUME_DIST is that the lowest value always starts at zero, whereas CUME_DIST gives the lowest value a positive fraction. Think of PERCENT_RANK as answering: "What fraction of the group is strictly below this value?"

1SELECT
2 employee_id,
3 department,
4 salary,
5 ROUND(
6 CUME_DIST() OVER (
7 PARTITION BY department
8 ORDER BY salary
9 ),
10 2
11 ) AS cume,
12 ROUND(
13 PERCENT_RANK() OVER (
14 PARTITION BY department
15 ORDER BY salary
16 ),
17 2
18 ) AS pct_rank
19FROM employee_metrics
Result
employee_iddepartmentsalarycumepct_rank
E005Engineering700000.250.00
E003Engineering800000.500.33
E001Engineering950000.750.67
E004Engineering1100001.001.00

Notice the difference: the lowest salary has CUME_DIST = 0.25 but PERCENT_RANK = 0.00. The highest has both at 1.00. Choose CUME_DIST when you want "what fraction of values are at or below this" and PERCENT_RANK when you want a 0-to-1 scale where lowest is 0 and highest is 1.

Distribution Patterns

Distribution functions power some of the most valuable analytics patterns: outlier detection, sessionization, and percentile-based tiering.
//

CUME_DIST Anomaly Detection

A practical application of CUME_DIST is identifying outliers. Values above the 99th percentile (CUME_DIST > 0.99) are statistical anomalies that often indicate fraud, system errors, or data quality issues. By computing CUME_DIST and filtering in an outer query, you can flag these rows for investigation.

1SELECT
2 transaction_id,
3 user_id,
4 amount,
5 ROUND(percentile, 4) AS percentile
6FROM (
7 SELECT
8 transaction_id,
9 user_id,
10 amount,
11 CUME_DIST() OVER (
12 ORDER BY amount
13 ) AS percentile
14 FROM transactions
15)
16WHERE percentile > 0.99
17ORDER BY amount DESC
Result
transaction_iduser_idamountpercentile
T9921U0449850.001.0000
T8812U1128200.000.9980
T7703U0077500.000.9960
T6654U2036900.000.9920
This query identifies the top 1% of transactions by amount. These are the outliers that warrant manual review. The same pattern works for detecting unusually long response times, abnormally high event counts, or any metric where extreme values signal problems.
//

Sessionization Pattern

Sessionization assigns a session identifier to groups of events that are close together in time. The standard approach detects gaps greater than a threshold (commonly 30 minutes) between consecutive events and increments a session counter at each gap. This requires two window function passes.

The first pass uses LAG to compute the time difference between each event and the one before it. The second pass uses a running SUM to count the number of gap boundaries encountered so far. Every time the gap exceeds the threshold, the SUM increments by one, creating a new session ID. Events within the threshold inherit the same session ID because the CASE expression evaluates to zero for them.

1SELECT
2 user_id,
3 event_time,
4 page_url,
5 session_id
6FROM (
7 SELECT
8 *,
9 SUM(
10 CASE
11 WHEN gap_min > 30
12 OR gap_min IS NULL THEN 1
13 ELSE 0
14 END
15 ) OVER (
16 PARTITION BY user_id
17 ORDER BY event_time
18 ROWS UNBOUNDED PRECEDING
19 ) AS session_id
20 FROM (
21 SELECT
22 user_id,
23 event_time,
24 page_url,
25 DATE_DIFF(
26 'minute',
27 LAG(event_time, 1) OVER (
28 PARTITION BY user_id
29 ORDER BY event_time
30 ),
31 event_time
32 ) AS gap_min
33 FROM clickstream
34 )
35)
Result
user_idevent_timepage_urlsession_id
U0012024-01-15 09:00:00/home1
U0012024-01-15 09:05:00/products1
U0012024-01-15 09:12:00/cart1
U0012024-01-15 14:00:00/home2
U0012024-01-15 14:03:00/search2

The inner query uses LAG to compute the gap in minutes between each event and the previous one. The outer query uses a running SUM to increment the session counter whenever a gap exceeds 30 minutes. The first event for each user starts session 1 because the gap is NULL (no previous event). The ROWS UNBOUNDED PRECEDING frame ensures the SUM accumulates from the very first row through the current row, giving a monotonically increasing session counter.

> Fill in the missing parts to calculate a rolling total of revenue using the current row and the two rows before it.

SELECT
  bill_date,
  amount,
  SUM(amount) OVER (
    ORDER BY bill_date
    ___ 2 PRECEDING AND ___
  ) AS rolling_sum
FROM cloud_costs
RANGE BETWEEN
UNBOUNDED FOLLOWING
ROWS BETWEEN
CURRENT ROW

A bounded frame like ROWS BETWEEN 2 PRECEDING AND CURRENT ROW keeps memory usage constant because the database only holds three rows at a time, regardless of how many total rows the partition contains.

The ROWS keyword counts physical rows by position, which makes rolling window calculations predictable even when the ordering column has duplicate values, unlike RANGE which groups duplicates into a single logical boundary.

Workarounds & Engine Limits

Daily Life
Interviews

Work around window function restrictions

Every SQL engine has limitations around window functions. Understanding these constraints and their workarounds prevents hours of debugging cryptic error messages. The most common issues involve where window functions can appear in a query, missing syntax features, and ordering guarantees.

Placement Restrictions

SQL's evaluation order determines where window functions can and cannot appear. These restrictions apply across all SQL engines.
//

Window Functions vs WHERE

Window functions cannot appear in a WHERE clause. This is an SQL standard rule, rooted in the SQL evaluation order. The engine processes a query in this sequence: FROM, WHERE, GROUP BY, HAVING, SELECT, window functions, ORDER BY, LIMIT. Because window functions are evaluated after WHERE, they do not exist yet when the WHERE clause runs. Referencing them in WHERE would create a circular dependency: the filter needs the window result, but the window needs the filtered rows.

The workaround is to wrap the window function in a subquery. The inner query computes the window function and exposes it as a column. The outer query then filters on that column, because by the time the outer WHERE runs, the inner query has already computed the window values.

1SELECT
2 user_id,
3 email,
4 created_at
5FROM (
6 SELECT
7 user_id,
8 email,
9 created_at,
10 ROW_NUMBER() OVER (
11 PARTITION BY user_id
12 ORDER BY created_at DESC
13 ) AS rn
14 FROM event_data
15)
16WHERE rn = 1
Result
user_idemailcreated_at
U001alice@example.com2024-03-15 14:30:00
U002bob@example.com2024-03-14 09:15:00
U003carol@example.com2024-03-15 11:00:00

The inner query computes ROW_NUMBER and exposes it as rn. The outer query filters on rn = 1. This "compute then filter" pattern is the standard way to use window functions in filtering logic. You will use this pattern constantly.

//

Window Functions vs HAVING

The same restriction applies to HAVING. Since HAVING is evaluated before window functions in the SQL processing order, you cannot reference a window function there. The workaround is identical: compute the window function in a subquery, then filter in the outer query. This often surprises people who expect HAVING to work like a post-aggregation filter that runs after everything else.

//

Repeating OVER Clauses

Some SQL references show a WINDOW clause that lets you name a window definition and reuse it across multiple functions. This syntax is not universally supported. When you need the same window specification in multiple functions, repeat the full OVER clause each time.

1SELECT
2 employee_id,
3 salary,
4 RANK() OVER (
5 PARTITION BY department
6 ORDER BY salary DESC
7 ) AS salary_rank,
8 CUME_DIST() OVER (
9 PARTITION BY department
10 ORDER BY salary DESC
11 ) AS salary_percentile,
12 LAG(salary, 1) OVER (
13 PARTITION BY department
14 ORDER BY salary DESC
15 ) AS next_higher_salary
16FROM employee_metrics

Advanced Window Techniques

Beyond the basic restrictions, several advanced techniques and ordering behaviors require careful attention.
//

ARRAY_AGG with OVER

ARRAY_AGG can be used as a window function to build running lists. Combined with a frame specification, it collects values up to the current row. This is powerful for building user journey paths, collecting recent event histories, and constructing ordered sequences without self-joins.

1SELECT
2 user_id,
3 event_time,
4 page_url,
5 ARRAY_AGG(page_url) OVER (
6 PARTITION BY user_id
7 ORDER BY event_time
8 ROWS UNBOUNDED PRECEDING
9 ) AS pages_visited_so_far
10FROM clickstream
Result
user_idevent_timepage_urlpages_visited_so_far
U00109:00:00/home[/home]
U00109:05:00/products[/home, /products]
U00109:12:00/cart[/home, /products, /cart]

Each row sees an array of all page URLs visited so far in that session. The ROWS UNBOUNDED PRECEDING frame ensures the array grows with each row. This is useful for building user journey paths without self-joins.

You can also use ARRAY_AGG with a bounded frame to collect only the last N items. For example, ROWS 2 PRECEDING would collect the current page and the two before it, creating a sliding window of recent pages. This is useful for detecting navigation patterns like "users who visit pricing then features then signup tend to convert."

//

Window Function Order

When a query contains multiple window functions, the engine may evaluate them in any order or even in parallel. You cannot assume that one window function's result is available to another window function in the same SELECT clause. If you need to chain window functions (for example, computing a rank based on a running total), you must use nested subqueries where each layer computes one window function and the next layer references its result.

//

ORDER BY in Subqueries

The optimizer may ignore ORDER BY in subqueries unless paired with LIMIT. The optimizer treats subquery ordering as meaningless because the outer query will reorder as needed. This matters when you expect rows in a specific order within a derived table. The problem is subtle: your query might return results in the expected order during testing (because the data happens to be stored in that order) but produce differently-ordered results in production when the data layout changes or parallelism kicks in.

TIP
Never rely on ORDER BY inside a subquery to guarantee row order in the outer query. Always place the final ORDER BY in the outermost query. Window function ordering within OVER is guaranteed for the window computation itself, but not for the output row order.
Another common source of unexpected NULLs in window function results involves frame clauses.

Common gotcha: NTH_VALUE requires an explicit frame clause. Without one, it may return NULL unexpectedly. Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when using NTH_VALUE to ensure it can see the entire partition.

> Complete this query to filter results based on a window function using a subquery.

SELECT
  department,
  metric_name,
  metric_value
FROM (
  SELECT
    department,
    metric_name,
    metric_value,
    ___() OVER (
      PARTITION BY department
      ORDER BY metric_value DESC
    ) AS rn
  FROM employee_metrics
) AS sub
WHERE ___ = 1
ROW_NUMBER
RANK
rn
row_num

You cannot filter on a window function in the same SELECT clause because WHERE executes before window functions are computed.

The subquery pattern computes the window function first, then the outer query filters on the result, enabling top-N-per-group queries.

CTEs (WITH clauses) provide a cleaner alternative to nested subqueries when the query involves multiple window function layers.

Deduplication

Daily Life
Interviews

Remove duplicate rows from production data

Duplicate records are one of the most common data quality problems in production pipelines. Sources send the same event multiple times, ETL pipelines replay data, and upstream systems lack idempotency guarantees. Window functions provide the standard solution: the ROW_NUMBER deduplication pattern numbers duplicates by your chosen criteria and keeps only row number 1.

Data Quality Patterns

The most common use of window functions in production is cleaning and deduplicating data. These patterns form the foundation of reliable data pipelines.
//

Dedup with ROW_NUMBER

Duplicate records are one of the most common data quality issues. Sources send the same event multiple times, ETL pipelines replay data, and upstream systems lack idempotency guarantees. The ROW_NUMBER deduplication pattern is the standard solution: number the duplicates and keep only row number 1.

The key decisions in deduplication are: what defines a duplicate (the PARTITION BY columns) and which copy to keep (the ORDER BY column and direction). ORDER BY created_at DESC keeps the newest record. ORDER BY created_at ASC keeps the oldest. You can also use multiple tiebreaker columns: ORDER BY created_at DESC, updated_at DESC, id DESC ensures deterministic ordering even when timestamps tie.

1SELECT
2 user_id,
3 email,
4 created_at
5FROM (
6 SELECT
7 user_id,
8 email,
9 created_at,
10 ROW_NUMBER() OVER (
11 PARTITION BY user_id
12 ORDER BY created_at DESC
13 ) AS rn
14 FROM event_data
15)
16WHERE rn = 1
1SELECT
2 user_id,
3 email,
4 created_at
5FROM (
6 SELECT
7 user_id,
8 email,
9 created_at,
10 ROW_NUMBER() OVER (
11 PARTITION BY user_id
12 ORDER BY created_at DESC
13 ) AS rn
14 FROM event_data
15)
16WHERE rn = 1
Result
user_idemailcreated_at
U001alice_new@example.com2024-03-15 14:30:00
U002bob@example.com2024-03-14 09:15:00
U003carol_v2@example.com2024-03-15 11:00:00

PARTITION BY user_id groups duplicates together. ORDER BY created_at DESC ensures the most recent record gets row number 1. The outer query keeps only that record. This pattern works for any deduplication key and any tiebreaker column.

> Complete this query to remove duplicate user records, keeping only the most recent entry for each user.

SELECT
  user_id,
  event_type
FROM (
  SELECT
    user_id,
    event_type,
    ___() OVER (
      ___ user_id
      ___ event_timestamp ___
    ) AS rn
  FROM event_data
)
WHERE rn = 1
ROW_NUMBER
ORDER BY
PARTITION BY
RANK
ASC
DESC
//

Funnel Analysis

Funnel analysis tracks how users progress through a sequence of steps: signup, activation, first purchase, repeat purchase. The goal is to measure drop-off at each stage. Window functions enable ordered funnel tracking per user. This is one of the most requested analytics patterns in product companies, because it directly answers the question: "Where are we losing users?"
1SELECT
2 step,
3 users_at_step,
4 ROUND(
5 CAST(users_at_step AS DOUBLE) / FIRST_VALUE(
6 users_at_step
7 ) OVER (
8 ORDER BY step_order
9 ROWS UNBOUNDED PRECEDING
10 ),
11 2
12 ) AS conversion_rate
13FROM (
14 SELECT
15 step,
16 step_order,
17 COUNT(DISTINCT user_id) AS users_at_step
18 FROM funnel_events
19 GROUP BY step, step_order
20)
21ORDER BY step_order
Result
stepusers_at_stepconversion_rate
signup100001.00
activation65000.65
purchase21000.21

FIRST_VALUE grabs the signup count (the first step), and each subsequent step divides its user count by that baseline. The result shows that 65% of users who signed up also activated, and 21% eventually made a purchase. The 35% drop-off between signup and activation is where the product team should focus optimization efforts.

For more detailed funnel analysis, you can add step-over-step conversion rates using LAG. This shows not just the absolute conversion from the top of the funnel, but the conversion between each consecutive pair of steps. A 65% signup-to-activation rate and a 32% activation-to-purchase rate tells a more actionable story than absolute percentages alone.

//

Gap-and-Island Detection

Gap-and-island detection identifies contiguous groups of rows (islands) separated by gaps. This pattern appears when you need to find consecutive days of user activity, identify continuous periods of service uptime, or group sequential status changes. The technique uses the difference between ROW_NUMBER and the actual sequence value to create group identifiers.

1SELECT
2 user_id,
3 MIN(active_date) AS streak_start,
4 MAX(active_date) AS streak_end,
5 COUNT(*) AS streak_length
6FROM (
7 SELECT
8 user_id,
9 active_date,
10 DATE_DIFF(
11 'day',
12 CAST(
13 ROW_NUMBER() OVER (
14 PARTITION BY user_id
15 ORDER BY active_date
16 )
17 AS INTEGER
18 ) * INTERVAL '1' DAY,
19 active_date
20 ) AS island_id
21 FROM daily_logins
22)
23GROUP BY user_id, island_id
24ORDER BY streak_length DESC
Result
user_idstreak_startstreak_endstreak_length
U0012024-01-052024-01-128
U0012024-01-152024-01-184
U0022024-01-012024-01-033
U0022024-01-102024-01-112

The trick works because ROW_NUMBER increments by 1 for every row, and dates in a contiguous streak also increment by 1 day. Subtracting the row number from the date produces the same result for all rows in a contiguous streak, creating a natural group identifier. When there is a gap, the date jumps forward but ROW_NUMBER does not, producing a different group ID.

Which real-world pattern matches your analysis goal?
1SELECT
2 user_id,
3 event_time,
4 SUM(
5 CASE
6 WHEN gap_min > 30
7 OR gap_min IS NULL THEN 1
8 ELSE 0
9 END
10 ) OVER (
11 PARTITION BY user_id
12 ORDER BY event_time
13 ROWS UNBOUNDED PRECEDING
14 ) AS session_id
15FROM (
16 SELECT
17 user_id,
18 event_time,
19 DATE_DIFF(
20 'minute',
21 LAG(event_time, 1) OVER (
22 PARTITION BY user_id
23 ORDER BY event_time
24 ),
25 event_time
26 ) AS gap_min
27 FROM clickstream
28)

Performance at Scale

Production workloads require careful attention to memory usage and data distribution. These techniques keep window functions performant on large datasets.
//

Memory Pressure

Window functions require the engine to hold partition data in memory for sorting and computation. Wide windows, meaning frames that span many rows or entire partitions, consume proportionally more memory. A window function with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING on a partition with 50 million rows forces the engine to buffer all 50 million rows simultaneously.

To reduce memory pressure, apply these strategies in order of impact: first, filter data before the window function runs (add WHERE clauses in the inner query to reduce partition sizes). Second, use bounded frames like ROWS 30 PRECEDING instead of UNBOUNDED when your logic allows it. Third, reduce the number of columns in the SELECT of the inner query, because the engine must buffer entire rows.

CURRENT ROWN PRECEDINGUNBND. PRECEDINGUNBND. BOTH
CURRENT ROW
Minimal
Point-in-time values
N PRECEDING
Bounded
Rolling window pattern
UNBND. PRECEDING
Grows w/ part.
Running total pattern
UNBND. BOTH
Full partition
Partition-wide compute
//

Skew Mitigation: Salting

When a single partition key dominates, salting breaks it into smaller pieces. The process has three steps. First, you create a salted key by appending a random or hash-derived suffix to the original partition key. This splits one massive partition into N smaller buckets. Second, you perform the aggregation or computation on the salted keys, which distributes the work across multiple workers. Third, you strip the salt and combine the partial results back into the final answer.
The number of salt buckets should be chosen based on your cluster size and the degree of skew. If one customer has 10 million rows and you have 100 workers, using 100 salt buckets gives each worker roughly 100 thousand rows for that customer, which is manageable. Using too few buckets does not solve the problem; using too many creates overhead from the extra aggregation step.
1SELECT
2 original_key,
3 SUM(subtotal) AS total_revenue
4FROM (
5 SELECT
6 customer_id AS original_key,
7 CONCAT(
8 customer_id,
9 '_',
10 CAST(
11 MOD(ABS(HASH(order_id)), 10)
12 AS VARCHAR
13 )
14 ) AS salted_key,
15 SUM(amount) AS subtotal
16 FROM orders
17 GROUP BY customer_id, CONCAT(
18 customer_id,
19 '_',
20 CAST(MOD(ABS(HASH(order_id)), 10) AS VARCHAR)
21 )
22)
23GROUP BY original_key
Result
original_keytotal_revenue
C_MEGA9500000.00
C00245000.00
C00332000.00
The inner query splits the mega-customer into 10 buckets by appending a hash-derived suffix (0 through 9). Each bucket is processed by a separate worker, and each worker computes a subtotal. The outer query combines the subtotals by the original key. This transforms a single-worker bottleneck into parallel processing across 10 workers, reducing wall-clock time by roughly 10x for the skewed partition.
Do
  • Use ROW_NUMBER subquery pattern for deduplication
  • Check partition key distribution before running window queries
  • Use bounded frames (ROWS N PRECEDING) when possible
  • Replace self-joins with LAG/LEAD for row comparisons
  • Place final ORDER BY in the outermost query
  • Filter data before window functions to reduce partition sizes
Don't
  • Put window functions in WHERE or HAVING clauses (use subquery)
  • Assume subquery ORDER BY survives to the outer query
  • Use UNBOUNDED frames on partitions with millions of rows
  • Ignore data skew in PARTITION BY columns
  • Expect named WINDOW syntax to work (it is not supported)
  • Reference one window function from another in the same SELECT
PUTTING IT ALL TOGETHER

> You are a senior analyst at BlackRock building a customer quartile report that groups wealth management clients by assets under management and flags statistically anomalous account balances for the compliance team.

Replacing self-joins with LAG() and AVG OVER cuts query runtime significantly because the engine scans the client table once instead of joining it to itself.
Aligning PARTITION BY with the physical table partition on account region avoids cross-node data shuffles and prevents skew from large institutional accounts bottlenecking workers.
CUME_DIST() computes each client's percentile position so the compliance team can immediately flag accounts above the 99th percentile as statistical outliers.
The ROW_NUMBER() deduplication pattern wrapped in a subquery ensures each client appears exactly once before quartile assignment, preventing inflated tier counts from duplicate records.
KEY TAKEAWAYS
Window functions replace most self-joins for row comparisons and group aggregates, running in a single pass
CUME_DIST and PERCENT_RANK provide percentile positioning for anomaly detection and benchmarking
Key patterns: deduplicate with ROW_NUMBER, sessionize with LAG + running SUM, detect streaks with gap-and-island
Align window PARTITION BY with physical table partitioning to avoid data shuffles; check for skew before querying
Use bounded frames and pre-filtering to manage memory; wrap window functions in subqueries to filter on their results

One pass to rule them all

Category
SQL
Difficulty
advanced
Duration
41 minutes
Challenges
4 hands-on challenges

Topics covered: Windows vs Self-Joins, Windows and Partitioning, Cumulative Distribution, Workarounds & Engine Limits, Deduplication

Lesson Sections

  1. Windows vs Self-Joins

    One of the most impactful things you can do with window functions is eliminate self-joins. A self-join is when a table is joined to itself, typically to compare each row with a related row in the same table. Self-joins are expensive: the engine must scan the table twice, build a join hash map, and match rows. Window functions accomplish the same comparison in a single pass over the data. Self-joins are extremely common in legacy codebases. Before window functions were widely supported, they were

  2. Windows and Partitioning

    Table partitioning divides a large table into smaller physical segments based on a column value, typically a date. A table partitioned by day stores each day of data in a separate directory or file. When a query filters on the partition column, the engine reads only the relevant partitions and skips the rest entirely. This is called partition pruning. Partition Pruning When your query filters and window partitions align with the physical table structure, performance improves dramatically. Partit

  3. Cumulative Distribution (concepts: sqlCumulativeDist)

    Distribution Functions Two functions express a row's position as a decimal between 0 and 1. Understanding their subtle differences is key to choosing the right one. CUME_DIST PERCENT_RANK Distribution Patterns Distribution functions power some of the most valuable analytics patterns: outlier detection, sessionization, and percentile-based tiering. CUME_DIST Anomaly Detection This query identifies the top 1% of transactions by amount. These are the outliers that warrant manual review. The same pa

  4. Workarounds & Engine Limits

    Every SQL engine has limitations around window functions. Understanding these constraints and their workarounds prevents hours of debugging cryptic error messages. The most common issues involve where window functions can appear in a query, missing syntax features, and ordering guarantees. Placement Restrictions SQL's evaluation order determines where window functions can and cannot appear. These restrictions apply across all SQL engines. Window Functions vs WHERE Window Functions vs HAVING Repe

  5. Deduplication (concepts: sqlWindowDedup)

    Data Quality Patterns The most common use of window functions in production is cleaning and deduplicating data. These patterns form the foundation of reliable data pipelines. Dedup with ROW_NUMBER Funnel Analysis Funnel analysis tracks how users progress through a sequence of steps: signup, activation, first purchase, repeat purchase. The goal is to measure drop-off at each stage. Window functions enable ordered funnel tracking per user. This is one of the most requested analytics patterns in pr