Window Functions: Advanced
Quantitative hedge funds like Jane Street and Citadel execute trades worth billions of dollars every day, and their analysts use advanced SQL window expressions to compute the signals that drive those decisions. NTILE and PERCENT_RANK rank security returns into percentile buckets, ROWS BETWEEN frame specifications compute rolling Sharpe ratios over trailing windows, and CUME_DIST identifies regime-change anomalies where a return falls in the extreme tail of its distribution. These are not academic exercises: a single well-designed window query can surface a mispriced asset before the market corrects, and the firms that build those queries fastest win. The advanced patterns in this lesson are what separate a working knowledge of SQL from the kind of analytical power that moves markets.
Windows vs Self-Joins
Eliminate costly self-joins with windows
Consider a common requirement: for each order, show the previous order amount for the same customer. The self-join approach requires joining the orders table to itself, matching on customer and finding the most recent prior order. The window function approach uses a single LAG call.
Rewriting Self-Joins
Self-Join vs LAG
- Multiple table scans
- Correlated subqueries
- Complex join conditions
- Hard to maintain
- Single table scan
- Declarative intent
- Readable OVER clause
- Easier to extend
Notice how the first order for each customer has NULL for prev_amount because there is no prior row. The amount_change column shows the difference, reusing the same LAG expression. With a self-join, computing both the previous value and the difference would require even more complex join logic.
Self-Join vs AVG OVER
Performance and Trade-offs
Single Scan vs Double Scan
Self-joins require scanning the table twice and building a hash table, while window functions scan once and sort partitions locally. In practice, window functions often reduce query time by 40-60% on large datasets.
- Self-join: Scans N rows twice (2N I/O), builds hash table of N entries, probes N times. Work scales as O(N) with large constant factors
- Window function: Scans N rows once, sorts each partition (typically << N), walks sequentially. No hash table or join overhead
COUNT Comparison
Window functions also replace self-joins for counting related rows. Instead of joining a table to an aggregated version of itself, use COUNT with OVER:
Each row now includes the total number of orders and total amount spent by that customer, without any self-join. The old approach would require a GROUP BY subquery joined back to the original table. The window function approach is both simpler to write and faster to execute.
COUNT_IF vs FILTER
Standard SQL defines a FILTER clause for aggregate window functions, but it is not universally supported. A common alternative is COUNT_IF as a direct replacement. Where some references show COUNT(*) FILTER (WHERE condition), the equivalent is COUNT_IF(condition). This function evaluates the boolean condition for each row and counts only the rows where it is true. It works both as a regular aggregate and as a window function with an OVER clause.
COUNT_IF counts only rows where the condition is true. Combined with OVER, it counts within each partition. This is cleaner than wrapping a CASE expression inside SUM or AVG.
COUNT(*) FILTER (WHERE ...) syntax. The equivalent here is COUNT_IF(...). The semantics are identical, but the syntax differs.When Self-Joins Are Needed
Window functions cannot do everything. Use a self-join instead when:
- Pairwise comparisons: Finding all pairs of rows matching complex criteria (e.g., employees hired within 7 days of each other)
- Non-sequential matching: Rows need to be compared across unordered subsets, not just within ordered partitions
- Complex aggregation logic: Joining a table to a specific aggregated view of itself with conditions beyond simple partition-level aggregation
Rule of thumb: If the comparison follows linear order within a partition, use a window function. If it requires comparing every row to every other row, use a self-join.
> Complete this query to split metrics into four equal tiers from highest to lowest value.
SELECT metric_name, metric_value, () OVER ( ORDER BY metric_value DESC ) AS quartile FROM employee_metrics
NTILE is the clearest way to assign equal-sized quartiles because it adapts automatically to the actual number of rows, unlike a CASE statement with hardcoded salary thresholds that becomes outdated as the workforce changes.
Windows and Partitioning
Optimize window queries for large datasets
There are two completely different meanings of "partitioning" in data engineering. PARTITION BY in a window function groups rows logically for the computation. Table partitioning determines how data is physically stored on disk. Understanding both, and how they interact, is critical for writing performant queries at scale.
In a real-world data warehouse running Hive or Trino, a table like event_data might be partitioned by a date column called dt. The physical storage layout on HDFS or S3 looks like a directory tree: event_data/dt=2024-01-01/, event_data/dt=2024-01-02/, and so on. Each directory contains the Parquet or ORC files for that day. A query that filters WHERE dt = '2024-01-15' reads only the files in that single directory, potentially skipping terabytes of irrelevant data.
Partition Pruning
Partition Pruning in Action
When your WHERE clause filters on the physical partition column, the engine skips entire directories of data. This is the single most important optimization in big data querying. A table with 365 daily partitions and a filter on one day reads roughly 1/365th of the data:
This query benefits from partition pruning because the WHERE clause filters on dt, which is the physical partition column. The engine reads only the 2024-01-15 directory. The window function then computes counts within each user for that day. No data from other days is ever loaded into memory.
When Partitions Align
When your window function PARTITION BY column matches the table partition column, the engine can process each physical partition independently. Each worker node handles its own partition without needing data from other nodes. This is the ideal scenario.
If the event_data table is physically partitioned by event_date, the WHERE clause triggers partition pruning (only January data is read). The window function then groups by event_date and user_id, which aligns with the physical layout. Each node processes its local data without shuffling rows across the network.
Understanding Data Skew
Data Skew
Diagnosing Skew
Before running a window function on a new dataset, always check the distribution of your partition key. A simple GROUP BY ... COUNT(*) query reveals whether one value dominates. Look at the ratio between the largest partition and the median partition. A ratio above 100:1 is a strong signal of problematic skew.
PARTITION BY customer_id alone, try PARTITION BY customer_id, DATE_TRUNC('month', event_date). This splits the mega-customer into monthly chunks, distributing the work more evenly across workers.> Complete this query to detect partition skew by counting events per customer.
SELECT , (*) AS event_count FROM event_data GROUP BY user_id ORDER BY event_count DESC LIMIT 10
Before using PARTITION BY in production, always check for skew by counting rows per partition key to identify potential bottlenecks.
When a single partition key dominates, consider compound partition keys to split mega-partitions into smaller, more balanced chunks.
Aligning window PARTITION BY with table partitioning enables local processing without expensive network shuffles in distributed systems.
Cumulative Distribution
Calculate percentiles across a dataset
Beyond ranking, window functions provide statistical distribution functions. CUME_DIST and PERCENT_RANK tell you where a value falls within its partition, expressed as a fraction between 0 and 1. These are essential for percentile analysis, grading curves, and performance benchmarking.
Distribution Functions
CUME_DIST
CUME_DIST returns the fraction of rows with values less than or equal to the current row's value. The formula is: (number of rows with value <= current value) / (total rows in partition). The result ranges from just above 0 to exactly 1.0. Think of it as answering the question: "What percentage of the group is at or below this value?"
In the Engineering department with 4 employees, the lowest-paid employee has a CUME_DIST of 0.25 (1/4 of salaries are at or below theirs). The highest earner is always 1.0 because 100% of values are at or below the maximum.
PERCENT_RANK
PERCENT_RANK uses a different formula: (rank - 1) / (total rows - 1). This places the lowest value at exactly 0.0 and the highest at exactly 1.0, distributing other values evenly between them. The key difference from CUME_DIST is that the lowest value always starts at zero, whereas CUME_DIST gives the lowest value a positive fraction. Think of PERCENT_RANK as answering: "What fraction of the group is strictly below this value?"
Notice the difference: the lowest salary has CUME_DIST = 0.25 but PERCENT_RANK = 0.00. The highest has both at 1.00. Choose CUME_DIST when you want "what fraction of values are at or below this" and PERCENT_RANK when you want a 0-to-1 scale where lowest is 0 and highest is 1.
Distribution Patterns
CUME_DIST Anomaly Detection
A practical application of CUME_DIST is identifying outliers. Values above the 99th percentile (CUME_DIST > 0.99) are statistical anomalies that often indicate fraud, system errors, or data quality issues. By computing CUME_DIST and filtering in an outer query, you can flag these rows for investigation.
Sessionization Pattern
The first pass uses LAG to compute the time difference between each event and the one before it. The second pass uses a running SUM to count the number of gap boundaries encountered so far. Every time the gap exceeds the threshold, the SUM increments by one, creating a new session ID. Events within the threshold inherit the same session ID because the CASE expression evaluates to zero for them.
The inner query uses LAG to compute the gap in minutes between each event and the previous one. The outer query uses a running SUM to increment the session counter whenever a gap exceeds 30 minutes. The first event for each user starts session 1 because the gap is NULL (no previous event). The ROWS UNBOUNDED PRECEDING frame ensures the SUM accumulates from the very first row through the current row, giving a monotonically increasing session counter.
> Fill in the missing parts to calculate a rolling total of revenue using the current row and the two rows before it.
SELECT bill_date, amount, SUM(amount) OVER ( ORDER BY bill_date 2 PRECEDING AND ) AS rolling_sum FROM cloud_costs
A bounded frame like ROWS BETWEEN 2 PRECEDING AND CURRENT ROW keeps memory usage constant because the database only holds three rows at a time, regardless of how many total rows the partition contains.
The ROWS keyword counts physical rows by position, which makes rolling window calculations predictable even when the ordering column has duplicate values, unlike RANGE which groups duplicates into a single logical boundary.
Workarounds & Engine Limits
Work around window function restrictions
Placement Restrictions
Window Functions vs WHERE
Window functions cannot appear in a WHERE clause. This is an SQL standard rule, rooted in the SQL evaluation order. The engine processes a query in this sequence: FROM, WHERE, GROUP BY, HAVING, SELECT, window functions, ORDER BY, LIMIT. Because window functions are evaluated after WHERE, they do not exist yet when the WHERE clause runs. Referencing them in WHERE would create a circular dependency: the filter needs the window result, but the window needs the filtered rows.
The workaround is to wrap the window function in a subquery. The inner query computes the window function and exposes it as a column. The outer query then filters on that column, because by the time the outer WHERE runs, the inner query has already computed the window values.
The inner query computes ROW_NUMBER and exposes it as rn. The outer query filters on rn = 1. This "compute then filter" pattern is the standard way to use window functions in filtering logic. You will use this pattern constantly.
Window Functions vs HAVING
The same restriction applies to HAVING. Since HAVING is evaluated before window functions in the SQL processing order, you cannot reference a window function there. The workaround is identical: compute the window function in a subquery, then filter in the outer query. This often surprises people who expect HAVING to work like a post-aggregation filter that runs after everything else.
Repeating OVER Clauses
Some SQL references show a WINDOW clause that lets you name a window definition and reuse it across multiple functions. This syntax is not universally supported. When you need the same window specification in multiple functions, repeat the full OVER clause each time.
Advanced Window Techniques
ARRAY_AGG with OVER
ARRAY_AGG can be used as a window function to build running lists. Combined with a frame specification, it collects values up to the current row. This is powerful for building user journey paths, collecting recent event histories, and constructing ordered sequences without self-joins.
Each row sees an array of all page URLs visited so far in that session. The ROWS UNBOUNDED PRECEDING frame ensures the array grows with each row. This is useful for building user journey paths without self-joins.
You can also use ARRAY_AGG with a bounded frame to collect only the last N items. For example, ROWS 2 PRECEDING would collect the current page and the two before it, creating a sliding window of recent pages. This is useful for detecting navigation patterns like "users who visit pricing then features then signup tend to convert."
Window Function Order
When a query contains multiple window functions, the engine may evaluate them in any order or even in parallel. You cannot assume that one window function's result is available to another window function in the same SELECT clause. If you need to chain window functions (for example, computing a rank based on a running total), you must use nested subqueries where each layer computes one window function and the next layer references its result.
ORDER BY in Subqueries
The optimizer may ignore ORDER BY in subqueries unless paired with LIMIT. The optimizer treats subquery ordering as meaningless because the outer query will reorder as needed. This matters when you expect rows in a specific order within a derived table. The problem is subtle: your query might return results in the expected order during testing (because the data happens to be stored in that order) but produce differently-ordered results in production when the data layout changes or parallelism kicks in.
ORDER BY inside a subquery to guarantee row order in the outer query. Always place the final ORDER BY in the outermost query. Window function ordering within OVER is guaranteed for the window computation itself, but not for the output row order.Common gotcha: NTH_VALUE requires an explicit frame clause. Without one, it may return NULL unexpectedly. Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when using NTH_VALUE to ensure it can see the entire partition.
> Complete this query to filter results based on a window function using a subquery.
SELECT department, metric_name, metric_value FROM ( SELECT department, metric_name, metric_value, () OVER ( PARTITION BY department ORDER BY metric_value DESC ) AS rn FROM employee_metrics ) AS sub WHERE = 1
You cannot filter on a window function in the same SELECT clause because WHERE executes before window functions are computed.
The subquery pattern computes the window function first, then the outer query filters on the result, enabling top-N-per-group queries.
CTEs (WITH clauses) provide a cleaner alternative to nested subqueries when the query involves multiple window function layers.
Deduplication
Remove duplicate rows from production data
Duplicate records are one of the most common data quality problems in production pipelines. Sources send the same event multiple times, ETL pipelines replay data, and upstream systems lack idempotency guarantees. Window functions provide the standard solution: the ROW_NUMBER deduplication pattern numbers duplicates by your chosen criteria and keeps only row number 1.
Data Quality Patterns
Dedup with ROW_NUMBER
Duplicate records are one of the most common data quality issues. Sources send the same event multiple times, ETL pipelines replay data, and upstream systems lack idempotency guarantees. The ROW_NUMBER deduplication pattern is the standard solution: number the duplicates and keep only row number 1.
The key decisions in deduplication are: what defines a duplicate (the PARTITION BY columns) and which copy to keep (the ORDER BY column and direction). ORDER BY created_at DESC keeps the newest record. ORDER BY created_at ASC keeps the oldest. You can also use multiple tiebreaker columns: ORDER BY created_at DESC, updated_at DESC, id DESC ensures deterministic ordering even when timestamps tie.
PARTITION BY user_id groups duplicates together. ORDER BY created_at DESC ensures the most recent record gets row number 1. The outer query keeps only that record. This pattern works for any deduplication key and any tiebreaker column.
> Complete this query to remove duplicate user records, keeping only the most recent entry for each user.
SELECT user_id, event_type FROM ( SELECT user_id, event_type, () OVER ( user_id event_timestamp ) AS rn FROM event_data ) WHERE rn = 1
Funnel Analysis
FIRST_VALUE grabs the signup count (the first step), and each subsequent step divides its user count by that baseline. The result shows that 65% of users who signed up also activated, and 21% eventually made a purchase. The 35% drop-off between signup and activation is where the product team should focus optimization efforts.
For more detailed funnel analysis, you can add step-over-step conversion rates using LAG. This shows not just the absolute conversion from the top of the funnel, but the conversion between each consecutive pair of steps. A 65% signup-to-activation rate and a 32% activation-to-purchase rate tells a more actionable story than absolute percentages alone.
Gap-and-Island Detection
Gap-and-island detection identifies contiguous groups of rows (islands) separated by gaps. This pattern appears when you need to find consecutive days of user activity, identify continuous periods of service uptime, or group sequential status changes. The technique uses the difference between ROW_NUMBER and the actual sequence value to create group identifiers.
The trick works because ROW_NUMBER increments by 1 for every row, and dates in a contiguous streak also increment by 1 day. Subtracting the row number from the date produces the same result for all rows in a contiguous streak, creating a natural group identifier. When there is a gap, the date jumps forward but ROW_NUMBER does not, producing a different group ID.
Performance at Scale
Memory Pressure
Window functions require the engine to hold partition data in memory for sorting and computation. Wide windows, meaning frames that span many rows or entire partitions, consume proportionally more memory. A window function with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING on a partition with 50 million rows forces the engine to buffer all 50 million rows simultaneously.
To reduce memory pressure, apply these strategies in order of impact: first, filter data before the window function runs (add WHERE clauses in the inner query to reduce partition sizes). Second, use bounded frames like ROWS 30 PRECEDING instead of UNBOUNDED when your logic allows it. Third, reduce the number of columns in the SELECT of the inner query, because the engine must buffer entire rows.
Skew Mitigation: Salting
- Use ROW_NUMBER subquery pattern for deduplication
- Check partition key distribution before running window queries
- Use bounded frames (ROWS N PRECEDING) when possible
- Replace self-joins with LAG/LEAD for row comparisons
- Place final ORDER BY in the outermost query
- Filter data before window functions to reduce partition sizes
- Put window functions in WHERE or HAVING clauses (use subquery)
- Assume subquery ORDER BY survives to the outer query
- Use UNBOUNDED frames on partitions with millions of rows
- Ignore data skew in PARTITION BY columns
- Expect named WINDOW syntax to work (it is not supported)
- Reference one window function from another in the same SELECT
> You are a senior analyst at BlackRock building a customer quartile report that groups wealth management clients by assets under management and flags statistically anomalous account balances for the compliance team.
LAG() and AVG OVER cuts query runtime significantly because the engine scans the client table once instead of joining it to itself.PARTITION BY with the physical table partition on account region avoids cross-node data shuffles and prevents skew from large institutional accounts bottlenecking workers.CUME_DIST() computes each client's percentile position so the compliance team can immediately flag accounts above the 99th percentile as statistical outliers.ROW_NUMBER() deduplication pattern wrapped in a subquery ensures each client appears exactly once before quartile assignment, preventing inflated tier counts from duplicate records.CUME_DIST and PERCENT_RANK provide percentile positioning for anomaly detection and benchmarkingROW_NUMBER, sessionize with LAG + running SUM, detect streaks with gap-and-islandPARTITION BY with physical table partitioning to avoid data shuffles; check for skew before queryingOne pass to rule them all
- Category
- SQL
- Difficulty
- advanced
- Duration
- 41 minutes
- Challenges
- 4 hands-on challenges
Topics covered: Windows vs Self-Joins, Windows and Partitioning, Cumulative Distribution, Workarounds & Engine Limits, Deduplication
Lesson Sections
- Windows vs Self-Joins
One of the most impactful things you can do with window functions is eliminate self-joins. A self-join is when a table is joined to itself, typically to compare each row with a related row in the same table. Self-joins are expensive: the engine must scan the table twice, build a join hash map, and match rows. Window functions accomplish the same comparison in a single pass over the data. Self-joins are extremely common in legacy codebases. Before window functions were widely supported, they were
- Windows and Partitioning
Table partitioning divides a large table into smaller physical segments based on a column value, typically a date. A table partitioned by day stores each day of data in a separate directory or file. When a query filters on the partition column, the engine reads only the relevant partitions and skips the rest entirely. This is called partition pruning. Partition Pruning When your query filters and window partitions align with the physical table structure, performance improves dramatically. Partit
- Cumulative Distribution (concepts: sqlCumulativeDist)
Distribution Functions Two functions express a row's position as a decimal between 0 and 1. Understanding their subtle differences is key to choosing the right one. CUME_DIST PERCENT_RANK Distribution Patterns Distribution functions power some of the most valuable analytics patterns: outlier detection, sessionization, and percentile-based tiering. CUME_DIST Anomaly Detection This query identifies the top 1% of transactions by amount. These are the outliers that warrant manual review. The same pa
- Workarounds & Engine Limits
Every SQL engine has limitations around window functions. Understanding these constraints and their workarounds prevents hours of debugging cryptic error messages. The most common issues involve where window functions can appear in a query, missing syntax features, and ordering guarantees. Placement Restrictions SQL's evaluation order determines where window functions can and cannot appear. These restrictions apply across all SQL engines. Window Functions vs WHERE Window Functions vs HAVING Repe
- Deduplication (concepts: sqlWindowDedup)
Data Quality Patterns The most common use of window functions in production is cleaning and deduplicating data. These patterns form the foundation of reliable data pipelines. Dedup with ROW_NUMBER Funnel Analysis Funnel analysis tracks how users progress through a sequence of steps: signup, activation, first purchase, repeat purchase. The goal is to measure drop-off at each stage. Window functions enable ordered funnel tracking per user. This is one of the most requested analytics patterns in pr