Window Functions: Advanced

Quantitative hedge funds like Jane Street and Citadel execute trades worth billions of dollars every day, and their analysts use advanced SQL window expressions to compute the signals that drive those decisions. NTILE and PERCENT_RANK rank security returns into percentile buckets, ROWS BETWEEN frame specifications compute rolling Sharpe ratios over trailing windows, and CUME_DIST identifies regime-change anomalies where a return falls in the extreme tail of its distribution. These are not academic exercises: a single well-designed window query can surface a mispriced asset before the market corrects, and the firms that build those queries fastest win. The advanced patterns in this lesson are what separate a working knowledge of SQL from the kind of analytical power that moves markets.

Windows vs Self-Joins

Daily Life

Interviews

Eliminate costly self-joins with windows

One of the most impactful things you can do with window functions is eliminate self-joins. A self-join is when a table is joined to itself, typically to compare each row with a related row in the same table. Self-joins are expensive: the engine must scan the table twice, build a join hash map, and match rows. Window functions accomplish the same comparison in a single pass over the data.

Self-joins are extremely common in legacy codebases. Before window functions were widely supported, they were the only way to compare a row with its neighbors. You will encounter them frequently when working with older data warehouses or reading code written by analysts who learned SQL before window functions became standard. Recognizing these patterns and knowing how to rewrite them is a high-value skill.

Consider a common requirement: for each order, show the previous order amount for the same customer. The self-join approach requires joining the orders table to itself, matching on customer and finding the most recent prior order. The window function approach uses a single LAG call.

Rewriting Self-Joins

Let us examine the most common self-join patterns and their window function equivalents side by side.

Self-Join vs LAG

	SELECT
	o1.order_id,
	o1.customer_id,
	o1.amount,
	o2.amount AS prev_amount
	FROM orders AS o1
	LEFT JOIN orders AS o2
	ON o1.customer_id = o2.customer_id
	AND o2.order_date = (
	SELECT
	MAX(o3.order_date)
	FROM orders AS o3
	WHERE o3.customer_id = o1.customer_id
	AND o3.order_date < o1.order_date
	)

•Self-Join

Multiple table scans
Correlated subqueries
Complex join conditions
Hard to maintain

•Window Function

Single table scan
Declarative intent
Readable OVER clause
Easier to extend

	SELECT
	order_id,
	customer_id,
	amount,
	LAG(amount, 1) OVER (
	PARTITION BY customer_id
	ORDER BY order_date
	) AS prev_amount,
	amount - LAG(amount, 1) OVER (
	PARTITION BY customer_id
	ORDER BY order_date
	) AS amount_change
	FROM orders

Result

order_id	customer_id	amount	prev_amount	amount_change
1001	C100	50.00
1002	C100	75.00	50.00	25.00
1003	C100	60.00	75.00	-15.00
1004	C200	120.00
1005	C200	95.00	120.00	-25.00

Notice how the first order for each customer has NULL for prev_amount because there is no prior row. The amount_change column shows the difference, reusing the same LAG expression. With a self-join, computing both the previous value and the difference would require even more complex join logic.

Self-Join vs AVG OVER

Another common self-join pattern compares each row to an aggregate of its group. For example, showing each employee's salary alongside their department average. The self-join version requires a subquery to compute the department average, then joins it back to the original table. The window function version computes the average inline.

	SELECT
	e.employee_id,
	e.department,
	e.salary,
	d.avg_salary,
	e.salary - d.avg_salary AS diff
	FROM employee_metrics AS e
	INNER JOIN (
	SELECT
	department,
	AVG(salary) AS avg_salary
	FROM employee_metrics
	GROUP BY department
	) AS d
	ON e.department = d.department

Performance and Trade-offs

Understanding when window functions outperform self-joins, and when they fall short, helps you make the right choice for each situation.

Single Scan vs Double Scan

Self-joins require scanning the table twice and building a hash table, while window functions scan once and sort partitions locally. In practice, window functions often reduce query time by 40-60% on large datasets.

Performance Comparison

Self-join: Scans N rows twice (2N I/O), builds hash table of N entries, probes N times. Work scales as O(N) with large constant factors
Window function: Scans N rows once, sorts each partition (typically << N), walks sequentially. No hash table or join overhead

COUNT Comparison

Window functions also replace self-joins for counting related rows. Instead of joining a table to an aggregated version of itself, use COUNT with OVER:

	SELECT
	order_id,
	customer_id,
	amount,
	COUNT(*) OVER (
	PARTITION BY customer_id
	) AS total_orders_by_customer,
	SUM(amount) OVER (
	PARTITION BY customer_id
	) AS total_spent_by_customer
	FROM orders

Result

order_id	customer_id	amount	total_orders_by_customer	total_spent_by_customer
1001	C100	50.00	3	185.00
1002	C100	75.00	3	185.00
1003	C100	60.00	3	185.00
1004	C200	120.00	2	215.00
1005	C200	95.00	2	215.00

Each row now includes the total number of orders and total amount spent by that customer, without any self-join. The old approach would require a GROUP BY subquery joined back to the original table. The window function approach is both simpler to write and faster to execute.

COUNT_IF vs FILTER

Standard SQL defines a FILTER clause for aggregate window functions, but it is not universally supported. A common alternative is COUNT_IF as a direct replacement. Where some references show COUNT(*) FILTER (WHERE condition), the equivalent is COUNT_IF(condition). This function evaluates the boolean condition for each row and counts only the rows where it is true. It works both as a regular aggregate and as a window function with an OVER clause.

	SELECT
	department,
	employee_id,
	salary,
	COUNT_IF(salary > 80000) OVER (
	PARTITION BY department
	) AS high_earners_in_dept
	FROM employee_metrics

Result

department	employee_id	salary	high_earners_in_dept
Engineering	E001	95000	2
Engineering	E002	72000	2
Engineering	E003	88000	2
Sales	S001	65000	1
Sales	S002	82000	1

COUNT_IF counts only rows where the condition is true. Combined with OVER, it counts within each partition. This is cleaner than wrapping a CASE expression inside SUM or AVG.

TIP

Some SQL references show COUNT(*) FILTER (WHERE ...) syntax. The equivalent here is COUNT_IF(...). The semantics are identical, but the syntax differs.

When Self-Joins Are Needed

Window functions cannot do everything. Use a self-join instead when:

Use Self-Joins When

Pairwise comparisons: Finding all pairs of rows matching complex criteria (e.g., employees hired within 7 days of each other)
Non-sequential matching: Rows need to be compared across unordered subsets, not just within ordered partitions
Complex aggregation logic: Joining a table to a specific aggregated view of itself with conditions beyond simple partition-level aggregation

Rule of thumb: If the comparison follows linear order within a partition, use a window function. If it requires comparing every row to every other row, use a self-join.

> Complete this query to split metrics into four equal tiers from highest to lowest value.

SELECT
  metric_name,
  metric_value,
  ___(___) OVER (
    ORDER BY metric_value DESC
  ) AS quartile
FROM employee_metrics

NTILE

RANK

NTILE is the clearest way to assign equal-sized quartiles because it adapts automatically to the actual number of rows, unlike a CASE statement with hardcoded salary thresholds that becomes outdated as the workforce changes.

Replacing a self-join with a window function does not just reduce the number of lines of SQL; it also removes the need for the database to build and probe a hash table, which can cut memory usage significantly on large datasets.

When profiling a slow query that contains a self-join used only to compare each row to an aggregate of its group, rewriting it as an aggregate window function is often the single most impactful optimization you can make.

Windows and Partitioning

Daily Life

Interviews

Optimize window queries for large datasets

There are two completely different meanings of "partitioning" in data engineering. PARTITION BY in a window function groups rows logically for the computation. Table partitioning determines how data is physically stored on disk. Understanding both, and how they interact, is critical for writing performant queries at scale.

Table partitioning divides a large table into smaller physical segments based on a column value, typically a date. A table partitioned by day stores each day of data in a separate directory or file. When a query filters on the partition column, the engine reads only the relevant partitions and skips the rest entirely. This is called partition pruning.

In a real-world data warehouse running Hive or Trino, a table like event_data might be partitioned by a date column called dt. The physical storage layout on HDFS or S3 looks like a directory tree: event_data/dt=2024-01-01/, event_data/dt=2024-01-02/, and so on. Each directory contains the Parquet or ORC files for that day. A query that filters WHERE dt = '2024-01-15' reads only the files in that single directory, potentially skipping terabytes of irrelevant data.

PARTITION BYTable partition

PARTITION BY

Logical scope

Groups rows for window

Table partition

Physical scope

Organizes files on disk

Partition Pruning

When your query filters and window partitions align with the physical table structure, performance improves dramatically.

Partition Pruning in Action

When your WHERE clause filters on the physical partition column, the engine skips entire directories of data. This is the single most important optimization in big data querying. A table with 365 daily partitions and a filter on one day reads roughly 1/365th of the data:

	SELECT
	dt,
	user_id,
	event_type,
	COUNT(*) OVER (
	PARTITION BY dt, user_id
	) AS events_per_user_per_day
	FROM event_data
	WHERE dt = '2024-01-15'

Result

dt	user_id	event_type	events_per_user_per_day
2024-01-15	U001	click	3
2024-01-15	U001	view	3
2024-01-15	U001	purchase	3
2024-01-15	U002	click	1

This query benefits from partition pruning because the WHERE clause filters on dt, which is the physical partition column. The engine reads only the 2024-01-15 directory. The window function then computes counts within each user for that day. No data from other days is ever loaded into memory.

When Partitions Align

When your window function PARTITION BY column matches the table partition column, the engine can process each physical partition independently. Each worker node handles its own partition without needing data from other nodes. This is the ideal scenario.

	SELECT
	event_date,
	user_id,
	event_type,
	ROW_NUMBER() OVER (
	PARTITION BY event_date, user_id
	ORDER BY event_time
	) AS event_sequence
	FROM event_data
	WHERE event_date BETWEEN DATE '2024-01-01'
	AND DATE '2024-01-31'

Result

event_date	user_id	event_type	event_sequence
2024-01-15	U001	click	1
2024-01-15	U001	view	2
2024-01-15	U001	purchase	3
2024-01-15	U002	click	1
2024-01-16	U001	click	1

If the event_data table is physically partitioned by event_date, the WHERE clause triggers partition pruning (only January data is read). The window function then groups by event_date and user_id, which aligns with the physical layout. Each node processes its local data without shuffling rows across the network.

Understanding Data Skew

Uneven data distribution is one of the most common causes of slow window function queries. Recognizing and addressing skew is essential for production workloads.

Data Skew

Data skew occurs when one partition key value contains vastly more data than others. If you partition by customer_id and one customer generates 90% of all events, the worker assigned to that customer becomes a bottleneck while other workers sit idle. The query is only as fast as the slowest worker.

Skew is especially dangerous with window functions because the engine must sort all rows within a partition. A partition with 100 million rows requires sorting 100 million rows on a single worker, even if every other partition has only a few thousand rows. This causes out-of-memory errors and extremely long query times. Consider a real scenario: if customer C1 has 10 million events and all other customers have roughly 10 thousand events each, the worker handling C1 does 1000 times more work than any other worker. Your query effectively runs single-threaded on that one partition.

Diagnosing Skew

Before running a window function on a new dataset, always check the distribution of your partition key. A simple GROUP BY ... COUNT(*) query reveals whether one value dominates. Look at the ratio between the largest partition and the median partition. A ratio above 100:1 is a strong signal of problematic skew.

	SELECT
	customer_id,
	COUNT(*) AS event_count
	FROM orders
	GROUP BY customer_id
	ORDER BY event_count DESC
	LIMIT 10

Result

customer_id	event_count
C_MEGA	10245800
C002	15200
C003	12400
C004	11800
C005	9600
C006	8900
C007	8100
C008	7500
C009	6800
C010	6200

In this example, customer C_MEGA has over 10 million events while the next largest customer has only 15 thousand. That is a 674:1 ratio. Any window function partitioned by customer_id will bottleneck on C_MEGA. This is the kind of skew that requires mitigation through salting, which is covered in the real-world patterns section.

TIP

Consider using compound partition keys to reduce skew. Instead of PARTITION BY customer_id alone, try PARTITION BY customer_id, DATE_TRUNC('month', event_date). This splits the mega-customer into monthly chunks, distributing the work more evenly across workers.

Logical vs Physical Partitioning

PARTITION BY groups rows for computation; table partitions organize data on disk for I/O efficiency.

Partition Pruning

Filtering on a table partition column lets the engine skip irrelevant files entirely before any processing.

Skew Detection

Always check value distribution before partitioning. One hot key can bottleneck the entire query.

Alignment Benefit

When window PARTITION BY matches table partitioning, workers process local data without network shuffles.

> Complete this query to detect partition skew by counting events per customer.

SELECT
  ___,
  ___(*) AS event_count
FROM event_data
GROUP BY user_id
ORDER BY event_count DESC
LIMIT 10

COUNT

user_id

SUM

event_type

Before using PARTITION BY in production, always check for skew by counting rows per partition key to identify potential bottlenecks.

When a single partition key dominates, consider compound partition keys to split mega-partitions into smaller, more balanced chunks.

Aligning window PARTITION BY with table partitioning enables local processing without expensive network shuffles in distributed systems.

Cumulative Distribution

Daily Life

Interviews

Calculate percentiles across a dataset

Beyond ranking, window functions provide statistical distribution functions. CUME_DIST and PERCENT_RANK tell you where a value falls within its partition, expressed as a fraction between 0 and 1. These are essential for percentile analysis, grading curves, and performance benchmarking.

Distribution Functions

Two functions express a row's position as a decimal between 0 and 1. Understanding their subtle differences is key to choosing the right one.

CUME_DIST

CUME_DIST returns the fraction of rows with values less than or equal to the current row's value. The formula is: (number of rows with value <= current value) / (total rows in partition). The result ranges from just above 0 to exactly 1.0. Think of it as answering the question: "What percentage of the group is at or below this value?"

	SELECT
	employee_id,
	department,
	salary,
	CUME_DIST() OVER (
	PARTITION BY department
	ORDER BY salary
	) AS salary_percentile
	FROM employee_metrics

Result

employee_id	department	salary	salary_percentile
E005	Engineering	70000	0.25
E003	Engineering	80000	0.50
E001	Engineering	95000	0.75
E004	Engineering	110000	1.00
S002	Sales	55000	0.33
S001	Sales	65000	0.67
S003	Sales	82000	1.00

In the Engineering department with 4 employees, the lowest-paid employee has a CUME_DIST of 0.25 (1/4 of salaries are at or below theirs). The highest earner is always 1.0 because 100% of values are at or below the maximum.

PERCENT_RANK

PERCENT_RANK uses a different formula: (rank - 1) / (total rows - 1). This places the lowest value at exactly 0.0 and the highest at exactly 1.0, distributing other values evenly between them. The key difference from CUME_DIST is that the lowest value always starts at zero, whereas CUME_DIST gives the lowest value a positive fraction. Think of PERCENT_RANK as answering: "What fraction of the group is strictly below this value?"

	SELECT
	employee_id,
	department,
	salary,
	ROUND(
	CUME_DIST() OVER (
	PARTITION BY department
	ORDER BY salary
	),
	2
	) AS cume,
	ROUND(
	PERCENT_RANK() OVER (
	PARTITION BY department
	ORDER BY salary
	),
	2
	) AS pct_rank
	FROM employee_metrics

Result

employee_id	department	salary	cume	pct_rank
E005	Engineering	70000	0.25	0.00
E003	Engineering	80000	0.50	0.33
E001	Engineering	95000	0.75	0.67
E004	Engineering	110000	1.00	1.00

Notice the difference: the lowest salary has CUME_DIST = 0.25 but PERCENT_RANK = 0.00. The highest has both at 1.00. Choose CUME_DIST when you want "what fraction of values are at or below this" and PERCENT_RANK when you want a 0-to-1 scale where lowest is 0 and highest is 1.

Distribution Patterns

Distribution functions power some of the most valuable analytics patterns: outlier detection, sessionization, and percentile-based tiering.

CUME_DIST Anomaly Detection

A practical application of CUME_DIST is identifying outliers. Values above the 99th percentile (CUME_DIST > 0.99) are statistical anomalies that often indicate fraud, system errors, or data quality issues. By computing CUME_DIST and filtering in an outer query, you can flag these rows for investigation.

	SELECT
	transaction_id,
	user_id,
	amount,
	ROUND(percentile, 4) AS percentile
	FROM (
	SELECT
	transaction_id,
	user_id,
	amount,
	CUME_DIST() OVER (
	ORDER BY amount
	) AS percentile
	FROM transactions
	)
	WHERE percentile > 0.99
	ORDER BY amount DESC

Result

transaction_id	user_id	amount	percentile
T9921	U044	9850.00	1.0000
T8812	U112	8200.00	0.9980
T7703	U007	7500.00	0.9960
T6654	U203	6900.00	0.9920

This query identifies the top 1% of transactions by amount. These are the outliers that warrant manual review. The same pattern works for detecting unusually long response times, abnormally high event counts, or any metric where extreme values signal problems.

Sessionization Pattern

Sessionization assigns a session identifier to groups of events that are close together in time. The standard approach detects gaps greater than a threshold (commonly 30 minutes) between consecutive events and increments a session counter at each gap. This requires two window function passes.

The first pass uses LAG to compute the time difference between each event and the one before it. The second pass uses a running SUM to count the number of gap boundaries encountered so far. Every time the gap exceeds the threshold, the SUM increments by one, creating a new session ID. Events within the threshold inherit the same session ID because the CASE expression evaluates to zero for them.

	SELECT
	user_id,
	event_time,
	page_url,
	session_id
	FROM (
	SELECT
	*,
	SUM(
	CASE
	WHEN gap_min > 30
	OR gap_min IS NULL THEN 1
	ELSE 0
	END
	) OVER (
	PARTITION BY user_id
	ORDER BY event_time
	ROWS UNBOUNDED PRECEDING
	) AS session_id
	FROM (
	SELECT
	user_id,
	event_time,
	page_url,
	DATE_DIFF(
	'minute',
	LAG(event_time, 1) OVER (
	PARTITION BY user_id
	ORDER BY event_time
	),
	event_time
	) AS gap_min
	FROM clickstream
	)
	)

Result

user_id	event_time	page_url	session_id
U001	2024-01-15 09:00:00	/home	1
U001	2024-01-15 09:05:00	/products	1
U001	2024-01-15 09:12:00	/cart	1
U001	2024-01-15 14:00:00	/home	2
U001	2024-01-15 14:03:00	/search	2

The inner query uses LAG to compute the gap in minutes between each event and the previous one. The outer query uses a running SUM to increment the session counter whenever a gap exceeds 30 minutes. The first event for each user starts session 1 because the gap is NULL (no previous event). The ROWS UNBOUNDED PRECEDING frame ensures the SUM accumulates from the very first row through the current row, giving a monotonically increasing session counter.

> Fill in the missing parts to calculate a rolling total of revenue using the current row and the two rows before it.

SELECT
  bill_date,
  amount,
  SUM(amount) OVER (
    ORDER BY bill_date
    ___ 2 PRECEDING AND ___
  ) AS rolling_sum
FROM cloud_costs

ROWS BETWEEN

CURRENT ROW

RANGE BETWEEN

UNBOUNDED FOLLOWING

A bounded frame like ROWS BETWEEN 2 PRECEDING AND CURRENT ROW keeps memory usage constant because the database only holds three rows at a time, regardless of how many total rows the partition contains.

The ROWS keyword counts physical rows by position, which makes rolling window calculations predictable even when the ordering column has duplicate values, unlike RANGE which groups duplicates into a single logical boundary.

Workarounds & Engine Limits

Daily Life

Interviews

Work around window function restrictions

Every SQL engine has limitations around window functions. Understanding these constraints and their workarounds prevents hours of debugging cryptic error messages. The most common issues involve where window functions can appear in a query, missing syntax features, and ordering guarantees.

Placement Restrictions

SQL's evaluation order determines where window functions can and cannot appear. These restrictions apply across all SQL engines.

Window Functions vs WHERE

Window functions cannot appear in a WHERE clause. This is an SQL standard rule, rooted in the SQL evaluation order. The engine processes a query in this sequence: FROM, WHERE, GROUP BY, HAVING, SELECT, window functions, ORDER BY, LIMIT. Because window functions are evaluated after WHERE, they do not exist yet when the WHERE clause runs. Referencing them in WHERE would create a circular dependency: the filter needs the window result, but the window needs the filtered rows.

The workaround is to wrap the window function in a subquery. The inner query computes the window function and exposes it as a column. The outer query then filters on that column, because by the time the outer WHERE runs, the inner query has already computed the window values.

	SELECT
	user_id,
	email,
	created_at
	FROM (
	SELECT
	user_id,
	email,
	created_at,
	ROW_NUMBER() OVER (
	PARTITION BY user_id
	ORDER BY created_at DESC
	) AS rn
	FROM event_data
	)
	WHERE rn = 1

Result

user_id	email	created_at
U001	alice@example.com	2024-03-15 14:30:00
U002	bob@example.com	2024-03-14 09:15:00
U003	carol@example.com	2024-03-15 11:00:00

The inner query computes ROW_NUMBER and exposes it as rn. The outer query filters on rn = 1. This "compute then filter" pattern is the standard way to use window functions in filtering logic. You will use this pattern constantly.

Window Functions vs HAVING

The same restriction applies to HAVING. Since HAVING is evaluated before window functions in the SQL processing order, you cannot reference a window function there. The workaround is identical: compute the window function in a subquery, then filter in the outer query. This often surprises people who expect HAVING to work like a post-aggregation filter that runs after everything else.

Repeating OVER Clauses

Some SQL references show a WINDOW clause that lets you name a window definition and reuse it across multiple functions. This syntax is not universally supported. When you need the same window specification in multiple functions, repeat the full OVER clause each time.

	SELECT
	employee_id,
	salary,
	RANK() OVER (
	PARTITION BY department
	ORDER BY salary DESC
	) AS salary_rank,
	CUME_DIST() OVER (
	PARTITION BY department
	ORDER BY salary DESC
	) AS salary_percentile,
	LAG(salary, 1) OVER (
	PARTITION BY department
	ORDER BY salary DESC
	) AS next_higher_salary
	FROM employee_metrics

Advanced Window Techniques

Beyond the basic restrictions, several advanced techniques and ordering behaviors require careful attention.

ARRAY_AGG with OVER

ARRAY_AGG can be used as a window function to build running lists. Combined with a frame specification, it collects values up to the current row. This is powerful for building user journey paths, collecting recent event histories, and constructing ordered sequences without self-joins.

	SELECT
	user_id,
	event_time,
	page_url,
	ARRAY_AGG(page_url) OVER (
	PARTITION BY user_id
	ORDER BY event_time
	ROWS UNBOUNDED PRECEDING
	) AS pages_visited_so_far
	FROM clickstream

Result

user_id	event_time	page_url	pages_visited_so_far
U001	09:00:00	/home	[/home]
U001	09:05:00	/products	[/home, /products]
U001	09:12:00	/cart	[/home, /products, /cart]

Each row sees an array of all page URLs visited so far in that session. The ROWS UNBOUNDED PRECEDING frame ensures the array grows with each row. This is useful for building user journey paths without self-joins.

You can also use ARRAY_AGG with a bounded frame to collect only the last N items. For example, ROWS 2 PRECEDING would collect the current page and the two before it, creating a sliding window of recent pages. This is useful for detecting navigation patterns like "users who visit pricing then features then signup tend to convert."

Window Function Order

When a query contains multiple window functions, the engine may evaluate them in any order or even in parallel. You cannot assume that one window function's result is available to another window function in the same SELECT clause. If you need to chain window functions (for example, computing a rank based on a running total), you must use nested subqueries where each layer computes one window function and the next layer references its result.

ORDER BY in Subqueries

The optimizer may ignore ORDER BY in subqueries unless paired with LIMIT. The optimizer treats subquery ordering as meaningless because the outer query will reorder as needed. This matters when you expect rows in a specific order within a derived table. The problem is subtle: your query might return results in the expected order during testing (because the data happens to be stored in that order) but produce differently-ordered results in production when the data layout changes or parallelism kicks in.

TIP

Never rely on ORDER BY inside a subquery to guarantee row order in the outer query. Always place the final ORDER BY in the outermost query. Window function ordering within OVER is guaranteed for the window computation itself, but not for the output row order.

Another common source of unexpected NULLs in window function results involves frame clauses.

Common gotcha: NTH_VALUE requires an explicit frame clause. Without one, it may return NULL unexpectedly. Always specify ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when using NTH_VALUE to ensure it can see the entire partition.

> Complete this query to filter results based on a window function using a subquery.

SELECT
  department,
  metric_name,
  metric_value
FROM (
  SELECT
    department,
    metric_name,
    metric_value,
    ___() OVER (
      PARTITION BY department
      ORDER BY metric_value DESC
    ) AS rn
  FROM employee_metrics
) AS sub
WHERE ___ = 1

ROW_NUMBER

RANK

row_num

You cannot filter on a window function in the same SELECT clause because WHERE executes before window functions are computed.

The subquery pattern computes the window function first, then the outer query filters on the result, enabling top-N-per-group queries.

CTEs (WITH clauses) provide a cleaner alternative to nested subqueries when the query involves multiple window function layers.

Deduplication

Daily Life

Interviews

Remove duplicate rows from production data

Duplicate records are one of the most common data quality problems in production pipelines. Sources send the same event multiple times, ETL pipelines replay data, and upstream systems lack idempotency guarantees. Window functions provide the standard solution: the ROW_NUMBER deduplication pattern numbers duplicates by your chosen criteria and keeps only row number 1.

Data Quality Patterns

The most common use of window functions in production is cleaning and deduplicating data. These patterns form the foundation of reliable data pipelines.

Dedup with ROW_NUMBER

Duplicate records are one of the most common data quality issues. Sources send the same event multiple times, ETL pipelines replay data, and upstream systems lack idempotency guarantees. The ROW_NUMBER deduplication pattern is the standard solution: number the duplicates and keep only row number 1.

The key decisions in deduplication are: what defines a duplicate (the PARTITION BY columns) and which copy to keep (the ORDER BY column and direction). ORDER BY created_at DESC keeps the newest record. ORDER BY created_at ASC keeps the oldest. You can also use multiple tiebreaker columns: ORDER BY created_at DESC, updated_at DESC, id DESC ensures deterministic ordering even when timestamps tie.

	SELECT
	user_id,
	email,
	created_at
	FROM (
	SELECT
	user_id,
	email,
	created_at,
	ROW_NUMBER() OVER (
	PARTITION BY user_id
	ORDER BY created_at DESC
	) AS rn
	FROM event_data
	)
	WHERE rn = 1

	SELECT
	user_id,
	email,
	created_at
	FROM (
	SELECT
	user_id,
	email,
	created_at,
	ROW_NUMBER() OVER (
	PARTITION BY user_id
	ORDER BY created_at DESC
	) AS rn
	FROM event_data
	)
	WHERE rn = 1

Result

user_id	email	created_at
U001	alice_new@example.com	2024-03-15 14:30:00
U002	bob@example.com	2024-03-14 09:15:00
U003	carol_v2@example.com	2024-03-15 11:00:00

PARTITION BY user_id groups duplicates together. ORDER BY created_at DESC ensures the most recent record gets row number 1. The outer query keeps only that record. This pattern works for any deduplication key and any tiebreaker column.

> Complete this query to remove duplicate user records, keeping only the most recent entry for each user.

SELECT
  user_id,
  event_type
FROM (
  SELECT
    user_id,
    event_type,
    ___() OVER (
      ___ user_id
      ___ event_timestamp ___
    ) AS rn
  FROM event_data
)
WHERE rn = 1

ROW_NUMBER

PARTITION BY

ORDER BY

DESC

RANK

ASC

Funnel Analysis

Funnel analysis tracks how users progress through a sequence of steps: signup, activation, first purchase, repeat purchase. The goal is to measure drop-off at each stage. Window functions enable ordered funnel tracking per user. This is one of the most requested analytics patterns in product companies, because it directly answers the question: "Where are we losing users?"

	SELECT
	step,
	users_at_step,
	ROUND(
	CAST(users_at_step AS DOUBLE) / FIRST_VALUE(
	users_at_step
	) OVER (
	ORDER BY step_order
	ROWS UNBOUNDED PRECEDING
	),
	2
	) AS conversion_rate
	FROM (
	SELECT
	step,
	step_order,
	COUNT(DISTINCT user_id) AS users_at_step
	FROM funnel_events
	GROUP BY step, step_order
	)
	ORDER BY step_order

Result

step	users_at_step	conversion_rate
signup	10000	1.00
activation	6500	0.65
purchase	2100	0.21

FIRST_VALUE grabs the signup count (the first step), and each subsequent step divides its user count by that baseline. The result shows that 65% of users who signed up also activated, and 21% eventually made a purchase. The 35% drop-off between signup and activation is where the product team should focus optimization efforts.

For more detailed funnel analysis, you can add step-over-step conversion rates using LAG. This shows not just the absolute conversion from the top of the funnel, but the conversion between each consecutive pair of steps. A 65% signup-to-activation rate and a 32% activation-to-purchase rate tells a more actionable story than absolute percentages alone.

Gap-and-Island Detection

Gap-and-island detection identifies contiguous groups of rows (islands) separated by gaps. This pattern appears when you need to find consecutive days of user activity, identify continuous periods of service uptime, or group sequential status changes. The technique uses the difference between ROW_NUMBER and the actual sequence value to create group identifiers.

	SELECT
	user_id,
	MIN(active_date) AS streak_start,
	MAX(active_date) AS streak_end,
	COUNT(*) AS streak_length
	FROM (
	SELECT
	user_id,
	active_date,
	DATE_DIFF(
	'day',
	CAST(
	ROW_NUMBER() OVER (
	PARTITION BY user_id
	ORDER BY active_date
	)
	AS INTEGER
	) * INTERVAL '1' DAY,
	active_date
	) AS island_id
	FROM daily_logins
	)
	GROUP BY user_id, island_id
	ORDER BY streak_length DESC

Result

user_id	streak_start	streak_end	streak_length
U001	2024-01-05	2024-01-12	8
U001	2024-01-15	2024-01-18	4
U002	2024-01-01	2024-01-03	3
U002	2024-01-10	2024-01-11	2

The trick works because ROW_NUMBER increments by 1 for every row, and dates in a contiguous streak also increment by 1 day. Subtracting the row number from the date produces the same result for all rows in a contiguous streak, creating a natural group identifier. When there is a gap, the date jumps forward but ROW_NUMBER does not, producing a different group ID.

Which real-world pattern matches your analysis goal?

	SELECT
	user_id,
	event_time,
	SUM(
	CASE
	WHEN gap_min > 30
	OR gap_min IS NULL THEN 1
	ELSE 0
	END
	) OVER (
	PARTITION BY user_id
	ORDER BY event_time
	ROWS UNBOUNDED PRECEDING
	) AS session_id
	FROM (
	SELECT
	user_id,
	event_time,
	DATE_DIFF(
	'minute',
	LAG(event_time, 1) OVER (
	PARTITION BY user_id
	ORDER BY event_time
	),
	event_time
	) AS gap_min
	FROM clickstream
	)

Performance at Scale

Production workloads require careful attention to memory usage and data distribution. These techniques keep window functions performant on large datasets.

Memory Pressure

Window functions require the engine to hold partition data in memory for sorting and computation. Wide windows, meaning frames that span many rows or entire partitions, consume proportionally more memory. A window function with ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING on a partition with 50 million rows forces the engine to buffer all 50 million rows simultaneously.

To reduce memory pressure, apply these strategies in order of impact: first, filter data before the window function runs (add WHERE clauses in the inner query to reduce partition sizes). Second, use bounded frames like ROWS 30 PRECEDING instead of UNBOUNDED when your logic allows it. Third, reduce the number of columns in the SELECT of the inner query, because the engine must buffer entire rows.

CURRENT ROWN PRECEDINGUNBND. PRECEDINGUNBND. BOTH

CURRENT ROW

Minimal

Point-in-time values

N PRECEDING

Bounded

Rolling window pattern

UNBND. PRECEDING

Grows w/ part.

Running total pattern

UNBND. BOTH

Full partition

Partition-wide compute

Skew Mitigation: Salting

When a single partition key dominates, salting breaks it into smaller pieces. The process has three steps. First, you create a salted key by appending a random or hash-derived suffix to the original partition key. This splits one massive partition into N smaller buckets. Second, you perform the aggregation or computation on the salted keys, which distributes the work across multiple workers. Third, you strip the salt and combine the partial results back into the final answer.

The number of salt buckets should be chosen based on your cluster size and the degree of skew. If one customer has 10 million rows and you have 100 workers, using 100 salt buckets gives each worker roughly 100 thousand rows for that customer, which is manageable. Using too few buckets does not solve the problem; using too many creates overhead from the extra aggregation step.

	SELECT
	original_key,
	SUM(subtotal) AS total_revenue
	FROM (
	SELECT
	customer_id AS original_key,
	CONCAT(
	customer_id,
	'_',
	CAST(
	MOD(ABS(HASH(order_id)), 10)
	AS VARCHAR
	)
	) AS salted_key,
	SUM(amount) AS subtotal
	FROM orders
	GROUP BY customer_id, CONCAT(
	customer_id,
	'_',
	CAST(MOD(ABS(HASH(order_id)), 10) AS VARCHAR)
	)
	)
	GROUP BY original_key

Result

original_key	total_revenue
C_MEGA	9500000.00
C002	45000.00
C003	32000.00

The inner query splits the mega-customer into 10 buckets by appending a hash-derived suffix (0 through 9). Each bucket is processed by a separate worker, and each worker computes a subtotal. The outer query combines the subtotals by the original key. This transforms a single-worker bottleneck into parallel processing across 10 workers, reducing wall-clock time by roughly 10x for the skewed partition.

✓Do

Use ROW_NUMBER subquery pattern for deduplication
Check partition key distribution before running window queries
Use bounded frames (ROWS N PRECEDING) when possible
Replace self-joins with LAG/LEAD for row comparisons
Place final ORDER BY in the outermost query
Filter data before window functions to reduce partition sizes

✗Don't

Put window functions in WHERE or HAVING clauses (use subquery)
Assume subquery ORDER BY survives to the outer query
Use UNBOUNDED frames on partitions with millions of rows
Ignore data skew in PARTITION BY columns
Expect named WINDOW syntax to work (it is not supported)
Reference one window function from another in the same SELECT

❯❯❯PUTTING IT ALL TOGETHER

> You are a senior analyst at BlackRock building a customer quartile report that groups wealth management clients by assets under management and flags statistically anomalous account balances for the compliance team.

Replacing self-joins with LAG() and AVG OVER cuts query runtime significantly because the engine scans the client table once instead of joining it to itself.

Aligning PARTITION BY with the physical table partition on account region avoids cross-node data shuffles and prevents skew from large institutional accounts bottlenecking workers.

CUME_DIST() computes each client's percentile position so the compliance team can immediately flag accounts above the 99th percentile as statistical outliers.

The ROW_NUMBER() deduplication pattern wrapped in a subquery ensures each client appears exactly once before quartile assignment, preventing inflated tier counts from duplicate records.

KEY TAKEAWAYS

Window functions replace most self-joins for row comparisons and group aggregates, running in a single pass

CUME_DIST and PERCENT_RANK provide percentile positioning for anomaly detection and benchmarking

Key patterns: deduplicate with ROW_NUMBER, sessionize with LAG + running SUM, detect streaks with gap-and-island

Align window PARTITION BY with physical table partitioning to avoid data shuffles; check for skew before querying

Use bounded frames and pre-filtering to manage memory; wrap window functions in subqueries to filter on their results

One pass to rule them all

Category: SQL
Difficulty: advanced
Duration: 41 minutes
Challenges: 4 hands-on challenges

Topics covered: Windows vs Self-Joins, Windows and Partitioning, Cumulative Distribution, Workarounds & Engine Limits, Deduplication

Lesson Sections

Windows vs Self-Joins (concepts: sqlSelfJoin)
One of the most impactful things you can do with window functions is eliminate self-joins. A self-join is when a table is joined to itself, typically to compare each row with a related row in the same table. Self-joins are expensive: the engine must scan the table twice, build a join hash map, and match rows. Window functions accomplish the same comparison in a single pass over the data. Self-joins are extremely common in legacy codebases. Before window functions were widely supported, they were
Windows and Partitioning (concepts: sqlPartitionBy)
Table partitioning divides a large table into smaller physical segments based on a column value, typically a date. A table partitioned by day stores each day of data in a separate directory or file. When a query filters on the partition column, the engine reads only the relevant partitions and skips the rest entirely. This is called partition pruning. Partition Pruning When your query filters and window partitions align with the physical table structure, performance improves dramatically. Partit
Cumulative Distribution (concepts: sqlCumulativeDist)
Distribution Functions Two functions express a row's position as a decimal between 0 and 1. Understanding their subtle differences is key to choosing the right one. CUME_DIST PERCENT_RANK Distribution Patterns Distribution functions power some of the most valuable analytics patterns: outlier detection, sessionization, and percentile-based tiering. CUME_DIST Anomaly Detection This query identifies the top 1% of transactions by amount. These are the outliers that warrant manual review. The same pa
Workarounds & Engine Limits (concepts: sqlExecutionOrder)
Every SQL engine has limitations around window functions. Understanding these constraints and their workarounds prevents hours of debugging cryptic error messages. The most common issues involve where window functions can appear in a query, missing syntax features, and ordering guarantees. Placement Restrictions SQL's evaluation order determines where window functions can and cannot appear. These restrictions apply across all SQL engines. Window Functions vs WHERE Window Functions vs HAVING Repe
Deduplication (concepts: sqlWindowDedup)
Data Quality Patterns The most common use of window functions in production is cleaning and deduplicating data. These patterns form the foundation of reliable data pipelines. Dedup with ROW_NUMBER Funnel Analysis Funnel analysis tracks how users progress through a sequence of steps: signup, activation, first purchase, repeat purchase. The goal is to measure drop-off at each stage. Window functions enable ordered funnel tracking per user. This is one of the most requested analytics patterns in pr