Window Functions: Intermediate
Robinhood shows every user a 30-day moving average price line and a daily portfolio change percentage, and those rolling calculations are computed in SQL using LAG, LEAD, and ROWS BETWEEN frame specifications rather than in application code. By computing trailing averages and period-over-period deltas directly in the database, Robinhood's dashboards update in milliseconds instead of seconds, even as millions of users simultaneously check their portfolios during volatile trading sessions. The intermediate window function techniques in this lesson are the exact patterns powering that kind of real-time financial analytics.
LAG and LEAD
Compare each row to the one before or after
LAG and LEAD let you look backward and forward across rows without leaving your query. Every dataset has a first row with nothing before it, and that boundary is where period-over-period comparisons can quietly break.
Nadia Reyes came in first thing this morning. Her growth report covers six months of revenue data. January is blank where every other month has a number. The source data is clean, no errors, no missing records. Nobody can explain the gap.
Six months. Six revenue figures. No gaps. No NULLs. The source data is clean.
| month_num | month | revenue |
|---|---|---|
| 1 | Jan | 45000 |
| 2 | Feb | 52000 |
| 3 | Mar | 48000 |
| 4 | Apr | 61000 |
| 5 | May | 58000 |
| 6 | Jun | 72000 |
LAG and LEAD are window functions that reach backward or forward to access values from other rows without a self-join. They are essential for comparing a row to the row before or after it.
Think of your result set as a spreadsheet. When you are on row 5, LAG lets you peek at row 4 (or row 3, or any row above), and LEAD lets you peek at row 6 (or row 7, or any row below). The database does not need to join the table to itself or use a correlated subquery. It simply remembers what it already processed and knows what comes next.
LAG and LEAD Fundamentals
The Syntax
LAG accepts up to three arguments: the column to read, an optional offset (defaults to 1), and an optional default value for when there is no previous row. LEAD works identically but looks forward instead of backward.
Why Not a Self-Join?
- Scans the table twice
- Silently drops rows with missing data
- Complex join conditions to maintain
- Single ordered pass over data
- Returns NULL or your default for gaps
- Clean, readable syntax
Month-Over-Month Comparison
The most common use of LAG is period-over-period comparison. By subtracting the previous period value from the current one, you get the change between periods in a single pass.
The first row has NULL for prev_month because there is no row before January. The arithmetic with NULL also produces NULL for mom_change. You can handle this with COALESCE or by providing a default value in the LAG call itself.
Practical Applications
Beyond simple comparisons, LAG and LEAD unlock powerful patterns for forecasting, change detection, and anomaly identification.
LEAD for Change Detection
While LAG is used to look backward, LEAD is equally valuable for forward-looking analysis. A common use case is detecting upcoming changes. For example, if you have a table of subscription plan changes, LEAD can show what plan a customer will switch to next, letting you identify potential downgrades before they happen.
Customer C1 is about to downgrade from Premium to Basic. Customer C2 upgraded from Basic to Premium. The NULL values in next_plan indicate the most recent plan, the one they are still on. This kind of forward-looking analysis powers churn prediction models and retention campaigns.
Detecting Sequential Gaps
LAG is also powerful for detecting gaps or anomalies in sequences. If you expect events to occur daily, you can flag days where the gap between consecutive events is larger than expected. This is critical in monitoring pipelines where a missing day of data could indicate a broken ETL job or a sensor failure.
The row with days_since_last = 3 reveals a gap: sensor S1 skipped March 3rd and 4th. You could wrap this in a WHERE clause (via a subquery) to filter only the rows where the gap exceeds your threshold. In a production pipeline, this pattern feeds automated alerting systems that notify engineers when data stops flowing.
LAG and LEAD always require ORDER BY inside the OVER clause. Without ordering, "previous" and "next" have no meaning. The database will reject the query or produce unpredictable results.Larger Offsets
The offset parameter lets you look further back or ahead. LAG(revenue, 12) looks 12 rows back, useful for year-over-year comparison with monthly data:
The first 12 rows will have NULL for same_month_last_year because there are no rows 12 positions back yet. Once you have a full year of data, every row gets a comparison. Year-over-year analysis is one of the most requested metrics in business intelligence dashboards, and LAG with offset 12 is by far the cleanest way to compute it.
> Complete this query to rank metrics by value, with ties sharing the same rank and no gaps in the numbering.
SELECT metric_name, metric_value, () OVER ( ORDER BY metric_value ) AS ranking FROM employee_metrics
DENSE_RANK is the correct choice whenever you want to count how many distinct performance levels exist above a given row, because it assigns consecutive integers with no gaps even when multiple rows share the same value.
LAG and LEAD are complementary to ranking functions because once you have ranked rows you can use LAG to compare each row to the one immediately above or below it in the ranking, which enables tier-boundary analysis without additional joins.
FIRST, LAST, NTH VALUE
Pull specific values from a window
FIRST_VALUE, LAST_VALUE, and NTH_VALUE let you pull specific positional values from a window. They answer questions like "what was the first order amount in each category?" or "what was the third highest salary in each department?"
These functions are different from LAG and LEAD in an important way. LAG and LEAD access rows relative to the current row (1 row back, 2 rows forward). Positional functions access rows relative to the window frame itself (the first row in the frame, the last row, or the Nth row). This distinction matters because the frame can be the entire partition or just a sliding subset of it.
FIRST_VALUE: First purchase date per customer (acquisition dates)LAST_VALUE: Last login time per user (activity monitoring)NTH_VALUE: Third highest sale per region (outlier detection)
Positional Function Basics
FIRST_VALUE
FIRST_VALUE returns the value from the first row in the window frame. This is commonly used to compare every row against the starting point. Unlike LAST_VALUE, FIRST_VALUE works correctly with the default frame because the first row of the frame is always included.
The LAST_VALUE Frame Trap
LAST_VALUE has a subtle but critical gotcha that catches nearly everyone the first time. The default window frame is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This means the "last value" in the frame is always the current row itself.
To understand why this happens, picture the frame as a sliding window that starts at the first row of the partition and ends at the current row. When the database processes row 1, the frame contains only row 1, so the last value is row 1. When it processes row 2, the frame contains rows 1 and 2, so the last value is row 2. The frame always ends at "where you are now," so LAST_VALUE just echoes the current row.
| employee | salary | default_last | true_last |
|---|---|---|---|
| Alice | 150000 | 150000 | 120000 |
| Bob | 130000 | 130000 | 120000 |
| Carol | 120000 | 120000 | 120000 |
Notice how default_last just echoes each row's own salary? That is because the frame ends at CURRENT ROW, so the "last" value is always the current row. To get the actual last value in the entire partition, you must explicitly expand the frame:
By specifying ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, the frame spans the entire partition. Now LAST_VALUE correctly returns the last value (the lowest salary when ordered descending).
Whenever you use LAST_VALUE or NTH_VALUE, always add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING unless you specifically want the frame to end at the current row. This is one of the most frequently tested window function concepts in interviews.
NTH_VALUE
NTH_VALUE returns the value from the Nth row in the window frame. It is useful when you need a specific position, such as the second highest or third most recent.
Notice the full frame specification again. Without it, NTH_VALUE suffers from the same trap as LAST_VALUE: the first row in each partition would return NULL for second_highest because the frame at that point only contains one row.
> Complete this query to show each employee's salary alongside the highest salary in their department.
SELECT department, metric_name, metric_value, () OVER ( PARTITION BY department ORDER BY metric_value DESC ) AS top_value FROM employee_metrics
FIRST_VALUE with ORDER BY DESC gives the maximum value, while LAST_VALUE needs an explicit frame clause to see the entire partition.
Always add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING when using LAST_VALUE or NTH_VALUE to avoid unexpected NULLs.
NTH_VALUE is useful for specific positions like second highest or third most recent, but also requires an explicit full-partition frame.
Frame Specifications
Define exactly which rows a window covers
Frame Syntax and Types
ROWS BETWEEN Syntax
The frame clause follows ORDER BY inside the OVER clause. The syntax is ROWS BETWEEN followed by a start bound and an end bound:
ROWS vs RANGE
SQL offers two frame types: ROWS and RANGE. They behave differently when your ORDER BY column contains duplicate values, which is common in real data.
ROWS counts physical rows. "3 PRECEDING" means exactly 3 rows above the current one, regardless of the values in those rows. RANGE, on the other hand, groups by value. "3 PRECEDING" with RANGE means all rows whose ORDER BY value is within 3 units of the current row's value. If multiple rows share the same value, RANGE treats them as a single logical group.
In practice, ROWS is used far more often than RANGE. ROWS is simpler to reason about, more predictable, and more widely supported across database engines. Use RANGE only when you specifically need value-based grouping, such as computing a sum of all transactions within a dollar amount range of the current row.
Common Frame Patterns
Real-World Frame Patterns
Sliding Window Average
The first row only has 2 values in its frame (itself and the next row) because there is no preceding row. The last row similarly only has 2 values. The AVG function automatically handles partial frames by averaging whatever rows are present. This edge behavior is important to understand: moving averages at the boundaries of your data are based on fewer data points and may be less reliable.
7-Day Trailing Average
The frame ROWS BETWEEN 6 PRECEDING AND CURRENT ROW creates a trailing 7-day window (6 previous rows plus the current one). Early rows where fewer than 7 days of data exist will average whatever is available. Note that the trailing average stabilizes after the first full week of data, making subsequent values more reliable for trend analysis.
Bounded vs Unbounded Frames
Bounded frames (like 6 PRECEDING AND CURRENT ROW) limit the number of rows the function examines at each step. Unbounded frames (like UNBOUNDED PRECEDING AND CURRENT ROW) grow as the partition progresses. This distinction has practical implications.
With a bounded frame, the database only needs to maintain a fixed-size window in memory. Adding one row to the front and dropping one from the back is efficient. With an unbounded frame, the running calculation must account for every row seen so far. For SUM and COUNT, the database can maintain a running total incrementally. But functions that require sorting or comparing all values in the frame cannot take this shortcut, making unbounded frames expensive on very large partitions.
> Complete this query to compare each day's cost with the previous day.
SELECT bill_date, amount, (amount, ) OVER ( ORDER BY bill_date ) AS prev_amount FROM cloud_costs
LAG with an offset of 1 is the standard tool for day-over-day or period-over-period comparisons because it reads the value from the preceding row in a single pass without requiring the table to be joined to itself.
Providing a default value as the third argument to LAG replaces the NULL that appears on the first row of each partition, which prevents NULL propagation when you subsequently subtract the lagged value from the current value.
Frame specifications and LAG serve different purposes: frames control which rows an aggregate function sees at each step, while LAG always jumps to a fixed offset regardless of any frame clause you specify.
NTILE and Bucketing
Split rows into equal-sized groups
NTILE divides the ordered rows in a partition into a specified number of roughly equal groups and assigns each row a bucket number from 1 to N. It is the go-to function for creating quartiles, deciles, and custom percentile buckets.
Percentile bucketing is fundamental to data engineering and analytics. Nearly every organization segments its users, customers, or products into tiers based on some metric. The top 25 percent of customers by spending might receive premium support. The bottom 10 percent of products by revenue might be candidates for discontinuation. NTILE automates this segmentation cleanly.
NTILE Fundamentals
The NTILE function follows a predictable pattern for dividing rows into equal groups, making it straightforward to apply once you understand its behavior.
Basic NTILE Usage
Handling Remainders
When the row count does not divide evenly, NTILE distributes the extra rows to the earlier buckets. With 10 rows and 4 buckets, the first two buckets get 3 rows each and the last two get 2 rows each.
- 10 rows into 4 buckets → 3, 3, 2, 2
- 11 rows into 4 buckets → 3, 3, 3, 2
- 12 rows into 4 buckets → 3, 3, 3, 3
- 100 rows into 10 buckets → 10 each (even)
- 103 rows into 10 buckets → 11, 11, 11, 10, 10, 10, 10, 10, 10, 10
Deciles and Percentiles
Use NTILE(3) for tertiles (3 groups), NTILE(10) for deciles (10 groups), or NTILE(100) for percentiles. Deciles are popular for salary analysis and customer segmentation. Percentiles give the finest granularity.
NTILE in Practice
Understanding when and how to use NTILE effectively requires comparing it to alternatives and knowing how to combine it with other window clauses.
NTILE vs CASE Bucketing
Before NTILE, analysts created buckets using manual CASE statements with hardcoded thresholds. This approach is fragile because the thresholds must be updated whenever the data distribution changes. NTILE adapts automatically.
- Hardcoded thresholds
- Must update when data changes
- Buckets may be very unequal
- Verbose: one WHEN per bucket
- Adapts to data automatically
- Always produces equal groups
- One line of SQL
- Works with any partition
Use manual CASE when your bucket boundaries have business meaning (such as "under 50K is junior, 50-100K is mid, over 100K is senior"). Use NTILE when you want equal-sized groups based on relative position, such as "top 25 percent of performers."
NTILE with PARTITION BY
Combine NTILE with PARTITION to create buckets within groups. This is essential when the same absolute value means different things in different contexts.
Note that NTILE does not consider the actual values when creating buckets. It only considers row position. Two employees with identical salaries could end up in different buckets if they fall on a boundary. If you need value-aware bucketing, use PERCENT_RANK or CUME_DIST instead.
> Complete this query to divide customers into 4 spending tiers.
SELECT customer_name, total_spent, () OVER ( ORDER BY total_spent DESC ) AS tier FROM customers
NTILE distributes rows as evenly as possible. If 10 rows are split into 4 buckets, two buckets get 3 rows and two get 2.
NTILE assigns buckets based on row position, not actual values. Two customers with identical spending could end up in different tiers at a boundary.
For value-aware bucketing where identical values should always share the same tier, use PERCENT_RANK or CUME_DIST instead.
Multiple Windows
Run different calculations in one query
There is no named WINDOW clause. You must write the full OVER specification for each window function. This means you will see multiple OVER clauses in a single SELECT, each potentially with different partitioning, ordering, and framing.
Multiple Window Definitions
Different Partitions
The first RANK partitions by category so each category has its own ranking. The second RANK has no partition, ranking all products together. Winter Jacket is ranked 1st in Clothing but 2nd overall.
Different Orderings
Multiple Query Perspectives
This single query produces three different analytical perspectives: a department-level running total, month-over-month comparison via LAG, and a company-wide revenue tier via NTILE. Each OVER clause is tailored to its specific purpose.
Performance Considerations
Shared Window Performance
When multiple window functions share the same OVER clause (same PARTITION BY, same ORDER BY, same frame), the query optimizer can often compute them in a single pass over the data. The database sorts the data once and evaluates all matching window functions together.
When window functions have different OVER clauses, the database may need to sort the data multiple times, once for each distinct window specification. On large datasets, each additional sort is expensive. This is why grouping window functions by their OVER clause is good practice.
- Always include ORDER BY in LAG, LEAD, and positional functions
- Use explicit frame specs with LAST_VALUE and NTH_VALUE
- Provide default values in LAG/LEAD to avoid NULL arithmetic
- Use NTILE for equal-group bucketing instead of manual CASE logic
- Comment your intent when using multiple windows in one query
- Use ROWS frames unless you specifically need value-based RANGE grouping
- Do not use LAST_VALUE without expanding the frame to UNBOUNDED FOLLOWING
- Do not assume NTILE buckets are exactly equal when rows do not divide evenly
- Do not confuse ROWS and RANGE frame types when dealing with duplicate values
- Do not use reserved words like row or rank as column aliases
- Do not forget that LAG and LEAD ignore the frame clause entirely
> Fill in the missing parts to calculate how much cost changed from one billing period to the next.
SELECT bill_date, amount, (, 1) OVER ( bill_date ) AS prev_amount, amount - (amount, 1) OVER ( ORDER BY bill_date ) AS mom_change FROM cloud_costs
Combining LAG for period comparison with NTILE for segmentation in the same query is a common analytics pattern because it simultaneously shows how a row changed from the previous period and where it ranks relative to all other rows.
Each window function in a SELECT list is evaluated independently against its own OVER clause, which means the database may sort the data multiple times if the specifications differ, so aligning OVER clauses where possible improves query performance on large datasets.
> You are a growth analyst at Spotify tracking month-over-month revenue changes per product line to surface trends for the quarterly business review, and your stakeholders want both period comparisons and bucket tiers in a single report.
LAG() pulls the prior month revenue for each product line in one pass, letting you subtract it from the current month to compute the change directly.FIRST_VALUE() and LAST_VALUE() with explicit ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING frames anchor each row to the quarter start and end for boundary comparisons.ROWS BETWEEN N PRECEDING AND CURRENT ROW produce a rolling three-month average that smooths seasonal noise across product lines.NTILE() divides product lines into performance quartiles so leadership can immediately see which lines rank in the top versus bottom tier each quarter.LAG/LEAD access previous/next rows in a single pass; use offset and default value params to handle edgesLAST_VALUE and NTH_VALUE need an explicit ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING frame to see the full partitionROWS for physical row counts (moving averages) and RANGE for value-based groupingNTILE(N) divides rows into N equal buckets for quartile/percentile segmentationLook back. Look forward. Frame it.
- Category
- SQL
- Difficulty
- intermediate
- Duration
- 39 minutes
- Challenges
- 5 hands-on challenges
Topics covered: LAG and LEAD, FIRST, LAST, NTH VALUE, Frame Specifications, NTILE and Bucketing, Multiple Windows
Lesson Sections
- LAG and LEAD (concepts: sqlLagLead)
LAG and LEAD Fundamentals Understanding the syntax and behavior of these functions is essential before applying them to real analytics scenarios. The Syntax Why Not a Self-Join? Before window functions existed, comparing a row to its predecessor required joining a table to itself. You would match each row to the previous row by offsetting a date or sequence column. This works, but it has real costs. Month-Over-Month Comparison Practical Applications LEAD for Change Detection Detecting Sequential
- FIRST, LAST, NTH VALUE (concepts: sqlFirstLastValue)
Positional Function Basics Each positional function accesses a specific row within the window frame. Understanding their syntax and default behaviors is essential for correct usage. FIRST_VALUE Every employee can see how far their salary is from the highest in their department. The person with the top salary has a gap of zero. This pattern is commonly used in compensation analysis to show each employee their relative position without revealing other individuals' exact salaries. The LAST_VALUE Fr
- Frame Specifications (concepts: sqlWindowFrame)
Frame specifications define exactly which rows participate in a window calculation. Every window function operates on a frame, and understanding frames is the key to unlocking the full power of window functions. Up to this point, you have been relying on the default frame, which extends from the beginning of the partition to the current row. That default works for running totals but fails for many other patterns. To build moving averages, trailing windows, and centered calculations, you need to
- NTILE and Bucketing (concepts: sqlNtile)
NTILE Fundamentals Basic NTILE Usage Consider a customer segmentation scenario. A marketing team wants to divide customers into four spending tiers to target campaigns differently. The top quartile gets exclusive offers, the bottom quartile gets re-engagement campaigns. With 8 rows divided into 4 groups, each group gets exactly 2 rows. Quartile 1 contains the highest spenders and quartile 4 the lowest. This is a clean division because 8 divides evenly by 4. Handling Remainders The rule is simple
- Multiple Windows (concepts: sqlMultipleWindows)
Real-world analysis often requires comparing a row from multiple perspectives in the same query. You might need a rank within a department, a running total across the company, and a comparison to the previous month all in one result set. Multiple Window Definitions Each window function in a query can have its own partitioning, ordering, and framing. This flexibility lets you answer multiple questions in a single pass over the data. Different Partitions Here is a practical example: for each sale,