Most candidates think window functions are a ranking trick. The interviewer is checking whether you understand that PARTITION BY doesn't reduce rows. That single confusion tanks more window function questions than syntax errors do. The data backs it up: PARTITION BY shows up in 21% of verified DE SQL rounds, ROW_NUMBER in 15%, RANK and DENSE_RANK in 11%, aggregate windows in 9%.
Most study guides dump a function list and call it a day. We're going to flip the order. Start with what the OVER clause actually does to row cardinality, then the frame semantics candidates always skip, then the six functions worth memorizing. Every example runs on PostgreSQL and copies cleanly to Snowflake, BigQuery, Redshift, and Databricks.
Questions with PARTITION BY
ROW_NUMBER appearances
RANK / DENSE_RANK
Aggregate window functions
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Here's where candidates trip. The OVER clause doesn't aggregate rows, it annotates them. Every input row comes out intact with extra columns attached. If you think PARTITION BY acts like GROUP BY, you'll write a query that returns the wrong cardinality and the interviewer will flag it in 30 seconds.
SELECT
employee_id,
department,
salary,
hire_date,
-- Full anatomy: function + OVER(PARTITION BY + ORDER BY + frame)
SUM(salary) OVER (
PARTITION BY department -- 1. Split rows into groups
ORDER BY hire_date -- 2. Sort within each group
ROWS BETWEEN UNBOUNDED PRECEDING -- 3. Define which rows
AND CURRENT ROW -- the function sees
) AS running_dept_salary
FROM employees;Divides the result set into independent groups. The window function resets for each partition. Think of it as GROUP BY without collapsing rows. If you omit PARTITION BY, the entire result set is one partition.
Interview tip: "PARTITION BY is like GROUP BY that keeps every row." That one sentence shows you understand the core concept.
Determines the sequence of rows within each partition. Required for ranking functions (ROW_NUMBER, RANK, DENSE_RANK) and offset functions (LAG, LEAD). For aggregate windows (SUM, AVG, COUNT), ORDER BY turns the aggregate into a running calculation instead of a partition-wide total.
Without ORDER BY, SUM() OVER (PARTITION BY dept) gives the same total on every row. With ORDER BY, it gives a running total. That distinction is a frequent interview question.
Controls exactly which rows within the partition the function considers. Specified as ROWS or RANGE followed by a boundary (UNBOUNDED PRECEDING, N PRECEDING, CURRENT ROW, N FOLLOWING, UNBOUNDED FOLLOWING). The default frame when ORDER BY is present is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which is not always what you want.
The default frame trips people up constantly. If you specify ORDER BY but no frame, you get a running total by default, not a partition-wide total. Spell out the frame explicitly to avoid surprises.
These six cover 90%+ of interview questions. ROW_NUMBER, RANK, and DENSE_RANK handle ranking. LAG and LEAD handle row comparisons. NTILE handles bucketing. Master these and you can derive everything else.
Assigns a unique sequential integer to each row within a partition. No ties. If two rows have the same ORDER BY value, the assignment is nondeterministic (the database picks one arbitrarily). This is the most-used window function in data engineering because it powers deduplication: partition by the natural key, order by updated_at DESC, keep row_number = 1.
SELECT
employee_id,
department,
salary,
ROW_NUMBER() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS rn
FROM employees;Interview note: ROW_NUMBER is the standard dedup tool. Interviewers expect you to write a CTE with ROW_NUMBER and filter WHERE rn = 1 in the outer query. Know that ties are broken arbitrarily unless you add a tiebreaker column.
Like ROW_NUMBER but handles ties. When two rows share the same ORDER BY value, they get the same rank. The next rank skips. If two employees tie at rank 2, the next employee gets rank 4 (not 3). This creates gaps in the sequence. Interviewers test whether you understand the gap behavior and when it matters.
SELECT
employee_id,
department,
salary,
RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS rnk
FROM employees;
-- Salaries: 90k, 90k, 80k -> Ranks: 1, 1, 3Interview note: RANK gaps can surprise downstream consumers. If a report says 'show top 3 by rank,' ties at rank 1 mean rank 2 might not exist. Clarify with the interviewer whether gaps are acceptable.
Ranks with ties but no gaps. Two employees at rank 2 are followed by rank 3, not rank 4. Use DENSE_RANK when you need N distinct rank levels regardless of ties. A 'top 3 departments by revenue' query using DENSE_RANK guarantees exactly 3 distinct rank values in the output, though more than 3 rows may qualify.
SELECT
employee_id,
department,
salary,
DENSE_RANK() OVER (
PARTITION BY department
ORDER BY salary DESC
) AS drnk
FROM employees;
-- Salaries: 90k, 90k, 80k -> Dense ranks: 1, 1, 2Interview note: The classic interview question: 'What is the difference between ROW_NUMBER, RANK, and DENSE_RANK?' You need a crisp one-sentence answer for each, plus an example showing where they diverge.
Accesses a value from a previous row within the partition, based on ORDER BY. LAG(column, 1) returns the value from one row back. LAG(column, 2) returns two rows back. If there is no previous row, the result is NULL (or a default you specify as the third argument). This is the go-to function for calculating row-over-row changes: revenue growth, day-over-day deltas, time between events.
SELECT
order_date,
daily_revenue,
LAG(daily_revenue, 1) OVER (
ORDER BY order_date
) AS prev_day_revenue,
daily_revenue - LAG(daily_revenue, 1) OVER (
ORDER BY order_date
) AS day_over_day_change
FROM daily_sales;Interview note: LAG without a default returns NULL for the first row. If that NULL breaks a downstream calculation (like a percentage change), wrap it in COALESCE or handle it in the WHERE clause.
The mirror of LAG. Accesses a value from a subsequent row within the partition. LEAD(column, 1) returns the next row's value. Useful for calculating time-to-next-event, gap analysis between consecutive records, and session timeout detection. In clickstream data, LEAD(timestamp) minus the current timestamp gives you the idle time between clicks.
SELECT
user_id,
event_timestamp,
LEAD(event_timestamp, 1) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
) AS next_event_time,
LEAD(event_timestamp, 1) OVER (
PARTITION BY user_id
ORDER BY event_timestamp
) - event_timestamp AS idle_time
FROM clickstream;Interview note: LEAD returns NULL for the last row in each partition. For session detection, that NULL typically means 'session ended.' Do not COALESCE it to NOW() unless the business logic requires it.
Divides the partition into n roughly equal buckets and assigns a bucket number (1 through n) to each row. NTILE(4) creates quartiles. NTILE(100) creates percentiles. When the rows do not divide evenly, earlier buckets get one extra row. This is how you build percentile distributions without needing PERCENT_RANK or CUME_DIST.
SELECT
employee_id,
salary,
NTILE(4) OVER (ORDER BY salary) AS salary_quartile
FROM employees;
-- 100 employees -> 25 per quartile
-- 101 employees -> bucket 1 gets 26, others get 25Interview note: NTILE does not handle ties intelligently. Two identical salaries can end up in different quartiles if they fall on a bucket boundary. If that matters, use PERCENT_RANK or CUME_DIST instead.
Any aggregate function (SUM, AVG, COUNT, MIN, MAX) can become a window function by adding OVER(). Without ORDER BY, you get a partition-wide total repeated on every row. With ORDER BY, you get a running calculation. Both patterns are common in interviews.
The most common aggregate window pattern. SUM with ORDER BY accumulates values row by row. This is how you compute cumulative revenue, running balances, or progressive totals without a self-join or correlated subquery.
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_sales;
-- order_date | daily_revenue | cumulative_revenue
-- 2024-01-01 | 1200 | 1200
-- 2024-01-02 | 800 | 2000
-- 2024-01-03 | 1500 | 3500Spelling out ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW is optional here (it matches the default), but doing it explicitly signals to the interviewer that you know the default and are being intentional.
A fixed-width sliding window. Instead of accumulating from the beginning, you average over the last N rows. This smooths out daily noise in metrics like DAU, conversion rate, or latency. The frame clause does the work.
SELECT
report_date,
daily_active_users,
AVG(daily_active_users) OVER (
ORDER BY report_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS seven_day_avg
FROM app_metrics;
-- First 6 rows have fewer than 7 data points.
-- Day 1: avg of 1 value. Day 2: avg of 2 values. Etc.Use ROWS, not RANGE, for moving averages. RANGE would group rows with the same date, which is rarely what you want for a 7-day window.
When you omit ORDER BY, the aggregate computes across the entire partition and repeats the same value on every row. This lets you compare each row to its group total without a self-join or subquery.
SELECT
employee_id,
department,
salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
ROUND(
salary * 100.0 / SUM(salary) OVER (PARTITION BY department),
1
) AS pct_of_dept
FROM employees;
-- Each row shows the individual salary AND what
-- percentage of the department total it represents.This pattern replaces the old-school approach of joining a table to a subquery that groups and aggregates. It is cleaner, faster, and easier to read. Interviewers favor it.
The frame clause is the part most people skip and the part that causes the most bugs. It controls exactly which rows within the ordered partition the window function sees. Two keywords, five boundaries, and one critical default you need to know.
ROWS counts physical rows. RANGE groups rows by value. The difference only matters when duplicate ORDER BY values exist.
-- Data: dates with duplicates
-- 2024-01-01: $100
-- 2024-01-01: $200 (same date)
-- 2024-01-02: $150
-- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- Row 1: 100
-- Row 2: 100 + 200 = 300
-- Row 3: 100 + 200 + 150 = 450
-- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- Row 1: 100 + 200 = 300 (includes all rows with same date)
-- Row 2: 100 + 200 = 300 (same result, same date group)
-- Row 3: 100 + 200 + 150 = 450RANGE treats rows with the same ORDER BY value as peers and includes all peers in the frame. ROWS processes them one at a time. For running totals, ROWS is almost always what you want.
Every frame has a start and an end. The start must come before (or equal) the end in the ordering. Here are the five boundary options, from earliest to latest.
UNBOUNDED PRECEDING -- first row of the partition
N PRECEDING -- N rows before the current row
CURRENT ROW -- the current row itself
N FOLLOWING -- N rows after the current row
UNBOUNDED FOLLOWING -- last row of the partition
-- Examples:
-- Running total from start to here:
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 3-row centered moving average:
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
-- Everything from here to the end:
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
-- Full partition (same as no frame at all):
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWINGWhen you write ORDER BY without an explicit frame, the SQL standard says the default is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. This creates a running calculation, not a partition-wide total. When you omit ORDER BY entirely, there is no frame and the function sees the whole partition. Mixing these up is one of the most common window function bugs.
-- Partition-wide total (no ORDER BY, no frame)
SUM(salary) OVER (PARTITION BY dept)
-- Every row shows the same total: 500000
-- Running total (ORDER BY triggers default frame)
SUM(salary) OVER (PARTITION BY dept ORDER BY hire_date)
-- Row 1: 60000, Row 2: 125000, Row 3: 195000 ...
-- Explicit partition-wide total WITH ordering
SUM(salary) OVER (
PARTITION BY dept
ORDER BY hire_date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
-- Every row shows the same total: 500000
-- but rows are still ordered by hire_dateIf you want ORDER BY for deterministic output but a partition-wide aggregate, you must explicitly set the frame to UNBOUNDED PRECEDING to UNBOUNDED FOLLOWING. The interviewer will notice whether you know this.
These are the patterns that come up repeatedly at companies like Meta, Amazon, Airbnb, and Stripe. The approach notes show what the interviewer expects, not just the final query.
Approach:
Use ROW_NUMBER() partitioned by user_id, ordered by login_date. Subtract the row number (as an interval) from login_date to create a grouping key. Consecutive dates produce the same grouping key. Then GROUP BY user_id and the grouping key, filter HAVING COUNT(*) >= 3. This is the islands-and-gaps pattern. Make sure to DISTINCT the login_date first to handle multiple logins per day.
Approach:
Use LAG(revenue, 1) OVER (ORDER BY month) to get the previous month's revenue. Then calculate (revenue - prev_revenue) / prev_revenue * 100.0. The first month returns NULL because LAG has no prior row. Do not COALESCE it to zero because that would imply 0% growth, which is wrong. Use NULLIF in the denominator to guard against division by zero if a prior month had zero revenue.
Approach:
CTE with ROW_NUMBER() OVER (PARTITION BY entity_id ORDER BY updated_at DESC) AS rn. Outer query filters WHERE rn = 1. If updated_at can have ties, add a tiebreaker: ORDER BY updated_at DESC, id DESC. This is the single most common window function pattern in production data pipelines, and interviewers expect you to write it in under two minutes.
Approach:
DENSE_RANK() OVER (PARTITION BY category ORDER BY total_sales DESC). The interviewer is testing whether you pick DENSE_RANK over RANK. If they say 'show the top 3,' DENSE_RANK guarantees exactly 3 distinct rank levels. Mention that more than 3 rows might appear if there are ties at rank 3.
Approach:
AVG(dau) OVER (ORDER BY report_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW). The frame includes the current row plus the 6 before it, for 7 rows total. Stress the difference between ROWS and RANGE: ROWS counts physical rows, RANGE would group rows with the same date (unlikely here, but the interviewer might ask). For the first 6 days, the average uses fewer than 7 data points. Decide whether to show those partial averages or filter them out.
Approach:
MAX(salary) OVER (PARTITION BY department) AS dept_max, then compute dept_max - salary AS gap. No ORDER BY needed in the window because you want a single aggregate across the entire partition. An alternative: FIRST_VALUE(salary) OVER (PARTITION BY department ORDER BY salary DESC) gives the same result but requires ORDER BY.
Approach:
Step 1: LAG(event_timestamp) OVER (PARTITION BY user_id ORDER BY event_timestamp) to get the previous event time. Step 2: CASE WHEN the gap exceeds 30 minutes, flag it as a new session (1), else 0. Step 3: SUM() that flag as a window function (running sum) partitioned by user_id, ordered by event_timestamp. The running sum increments at each session boundary, producing a session_id. This three-step pattern is standard for clickstream sessionization and shows up in interviews at Airbnb, Spotify, and similar product-analytics companies.
These four errors account for most window function bugs in interviews and production code.
RANK() OVER (PARTITION BY department) without ORDER BY is valid syntax in most engines, but every row gets rank 1 because the database has no criteria to differentiate them. This is never what you want. Always include ORDER BY when using ranking functions. ROW_NUMBER without ORDER BY produces nondeterministic results: the row numbering changes between executions.
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW counts exactly 2 physical rows back. RANGE BETWEEN 2 PRECEDING AND CURRENT ROW includes all rows whose ORDER BY value is within 2 of the current row's value. With dates, RANGE can pull in multiple rows if several share the same date. Most engines default to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW when you specify ORDER BY without an explicit frame. This default catches people because it is not the same as ROWS UNBOUNDED PRECEDING.
GROUP BY collapses rows: 1000 rows become 10 groups with 10 output rows. PARTITION BY does not collapse anything: 1000 rows in, 1000 rows out. The window function computes a value for each row based on its partition, but every original row survives. Candidates who mix these up try to SELECT non-aggregated columns alongside GROUP BY, get errors, and waste interview time debugging.
You cannot write WHERE ROW_NUMBER() OVER (...) = 1. Window functions execute after WHERE. To filter on a window function result, wrap the query in a CTE or subquery and filter in the outer query. This is such a common slip that some interviewers set it as a deliberate trap: they give you a requirement that requires filtering on a rank, and watch whether you reach for a CTE.
Reading OVER clauses won't save you when the interviewer asks for a dedup CTE. Run the queries yourself, break them, fix them, and see what the planner actually does with your frame spec.
Hands-on problems: ROW_NUMBER dedup, running totals, and sessionization with real SQL execution
Deep dive into ROW_NUMBER vs RANK vs DENSE_RANK with side-by-side output comparisons
Full guide to every SQL topic tested in data engineering interviews