Advanced SQL Cheat Sheet
The cheat sheet we keep pinned in the on-call channel. Every pattern here comes from a pipeline we've had to fix at 2am, a PR that got bounced by a senior reviewer, or an interview one of us almost fumbled.
What this cheat sheet covers
Every pattern here comes from a pipeline we've had to fix at 2am, a PR that got bounced by a senior reviewer, or an interview one of us almost fumbled. Window functions, CTEs, recursive walks, LATERAL, GROUPING SETS, MERGE, and the four optimizer patterns that actually matter under load.
Window functions
Most broken dashboards we've debugged traced back to a window function that ran fine on sample data and melted down on 200M rows. Missing frame clause, wrong default, or a PARTITION BY key that wasn't selective enough.
Know advanced SQL the way the interviewer who asks it knows it.
Repeat Purchase Window
The retention squad is looking for repeat purchasers.
Pulled from debriefs where SQL was the gate.
Ranking: ROW_NUMBER, RANK, DENSE_RANK
-- Top 1 product per category by revenue
SELECT *
FROM (
SELECT
category,
product_name,
revenue,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY revenue DESC
) AS rn
FROM products
) ranked
WHERE rn = 1;Use ROW_NUMBER for 'exactly one per group' queries. RANK when ties matter and you want gaps. DENSE_RANK when ties matter and you don't want gaps.
Common table expressions (CTEs)
CTEs make multi-step queries readable. In interviews, using CTEs shows the interviewer that you think in structured, debuggable steps.
LAG and LEAD
-- Month-over-month revenue change per product
SELECT
product_id,
month,
revenue,
LAG(revenue) OVER (
PARTITION BY product_id
ORDER BY month
) AS prev_month_revenue,
revenue - LAG(revenue) OVER (
PARTITION BY product_id
ORDER BY month
) AS revenue_change
FROM monthly_revenue;LAG looks at a previous row; LEAD at a following row. Both accept an offset parameter (default 1) and an optional default value for when no row exists.
Running totals and moving averages
-- Running total of revenue
SELECT
order_date,
revenue,
SUM(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM daily_revenue;
-- 7-day moving average
SELECT
order_date,
revenue,
AVG(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_revenue;ROWS BETWEEN defines the frame. UNBOUNDED PRECEDING starts from the first row. '6 PRECEDING AND CURRENT ROW' creates a 7-row window. Use RANGE instead of ROWS when you want date-based windows that handle missing dates correctly.
NTILE for percentiles and bucketing
-- Divide customers into quartiles by total spend
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_totals;NTILE(n) divides ordered rows into n roughly equal groups. If the row count isn't evenly divisible, earlier groups get one extra row. Use for percentile bucketing, A/B test cohorts, segmentation.
Multi-step CTE
WITH monthly_revenue AS (
SELECT
customer_id,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
revenue_with_change AS (
SELECT
customer_id,
month,
revenue,
LAG(revenue) OVER (
PARTITION BY customer_id ORDER BY month
) AS prev_revenue
FROM monthly_revenue
)
SELECT
customer_id,
month,
revenue,
prev_revenue,
ROUND(
(revenue - prev_revenue) / NULLIF(prev_revenue, 0) * 100, 1
) AS pct_change
FROM revenue_with_change
WHERE prev_revenue IS NOT NULL;Chain CTEs to break complex logic into named steps. Each CTE can reference the ones before it. The standard approach for multi-step interview queries.
Recursive CTE
-- Traverse an org chart: find all reports under a manager
WITH RECURSIVE org_tree AS (
-- Base case: the target manager
SELECT employee_id, name, manager_id, 0 AS depth
FROM employees
WHERE employee_id = 42
UNION ALL
-- Recursive step: find direct reports
SELECT e.employee_id, e.name, e.manager_id, ot.depth + 1
FROM employees e
JOIN org_tree ot ON e.manager_id = ot.employee_id
WHERE ot.depth < 10 -- safety limit
)
SELECT * FROM org_tree;Recursive CTEs traverse hierarchies and graphs. The base case seeds the first rows; the recursive step joins back to the CTE. Always include a depth limit to prevent infinite loops in circular data.
Generate date series (recursive)
-- Generate all dates in a range (fill gaps)
WITH RECURSIVE date_series AS (
SELECT DATE '2025-01-01' AS dt
UNION ALL
SELECT dt + INTERVAL '1 day'
FROM date_series
WHERE dt < DATE '2025-12-31'
)
SELECT ds.dt, COALESCE(r.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN daily_revenue r ON ds.dt = r.order_date;Date spines fill gaps in time-series data. LEFT JOIN your actual data onto the generated dates so days with no activity show as zero rather than disappearing.
LATERAL joins
-- Top 3 orders per customer (cleaner than window + filter)
SELECT c.customer_id, c.name, top_orders.*
FROM customers c
CROSS JOIN LATERAL (
SELECT order_id, amount, order_date
FROM orders o
WHERE o.customer_id = c.customer_id
ORDER BY amount DESC
LIMIT 3
) AS top_orders;LATERAL is what we reach for when 'top N per group' needs a LIMIT and the window-function version keeps hitting a sort spill on the big partition. Supported in PostgreSQL, BigQuery, MySQL 8+. SQL Server uses CROSS APPLY / OUTER APPLY for the same pattern.
GROUPING SETS
-- Revenue by category, by region, and overall total
SELECT
category,
region,
SUM(revenue) AS total_revenue
FROM sales
GROUP BY GROUPING SETS (
(category),
(region),
() -- grand total
);Each set produces a separate aggregation level. NULLs in output indicate the 'all' level for that column. Use GROUPING(column) to distinguish real NULLs from aggregation NULLs.
ROLLUP and CUBE
-- ROLLUP: hierarchical subtotals
-- (year, quarter, month), (year, quarter), (year), ()
SELECT year, quarter, month, SUM(revenue) AS total
FROM sales
GROUP BY ROLLUP (year, quarter, month);
-- CUBE: all combinations
-- Produces every combination of the grouped columns
SELECT category, region, SUM(revenue) AS total
FROM sales
GROUP BY CUBE (category, region);ROLLUP produces hierarchical subtotals (right to left). CUBE produces all possible combinations. CUBE on 3 columns generates 2^3 = 8 grouping levels.
MERGE / UPSERT (Snowflake, BigQuery, SQL Server, Oracle)
MERGE INTO target_table t
USING source_table s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET
t.name = s.name,
t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (id, name, created_at)
VALUES (s.id, s.name, CURRENT_TIMESTAMP);MERGE is the standard SQL way to do upserts. Not natively supported in older PostgreSQL or MySQL.
PostgreSQL INSERT ON CONFLICT
INSERT INTO users (id, email, name)
VALUES (1, 'alice@example.com', 'Alice')
ON CONFLICT (id) DO UPDATE SET
email = EXCLUDED.email,
name = EXCLUDED.name,
updated_at = CURRENT_TIMESTAMP;EXCLUDED refers to the row that would have been inserted. Requires a unique constraint or index on the conflict column. Use DO NOTHING to skip duplicates without updating.
Performance patterns to discuss in interviews
Common optimization patterns when asked 'how would you make this fast?'
Partition pruning
Partition by date, then filter by date range. The engine reads only the relevant partitions. Partitioning by date is the single most common optimization in DE — if a table has billions of rows and most queries filter by date, partition by date.
Index-only scans (covering index)
All columns the query needs are in the index. Avoids reading the table entirely. Useful for hot queries on large tables. Trade-off: more storage, slower writes.
Materialized views for repeated aggregations
Pre-compute daily revenue or other expensive aggregations. Refresh on a schedule (hourly or daily) for dashboards and reports that don't need real-time data. Trade freshness for speed.
Avoid SELECT * on wide tables
On columnar storage (Redshift, BigQuery, Snowflake), each column is stored separately. SELECT * reads every column's storage. Selecting only needed columns can reduce I/O by 10x or more.
Frequently asked questions
Difference between ROW_NUMBER, RANK, and DENSE_RANK?+
When should I use a CTE vs a subquery?+
How do recursive CTEs work?+
The pager stops waking you up
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition