SQL Reference

Advanced SQL Cheat Sheet

This is the cheat sheet we keep pinned in the on-call channel. Every pattern here comes from a pipeline we've had to fix at 2 a.m., a PR that got bounced by a senior reviewer, or an interview one of us almost fumbled. Window functions, CTEs, recursive walks, LATERAL, GROUPING SETS, MERGE, and the four optimizer patterns that actually matter under load.

You won't find a gentle walk-through. You'll find the exact syntax, a pattern that works in production, and one line about the failure mode we've seen. Skim it, copy the parts you need, get back to work.

2am

When you'll need this

854

SQL challenges in library

35%

Senior rounds use windows

41%

Interview rounds are SQL

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Window Functions

Most broken dashboards we've debugged traced back to a window function that ran fine on sample data and melted down on 200M rows. The pattern is always the same: missing frame clause, wrong default, or a PARTITION BY key that wasn't selective enough. Here's what we reach for.

Ranking: ROW_NUMBER, RANK, DENSE_RANK

-- Top 1 product per category by revenue
SELECT *
FROM (
  SELECT
    category,
    product_name,
    revenue,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY revenue DESC
    ) AS rn
  FROM products
) ranked
WHERE rn = 1;

Use ROW_NUMBER for "exactly one per group" queries. Use RANK when ties matter and you want gaps. Use DENSE_RANK when ties matter and you don't want gaps.

LEAD and LAG

-- Month-over-month revenue change per product
SELECT
  product_id,
  month,
  revenue,
  LAG(revenue) OVER (
    PARTITION BY product_id
    ORDER BY month
  ) AS prev_month_revenue,
  revenue - LAG(revenue) OVER (
    PARTITION BY product_id
    ORDER BY month
  ) AS revenue_change
FROM monthly_revenue;

LAG looks at a previous row; LEAD looks at a following row. Both accept an offset parameter (default 1) and an optional default value for when no row exists.

Running Totals and Moving Averages

-- Running total of revenue
SELECT
  order_date,
  revenue,
  SUM(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM daily_revenue;

-- 7-day moving average
SELECT
  order_date,
  revenue,
  AVG(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d
FROM daily_revenue;

ROWS BETWEEN defines the frame. UNBOUNDED PRECEDING starts from the first row. "6 PRECEDING AND CURRENT ROW" creates a 7-row window. Use RANGE instead of ROWS when you want date-based windows that handle missing dates correctly.

NTILE for Percentiles and Bucketing

-- Divide customers into quartiles by total spend
SELECT
  customer_id,
  total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM customer_totals;

NTILE(n) divides ordered rows into n roughly equal groups. If the row count isn't evenly divisible, earlier groups get one extra row. Use it for percentile bucketing, A/B test cohorts, and segmentation.

Common Table Expressions (CTEs)

CTEs make multi-step queries readable. In interviews, using CTEs shows the interviewer that you think in structured, debuggable steps.

Multi-step CTE

WITH monthly_revenue AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS revenue
  FROM orders
  GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
revenue_with_change AS (
  SELECT
    customer_id,
    month,
    revenue,
    LAG(revenue) OVER (
      PARTITION BY customer_id ORDER BY month
    ) AS prev_revenue
  FROM monthly_revenue
)
SELECT
  customer_id,
  month,
  revenue,
  prev_revenue,
  ROUND(
    (revenue - prev_revenue) / NULLIF(prev_revenue, 0) * 100, 1
  ) AS pct_change
FROM revenue_with_change
WHERE prev_revenue IS NOT NULL;

Chain CTEs to break complex logic into named steps. Each CTE can reference the ones before it. This pattern is the standard approach for multi-step interview queries.

Recursive CTE

-- Traverse an org chart: find all reports under a manager
WITH RECURSIVE org_tree AS (
  -- Base case: the target manager
  SELECT employee_id, name, manager_id, 0 AS depth
  FROM employees
  WHERE employee_id = 42

  UNION ALL

  -- Recursive step: find direct reports
  SELECT e.employee_id, e.name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.employee_id
  WHERE ot.depth < 10  -- safety limit
)
SELECT * FROM org_tree;

Recursive CTEs traverse hierarchies and graphs. The base case seeds the first rows; the recursive step joins back to the CTE. Always include a depth limit to prevent infinite loops in circular data.

Generate Date Series (Recursive)

-- Generate all dates in a range (useful for filling gaps)
WITH RECURSIVE date_series AS (
  SELECT DATE '2025-01-01' AS dt
  UNION ALL
  SELECT dt + INTERVAL '1 day'
  FROM date_series
  WHERE dt < DATE '2025-12-31'
)
SELECT ds.dt, COALESCE(r.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN daily_revenue r ON ds.dt = r.order_date;

Date spines fill gaps in time-series data. LEFT JOIN your actual data onto the generated dates so days with no activity show as zero rather than disappearing.

LATERAL Joins

LATERAL is what we reach for when "top N per group" needs a LIMIT and the window-function version keeps hitting a sort spill on the big partition. Saw a Redshift cluster cut a 14 minute query to 38 seconds by switching from a nested ROW_NUMBER to LATERAL with a LIMIT 5.

-- Top 3 orders per customer (cleaner than window + filter)
SELECT c.customer_id, c.name, top_orders.*
FROM customers c
CROSS JOIN LATERAL (
  SELECT order_id, amount, order_date
  FROM orders o
  WHERE o.customer_id = c.customer_id
  ORDER BY amount DESC
  LIMIT 3
) AS top_orders;

LATERAL is supported in PostgreSQL and BigQuery. MySQL 8+ has LATERAL as well. SQL Server uses CROSS APPLY / OUTER APPLY for the same pattern. Use LATERAL when you need "top N per group" with a LIMIT inside the subquery.

GROUPING SETS, ROLLUP, CUBE

Generate multiple levels of aggregation in a single query. Useful for reports that need subtotals, grand totals, and cross-tabulations.

GROUPING SETS

-- Revenue by category, by region, and overall total
SELECT
  category,
  region,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY GROUPING SETS (
  (category),
  (region),
  ()  -- grand total
);

Each set in GROUPING SETS produces a separate aggregation level. NULLs in the output indicate the "all" level for that column. Use GROUPING(column) to distinguish real NULLs from aggregation NULLs.

ROLLUP and CUBE

-- ROLLUP: hierarchical subtotals
-- (year, quarter, month), (year, quarter), (year), ()
SELECT year, quarter, month, SUM(revenue) AS total
FROM sales
GROUP BY ROLLUP (year, quarter, month);

-- CUBE: all combinations
-- Produces every combination of the grouped columns
SELECT category, region, SUM(revenue) AS total
FROM sales
GROUP BY CUBE (category, region);

ROLLUP produces hierarchical subtotals (right to left). CUBE produces all possible combinations. CUBE on 3 columns generates 2^3 = 8 grouping levels.

MERGE / UPSERT

Atomically insert, update, or delete rows based on whether a match exists. Critical for incremental loading in data pipelines.

MERGE (SQL Standard / Snowflake / SQL Server)

MERGE INTO target_table t
USING source_table s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET
    t.name = s.name,
    t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (id, name, created_at)
  VALUES (s.id, s.name, CURRENT_TIMESTAMP);

MERGE is the standard SQL way to do upserts. Supported by Snowflake, BigQuery, SQL Server, and Oracle. Not natively supported in PostgreSQL or MySQL.

PostgreSQL: INSERT ON CONFLICT

INSERT INTO users (id, email, name)
VALUES (1, 'alice@example.com', 'Alice')
ON CONFLICT (id) DO UPDATE SET
  email = EXCLUDED.email,
  name = EXCLUDED.name,
  updated_at = CURRENT_TIMESTAMP;

EXCLUDED refers to the row that would have been inserted. Requires a unique constraint or index on the conflict column. Use DO NOTHING to skip duplicates without updating.

Performance Patterns

Common optimization patterns to discuss in interviews when asked "how would you make this fast?"

Partition pruning

-- Partition by date, then filter by date range
-- The engine reads only the relevant partitions
SELECT * FROM events
WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31';

-- Create a partitioned table (PostgreSQL)
CREATE TABLE events (
  event_id BIGINT,
  event_date DATE,
  payload JSONB
) PARTITION BY RANGE (event_date);

Partitioning by date is the single most common optimization in DE. If a table has billions of rows and most queries filter by date, partition by date.

Index-only scans

-- Covering index: all columns the query needs are in the index
CREATE INDEX idx_orders_customer_date
ON orders (customer_id, order_date)
INCLUDE (amount);

-- This query can be served entirely from the index
SELECT customer_id, order_date, amount
FROM orders
WHERE customer_id = 42
  AND order_date > '2025-01-01';

A covering index avoids reading the table entirely. Useful for hot queries on large tables. Trade-off: more storage and slower writes.

Materialized views for repeated aggregations

-- Pre-compute daily revenue
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT
  DATE_TRUNC('day', order_date) AS day,
  product_id,
  SUM(amount) AS revenue,
  COUNT(*) AS order_count
FROM orders
GROUP BY 1, 2;

-- Refresh on a schedule
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;

Materialized views trade freshness for speed. Refresh them on a schedule (hourly or daily) for dashboards and reports that don't need real-time data.

Avoid SELECT * on wide tables

-- Bad: reads all 200 columns
SELECT * FROM wide_table WHERE id = 42;

-- Good: reads only what you need
SELECT id, name, created_at FROM wide_table WHERE id = 42;

On columnar storage (Redshift, BigQuery, Snowflake), each column is stored separately. SELECT * reads every column's storage. Selecting only needed columns can reduce I/O by 10x or more.

Frequently Asked Questions

What's the difference between ROW_NUMBER, RANK, and DENSE_RANK?+
ROW_NUMBER assigns a unique sequential integer to every row within a partition, with no gaps and no ties. RANK assigns the same number to tied rows but leaves gaps (1, 1, 3). DENSE_RANK assigns the same number to tied rows without gaps (1, 1, 2). Use ROW_NUMBER when you need exactly one row per group (like 'top 1 per category'). Use RANK when you want to preserve the total count (the third-ranked item is rank 3 even if two items are tied for first). Use DENSE_RANK when you want consecutive ranking without gaps.
When should I use a CTE vs. a subquery?+
Use a CTE when the derived result is referenced more than once in the query, when the query has multiple logical steps that benefit from named intermediate results, or when readability is important (which is always in interviews). Use a subquery for simple, one-off derivations that don't need a name. In practice, modern query optimizers treat CTEs and subqueries identically in most cases. The choice is about readability and maintainability, not performance.
How do recursive CTEs work?+
A recursive CTE has two parts: a base case (the anchor member) and a recursive step that references the CTE itself. The database executes the base case first, then repeatedly executes the recursive step, feeding the output of each iteration back as input to the next, until no new rows are produced. Common uses: traversing hierarchies (org charts, category trees), generating date series, and graph traversal. Always include a termination condition (like a max depth) to prevent infinite loops.

The pager stops waking you up

Write these patterns enough and you'll stop getting called in for the same class of bugs. Interview rounds get easier as a side effect.

Start Practicing