Advanced SQL Cheat Sheet

The cheat sheet we keep pinned in the on-call channel. Every pattern here comes from a pipeline we've had to fix at 2am, a PR that got bounced by a senior reviewer, or an interview one of us almost fumbled.

What this cheat sheet covers

Every pattern here comes from a pipeline we've had to fix at 2am, a PR that got bounced by a senior reviewer, or an interview one of us almost fumbled. Window functions, CTEs, recursive walks, LATERAL, GROUPING SETS, MERGE, and the four optimizer patterns that actually matter under load.

2am

When you'll need this

854

SQL challenges in library

35%

Senior rounds use windows

41%

Interview rounds are SQL

Window functions

Most broken dashboards we've debugged traced back to a window function that ran fine on sample data and melted down on 200M rows. Missing frame clause, wrong default, or a PARTITION BY key that wasn't selective enough.

Ranking: ROW_NUMBER, RANK, DENSE_RANK

-- Top 1 product per category by price
SELECT *
FROM (
  SELECT
    category,
    product_name,
    price,
    ROW_NUMBER() OVER (
      PARTITION BY category
      ORDER BY price DESC
    ) AS rn
  FROM products
) ranked
WHERE rn = 1;

Use ROW_NUMBER for 'exactly one per group' queries. RANK when ties matter and you want gaps. DENSE_RANK when ties matter and you don't want gaps.

LAG and LEAD

-- Month-over-month revenue change per product
SELECT
  product_id,
  month,
  revenue,
  LAG(revenue) OVER (
    PARTITION BY product_id
    ORDER BY month
  ) AS prev_month_revenue,
  revenue - LAG(revenue) OVER (
    PARTITION BY product_id
    ORDER BY month
  ) AS revenue_change
FROM (
  SELECT
    product_id,
    substr(transaction_date, 1, 7) AS month,
    SUM(total_amount) AS revenue
  FROM transactions
  GROUP BY product_id, substr(transaction_date, 1, 7)
) monthly_revenue;

LAG looks at a previous row; LEAD at a following row. Both accept an offset parameter (default 1) and an optional default value for when no row exists.

Running totals and moving averages

-- Running total of revenue
SELECT
  order_date,
  revenue,
  SUM(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  ) AS running_total
FROM daily_revenue;

-- 7-day moving average
SELECT
  order_date,
  revenue,
  AVG(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS moving_avg_7d
FROM daily_revenue;

ROWS BETWEEN defines the frame. UNBOUNDED PRECEDING starts from the first row. '6 PRECEDING AND CURRENT ROW' creates a 7-row window. Use RANGE instead of ROWS when you want date-based windows that handle missing dates correctly.

NTILE for percentiles and bucketing

-- Divide customers into quartiles by total spend
SELECT
  user_id,
  total_spend,
  NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile
FROM (
  SELECT user_id, SUM(total_amount) AS total_spend
  FROM transactions
  GROUP BY user_id
) customer_totals;

NTILE(n) divides ordered rows into n roughly equal groups. If the row count isn't evenly divisible, earlier groups get one extra row. Use for percentile bucketing, A/B test cohorts, segmentation.

Common table expressions (CTEs)

CTEs make multi-step queries readable. In interviews, using CTEs shows the interviewer that you think in structured, debuggable steps.

Multi-step CTE

WITH monthly_revenue AS (
  SELECT
    customer_id,
    DATE_TRUNC('month', order_date) AS month,
    SUM(amount) AS revenue
  FROM orders
  GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
revenue_with_change AS (
  SELECT
    customer_id,
    month,
    revenue,
    LAG(revenue) OVER (
      PARTITION BY customer_id ORDER BY month
    ) AS prev_revenue
  FROM monthly_revenue
)
SELECT
  customer_id,
  month,
  revenue,
  prev_revenue,
  ROUND(
    (revenue - prev_revenue) / NULLIF(prev_revenue, 0) * 100, 1
  ) AS pct_change
FROM revenue_with_change
WHERE prev_revenue IS NOT NULL;

Chain CTEs to break complex logic into named steps. Each CTE can reference the ones before it. The standard approach for multi-step interview queries.

Recursive CTE

-- Traverse an org chart: find all reports under a manager
WITH RECURSIVE org_tree AS (
  -- Base case: the target manager
  SELECT employee_id, emp_name, manager_id, 0 AS depth
  FROM employees
  WHERE employee_id = 10

  UNION ALL

  -- Recursive step: find direct reports
  SELECT e.employee_id, e.emp_name, e.manager_id, ot.depth + 1
  FROM employees e
  JOIN org_tree ot ON e.manager_id = ot.employee_id
  WHERE ot.depth < 10  -- safety limit
)
SELECT * FROM org_tree;

Recursive CTEs traverse hierarchies and graphs. The base case seeds the first rows; the recursive step joins back to the CTE. Always include a depth limit to prevent infinite loops in circular data.

Generate date series (recursive)

-- Generate all dates in a range (fill gaps)
WITH RECURSIVE date_series AS (
  SELECT DATE '2025-01-01' AS dt
  UNION ALL
  SELECT dt + INTERVAL '1 day'
  FROM date_series
  WHERE dt < DATE '2025-12-31'
)
SELECT ds.dt, COALESCE(r.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN daily_revenue r ON ds.dt = r.order_date;

Date spines fill gaps in time-series data. LEFT JOIN your actual data onto the generated dates so days with no activity show as zero rather than disappearing.

LATERAL joins

-- Top 3 orders per customer (cleaner than window + filter)
SELECT c.customer_id, c.name, top_orders.*
FROM customers c
CROSS JOIN LATERAL (
  SELECT order_id, amount, order_date
  FROM orders o
  WHERE o.customer_id = c.customer_id
  ORDER BY amount DESC
  LIMIT 3
) AS top_orders;

LATERAL is what we reach for when 'top N per group' needs a LIMIT and the window-function version keeps hitting a sort spill on the big partition. Supported in PostgreSQL, BigQuery, MySQL 8+. SQL Server uses CROSS APPLY / OUTER APPLY for the same pattern.

GROUPING SETS

-- Revenue by category, by region, and overall total
SELECT
  category,
  region,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY GROUPING SETS (
  (category),
  (region),
  ()  -- grand total
);

Each set produces a separate aggregation level. NULLs in output indicate the 'all' level for that column. Use GROUPING(column) to distinguish real NULLs from aggregation NULLs.

ROLLUP and CUBE

-- ROLLUP: hierarchical subtotals
-- (year, quarter, month), (year, quarter), (year), ()
SELECT year, quarter, month, SUM(revenue) AS total
FROM sales
GROUP BY ROLLUP (year, quarter, month);

-- CUBE: all combinations
-- Produces every combination of the grouped columns
SELECT category, region, SUM(revenue) AS total
FROM sales
GROUP BY CUBE (category, region);

ROLLUP produces hierarchical subtotals (right to left). CUBE produces all possible combinations. CUBE on 3 columns generates 2^3 = 8 grouping levels.

MERGE / UPSERT (Snowflake, BigQuery, SQL Server, Oracle)

MERGE INTO target_table t
USING source_table s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET
    t.name = s.name,
    t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (id, name, created_at)
  VALUES (s.id, s.name, CURRENT_TIMESTAMP);

MERGE is the standard SQL way to do upserts. Not natively supported in older PostgreSQL or MySQL.

PostgreSQL INSERT ON CONFLICT

INSERT INTO users (id, email, name)
VALUES (1, 'alice@example.com', 'Alice')
ON CONFLICT (id) DO UPDATE SET
  email = EXCLUDED.email,
  name = EXCLUDED.name,
  updated_at = CURRENT_TIMESTAMP;

EXCLUDED refers to the row that would have been inserted. Requires a unique constraint or index on the conflict column. Use DO NOTHING to skip duplicates without updating.

Performance patterns to discuss in interviews

Common optimization patterns when asked 'how would you make this fast?'

Partition pruning

Partition by date, then filter by date range. The engine reads only the relevant partitions. Partitioning by date is the single most common optimization in DE — if a table has billions of rows and most queries filter by date, partition by date.

Index-only scans (covering index)

All columns the query needs are in the index. Avoids reading the table entirely. Useful for hot queries on large tables. Trade-off: more storage, slower writes.

Materialized views for repeated aggregations

Pre-compute daily revenue or other expensive aggregations. Refresh on a schedule (hourly or daily) for dashboards and reports that don't need real-time data. Trade freshness for speed.

Avoid SELECT * on wide tables

On columnar storage (Redshift, BigQuery, Snowflake), each column is stored separately. SELECT * reads every column's storage. Selecting only needed columns can reduce I/O by 10x or more.

Frequently asked questions

Difference between ROW_NUMBER, RANK, and DENSE_RANK?+

ROW_NUMBER assigns a unique sequential integer to every row within a partition, no gaps and no ties. RANK assigns the same number to tied rows but leaves gaps (1, 1, 3). DENSE_RANK assigns the same number to tied rows without gaps (1, 1, 2). Use ROW_NUMBER when you need exactly one row per group. Use RANK when you want to preserve the total count. Use DENSE_RANK when you want consecutive ranking without gaps.

When should I use a CTE vs a subquery?+

CTE when the derived result is referenced more than once, when the query has multiple logical steps that benefit from named intermediate results, or when readability is important (always in interviews). Subquery for simple, one-off derivations. Modern optimizers treat them identically in most cases — the choice is about readability and maintainability.

How do recursive CTEs work?+

Two parts: a base case (anchor member) and a recursive step that references the CTE itself. The database executes the base case first, then repeatedly executes the recursive step, feeding output of each iteration back as input, until no new rows are produced. Common uses: org charts, category trees, date series, graph traversal. Always include a termination condition (max depth) to prevent infinite loops.

02 / Why practice

The pager stops waking you up

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start practicing

Related guides

SQL Cheat Sheet (Basics)→

SELECT, JOIN, GROUP BY, and fundamental patterns.

SQL Interview Questions→

Every SQL topic tested in DE interviews.

Window Functions Practice→

Drill the highest-frequency window function patterns.