This is the cheat sheet we keep pinned in the on-call channel. Every pattern here comes from a pipeline we've had to fix at 2 a.m., a PR that got bounced by a senior reviewer, or an interview one of us almost fumbled. Window functions, CTEs, recursive walks, LATERAL, GROUPING SETS, MERGE, and the four optimizer patterns that actually matter under load.
You won't find a gentle walk-through. You'll find the exact syntax, a pattern that works in production, and one line about the failure mode we've seen. Skim it, copy the parts you need, get back to work.
When you'll need this
SQL challenges in library
Senior rounds use windows
Interview rounds are SQL
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Most broken dashboards we've debugged traced back to a window function that ran fine on sample data and melted down on 200M rows. The pattern is always the same: missing frame clause, wrong default, or a PARTITION BY key that wasn't selective enough. Here's what we reach for.
-- Top 1 product per category by revenue
SELECT *
FROM (
SELECT
category,
product_name,
revenue,
ROW_NUMBER() OVER (
PARTITION BY category
ORDER BY revenue DESC
) AS rn
FROM products
) ranked
WHERE rn = 1;Use ROW_NUMBER for "exactly one per group" queries. Use RANK when ties matter and you want gaps. Use DENSE_RANK when ties matter and you don't want gaps.
-- Month-over-month revenue change per product
SELECT
product_id,
month,
revenue,
LAG(revenue) OVER (
PARTITION BY product_id
ORDER BY month
) AS prev_month_revenue,
revenue - LAG(revenue) OVER (
PARTITION BY product_id
ORDER BY month
) AS revenue_change
FROM monthly_revenue;LAG looks at a previous row; LEAD looks at a following row. Both accept an offset parameter (default 1) and an optional default value for when no row exists.
-- Running total of revenue
SELECT
order_date,
revenue,
SUM(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS running_total
FROM daily_revenue;
-- 7-day moving average
SELECT
order_date,
revenue,
AVG(revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d
FROM daily_revenue;ROWS BETWEEN defines the frame. UNBOUNDED PRECEDING starts from the first row. "6 PRECEDING AND CURRENT ROW" creates a 7-row window. Use RANGE instead of ROWS when you want date-based windows that handle missing dates correctly.
-- Divide customers into quartiles by total spend SELECT customer_id, total_spend, NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile FROM customer_totals;
NTILE(n) divides ordered rows into n roughly equal groups. If the row count isn't evenly divisible, earlier groups get one extra row. Use it for percentile bucketing, A/B test cohorts, and segmentation.
CTEs make multi-step queries readable. In interviews, using CTEs shows the interviewer that you think in structured, debuggable steps.
WITH monthly_revenue AS (
SELECT
customer_id,
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
revenue_with_change AS (
SELECT
customer_id,
month,
revenue,
LAG(revenue) OVER (
PARTITION BY customer_id ORDER BY month
) AS prev_revenue
FROM monthly_revenue
)
SELECT
customer_id,
month,
revenue,
prev_revenue,
ROUND(
(revenue - prev_revenue) / NULLIF(prev_revenue, 0) * 100, 1
) AS pct_change
FROM revenue_with_change
WHERE prev_revenue IS NOT NULL;Chain CTEs to break complex logic into named steps. Each CTE can reference the ones before it. This pattern is the standard approach for multi-step interview queries.
-- Traverse an org chart: find all reports under a manager WITH RECURSIVE org_tree AS ( -- Base case: the target manager SELECT employee_id, name, manager_id, 0 AS depth FROM employees WHERE employee_id = 42 UNION ALL -- Recursive step: find direct reports SELECT e.employee_id, e.name, e.manager_id, ot.depth + 1 FROM employees e JOIN org_tree ot ON e.manager_id = ot.employee_id WHERE ot.depth < 10 -- safety limit ) SELECT * FROM org_tree;
Recursive CTEs traverse hierarchies and graphs. The base case seeds the first rows; the recursive step joins back to the CTE. Always include a depth limit to prevent infinite loops in circular data.
-- Generate all dates in a range (useful for filling gaps) WITH RECURSIVE date_series AS ( SELECT DATE '2025-01-01' AS dt UNION ALL SELECT dt + INTERVAL '1 day' FROM date_series WHERE dt < DATE '2025-12-31' ) SELECT ds.dt, COALESCE(r.revenue, 0) AS revenue FROM date_series ds LEFT JOIN daily_revenue r ON ds.dt = r.order_date;
Date spines fill gaps in time-series data. LEFT JOIN your actual data onto the generated dates so days with no activity show as zero rather than disappearing.
LATERAL is what we reach for when "top N per group" needs a LIMIT and the window-function version keeps hitting a sort spill on the big partition. Saw a Redshift cluster cut a 14 minute query to 38 seconds by switching from a nested ROW_NUMBER to LATERAL with a LIMIT 5.
-- Top 3 orders per customer (cleaner than window + filter) SELECT c.customer_id, c.name, top_orders.* FROM customers c CROSS JOIN LATERAL ( SELECT order_id, amount, order_date FROM orders o WHERE o.customer_id = c.customer_id ORDER BY amount DESC LIMIT 3 ) AS top_orders;
LATERAL is supported in PostgreSQL and BigQuery. MySQL 8+ has LATERAL as well. SQL Server uses CROSS APPLY / OUTER APPLY for the same pattern. Use LATERAL when you need "top N per group" with a LIMIT inside the subquery.
Generate multiple levels of aggregation in a single query. Useful for reports that need subtotals, grand totals, and cross-tabulations.
-- Revenue by category, by region, and overall total SELECT category, region, SUM(revenue) AS total_revenue FROM sales GROUP BY GROUPING SETS ( (category), (region), () -- grand total );
Each set in GROUPING SETS produces a separate aggregation level. NULLs in the output indicate the "all" level for that column. Use GROUPING(column) to distinguish real NULLs from aggregation NULLs.
-- ROLLUP: hierarchical subtotals -- (year, quarter, month), (year, quarter), (year), () SELECT year, quarter, month, SUM(revenue) AS total FROM sales GROUP BY ROLLUP (year, quarter, month); -- CUBE: all combinations -- Produces every combination of the grouped columns SELECT category, region, SUM(revenue) AS total FROM sales GROUP BY CUBE (category, region);
ROLLUP produces hierarchical subtotals (right to left). CUBE produces all possible combinations. CUBE on 3 columns generates 2^3 = 8 grouping levels.
Atomically insert, update, or delete rows based on whether a match exists. Critical for incremental loading in data pipelines.
MERGE INTO target_table t
USING source_table s
ON t.id = s.id
WHEN MATCHED THEN
UPDATE SET
t.name = s.name,
t.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (id, name, created_at)
VALUES (s.id, s.name, CURRENT_TIMESTAMP);MERGE is the standard SQL way to do upserts. Supported by Snowflake, BigQuery, SQL Server, and Oracle. Not natively supported in PostgreSQL or MySQL.
INSERT INTO users (id, email, name) VALUES (1, 'alice@example.com', 'Alice') ON CONFLICT (id) DO UPDATE SET email = EXCLUDED.email, name = EXCLUDED.name, updated_at = CURRENT_TIMESTAMP;
EXCLUDED refers to the row that would have been inserted. Requires a unique constraint or index on the conflict column. Use DO NOTHING to skip duplicates without updating.
Common optimization patterns to discuss in interviews when asked "how would you make this fast?"
-- Partition by date, then filter by date range -- The engine reads only the relevant partitions SELECT * FROM events WHERE event_date BETWEEN '2025-01-01' AND '2025-01-31'; -- Create a partitioned table (PostgreSQL) CREATE TABLE events ( event_id BIGINT, event_date DATE, payload JSONB ) PARTITION BY RANGE (event_date);
Partitioning by date is the single most common optimization in DE. If a table has billions of rows and most queries filter by date, partition by date.
-- Covering index: all columns the query needs are in the index CREATE INDEX idx_orders_customer_date ON orders (customer_id, order_date) INCLUDE (amount); -- This query can be served entirely from the index SELECT customer_id, order_date, amount FROM orders WHERE customer_id = 42 AND order_date > '2025-01-01';
A covering index avoids reading the table entirely. Useful for hot queries on large tables. Trade-off: more storage and slower writes.
-- Pre-compute daily revenue
CREATE MATERIALIZED VIEW daily_revenue AS
SELECT
DATE_TRUNC('day', order_date) AS day,
product_id,
SUM(amount) AS revenue,
COUNT(*) AS order_count
FROM orders
GROUP BY 1, 2;
-- Refresh on a schedule
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_revenue;Materialized views trade freshness for speed. Refresh them on a schedule (hourly or daily) for dashboards and reports that don't need real-time data.
-- Bad: reads all 200 columns SELECT * FROM wide_table WHERE id = 42; -- Good: reads only what you need SELECT id, name, created_at FROM wide_table WHERE id = 42;
On columnar storage (Redshift, BigQuery, Snowflake), each column is stored separately. SELECT * reads every column's storage. Selecting only needed columns can reduce I/O by 10x or more.
Write these patterns enough and you'll stop getting called in for the same class of bugs. Interview rounds get easier as a side effect.
Start Practicing