SQL Cheat Sheet for Data Engineer Interviews (2026)

Organized by interview frequency, not alphabetical order. Every syntax example includes the trap interviewers actually test and the mistakes that cost candidates points. Based on thousands of questions tracked on the DataDriven platform.

24.5%

GROUP BY and aggregation

19.6%

JOINs

15.1%

Window functions

60%

GROUP BY and aggregation

-- 24.5% of SQL interview questions involve GROUP BY.
-- The HAVING/WHERE distinction is the most-tested trap.

-- Basic GROUP BY: every non-aggregated SELECT column must be in GROUP BY
SELECT department, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

-- HAVING filters AFTER aggregation; WHERE filters BEFORE.
SELECT department, COUNT(*) AS headcount
FROM employees
WHERE active = true              -- filter rows first
GROUP BY department
HAVING COUNT(*) > 10;             -- then filter groups

-- Conditional aggregation: replace multiple subqueries with one pass.
SELECT
  DATE_TRUNC('month', order_date) AS month,
  COUNT(*) AS total_orders,
  SUM(CASE WHEN status = 'returned' THEN 1 ELSE 0 END) AS returns,
  ROUND(100.0 * SUM(CASE WHEN status = 'returned' THEN 1 ELSE 0 END)
    / COUNT(*), 1) AS return_pct
FROM orders
GROUP BY DATE_TRUNC('month', order_date);

-- ROLLUP adds subtotal + grand total rows. CUBE adds all combinations.
SELECT
  COALESCE(region, 'ALL REGIONS') AS region,
  COALESCE(product, 'ALL PRODUCTS') AS product,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY ROLLUP(region, product);

24.5% of SQL interview questions. The HAVING-vs-WHERE distinction is tested in every interview at every level.

JOINs

-- 19.6% of questions. INNER, LEFT, FULL OUTER, CROSS, self-joins.

-- INNER: only matching rows on both sides
SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

-- LEFT JOIN + COALESCE: keep all left rows even when no match
SELECT c.name, COALESCE(COUNT(o.order_id), 0) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;

-- Anti-join: find left rows with NO right match. Asked in nearly every interview.
SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.order_id IS NULL;

-- Self-join: hierarchies (employees → managers)
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- CROSS JOIN: scaffolding (date spine × product list)
SELECT d.date, p.product_id
FROM date_spine d
CROSS JOIN (SELECT DISTINCT product_id FROM products) p;

-- FULL OUTER: reconciliation (what's in A but not B and vice versa)
SELECT
  COALESCE(a.id, b.id) AS id,
  a.value AS source_a, b.value AS source_b
FROM table_a a
FULL OUTER JOIN table_b b ON a.id = b.id;

19.6% of questions. INNER, LEFT, FULL OUTER, CROSS, and self-joins. Anti-join (LEFT JOIN + IS NULL) is the highest-yield pattern.

SQL clause execution order

SQL does not execute top-to-bottom. Understanding the real order explains why you can't use SELECT aliases in WHERE, why HAVING can reference aggregates, and why ORDER BY can use aliases. Interviewers test this directly.

Step	Clause	Notes
1	FROM / JOIN	Tables identified and joined. JOIN order matters for performance.
2	WHERE	Rows filtered. Cannot reference SELECT aliases yet.
3	GROUP BY	Remaining rows grouped. Non-aggregated SELECT columns must appear here.
4	HAVING	Groups filtered using aggregate values.
5	SELECT	Columns and expressions computed. Window functions evaluate here.
6	DISTINCT	Duplicate rows removed.
7	ORDER BY	Results sorted. CAN reference SELECT aliases.
8	LIMIT / OFFSET	Row count restricted.

Window functions

-- 15.1% of questions. The senior-vs-mid signal.

-- ROW_NUMBER / RANK / DENSE_RANK — different tie behavior
ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn  -- (1,2,3,4)
RANK()       OVER (PARTITION BY dept ORDER BY salary DESC) AS rk  -- (1,2,2,4)
DENSE_RANK() OVER (PARTITION BY dept ORDER BY salary DESC) AS dr  -- (1,2,2,3)

-- LAG / LEAD: look backward / forward within a partition
SELECT month, revenue,
  LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
  revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change
FROM monthly_revenue;

-- Running total / moving average — frame clause matters
SUM(amount) OVER (
  ORDER BY date
  ROWS UNBOUNDED PRECEDING
) AS running_total

AVG(amount) OVER (
  ORDER BY date
  ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS avg_7d

-- NTILE: divide rows into N buckets (quartiles, percentiles)
NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile

-- LAST_VALUE classic trap: default frame ends at CURRENT ROW.
-- Must be set to UNBOUNDED FOLLOWING for "last" to actually mean last.
LAST_VALUE(event_type) OVER (
  PARTITION BY user_id ORDER BY event_time
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_action

15.1% of questions and the topic that separates mid-level from senior candidates. The LAST_VALUE frame trap is asked in roughly 1 in 5 senior interviews.

CTEs and recursive queries

-- 4.9% of questions as standalone, but used in MOST complex queries.
-- Interviewers expect CTEs for readability whenever logic stacks.

-- Basic CTE
WITH active_users AS (
  SELECT user_id, COUNT(*) AS sessions
  FROM sessions
  WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY user_id
)
SELECT u.name, a.sessions
FROM users u
JOIN active_users a ON u.id = a.user_id;

-- Multiple CTEs in one WITH (commas, not multiple WITH keywords)
WITH
  recent AS (SELECT * FROM orders WHERE order_date >= '2025-01-01'),
  totals AS (SELECT customer_id, SUM(amount) AS total FROM recent GROUP BY customer_id)
SELECT * FROM totals WHERE total > 1000;

-- Recursive CTE: date spine, org chart, graph traversal
WITH RECURSIVE date_spine AS (
  SELECT DATE '2024-01-01' AS dt        -- anchor
  UNION ALL
  SELECT dt + INTERVAL '1 day'           -- recursive member
  FROM date_spine
  WHERE dt < '2024-12-31'                -- termination
)
SELECT ds.dt, COALESCE(r.revenue, 0) AS revenue
FROM date_spine ds
LEFT JOIN daily_revenue r ON ds.dt = r.day;

Common Table Expressions decompose complex queries. Recursive CTEs power date spines, org charts, and graph traversals.

CASE WHEN and conditional logic

-- 1.8% standalone, but embedded in 40%+ of complex queries.

-- Searched CASE: top-to-bottom evaluation, first TRUE wins
CASE
  WHEN salary >= 150000 THEN 'Executive'
  WHEN salary >= 100000 THEN 'Senior'
  WHEN salary >= 60000  THEN 'Mid'
  ELSE 'Junior'
END AS level

-- Pivoting with CASE — works in every engine
SELECT user_id,
  MAX(CASE WHEN key = 'email' THEN value END) AS email,
  MAX(CASE WHEN key = 'phone' THEN value END) AS phone
FROM user_attributes
GROUP BY user_id;

-- CASE inside COUNT for conditional counts
SELECT
  COUNT(*) AS total_orders,
  COUNT(CASE WHEN status = 'returned' THEN 1 END) AS returned_orders,
  COUNT(CASE WHEN status = 'shipped'  THEN 1 END) AS shipped_orders
FROM orders;

Embedded in 40%+ of analytical queries. The pivot pattern (MAX over CASE) appears in nearly every data-cleaning interview.

Engine-specific syntax differences to memorize

Eight high-frequency operations where Postgres, MySQL, SQL Server, Snowflake, and BigQuery diverge. State your dialect before writing code in an interview.

Operation	Postgres / standard	MySQL / SQL Server / cloud warehouses
Date arithmetic	CURRENT_DATE - INTERVAL '30 days'	DATEADD(day, -30, GETDATE()) / DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
Date truncation	DATE_TRUNC('month', col)	DATE_TRUNC(col, MONTH) (BigQuery), TRUNC(col, 'MONTH') (Oracle)
String concat	col1 \|\| ' ' \|\| col2 (ANSI)	CONCAT(col1, ' ', col2) (MySQL, SQL Server)
NULL-safe equality	col IS NOT DISTINCT FROM other	<=> (MySQL), NVL/COALESCE comparisons
Upsert	INSERT ... ON CONFLICT (Postgres)	MERGE (Snowflake, Postgres 15+, Oracle, SQL Server)
String split	SPLIT_PART(col, '.', 2) (Postgres)	SPLIT(col, '.')[OFFSET(1)] (BigQuery)
Regex replace	REGEXP_REPLACE(col, pattern, repl, 'g')	REGEXP_REPLACE(col, pattern, repl) (BigQuery)
LIMIT with offset	LIMIT 10 OFFSET 20	OFFSET 20 ROWS FETCH NEXT 10 ROWS ONLY (ANSI)

Subqueries and EXISTS

-- EXISTS / NOT EXISTS — NULL-safe, unlike NOT IN
SELECT c.name
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.id
);
-- Prefer NOT EXISTS over NOT IN whenever the subquery column may be NULL.
-- NOT IN returns NO rows if ANY value in the subquery is NULL. Classic trap.

-- Scalar subquery: returns ONE value
SELECT name, salary,
  salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;
-- Errors if the subquery returns >1 row.

-- IN with a subquery: usually rewriteable as a JOIN for better plans
SELECT * FROM orders
WHERE customer_id IN (SELECT id FROM customers WHERE country = 'US');

NOT EXISTS over NOT IN is the most-tested correctness trap. NULL in the subquery silently returns zero rows from NOT IN.

MERGE and upsert

-- MERGE: standard upsert. Postgres got it in v15.

MERGE INTO dim_customer AS target
USING staging AS source ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET
    name = source.name,
    updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (id, name, created_at)
  VALUES (source.id, source.name, CURRENT_TIMESTAMP);

-- Postgres alternative (pre-15 and still common): ON CONFLICT
INSERT INTO dim_customer (id, name, updated_at)
VALUES (1, 'Alice', NOW())
ON CONFLICT (id) DO UPDATE SET
  name = EXCLUDED.name,
  updated_at = EXCLUDED.updated_at;
-- EXCLUDED refers to the row that WOULD have been inserted.

-- MySQL: INSERT ... ON DUPLICATE KEY UPDATE
INSERT INTO dim_customer (id, name) VALUES (1, 'Alice')
ON DUPLICATE KEY UPDATE name = VALUES(name);

Core to incremental loading patterns. Postgres uses INSERT ON CONFLICT pre-v15; MERGE is now the standard cross-engine syntax.

Date functions

-- Almost every interview involves a time dimension.
-- Syntax varies HEAVILY across engines; always name your target.

-- DATE_TRUNC (Postgres / Snowflake)
DATE_TRUNC('month', order_date)

-- DATE_TRUNC (BigQuery — argument order reversed)
DATE_TRUNC(order_date, MONTH)

-- Date arithmetic (Postgres uses INTERVAL)
CURRENT_DATE - INTERVAL '30 days'

-- Date arithmetic (SQL Server / Snowflake use DATEADD)
DATEADD(day, -30, GETDATE())

-- Day difference
DATEDIFF(day, start_date, end_date)         -- SQL Server
DATE_DIFF(end_date, start_date, DAY)        -- BigQuery
end_date - start_date                       -- Postgres returns days

-- EXTRACT works in most engines
EXTRACT(YEAR FROM order_date)
EXTRACT(DOW FROM order_date)   -- day of week (numbering varies)
EXTRACT(EPOCH FROM ts_col)     -- unix timestamp

-- ISO week (consistent across engines unlike "week")
EXTRACT(ISOYEAR FROM date_col)
EXTRACT(WEEK FROM date_col)

Almost every interview involves a time dimension. Always name the engine before writing — DATE_TRUNC argument order differs between Postgres and BigQuery.

Set operations

-- UNION vs UNION ALL — most-tested set-operation trap

-- UNION removes duplicates (slower; sorts and dedupes)
SELECT id FROM table_a
UNION
SELECT id FROM table_b;

-- UNION ALL keeps duplicates (faster; use unless you need dedup)
SELECT id FROM table_a
UNION ALL
SELECT id FROM table_b;

-- INTERSECT: rows in BOTH
SELECT id FROM table_a
INTERSECT
SELECT id FROM table_b;

-- EXCEPT: rows in A but not B (called MINUS in Oracle)
SELECT id FROM table_a
EXCEPT
SELECT id FROM table_b;

UNION ALL beats UNION when you don't need dedup. UNION sorts and deduplicates, which is expensive on large inputs.

String functions and pattern matching

-- String functions appear in data cleaning questions.

-- Concatenation
first_name || ' ' || last_name              -- ANSI / Postgres
CONCAT(first_name, ' ', last_name)          -- MySQL / SQL Server / BigQuery
-- || propagates NULL; CONCAT treats NULL as empty string

-- TRIM / LOWER / REPLACE for join key cleanup
TRIM(BOTH ' ' FROM raw_input)
LOWER(email)
REPLACE(phone, '-', '')
-- Always TRIM+LOWER before joining on string keys. Untrimmed
-- whitespace causes silent join failures (the worst kind).

-- Split by delimiter
SPLIT_PART('a.b.c', '.', 2)                          -- Postgres → 'b'
SPLIT('a.b.c', '.')[OFFSET(1)]                       -- BigQuery → 'b'
STRING_TO_ARRAY('a,b,c', ',')                        -- Postgres → array

-- Regex (engine-specific)
REGEXP_REPLACE(phone, '[^0-9]', '', 'g')             -- Postgres / Snowflake
REGEXP_REPLACE(phone, r'[^0-9]', '')                 -- BigQuery
phone REGEXP '^[0-9]+$'                              -- MySQL boolean

-- Pattern matching
LIKE 'abc%'                                          -- prefix match
ILIKE 'abc%'                                         -- case-insensitive (Postgres)
SIMILAR TO 'pattern'                                 -- regex-lite

Data-cleaning questions test TRIM+LOWER before joins. Engine-specific regex syntax matters most when interviewing at companies that use BigQuery vs. Postgres.

What interviewers reward in SQL rounds

Four habits that separate strong submissions from passing ones. Each is independent — apply whichever fits the question.

Readability

Decompose with CTEs, not nested subqueries

Interviewers follow your query top-to-bottom. A 3-level nested subquery forces them to read inside-out, which loses points even when the SQL is correct. Stack CTEs instead: each one named, each one a logical step. The interviewer can ask 'show me what active_users looks like' and you can run that CTE alone.

Nesting > 2 levels = bad signal

Awareness

State engine assumptions out loud

DATE_TRUNC('month', col) is Postgres. DATE_TRUNC(col, MONTH) is BigQuery. Strong candidates say 'assuming Postgres, I'd write...' before the code. It signals you know the syntax isn't universal and you've worked across engines. Weak candidates write one form and look surprised when the interviewer asks about portability.

Name the dialect before writing

Correctness

Predict the row count before running

After writing a JOIN, say what the result row count should be. 'Each order joins to exactly one customer, so 1M orders × 1 customer = 1M rows.' If you can't predict the count, you don't understand the query. Interviewers test this directly: 'How many rows will this produce?' Wrong answer = uncertainty about the join semantics.

Row math = join correctness check

Edge cases

Mention NULL behavior explicitly

NULL = NULL is false. NOT IN with NULL returns no rows. SUM ignores NULL but COUNT(*) counts all rows including NULL. Strong candidates flag NULL handling for every join, every aggregate, every comparison. 'If customer_id can be NULL, I'd use IS NULL or COALESCE.' Weak candidates ignore NULL until the interviewer points out the bug.

NULL is the #1 source of silent bugs

SQL interview FAQ

What SQL topics are tested most in data engineering interviews?+

Based on DataDriven platform data: GROUP BY and aggregation (24.5%), JOINs (19.6%), and window functions (15.1%) account for nearly 60% of all SQL questions. CTEs and subqueries add another 4.9%. The remaining questions involve date functions, string manipulation, CASE WHEN, and set operations.

Which SQL engine should I practice on?+

PostgreSQL is the most common interview engine. It is open-source, standards-compliant, and supports window functions, CTEs, and MERGE (v15+). If interviewing at a specific company, check their stack: Meta uses Presto, Google uses BigQuery, Amazon uses Redshift, Databricks uses Spark SQL.

Do I need to memorize all SQL syntax?+

No. Focus on patterns, not exact syntax. Interviewers care whether you decompose a problem into the right operations. Minor syntax variations between engines are forgiven. What is not forgiven: using a self-join when a window function is the right tool, or NOT IN with NULL-bearing columns.

Should I use CTEs or subqueries in interviews?+

CTEs are almost always better in interviews. They make your query readable top-to-bottom, which helps the interviewer follow your logic. Use subqueries only for simple scalar lookups. Never nest more than two levels of subqueries.

How long should a SQL interview answer take?+

Most SQL questions should be solvable in 10 to 15 minutes. If you are taking longer, your approach is too complex. Interviewers allocate 45 to 60 minutes for a SQL round and expect 3 to 4 questions.

02 / Why practice

Reading syntax isn't the same as writing it

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start practicing SQL

Related SQL prep

Window functions practice→

The topic that separates mid-level from senior candidates

JOIN practice→

INNER, LEFT, FULL OUTER, CROSS, and self-joins with real tables

CTE practice→

From basic WITH clauses to recursive traversals

SQL interview questions→

Complete guide to every SQL topic tested in DE interviews