SQL Reference

SQL Cheat Sheet for Data Engineering Interviews (2026)

Organized by interview frequency, not alphabetical order. Every syntax example includes what interviewers actually test and the mistakes that cost candidates points. Based on thousands of questions tracked on the DataDriven platform.

GROUP BY and Aggregation

24.5%

The most-tested SQL category. Nearly one in four SQL questions involves grouping and aggregation.

Basic GROUP BY

SELECT department, COUNT(*) AS headcount, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;

Interview note: Every non-aggregated column in SELECT must appear in GROUP BY. This is the most common syntax error under interview pressure.

HAVING (filter after aggregation)

SELECT department, COUNT(*) AS headcount
FROM employees
GROUP BY department
HAVING COUNT(*) > 10;

Interview note: WHERE filters rows before grouping. HAVING filters groups after aggregation. Interviewers test this distinction constantly.

Conditional Aggregation

SELECT
  DATE_TRUNC('month', order_date) AS month,
  COUNT(*) AS total_orders,
  SUM(CASE WHEN status = 'returned' THEN 1 ELSE 0 END) AS returns,
  ROUND(100.0 * SUM(CASE WHEN status = 'returned' THEN 1 ELSE 0 END)
    / COUNT(*), 1) AS return_pct
FROM orders
GROUP BY DATE_TRUNC('month', order_date);

Interview note: Replaces multiple subqueries with a single pass. Tests CASE WHEN, aggregation, and arithmetic in one query.

GROUP BY with ROLLUP

SELECT
  COALESCE(region, 'ALL REGIONS') AS region,
  COALESCE(product, 'ALL PRODUCTS') AS product,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY ROLLUP(region, product);

Interview note: ROLLUP adds subtotal and grand total rows. CUBE adds all combinations. Knowing these exist shows depth.

JOINs

19.6%

Second most common. Interviewers test INNER, LEFT, FULL OUTER, CROSS, and self-joins. The tricky part is predicting row count and handling NULLs.

INNER JOIN

SELECT o.order_id, c.name
FROM orders o
INNER JOIN customers c ON o.customer_id = c.id;

Interview note: Returns only matching rows. If a customer has no orders, they are excluded.

LEFT JOIN with COALESCE

SELECT c.name, COALESCE(COUNT(o.order_id), 0) AS order_count
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
GROUP BY c.name;

Interview note: Returns all left-table rows. Right-table columns are NULL for non-matches. Most common join type in analytics.

Anti-Join (find non-matches)

SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON c.id = o.customer_id
WHERE o.order_id IS NULL;

Interview note: Finds left-table rows with no right-table match. 'Customers who never ordered.' Tested in nearly every SQL interview.

Self-Join

SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

Interview note: Table joined to itself. Common for hierarchies and comparing rows within the same table.

CROSS JOIN (scaffold)

SELECT d.date, p.product_id
FROM date_spine d
CROSS JOIN (SELECT DISTINCT product_id FROM products) p;

Interview note: Produces every combination. Used to generate a complete grid before LEFT JOINing actual data. Essential for gap-filling.

FULL OUTER JOIN

SELECT
  COALESCE(a.id, b.id) AS id,
  a.value AS source_a, b.value AS source_b
FROM table_a a
FULL OUTER JOIN table_b b ON a.id = b.id;

Interview note: Returns all rows from both sides. Used for reconciliation: what is in A but not B, and vice versa.

Window Functions

15.1%

The topic that separates mid-level from senior candidates. Window functions compute values across related rows without collapsing them.

ROW_NUMBER / RANK / DENSE_RANK

ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC) AS rn
-- ROW_NUMBER: always unique (1,2,3,4)
-- RANK: ties share, next skips (1,2,2,4)
-- DENSE_RANK: ties share, no skip (1,2,2,3)

Interview note: ROW_NUMBER for dedup (WHERE rn = 1). RANK/DENSE_RANK for leaderboards. Know the tie-handling difference.

LAG / LEAD

SELECT month, revenue,
  LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
  revenue - LAG(revenue, 1) OVER (ORDER BY month) AS mom_change
FROM monthly_revenue;

Interview note: LAG looks backward, LEAD looks forward. Second arg is offset (default 1). Third arg is default for missing rows (default NULL).

Running Total / Moving Average

SUM(amount) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) AS running_total,
AVG(amount) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7d

Interview note: Frame clause (ROWS BETWEEN) controls which rows the window includes. This syntax trips people up most.

NTILE

NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile

Interview note: Divides rows into N roughly equal buckets. NTILE(4) gives quartiles, NTILE(100) gives percentiles.

FIRST_VALUE / LAST_VALUE

FIRST_VALUE(event_type) OVER (
  PARTITION BY user_id ORDER BY event_time
  ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_action

Interview note: LAST_VALUE requires explicit frame to UNBOUNDED FOLLOWING. Without it, the default frame ends at CURRENT ROW. Classic trap.

CTEs (Common Table Expressions)

4.9%

Under 5% as standalone topic, but interviewers expect CTEs whenever a query gets complex. They test readability and decomposition.

Basic CTE

WITH active_users AS (
  SELECT user_id, COUNT(*) AS sessions
  FROM sessions
  WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
  GROUP BY user_id
)
SELECT u.name, a.sessions
FROM users u JOIN active_users a ON u.id = a.user_id;

Interview note: Named temporary result. Reference it like a table. Multiple CTEs separated by commas, not multiple WITH keywords.

Recursive CTE

WITH RECURSIVE date_spine AS (
  SELECT DATE '2024-01-01' AS dt
  UNION ALL
  SELECT dt + INTERVAL '1 day' FROM date_spine WHERE dt < '2024-12-31'
)
SELECT ds.dt, COALESCE(r.revenue, 0) AS revenue
FROM date_spine ds LEFT JOIN daily_revenue r ON ds.dt = r.day;

Interview note: Anchor + UNION ALL + recursive member. Used for date spines, org charts, graph traversals. Always include termination condition.

CASE WHEN

1.8% standalone, embedded in 40%+ of queries

Rarely the main topic, but woven into most complex queries. Conditional logic connects raw data to business rules.

Searched CASE

CASE
  WHEN salary >= 150000 THEN 'Executive'
  WHEN salary >= 100000 THEN 'Senior'
  WHEN salary >= 60000  THEN 'Mid'
  ELSE 'Junior'
END AS level

Interview note: Conditions evaluated top to bottom; first TRUE wins. Put most restrictive condition first.

Pivoting with CASE

SELECT user_id,
  MAX(CASE WHEN key = 'email' THEN value END) AS email,
  MAX(CASE WHEN key = 'phone' THEN value END) AS phone
FROM user_attributes GROUP BY user_id;

Interview note: Manual pivot. Works in every engine. MAX as outer aggregate because each group has at most one value per key.

Date Functions

Embedded in most analytical queries

Almost every interview involves a time dimension. Syntax varies heavily across engines.

DATE_TRUNC

-- PostgreSQL / Snowflake
DATE_TRUNC('month', order_date)
-- BigQuery
DATE_TRUNC(order_date, MONTH)

Interview note: Truncates to specified precision. Argument order differs between engines. Know your target.

Date Arithmetic

-- PostgreSQL
CURRENT_DATE - INTERVAL '30 days'
-- SQL Server / Snowflake
DATEADD(day, -30, GETDATE())
-- DATEDIFF
DATEDIFF(day, start_date, end_date)

Interview note: Syntax is completely engine-specific. INTERVAL is more readable. DATEADD/DATEDIFF are more portable across cloud warehouses.

EXTRACT

EXTRACT(YEAR FROM order_date)
EXTRACT(DOW FROM order_date)  -- day of week
EXTRACT(EPOCH FROM ts_col)    -- unix timestamp

Interview note: DOW numbering varies: PostgreSQL Sunday = 0, others differ. Always test or clarify in interviews.

String Functions

Common in data cleaning questions

Parsing, trimming, and pattern matching for raw data ingestion.

CONCAT / ||

-- ANSI: first_name || ' ' || last_name
-- MySQL/SQL Server: CONCAT(first_name, ' ', last_name)

Interview note: || propagates NULL in PostgreSQL. CONCAT in MySQL treats NULL as empty string. Know the difference.

TRIM / LOWER / REPLACE

TRIM(BOTH ' ' FROM raw_input)
LOWER(email)
REPLACE(phone, '-', '')

Interview note: Always TRIM and LOWER before joining on string keys. Untrimmed whitespace causes silent join failures.

SPLIT / REGEXP

-- PostgreSQL: SPLIT_PART('a.b.c', '.', 2) returns 'b'
-- REGEXP_REPLACE(phone, '[^0-9]', '', 'g') -- digits only

Interview note: Engine-specific syntax. These appear in data cleaning questions where source data is messy.

MERGE / Upsert

Tested in pipeline and ETL questions

MERGE powers incremental loading patterns. Critical for idempotent data pipelines.

MERGE (Upsert)

MERGE INTO dim_customer AS target
USING staging AS source ON target.id = source.id
WHEN MATCHED THEN
  UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (id, name, created_at)
  VALUES (source.id, source.name, CURRENT_TIMESTAMP);

Interview note: Atomic upsert. Match on key; update if exists, insert if new. PostgreSQL uses INSERT ... ON CONFLICT instead (MERGE added in v15).

INSERT ... ON CONFLICT (PostgreSQL)

INSERT INTO dim_customer (id, name, updated_at)
VALUES (1, 'Alice', NOW())
ON CONFLICT (id) DO UPDATE SET
  name = EXCLUDED.name,
  updated_at = EXCLUDED.updated_at;

Interview note: PostgreSQL's upsert before MERGE was available. EXCLUDED refers to the row that would have been inserted.

Subqueries and EXISTS

Part of CTE/subquery 4.9%

Scalar subqueries, correlated subqueries, and EXISTS/NOT EXISTS. Know when each is the right tool.

EXISTS / NOT EXISTS

-- Customers with no orders (NULL-safe, unlike NOT IN)
SELECT c.name FROM customers c
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.customer_id = c.id
);

Interview note: EXISTS is NULL-safe (unlike NOT IN). Prefer NOT EXISTS over NOT IN when the subquery column might contain NULLs.

Scalar Subquery

SELECT name, salary,
  salary - (SELECT AVG(salary) FROM employees) AS diff_from_avg
FROM employees;

Interview note: Returns one value. If it returns multiple rows, the query errors. Use for comparing each row against a global metric.

Set Operations

Occasional

UNION, INTERSECT, EXCEPT for combining result sets.

UNION vs UNION ALL

-- UNION: removes duplicates (slower)
SELECT id FROM table_a UNION SELECT id FROM table_b;
-- UNION ALL: keeps duplicates (faster)
SELECT id FROM table_a UNION ALL SELECT id FROM table_b;

Interview note: Always use UNION ALL unless you need dedup. UNION sorts and deduplicates, which is expensive. Interviewers test this.

EXCEPT (find differences)

SELECT id FROM table_a EXCEPT SELECT id FROM table_b;

Interview note: Rows in A but not B. Called MINUS in Oracle. Useful for reconciliation and migration validation.

SQL Execution Order

SQL does not execute top to bottom. Understanding execution order explains why you cannot use SELECT aliases in WHERE, why HAVING can reference aggregates, and why ORDER BY can use aliases. Interviewers test this directly.

1
FROM / JOINTables identified and joined
2
WHERERows filtered (cannot use SELECT aliases)
3
GROUP BYRemaining rows grouped
4
HAVINGGroups filtered (can use aggregates)
5
SELECTColumns and expressions computed
6
DISTINCTDuplicates removed
7
ORDER BYResults sorted (can use aliases)
8
LIMIT / OFFSETRow count restricted

SQL Interview FAQ

What SQL topics are tested most in data engineering interviews?+
Based on DataDriven platform data: GROUP BY and aggregation (24.5%), JOINs (19.6%), and window functions (15.1%) account for nearly 60% of all SQL questions. CTEs and subqueries add another 4.9%. The remaining questions involve date functions, string manipulation, CASE WHEN, and set operations.
Which SQL engine should I practice on?+
PostgreSQL is the most common interview engine. It is open-source, standards-compliant, and supports window functions, CTEs, and MERGE (v15+). If interviewing at a specific company, check their stack: Meta uses Presto, Google uses BigQuery, Amazon uses Redshift.
Do I need to memorize all SQL syntax?+
No. Focus on patterns, not exact syntax. Interviewers care whether you decompose a problem into the right operations. Minor syntax variations between engines are forgiven. What is not forgiven: using a self-join when a window function is the right tool.
Should I use CTEs or subqueries in interviews?+
CTEs are almost always better in interviews. They make your query readable top-to-bottom, which helps the interviewer follow your logic. Use subqueries only for simple scalar lookups. Never nest more than two levels of subqueries.
How long should a SQL interview answer take?+
Most SQL questions should be solvable in 10 to 15 minutes. If you are taking longer, your approach is too complex. Interviewers allocate 45 to 60 minutes for a SQL round and expect 3 to 4 questions.

Reading Syntax is Not the Same as Writing It

A cheat sheet helps you review. Practice helps you perform. Write real queries against real tables with instant feedback and result comparison.