SQL is the most-tested skill in data engineering interviews, appearing in nearly seven out of ten interview loops. Below: 50+ questions organized by topic, 15 fully worked solutions with code, explanations, common mistakes, and complexity analysis.
Every question below can be practiced with real SQL execution on DataDriven. Write the query, run it against a real database, and see if your output matches.
ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, and frame clauses. PARTITION BY is the single most common window function keyword in real DE interviews.
Write a query to find the second-highest salary in each department using window functions.
Calculate a 7-day rolling average of daily revenue using a window frame.
Use LAG to find customers whose order value decreased compared to their previous order.
Rank products by total sales within each category, handling ties with DENSE_RANK.
Use NTILE to divide employees into salary quartiles within each department.
Common Table Expressions for readable multi-step queries. Recursive CTEs for hierarchical data like org charts and category trees.
Write a CTE to find all employees in a management chain starting from the CEO.
Use a recursive CTE to generate a date series for the last 90 days.
Chain multiple CTEs to calculate month-over-month revenue growth rates.
Write a recursive query to find all subcategories of a given parent category.
Use a CTE to deduplicate records, keeping only the most recent entry per user.
INNER JOIN is the second most-tested SQL keyword after GROUP BY. LEFT, RIGHT, FULL OUTER, CROSS, self-joins, and inequality joins round out the category.
Write a self-join to find pairs of employees who share the same manager.
Use a LEFT JOIN to find customers who have never placed an order.
Write an inequality join to match events that occurred within 30 minutes of each other.
Join three tables to find the top-selling product in each region.
Use a CROSS JOIN to generate all possible date-product combinations, then LEFT JOIN to find missing data.
GROUP BY is the most frequently tested SQL keyword in DE interviews. COUNT, SUM, AVG, MIN, MAX with HAVING and conditional aggregation using CASE WHEN.
Find departments where the average salary exceeds $100K using HAVING.
Calculate the percentage of orders that were returned, grouped by product category.
Use conditional aggregation (CASE WHEN inside SUM) to pivot monthly sales into columns.
Find users who made purchases in at least 3 different months.
Write a query to find the most common pair of products bought together.
Scalar, correlated, and EXISTS subqueries. Critical for complex filtering and comparison logic.
Use a correlated subquery to find each employee's salary as a percentage of their department average.
Write an EXISTS subquery to find products that have never been ordered.
Use a scalar subquery to find orders placed on the day with the highest total revenue.
Write a subquery in the FROM clause to first aggregate, then filter on the aggregated result.
Use NOT EXISTS to find customers who bought product A but never bought product B.
COALESCE, NULLIF, IS NULL, three-valued logic. Subtle but critical for correctness.
Write a query that correctly handles NULL in a LEFT JOIN where the right side has no match.
Use COALESCE to replace NULL values with a running total from the previous row.
Explain why WHERE column != 'value' does not return rows where column IS NULL.
Use NULLIF to avoid division-by-zero errors in a revenue-per-user calculation.
Write a query to find records where any of five columns contains a NULL value.
DATE_TRUNC, DATE_DIFF, EXTRACT, interval arithmetic. Essential for time-series analysis in data engineering.
Calculate user retention rates at 1, 7, and 30 days after signup.
Use DATE_TRUNC to aggregate daily event data into weekly cohorts.
Find the gap in days between each consecutive login for every user.
Write a query to identify users who were active for at least 3 consecutive days.
Calculate the percentage of monthly active users who were also active the previous month.
LIKE, regex, CONCAT, SUBSTRING, SPLIT_PART. Common in data cleaning and ETL pipeline questions.
Extract the domain name from a list of email addresses using string functions.
Use LIKE with wildcards to find products whose names contain a specific pattern.
Split a comma-separated tags column into individual rows.
Write a query to standardize phone numbers into a consistent format.
Use REGEXP_REPLACE to clean HTML tags from a text column.
Each solution includes the SQL code, a line-by-line explanation, the most common mistakes candidates make, and complexity analysis. These are the questions that appear most often in real data engineering interviews.
WITH ranked AS (
SELECT
department_id,
employee_name,
salary,
DENSE_RANK() OVER (
PARTITION BY department_id
ORDER BY salary DESC
) AS rnk
FROM employees
)
SELECT department_id, employee_name, salary
FROM ranked
WHERE rnk = 2;Explanation
DENSE_RANK handles ties correctly: if two employees share the highest salary, the next distinct salary is rank 2, not rank 3. ROW_NUMBER would skip tied employees. RANK would leave a gap. Interviewers test whether you know the difference between all three.
Common Mistakes
Complexity: O(n log n) for the sort within each partition. The CTE adds no extra scan since the optimizer folds it into a single pass.
SELECT
order_date,
daily_revenue,
ROUND(
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 2
) AS rolling_7d_avg
FROM daily_revenue_summary
ORDER BY order_date;Explanation
The frame clause ROWS BETWEEN 6 PRECEDING AND CURRENT ROW creates a 7-day window (today + 6 prior). For the first 6 rows, the frame is shorter than 7 days, so the average uses fewer values. To exclude partial windows, wrap in a CTE and filter by row number. Use ROWS, not RANGE: RANGE treats ties as the same position, which changes behavior when two days have identical dates.
Common Mistakes
Complexity: O(n) with the window function scan. The frame is fixed-width, so AVG computation per row is O(1) amortized.
SELECT
customer_id,
order_date,
order_total,
LAG(order_total) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS prev_order_total
FROM orders
WHERE order_total < LAG(order_total) OVER (
PARTITION BY customer_id
ORDER BY order_date
);
-- Correct version (cannot use window in WHERE):
WITH with_prev AS (
SELECT
customer_id,
order_date,
order_total,
LAG(order_total) OVER (
PARTITION BY customer_id
ORDER BY order_date
) AS prev_order_total
FROM orders
)
SELECT *
FROM with_prev
WHERE order_total < prev_order_total;Explanation
Window functions cannot appear in WHERE clauses. This is the most common mistake for this question. The fix: compute LAG in a CTE or subquery, then filter in the outer query. The first query above is the wrong version that candidates write. The second is the correct approach.
Common Mistakes
Complexity: O(n log n) for the sort per partition. LAG itself is O(1) per row.
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY updated_at DESC
) AS rn
FROM user_profiles
)
SELECT user_id, email, name, updated_at
FROM ranked
WHERE rn = 1;Explanation
ROW_NUMBER assigns a unique rank within each user_id partition, ordered by most recent first. Filtering WHERE rn = 1 keeps exactly one row per user. This pattern appears in nearly every DE interview because deduplication is a daily pipeline task. DISTINCT ON (PostgreSQL) is a shortcut, but the CTE pattern works across all SQL engines.
Common Mistakes
Complexity: O(n log n) for the sort. Single table scan with window function computation.
WITH monthly AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(total_amount) AS revenue
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
),
with_prev AS (
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue
FROM monthly
)
SELECT
month,
revenue,
prev_revenue,
ROUND(
(revenue - prev_revenue) * 100.0 / NULLIF(prev_revenue, 0),
1
) AS growth_pct
FROM with_prev
ORDER BY month;Explanation
Three-step pattern: (1) aggregate to monthly grain, (2) use LAG to get previous month, (3) compute percentage change. NULLIF(prev_revenue, 0) prevents division-by-zero when the first month has no predecessor. Chaining CTEs makes each step readable and testable independently.
Common Mistakes
Complexity: O(n log n) for the GROUP BY sort, plus O(m) for the LAG pass over m months.
-- Approach 1: LEFT JOIN + NULL check
SELECT c.customer_id, c.name, c.email
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;
-- Approach 2: NOT EXISTS (often faster)
SELECT c.customer_id, c.name, c.email
FROM customers c
WHERE NOT EXISTS (
SELECT 1
FROM orders o
WHERE o.customer_id = c.customer_id
);Explanation
Both approaches find the same result. LEFT JOIN + IS NULL is more intuitive. NOT EXISTS can be faster because it short-circuits: it stops scanning orders as soon as it finds one match. The optimizer in modern engines often produces the same plan for both, but stating the NOT EXISTS alternative shows the interviewer you think about performance.
Common Mistakes
Complexity: O(n + m) with a hash join where n = customers, m = orders. The NOT EXISTS variant can be O(n) if orders has an index on customer_id.
SELECT
e1.employee_name AS employee_1,
e2.employee_name AS employee_2,
e1.manager_id
FROM employees e1
JOIN employees e2
ON e1.manager_id = e2.manager_id
AND e1.employee_id < e2.employee_id;Explanation
The self-join matches employees on the same manager_id. The condition e1.employee_id < e2.employee_id eliminates duplicate pairs (Alice-Bob and Bob-Alice) and self-matches (Alice-Alice). Using < instead of != cuts the result set in half and is more efficient.
Common Mistakes
Complexity: O(n^2/k) where k is the number of distinct managers. Each group of size g produces g*(g-1)/2 pairs.
SELECT
department_id,
COUNT(*) AS employee_count,
ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department_id
HAVING AVG(salary) > 100000
ORDER BY avg_salary DESC;Explanation
HAVING filters groups after aggregation. WHERE filters rows before aggregation. This distinction is fundamental. Include COUNT(*) to show the interviewer you think about sample size: a department with 1 employee earning $101K is different from a department with 500 employees averaging $101K.
Common Mistakes
Complexity: O(n) for a single table scan with hash aggregation. GROUP BY on an indexed column can be faster with sort aggregation.
SELECT
product_id,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1
THEN total_amount ELSE 0 END) AS jan_sales,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2
THEN total_amount ELSE 0 END) AS feb_sales,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3
THEN total_amount ELSE 0 END) AS mar_sales,
SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 4
THEN total_amount ELSE 0 END) AS apr_sales
FROM orders
WHERE order_date >= '2025-01-01'
AND order_date < '2025-05-01'
GROUP BY product_id
ORDER BY product_id;Explanation
CASE WHEN inside SUM is the standard cross-engine pivot pattern. Each CASE expression acts as a filter for one month, and SUM aggregates only the matching rows. This avoids PIVOT syntax (which varies between engines) and works everywhere. The pattern extends to any categorical pivot: by status, by region, by type.
Common Mistakes
Complexity: O(n) single-pass aggregation. Each CASE expression evaluates per row, so total work is O(n * columns).
SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id = 'A'
AND NOT EXISTS (
SELECT 1
FROM orders o2
JOIN order_items oi2 ON o2.order_id = oi2.order_id
WHERE o2.customer_id = c.customer_id
AND oi2.product_id = 'B'
);Explanation
The outer query finds customers who bought A. The NOT EXISTS subquery checks whether each of those customers also bought B and excludes them. NOT EXISTS short-circuits: it stops scanning as soon as it finds one match, making it efficient for large tables. This pattern is common in interview questions about user segmentation and product analysis.
Common Mistakes
Complexity: O(n * m) worst case where n = orders for product A and m = orders per customer. With an index on (customer_id, product_id), the subquery is O(log m) per customer.
SELECT
campaign_id,
total_revenue,
total_users,
ROUND(
total_revenue / NULLIF(total_users, 0),
2
) AS revenue_per_user
FROM campaign_metrics;Explanation
NULLIF(total_users, 0) returns NULL when total_users is 0, and division by NULL returns NULL instead of raising an error. This is cleaner than a CASE WHEN and is the idiomatic SQL approach. You can wrap the result in COALESCE(..., 0) if you want zero instead of NULL for display purposes.
Common Mistakes
Complexity: O(n) single-pass computation.
WITH signups AS (
SELECT user_id, MIN(event_date) AS signup_date
FROM events
GROUP BY user_id
),
activity AS (
SELECT DISTINCT e.user_id, e.event_date
FROM events e
JOIN signups s ON e.user_id = s.user_id
WHERE e.event_date > s.signup_date
)
SELECT
s.signup_date,
COUNT(DISTINCT s.user_id) AS cohort_size,
COUNT(DISTINCT CASE
WHEN a.event_date = s.signup_date + INTERVAL '1 day'
THEN s.user_id END
) * 100.0 / COUNT(DISTINCT s.user_id) AS d1_retention,
COUNT(DISTINCT CASE
WHEN a.event_date = s.signup_date + INTERVAL '7 days'
THEN s.user_id END
) * 100.0 / COUNT(DISTINCT s.user_id) AS d7_retention,
COUNT(DISTINCT CASE
WHEN a.event_date = s.signup_date + INTERVAL '30 days'
THEN s.user_id END
) * 100.0 / COUNT(DISTINCT s.user_id) AS d30_retention
FROM signups s
LEFT JOIN activity a ON s.user_id = a.user_id
GROUP BY s.signup_date
ORDER BY s.signup_date;Explanation
Three steps: (1) find each user's signup date, (2) join to their activity history, (3) use conditional COUNT DISTINCT to check if they were active on each target day. LEFT JOIN ensures users with no return visits appear in the cohort count. This is a classic DE interview question because it tests CTEs, JOINs, conditional aggregation, and date arithmetic in one query.
Common Mistakes
Complexity: O(n log n) for the GROUP BY in the signups CTE, plus O(n * m) for the join where m is average events per user.
WITH daily_activity AS (
SELECT DISTINCT user_id, DATE(event_timestamp) AS active_date
FROM events
),
with_groups AS (
SELECT
user_id,
active_date,
active_date - ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY active_date
) * INTERVAL '1 day' AS grp
FROM daily_activity
)
SELECT user_id, MIN(active_date) AS streak_start,
MAX(active_date) AS streak_end,
COUNT(*) AS streak_days
FROM with_groups
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id, streak_start;Explanation
The 'gaps and islands' technique: for consecutive dates, subtracting an incrementing integer (via ROW_NUMBER) produces the same value for each group of consecutive days. Grouping by this computed value identifies streaks. This is a senior-level pattern that interviewers love because it tests window functions, date arithmetic, and grouping in one query.
Common Mistakes
Complexity: O(n log n) for the DISTINCT and window function sort. The GROUP BY on the computed column is O(n).
-- PostgreSQL / Redshift
SELECT
email,
SPLIT_PART(email, '@', 2) AS domain
FROM users;
-- MySQL
SELECT
email,
SUBSTRING_INDEX(email, '@', -1) AS domain
FROM users;
-- Standard SQL (works everywhere)
SELECT
email,
SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain
FROM users;Explanation
Three approaches for different engines. SPLIT_PART is the cleanest in PostgreSQL. The standard SQL approach uses POSITION to find '@' and SUBSTRING to extract everything after it. Interviewers may ask you to also extract the username (before @) or to group by domain to find the most common email providers.
Common Mistakes
Complexity: O(n * k) where k is average email length. String operations are linear in string length.
-- Approach 1: Window function (preferred)
SELECT
employee_name,
department_id,
salary,
ROUND(
salary * 100.0 / AVG(salary) OVER (
PARTITION BY department_id
), 1
) AS pct_of_dept_avg
FROM employees
ORDER BY department_id, pct_of_dept_avg DESC;
-- Approach 2: Correlated subquery
SELECT
e.employee_name,
e.department_id,
e.salary,
ROUND(
e.salary * 100.0 / (
SELECT AVG(e2.salary)
FROM employees e2
WHERE e2.department_id = e.department_id
), 1
) AS pct_of_dept_avg
FROM employees e
ORDER BY e.department_id, pct_of_dept_avg DESC;Explanation
The window function approach is a single pass: AVG with PARTITION BY computes each department's average inline. The correlated subquery approach runs the inner query once per row, which is O(n^2) without optimization. Show the window function approach first, then mention the subquery as an alternative. This demonstrates you know multiple techniques and can choose the better one.
Common Mistakes
Complexity: Window function: O(n log n) for the sort. Correlated subquery: O(n * m) where m is department size, but optimizers often transform it to match the window version.
SQL is the most common skill tested in data engineering interviews, appearing in the majority of loops and a significant share of phone screens. The format is predictable: you get a schema description, a business question, and 20-30 minutes to write a query.
Start with JOINs and aggregations. These appear in every interview and are the foundation for harder topics. If GROUP BY and HAVING are second nature, you can focus your limited interview time on the harder parts of the problem.
Master window functions. Window functions are the highest-difficulty topic that appears consistently. Practice until you can write ROW_NUMBER, RANK, LAG, and LEAD without looking up the syntax.
Learn the deduplication pattern cold. ROW_NUMBER + PARTITION BY + WHERE rn = 1 appears in almost every DE interview. It tests CTEs, window functions, and data quality thinking in one question. See solution #4 above.
Practice under time pressure. Knowing the concept is not enough. You need to write correct SQL in under 15 minutes per question. DataDriven's timed drill mode simulates real interview pressure.
Run your code. Reading solutions is not the same as writing them. Every question on DataDriven executes against a real database so you catch syntax errors, NULL edge cases, and logic bugs before the interview.
Write real SQL. Get real results. Know exactly where you stand before the interview.
ROW_NUMBER, RANK, LAG, LEAD
INNER, LEFT, self-joins, anti-joins
Recursive and chained CTEs
Aggregation and HAVING
Cross-tab and UNPIVOT patterns
NULL handling for pipelines
Correlated and scalar subqueries
Quick reference by interview frequency
Continue your prep
50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.