Updated April 2026By The DataDriven Team
Data Engineering Interview Prep

SQL Interview Questions for Data Engineers

SQL is the most-tested skill in data engineering interviews, appearing in nearly seven out of ten interview loops. Below: 50+ questions organized by topic, 15 fully worked solutions with code, explanations, common mistakes, and complexity analysis.

Every question below can be practiced with real SQL execution on DataDriven. Write the query, run it against a real database, and see if your output matches.

1. Window Functions

Difficulty: Medium-HardInterview frequency: 15.1% of SQL questions
+

ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, and frame clauses. PARTITION BY is the single most common window function keyword in real DE interviews.

Q1

Write a query to find the second-highest salary in each department using window functions.

Q2

Calculate a 7-day rolling average of daily revenue using a window frame.

Q3

Use LAG to find customers whose order value decreased compared to their previous order.

Q4

Rank products by total sales within each category, handling ties with DENSE_RANK.

Q5

Use NTILE to divide employees into salary quartiles within each department.

2. CTEs & Recursive Queries

Difficulty: Medium-HardInterview frequency: High
+

Common Table Expressions for readable multi-step queries. Recursive CTEs for hierarchical data like org charts and category trees.

Q1

Write a CTE to find all employees in a management chain starting from the CEO.

Q2

Use a recursive CTE to generate a date series for the last 90 days.

Q3

Chain multiple CTEs to calculate month-over-month revenue growth rates.

Q4

Write a recursive query to find all subcategories of a given parent category.

Q5

Use a CTE to deduplicate records, keeping only the most recent entry per user.

3. JOINs & Self-Joins

Difficulty: MediumInterview frequency: 19.6% of SQL questions
+

INNER JOIN is the second most-tested SQL keyword after GROUP BY. LEFT, RIGHT, FULL OUTER, CROSS, self-joins, and inequality joins round out the category.

Q1

Write a self-join to find pairs of employees who share the same manager.

Q2

Use a LEFT JOIN to find customers who have never placed an order.

Q3

Write an inequality join to match events that occurred within 30 minutes of each other.

Q4

Join three tables to find the top-selling product in each region.

Q5

Use a CROSS JOIN to generate all possible date-product combinations, then LEFT JOIN to find missing data.

4. Aggregations & GROUP BY

Difficulty: Easy-MediumInterview frequency: 24.5% of SQL questions
+

GROUP BY is the most frequently tested SQL keyword in DE interviews. COUNT, SUM, AVG, MIN, MAX with HAVING and conditional aggregation using CASE WHEN.

Q1

Find departments where the average salary exceeds $100K using HAVING.

Q2

Calculate the percentage of orders that were returned, grouped by product category.

Q3

Use conditional aggregation (CASE WHEN inside SUM) to pivot monthly sales into columns.

Q4

Find users who made purchases in at least 3 different months.

Q5

Write a query to find the most common pair of products bought together.

5. Subqueries

Difficulty: Medium-HardInterview frequency: High
+

Scalar, correlated, and EXISTS subqueries. Critical for complex filtering and comparison logic.

Q1

Use a correlated subquery to find each employee's salary as a percentage of their department average.

Q2

Write an EXISTS subquery to find products that have never been ordered.

Q3

Use a scalar subquery to find orders placed on the day with the highest total revenue.

Q4

Write a subquery in the FROM clause to first aggregate, then filter on the aggregated result.

Q5

Use NOT EXISTS to find customers who bought product A but never bought product B.

6. NULL Handling

Difficulty: Easy-MediumInterview frequency: Medium
+

COALESCE, NULLIF, IS NULL, three-valued logic. Subtle but critical for correctness.

Q1

Write a query that correctly handles NULL in a LEFT JOIN where the right side has no match.

Q2

Use COALESCE to replace NULL values with a running total from the previous row.

Q3

Explain why WHERE column != 'value' does not return rows where column IS NULL.

Q4

Use NULLIF to avoid division-by-zero errors in a revenue-per-user calculation.

Q5

Write a query to find records where any of five columns contains a NULL value.

7. Date & Time Functions

Difficulty: MediumInterview frequency: High
+

DATE_TRUNC, DATE_DIFF, EXTRACT, interval arithmetic. Essential for time-series analysis in data engineering.

Q1

Calculate user retention rates at 1, 7, and 30 days after signup.

Q2

Use DATE_TRUNC to aggregate daily event data into weekly cohorts.

Q3

Find the gap in days between each consecutive login for every user.

Q4

Write a query to identify users who were active for at least 3 consecutive days.

Q5

Calculate the percentage of monthly active users who were also active the previous month.

8. String Functions & Pattern Matching

Difficulty: Easy-MediumInterview frequency: Medium
+

LIKE, regex, CONCAT, SUBSTRING, SPLIT_PART. Common in data cleaning and ETL pipeline questions.

Q1

Extract the domain name from a list of email addresses using string functions.

Q2

Use LIKE with wildcards to find products whose names contain a specific pattern.

Q3

Split a comma-separated tags column into individual rows.

Q4

Write a query to standardize phone numbers into a consistent format.

Q5

Use REGEXP_REPLACE to clean HTML tags from a text column.

15 Worked Solutions with Full Explanations

Each solution includes the SQL code, a line-by-line explanation, the most common mistakes candidates make, and complexity analysis. These are the questions that appear most often in real data engineering interviews.

Window FunctionsMedium

1. Find the second-highest salary in each department.

WITH ranked AS (
  SELECT
    department_id,
    employee_name,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department_id
      ORDER BY salary DESC
    ) AS rnk
  FROM employees
)
SELECT department_id, employee_name, salary
FROM ranked
WHERE rnk = 2;

Explanation

DENSE_RANK handles ties correctly: if two employees share the highest salary, the next distinct salary is rank 2, not rank 3. ROW_NUMBER would skip tied employees. RANK would leave a gap. Interviewers test whether you know the difference between all three.

Common Mistakes

  • ×Using ROW_NUMBER instead of DENSE_RANK. ROW_NUMBER assigns arbitrary order to ties, so you might miss the true second-highest.
  • ×Using RANK instead of DENSE_RANK. RANK skips numbers after ties (1, 1, 3), so filtering WHERE rnk = 2 returns nothing if the top salary is tied.
  • ×Forgetting PARTITION BY. Without it, you get the second-highest across the entire company, not per department.

Complexity: O(n log n) for the sort within each partition. The CTE adds no extra scan since the optimizer folds it into a single pass.

Window FunctionsMedium

2. Calculate a 7-day rolling average of daily revenue.

SELECT
  order_date,
  daily_revenue,
  ROUND(
    AVG(daily_revenue) OVER (
      ORDER BY order_date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2
  ) AS rolling_7d_avg
FROM daily_revenue_summary
ORDER BY order_date;

Explanation

The frame clause ROWS BETWEEN 6 PRECEDING AND CURRENT ROW creates a 7-day window (today + 6 prior). For the first 6 rows, the frame is shorter than 7 days, so the average uses fewer values. To exclude partial windows, wrap in a CTE and filter by row number. Use ROWS, not RANGE: RANGE treats ties as the same position, which changes behavior when two days have identical dates.

Common Mistakes

  • ×Using RANGE instead of ROWS. RANGE groups rows with the same ORDER BY value, producing different results when dates have duplicates or gaps.
  • ×Using ROWS BETWEEN 7 PRECEDING instead of 6 PRECEDING. That gives an 8-day window (7 prior + current).
  • ×Not addressing partial windows in the first 6 rows. The interviewer will ask about this edge case.

Complexity: O(n) with the window function scan. The frame is fixed-width, so AVG computation per row is O(1) amortized.

Window FunctionsMedium

3. Use LAG to find customers whose order value decreased from their previous order.

SELECT
  customer_id,
  order_date,
  order_total,
  LAG(order_total) OVER (
    PARTITION BY customer_id
    ORDER BY order_date
  ) AS prev_order_total
FROM orders
WHERE order_total < LAG(order_total) OVER (
  PARTITION BY customer_id
  ORDER BY order_date
);

-- Correct version (cannot use window in WHERE):
WITH with_prev AS (
  SELECT
    customer_id,
    order_date,
    order_total,
    LAG(order_total) OVER (
      PARTITION BY customer_id
      ORDER BY order_date
    ) AS prev_order_total
  FROM orders
)
SELECT *
FROM with_prev
WHERE order_total < prev_order_total;

Explanation

Window functions cannot appear in WHERE clauses. This is the most common mistake for this question. The fix: compute LAG in a CTE or subquery, then filter in the outer query. The first query above is the wrong version that candidates write. The second is the correct approach.

Common Mistakes

  • ×Putting LAG directly in the WHERE clause. SQL evaluates WHERE before window functions, so this is a syntax error.
  • ×Forgetting PARTITION BY customer_id. Without it, LAG looks at the previous row globally, mixing customers together.
  • ×Not handling the first order per customer (where prev_order_total is NULL). The WHERE filter naturally excludes these, but state this explicitly.

Complexity: O(n log n) for the sort per partition. LAG itself is O(1) per row.

CTEsMedium

4. Deduplicate records, keeping only the most recent entry per user.

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM user_profiles
)
SELECT user_id, email, name, updated_at
FROM ranked
WHERE rn = 1;

Explanation

ROW_NUMBER assigns a unique rank within each user_id partition, ordered by most recent first. Filtering WHERE rn = 1 keeps exactly one row per user. This pattern appears in nearly every DE interview because deduplication is a daily pipeline task. DISTINCT ON (PostgreSQL) is a shortcut, but the CTE pattern works across all SQL engines.

Common Mistakes

  • ×Using DISTINCT instead of ROW_NUMBER. DISTINCT deduplicates entire rows but does not let you choose which duplicate to keep.
  • ×Using GROUP BY with MAX(updated_at). This gets the right timestamp but drops other columns unless you add them all to GROUP BY or re-join.
  • ×Ordering ASC instead of DESC. You want the most recent first, so DESC on the timestamp.

Complexity: O(n log n) for the sort. Single table scan with window function computation.

CTEsMedium-Hard

5. Calculate month-over-month revenue growth rates.

WITH monthly AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS revenue
  FROM orders
  GROUP BY DATE_TRUNC('month', order_date)
),
with_prev AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_revenue
  FROM monthly
)
SELECT
  month,
  revenue,
  prev_revenue,
  ROUND(
    (revenue - prev_revenue) * 100.0 / NULLIF(prev_revenue, 0),
    1
  ) AS growth_pct
FROM with_prev
ORDER BY month;

Explanation

Three-step pattern: (1) aggregate to monthly grain, (2) use LAG to get previous month, (3) compute percentage change. NULLIF(prev_revenue, 0) prevents division-by-zero when the first month has no predecessor. Chaining CTEs makes each step readable and testable independently.

Common Mistakes

  • ×Dividing by prev_revenue without handling zero or NULL. The first month has no predecessor (NULL), and a month with zero revenue causes division-by-zero.
  • ×Using EXTRACT(month, ...) instead of DATE_TRUNC. EXTRACT returns the month number (1-12), which collapses January 2025 and January 2026 into the same bucket.
  • ×Trying to do everything in one query without CTEs. It works but is hard to read and debug. Interviewers penalize unreadable code.

Complexity: O(n log n) for the GROUP BY sort, plus O(m) for the LAG pass over m months.

JOINsEasy-Medium

6. Find customers who have never placed an order.

-- Approach 1: LEFT JOIN + NULL check
SELECT c.customer_id, c.name, c.email
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

-- Approach 2: NOT EXISTS (often faster)
SELECT c.customer_id, c.name, c.email
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

Explanation

Both approaches find the same result. LEFT JOIN + IS NULL is more intuitive. NOT EXISTS can be faster because it short-circuits: it stops scanning orders as soon as it finds one match. The optimizer in modern engines often produces the same plan for both, but stating the NOT EXISTS alternative shows the interviewer you think about performance.

Common Mistakes

  • ×Checking WHERE o.order_id IS NULL instead of WHERE o.customer_id IS NULL. Both work if customer_id is never NULL in orders, but checking the join key is semantically clearer.
  • ×Using NOT IN (SELECT customer_id FROM orders). This fails silently if any customer_id in orders is NULL, because NOT IN with NULLs returns no rows.
  • ×Forgetting that LEFT JOIN preserves all left rows. Candidates sometimes use INNER JOIN and wonder why they get zero results.

Complexity: O(n + m) with a hash join where n = customers, m = orders. The NOT EXISTS variant can be O(n) if orders has an index on customer_id.

JOINsMedium

7. Write a self-join to find pairs of employees who share the same manager.

SELECT
  e1.employee_name AS employee_1,
  e2.employee_name AS employee_2,
  e1.manager_id
FROM employees e1
JOIN employees e2
  ON e1.manager_id = e2.manager_id
  AND e1.employee_id < e2.employee_id;

Explanation

The self-join matches employees on the same manager_id. The condition e1.employee_id < e2.employee_id eliminates duplicate pairs (Alice-Bob and Bob-Alice) and self-matches (Alice-Alice). Using < instead of != cuts the result set in half and is more efficient.

Common Mistakes

  • ×Using != instead of <. This produces both (Alice, Bob) and (Bob, Alice). The interviewer will ask you to deduplicate.
  • ×Not excluding self-matches. Without the inequality condition, every employee pairs with themselves.
  • ×Forgetting to handle employees with no manager (NULL manager_id). NULL = NULL is false in SQL, so they are naturally excluded. State this explicitly.

Complexity: O(n^2/k) where k is the number of distinct managers. Each group of size g produces g*(g-1)/2 pairs.

AggregationsEasy

8. Find departments where the average salary exceeds $100K.

SELECT
  department_id,
  COUNT(*) AS employee_count,
  ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department_id
HAVING AVG(salary) > 100000
ORDER BY avg_salary DESC;

Explanation

HAVING filters groups after aggregation. WHERE filters rows before aggregation. This distinction is fundamental. Include COUNT(*) to show the interviewer you think about sample size: a department with 1 employee earning $101K is different from a department with 500 employees averaging $101K.

Common Mistakes

  • ×Using WHERE AVG(salary) > 100000. Aggregate functions cannot appear in WHERE. This is a syntax error.
  • ×Using HAVING avg_salary > 100000 with an alias. Some SQL engines do not allow aliases in HAVING. Use the full expression to be safe.
  • ×Not including COUNT(*) in the output. The interviewer may follow up with 'how reliable is this average?' and sample size is the answer.

Complexity: O(n) for a single table scan with hash aggregation. GROUP BY on an indexed column can be faster with sort aggregation.

AggregationsMedium

9. Use conditional aggregation to pivot monthly sales into columns.

SELECT
  product_id,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1
    THEN total_amount ELSE 0 END) AS jan_sales,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2
    THEN total_amount ELSE 0 END) AS feb_sales,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3
    THEN total_amount ELSE 0 END) AS mar_sales,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 4
    THEN total_amount ELSE 0 END) AS apr_sales
FROM orders
WHERE order_date >= '2025-01-01'
  AND order_date < '2025-05-01'
GROUP BY product_id
ORDER BY product_id;

Explanation

CASE WHEN inside SUM is the standard cross-engine pivot pattern. Each CASE expression acts as a filter for one month, and SUM aggregates only the matching rows. This avoids PIVOT syntax (which varies between engines) and works everywhere. The pattern extends to any categorical pivot: by status, by region, by type.

Common Mistakes

  • ×Using ELSE NULL instead of ELSE 0. SUM ignores NULLs, so the result is the same, but the intent is clearer with 0.
  • ×Forgetting the date filter in WHERE. Without it, you aggregate all years and January 2024 mixes with January 2025.
  • ×Hardcoding months instead of using DATE_TRUNC. For a dynamic number of columns, a pivot approach or application-side transformation is needed. State this limitation.

Complexity: O(n) single-pass aggregation. Each CASE expression evaluates per row, so total work is O(n * columns).

SubqueriesMedium

10. Find customers who bought product A but never bought product B.

SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id = 'A'
  AND NOT EXISTS (
    SELECT 1
    FROM orders o2
    JOIN order_items oi2 ON o2.order_id = oi2.order_id
    WHERE o2.customer_id = c.customer_id
      AND oi2.product_id = 'B'
  );

Explanation

The outer query finds customers who bought A. The NOT EXISTS subquery checks whether each of those customers also bought B and excludes them. NOT EXISTS short-circuits: it stops scanning as soon as it finds one match, making it efficient for large tables. This pattern is common in interview questions about user segmentation and product analysis.

Common Mistakes

  • ×Using NOT IN instead of NOT EXISTS. NOT IN fails when the subquery returns any NULL values, silently returning zero rows.
  • ×Forgetting DISTINCT. A customer who bought product A three times appears three times without it.
  • ×Using LEFT JOIN + IS NULL instead of NOT EXISTS. Both work, but the NOT EXISTS version is clearer about intent and often has a better query plan.

Complexity: O(n * m) worst case where n = orders for product A and m = orders per customer. With an index on (customer_id, product_id), the subquery is O(log m) per customer.

NULL HandlingEasy-Medium

11. Use NULLIF to avoid division-by-zero in revenue per user.

SELECT
  campaign_id,
  total_revenue,
  total_users,
  ROUND(
    total_revenue / NULLIF(total_users, 0),
    2
  ) AS revenue_per_user
FROM campaign_metrics;

Explanation

NULLIF(total_users, 0) returns NULL when total_users is 0, and division by NULL returns NULL instead of raising an error. This is cleaner than a CASE WHEN and is the idiomatic SQL approach. You can wrap the result in COALESCE(..., 0) if you want zero instead of NULL for display purposes.

Common Mistakes

  • ×Not handling division by zero at all. This crashes the query in most SQL engines.
  • ×Using CASE WHEN total_users = 0 THEN NULL ELSE total_revenue / total_users END. This works but NULLIF is shorter and more idiomatic.
  • ×Using COALESCE on the wrong side. COALESCE(total_users, 1) would replace 0 with nothing and NULL with 1, which is wrong. NULLIF targets the zero specifically.

Complexity: O(n) single-pass computation.

DatesHard

12. Calculate user retention at 1, 7, and 30 days after signup.

WITH signups AS (
  SELECT user_id, MIN(event_date) AS signup_date
  FROM events
  GROUP BY user_id
),
activity AS (
  SELECT DISTINCT e.user_id, e.event_date
  FROM events e
  JOIN signups s ON e.user_id = s.user_id
  WHERE e.event_date > s.signup_date
)
SELECT
  s.signup_date,
  COUNT(DISTINCT s.user_id) AS cohort_size,
  COUNT(DISTINCT CASE
    WHEN a.event_date = s.signup_date + INTERVAL '1 day'
    THEN s.user_id END
  ) * 100.0 / COUNT(DISTINCT s.user_id) AS d1_retention,
  COUNT(DISTINCT CASE
    WHEN a.event_date = s.signup_date + INTERVAL '7 days'
    THEN s.user_id END
  ) * 100.0 / COUNT(DISTINCT s.user_id) AS d7_retention,
  COUNT(DISTINCT CASE
    WHEN a.event_date = s.signup_date + INTERVAL '30 days'
    THEN s.user_id END
  ) * 100.0 / COUNT(DISTINCT s.user_id) AS d30_retention
FROM signups s
LEFT JOIN activity a ON s.user_id = a.user_id
GROUP BY s.signup_date
ORDER BY s.signup_date;

Explanation

Three steps: (1) find each user's signup date, (2) join to their activity history, (3) use conditional COUNT DISTINCT to check if they were active on each target day. LEFT JOIN ensures users with no return visits appear in the cohort count. This is a classic DE interview question because it tests CTEs, JOINs, conditional aggregation, and date arithmetic in one query.

Common Mistakes

  • ×Using = instead of exact day match. If you check event_date BETWEEN signup + 1 AND signup + 7, you measure 'returned within 7 days' instead of 'returned on day 7.'
  • ×Forgetting LEFT JOIN. INNER JOIN drops users who never returned, deflating your denominator.
  • ×Not using COUNT(DISTINCT ...). A user with 5 events on day 1 should count as 1 retained user, not 5.

Complexity: O(n log n) for the GROUP BY in the signups CTE, plus O(n * m) for the join where m is average events per user.

DatesHard

13. Find users who were active for at least 3 consecutive days.

WITH daily_activity AS (
  SELECT DISTINCT user_id, DATE(event_timestamp) AS active_date
  FROM events
),
with_groups AS (
  SELECT
    user_id,
    active_date,
    active_date - ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY active_date
    ) * INTERVAL '1 day' AS grp
  FROM daily_activity
)
SELECT user_id, MIN(active_date) AS streak_start,
  MAX(active_date) AS streak_end,
  COUNT(*) AS streak_days
FROM with_groups
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id, streak_start;

Explanation

The 'gaps and islands' technique: for consecutive dates, subtracting an incrementing integer (via ROW_NUMBER) produces the same value for each group of consecutive days. Grouping by this computed value identifies streaks. This is a senior-level pattern that interviewers love because it tests window functions, date arithmetic, and grouping in one query.

Common Mistakes

  • ×Not deduplicating daily activity first. A user with multiple events on the same day creates gaps in the ROW_NUMBER sequence.
  • ×Using LAG to check consecutive days. LAG only compares adjacent rows, so finding streaks of 3+ requires nested checks. The ROW_NUMBER subtraction method scales to any streak length.
  • ×Integer subtraction instead of interval subtraction. The exact syntax depends on the engine (PostgreSQL uses intervals, MySQL uses DATE_SUB).

Complexity: O(n log n) for the DISTINCT and window function sort. The GROUP BY on the computed column is O(n).

StringsEasy

14. Extract the domain from email addresses.

-- PostgreSQL / Redshift
SELECT
  email,
  SPLIT_PART(email, '@', 2) AS domain
FROM users;

-- MySQL
SELECT
  email,
  SUBSTRING_INDEX(email, '@', -1) AS domain
FROM users;

-- Standard SQL (works everywhere)
SELECT
  email,
  SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain
FROM users;

Explanation

Three approaches for different engines. SPLIT_PART is the cleanest in PostgreSQL. The standard SQL approach uses POSITION to find '@' and SUBSTRING to extract everything after it. Interviewers may ask you to also extract the username (before @) or to group by domain to find the most common email providers.

Common Mistakes

  • ×Assuming all emails have exactly one '@'. Production data is messy. An edge case: emails with no '@' at all. SPLIT_PART returns an empty string, POSITION returns 0.
  • ×Not mentioning engine differences. If the interviewer asks 'what engine are we using?' and you do not clarify, write the standard SQL version.
  • ×Forgetting LOWER() for case-insensitive domain comparison. 'Gmail.com' and 'gmail.com' should be the same domain.

Complexity: O(n * k) where k is average email length. String operations are linear in string length.

SubqueriesMedium

15. Find each employee's salary as a percentage of their department average.

-- Approach 1: Window function (preferred)
SELECT
  employee_name,
  department_id,
  salary,
  ROUND(
    salary * 100.0 / AVG(salary) OVER (
      PARTITION BY department_id
    ), 1
  ) AS pct_of_dept_avg
FROM employees
ORDER BY department_id, pct_of_dept_avg DESC;

-- Approach 2: Correlated subquery
SELECT
  e.employee_name,
  e.department_id,
  e.salary,
  ROUND(
    e.salary * 100.0 / (
      SELECT AVG(e2.salary)
      FROM employees e2
      WHERE e2.department_id = e.department_id
    ), 1
  ) AS pct_of_dept_avg
FROM employees e
ORDER BY e.department_id, pct_of_dept_avg DESC;

Explanation

The window function approach is a single pass: AVG with PARTITION BY computes each department's average inline. The correlated subquery approach runs the inner query once per row, which is O(n^2) without optimization. Show the window function approach first, then mention the subquery as an alternative. This demonstrates you know multiple techniques and can choose the better one.

Common Mistakes

  • ×Using the correlated subquery when a window function is simpler. The interviewer wants to see that you reach for the right tool.
  • ×Dividing by the overall average instead of the department average. Forgetting PARTITION BY or the WHERE clause in the subquery.
  • ×Using integer division. Without * 100.0, you get integer truncation (e.g., 0 instead of 95.3).

Complexity: Window function: O(n log n) for the sort. Correlated subquery: O(n * m) where m is department size, but optimizers often transform it to match the window version.

How to Prepare for SQL Data Engineering Interviews

SQL is the most common skill tested in data engineering interviews, appearing in the majority of loops and a significant share of phone screens. The format is predictable: you get a schema description, a business question, and 20-30 minutes to write a query.

Start with JOINs and aggregations. These appear in every interview and are the foundation for harder topics. If GROUP BY and HAVING are second nature, you can focus your limited interview time on the harder parts of the problem.

Master window functions. Window functions are the highest-difficulty topic that appears consistently. Practice until you can write ROW_NUMBER, RANK, LAG, and LEAD without looking up the syntax.

Learn the deduplication pattern cold. ROW_NUMBER + PARTITION BY + WHERE rn = 1 appears in almost every DE interview. It tests CTEs, window functions, and data quality thinking in one question. See solution #4 above.

Practice under time pressure. Knowing the concept is not enough. You need to write correct SQL in under 15 minutes per question. DataDriven's timed drill mode simulates real interview pressure.

Run your code. Reading solutions is not the same as writing them. Every question on DataDriven executes against a real database so you catch syntax errors, NULL edge cases, and logic bugs before the interview.

SQL Interview Questions FAQ

What SQL topics are most commonly tested in data engineering interviews?+
Based on DataDriven's analysis of verified interview data, the three most commonly tested SQL topics are aggregation (24.5%), JOINs (19.6%), and window functions (15.1%). GROUP BY and INNER JOIN are the two most frequent individual keywords. Most interviews test 2-3 of these topics in a single 45-minute round.
How many SQL questions should I practice before a data engineering interview?+
Solve at least 50 problems across window functions, JOINs, CTEs, and aggregations. Quality matters more than quantity. Focus on understanding patterns rather than memorizing solutions. If you can solve a medium-difficulty window function problem in under 12 minutes without looking up syntax, you are in good shape.
What is the difference between SQL questions for data engineers vs. data analysts?+
Data engineering SQL questions emphasize performance-aware queries, CTEs for pipeline logic, window functions for complex transformations, and handling edge cases (NULLs, duplicates, time zones). Analyst questions lean more toward reporting: simple aggregations, filtering, and basic JOINs. DE interviews also test deduplication, incremental loads, and data quality checks.
Do data engineering interviews use LeetCode-style SQL?+
Not usually. LeetCode SQL problems are puzzle-oriented: tricky self-joins, recursive tricks, clever one-liners. Data engineering interviews test practical patterns: pipeline transformations, data quality checks, incremental processing, and multi-step queries chained with CTEs. The difficulty is comparable, but the style is different.
Should I memorize SQL syntax or understand concepts?+
Both. You need to write correct syntax without autocomplete (some interviews use plain text editors). But memorizing syntax without understanding the underlying concept leads to brittle solutions. Focus on understanding why a window function is better than a self-join for certain problems, then practice until the syntax is automatic.
What SQL engine should I practice on?+
PostgreSQL is the safest bet. Most interview platforms use it, and its syntax is closest to standard SQL. If your target company uses a specific engine (Snowflake, BigQuery, Redshift), learn its DATE functions and window function quirks, but the core SQL patterns are the same everywhere.

Practice These SQL Questions Now

Write real SQL. Get real results. Know exactly where you stand before the interview.

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats