SQL Interview Questions for Data Engineers

Q: Which SQL topics come up most in DE interviews?

Aggregation is the largest single bucket at around 24%, JOINs next at 20%, window functions at about 15%. GROUP BY and INNER JOIN appear in the most questions. A typical 45-minute round combines two or three of these in a single problem, so the practice that pays off most is connecting them, not drilling each in isolation.

Q: How many SQL questions should I solve before a loop?

Fifty is a reasonable floor, but only if you actually solve them: write the query, run it, fix the edge case, then close the tab and try to re-derive the answer two days later. The signal you're ready isn't a count. It's whether you can write a medium window-function problem in under 12 minutes without looking up syntax.

Q: How is DE SQL different from analyst SQL?

Analyst SQL leans on reporting: aggregations, filters, the occasional pivot. DE SQL adds patterns analysts never have to write: dedup by composite key, incremental loads with merge semantics, gap-and-island analysis, time-zone-aware bucketing, queries that have to stay correct under late-arriving data. The keywords overlap. The questions don't.

Q: Are DE SQL interviews like LeetCode SQL?

Not really. LeetCode leans into puzzle problems: clever self-joins, NULL gotchas, one-liners that are hard to read. DE interviews lean into pipeline-shaped problems: multi-step CTEs, deduplication, data-quality predicates, time-window matching. Both can be hard, but the kind of hard is different.

Q: What SQL engine should I practice on?

Postgres is the safe default. Most interview platforms use Postgres or something close enough that ANSI features behave the same. If you're interviewing at a Snowflake or BigQuery shop, spend a couple of hours learning the date functions and the QUALIFY clause for those engines, but the bulk of what you practice will transfer.

SQL is the round you will definitely face. It shows up on roughly 7 of every 10 DE loops, and on most of those it shows up twice. This guide groups the questions by topic with frequency weights pulled from real interview debriefs. Fifteen of the most-asked are worked end to end with the code, reasoning, and failure modes.

SQL is the round you will definitely face. It shows up on roughly seven of every ten data engineering loops, and on most of those it shows up twice. Eight topics drive ninety-five percent of the questions. Aggregation is the largest bucket, joins next, window functions third. This guide ranks those topics and works fifteen of the most-asked questions end to end with the SQL, the reasoning, and the failure modes interviewers fish for.

The eight topics, ranked by how often they show up

Topic	Difficulty	Share of SQL questions
Window Functions	Medium-Hard	~15% of SQL questions
CTEs and Recursive Queries	Medium-Hard	Most multi-step problems
JOINs and Self-Joins	Medium	~20% of SQL questions
Aggregations and GROUP BY	Easy-Medium	~24% of SQL questions
Subqueries	Medium-Hard	Common in filtering and exclusion problems
NULL Handling	Easy-Medium	A trap on most queries
Date and Time Functions	Medium	Most analytical questions touch them
String Functions and Pattern Matching	Easy-Medium	Cleaning problems and pipeline take-homes

What each topic actually tests

Window Functions. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, NTILE, frame clause. Interviewers scrutinize your ORDER BY inside the OVER clause hardest, because the wrong sort silently changes the answer.

CTEs and Recursive Queries. Common Table Expressions are how you make a multi-step query readable and how interviewers expect you to structure anything that isn't a single SELECT. Recursive CTEs handle org charts, category trees, dependency graphs.

JOINs and Self-Joins. INNER and LEFT are the daily bread. The questions that separate candidates: self-joins for pair-finding, anti-joins for exclusion, inequality joins for time-window matching. Almost every wrong join is a duplicate from an unaccounted many-to-many.

Aggregations and GROUP BY. The largest single bucket. COUNT, SUM, AVG, MIN, MAX with HAVING for post-aggregate filtering, conditional aggregation (CASE WHEN inside SUM) for pivots. The real trap is COUNT(*) vs COUNT(col) when col can be NULL.

Subqueries. Scalar subqueries return a single value. Correlated subqueries re-execute per outer row. EXISTS and NOT EXISTS are the safe way to express set membership without the NULL traps of IN and NOT IN.

NULL Handling. COALESCE for defaulting, NULLIF for guarding division, IS NULL because = NULL is always unknown. Three-valued logic is the source of more silent wrong answers than any other SQL feature.

Date and Time Functions. DATE_TRUNC for bucketing, DATE_DIFF for windows, EXTRACT for cohorts, interval arithmetic for retention. Function names differ by engine, which is itself a question interviewers like to ask.

String Functions and Pattern Matching. LIKE for simple patterns, regex when LIKE isn't enough, CONCAT, SUBSTRING, SPLIT_PART for parsing. Questions usually look like data cleaning, because that's where string functions earn their keep.

Prepare for the interview

01 / Open invite

02min.

Know the patterns before the interviewer asks them.

a SQL query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

MicrosoftInterview question

Solve a problem

Prepare for the interview

03 / From the bank03 of many

03hand-picked.

Top Active Senders per Channel

Medium26 min

Top three messages per channel by replies.

Open the sandbox

Pick your level

BlockL3Top Metric Values ShopifyL4First Contact CoinbaseL5Alert Severity

Pulled from debriefs where SQL was the gate.

Q1: Find the second-highest salary in each department

WITH ranked AS (
  SELECT
    department_id,
    employee_name,
    salary,
    DENSE_RANK() OVER (
      PARTITION BY department_id
      ORDER BY salary DESC
    ) AS rnk
  FROM employees
)
SELECT department_id, employee_name, salary
FROM ranked
WHERE rnk = 2;

DENSE_RANK handles ties the way the business question intends. If two employees share top salary, both get rank 1 and the next distinct salary is rank 2. RANK would jump to rank 3 and your filter returns nothing. ROW_NUMBER would arbitrarily pick one of the tied employees. Common mistakes: using ROW_NUMBER (arbitrary tie ordering), using RANK (skips numbers after ties), forgetting PARTITION BY. Complexity: O(n log n) per partition for the sort.

Q2: 7-day rolling average of daily revenue

SELECT
  order_date,
  daily_revenue,
  ROUND(
    AVG(daily_revenue) OVER (
      ORDER BY order_date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ), 2
  ) AS rolling_7d_avg
FROM daily_revenue_summary
ORDER BY order_date;

ROWS BETWEEN 6 PRECEDING AND CURRENT ROW gives a 7-day window. For the first 6 rows the frame is shorter than 7. The choice of ROWS over RANGE matters: RANGE collapses ties on the ORDER BY value, which on intraday or duplicated dates silently changes the answer. Common mistakes: using RANGE instead of ROWS, writing ROWS BETWEEN 7 PRECEDING (that's an 8-day window), not handling partial windows in the first 6 rows.

Q3: LAG to find customers whose order value decreased

WITH with_prev AS (
  SELECT
    customer_id,
    order_date,
    order_total,
    LAG(order_total) OVER (
      PARTITION BY customer_id
      ORDER BY order_date
    ) AS prev_order_total
  FROM orders
)
SELECT *
FROM with_prev
WHERE order_total < prev_order_total;

Window functions can't appear in WHERE because SQL evaluates WHERE before windowing. The CTE form is the standard fix. Postgres 16 added QUALIFY to skip the CTE; most engines still require it. Common mistakes: putting LAG directly in WHERE (syntax error), forgetting PARTITION BY (mixes customers), not addressing the first order per customer where prev is NULL.

Q4: Deduplicate records, keeping the most recent per user

WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY updated_at DESC
    ) AS rn
  FROM user_profiles
)
SELECT user_id, email, name, updated_at
FROM ranked
WHERE rn = 1;

ROW_NUMBER assigns a unique rank inside each user_id partition, most-recent first. Filtering rn = 1 keeps exactly one row per user and preserves all other columns. The pattern shows up in nearly every loop. Postgres has DISTINCT ON; BigQuery and Snowflake have QUALIFY. The CTE form works everywhere. Common mistakes: DISTINCT (no control over which duplicate survives), GROUP BY with MAX (right timestamp, lose other columns), ASC instead of DESC.

Q5: Month-over-month revenue growth rate

WITH monthly AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS revenue
  FROM orders
  GROUP BY DATE_TRUNC('month', order_date)
),
with_prev AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_revenue
  FROM monthly
)
SELECT
  month,
  revenue,
  prev_revenue,
  ROUND(
    (revenue - prev_revenue) * 100.0 / NULLIF(prev_revenue, 0),
    1
  ) AS growth_pct
FROM with_prev
ORDER BY month;

Three steps the CTE chain makes obvious: aggregate to month, LAG one row back, compute % change. NULLIF on prev_revenue is what interviewers want you to volunteer, since the first month has no predecessor. Common mistakes: dividing by prev_revenue without NULLIF (crashes on first row), using EXTRACT(MONTH FROM) instead of DATE_TRUNC (collapses Jan 2025 and Jan 2026), doing it all in one query without CTEs.

Q6: Find customers who have never placed an order

-- Approach 1: LEFT JOIN + NULL check
SELECT c.customer_id, c.name, c.email
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

-- Approach 2: NOT EXISTS (often faster)
SELECT c.customer_id, c.name, c.email
FROM customers c
WHERE NOT EXISTS (
  SELECT 1
  FROM orders o
  WHERE o.customer_id = c.customer_id
);

Both return the same answer. LEFT JOIN + IS NULL is more intuitive. NOT EXISTS can be faster because it short-circuits and on engines without a great anti-join optimizer it's the more reliable plan. Worth volunteering NOT EXISTS in the interview to show you've thought about the plan. Common mistakes: NOT IN with potential NULLs (silently returns nothing), INNER JOIN by mistake.

Q7: Self-join to find pairs of employees with the same manager

SELECT
  e1.employee_name AS employee_1,
  e2.employee_name AS employee_2,
  e1.manager_id
FROM employees e1
JOIN employees e2
  ON e1.manager_id = e2.manager_id
  AND e1.employee_id < e2.employee_id;

Self-join on manager_id. The inequality e1.employee_id < e2.employee_id drops both the (Alice, Bob) / (Bob, Alice) duplicates and the self-matches in one condition. Strict less-than over not-equal is the canonical answer because it halves the work and yields one row per real pair. Common mistakes: != instead of < (returns each pair twice), forgetting the inequality (every employee pairs with themselves).

Q8: Departments where avg salary exceeds $100K

SELECT
  department_id,
  COUNT(*) AS employee_count,
  ROUND(AVG(salary), 2) AS avg_salary
FROM employees
GROUP BY department_id
HAVING AVG(salary) > 100000
ORDER BY avg_salary DESC;

WHERE filters rows before aggregation; HAVING filters groups after. Including COUNT(*) is the move that separates passing from great, because the natural follow-up is whether a $101K average over one employee should count the same as over 500. Common mistakes: AVG(salary) in WHERE (errors out), referencing the alias in HAVING (some engines accept it, many don't, so use the full expression).

Q9: Conditional aggregation to pivot monthly sales

SELECT
  product_id,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 1
    THEN total_amount ELSE 0 END) AS jan_sales,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 2
    THEN total_amount ELSE 0 END) AS feb_sales,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 3
    THEN total_amount ELSE 0 END) AS mar_sales,
  SUM(CASE WHEN EXTRACT(MONTH FROM order_date) = 4
    THEN total_amount ELSE 0 END) AS apr_sales
FROM orders
WHERE order_date >= '2025-01-01'
  AND order_date < '2025-05-01'
GROUP BY product_id
ORDER BY product_id;

CASE WHEN inside SUM is the engine-agnostic pivot. Each CASE acts as a per-row filter; SUM only counts matching rows. Same pattern extends to any categorical pivot. Common mistakes: skipping the date filter (Jan 2024 and Jan 2025 collide), hardcoding columns when categories are unbounded (call out the need for dynamic pivot or application-side transformation).

Q10: Customers who bought product A but never bought B

SELECT DISTINCT c.customer_id, c.name
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE oi.product_id = 'A'
  AND NOT EXISTS (
    SELECT 1
    FROM orders o2
    JOIN order_items oi2 ON o2.order_id = oi2.order_id
    WHERE o2.customer_id = c.customer_id
      AND oi2.product_id = 'B'
  );

Outer query finds customers who bought A. NOT EXISTS excludes any who also bought B. NOT EXISTS short-circuits on the first match. Pattern shows up constantly in user segmentation, cohort analysis, exclusion-style product questions. Common mistakes: NOT IN (returns nothing if inner has NULLs), missing DISTINCT (a customer who bought A three times appears 3x).

Q11: NULLIF to avoid division-by-zero in revenue per user

SELECT
  campaign_id,
  total_revenue,
  total_users,
  ROUND(
    total_revenue / NULLIF(total_users, 0),
    2
  ) AS revenue_per_user
FROM campaign_metrics;

NULLIF returns NULL when its two arguments are equal, so NULLIF(total_users, 0) gives NULL on zero. Dividing by NULL is NULL in standard SQL, not an error. Cleaner than CASE WHEN and idiomatic in every dialect. Wrap in COALESCE if you want a 0 for display. Common mistakes: no guard (engines throw on divide-by-zero), CASE WHEN (works but verbose), COALESCE(total_users, 1) (handles NULL but not zero).

Q12: User retention at 1, 7, 30 days after signup

WITH signups AS (
  SELECT user_id, MIN(event_date) AS signup_date
  FROM events
  GROUP BY user_id
),
activity AS (
  SELECT DISTINCT e.user_id, e.event_date
  FROM events e
  JOIN signups s ON e.user_id = s.user_id
  WHERE e.event_date > s.signup_date
)
SELECT
  s.signup_date,
  COUNT(DISTINCT s.user_id) AS cohort_size,
  COUNT(DISTINCT CASE
    WHEN a.event_date = s.signup_date + INTERVAL '1 day'
    THEN s.user_id END
  ) * 100.0 / COUNT(DISTINCT s.user_id) AS d1_retention,
  COUNT(DISTINCT CASE
    WHEN a.event_date = s.signup_date + INTERVAL '7 days'
    THEN s.user_id END
  ) * 100.0 / COUNT(DISTINCT s.user_id) AS d7_retention,
  COUNT(DISTINCT CASE
    WHEN a.event_date = s.signup_date + INTERVAL '30 days'
    THEN s.user_id END
  ) * 100.0 / COUNT(DISTINCT s.user_id) AS d30_retention
FROM signups s
LEFT JOIN activity a ON s.user_id = a.user_id
GROUP BY s.signup_date
ORDER BY s.signup_date;

Signups CTE captures each user's first event; activity CTE pulls subsequent events; final SELECT uses conditional COUNT DISTINCT to check whether the user was back on day 1, 7, 30. LEFT JOIN keeps users who never returned in the denominator. Common mistakes: confusing 'returned on day 7' with 'returned within 7 days' (exact match vs BETWEEN), INNER JOIN (drops users who never returned, inflating retention), skipping COUNT(DISTINCT) (a user with 5 events on day 7 counts 5x).

Q13: Users active for at least 3 consecutive days (gaps and islands)

WITH daily_activity AS (
  SELECT DISTINCT user_id, DATE(event_timestamp) AS active_date
  FROM events
),
with_groups AS (
  SELECT
    user_id,
    active_date,
    active_date - ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY active_date
    ) * INTERVAL '1 day' AS grp
  FROM daily_activity
)
SELECT user_id, MIN(active_date) AS streak_start,
  MAX(active_date) AS streak_end,
  COUNT(*) AS streak_days
FROM with_groups
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id, streak_start;

Gaps and islands. The trick: subtract ROW_NUMBER from the date and consecutive dates produce a constant difference. Grouping on that constant identifies a run. Once you've seen the pattern, it's automatic. One of about five tricks interviewers treat as standard at senior level. Common mistakes: skipping the daily-activity dedup (multiple events same day breaks the sequence), reaching for LAG (compares adjacent rows, streaks of 3+ need nested checks), engine-mismatched arithmetic.

Q14: Extract the domain from email addresses

-- PostgreSQL / Redshift
SELECT email, SPLIT_PART(email, '@', 2) AS domain
FROM users;

-- MySQL
SELECT email, SUBSTRING_INDEX(email, '@', -1) AS domain
FROM users;

-- Standard SQL (works everywhere)
SELECT
  email,
  SUBSTRING(email FROM POSITION('@' IN email) + 1) AS domain
FROM users;

Three answers because the right syntax depends on engine. SPLIT_PART is cleanest in Postgres/Redshift. SUBSTRING_INDEX is MySQL's equivalent. POSITION + SUBSTRING is the standard form. Common mistakes: assuming every email has exactly one @ (production data has malformed addresses), not asking which engine first (default to standard SQL and call out shortcuts), forgetting LOWER for case-insensitive grouping.

Q15: Salary as percentage of department average

-- Approach 1: Window function (preferred)
SELECT
  employee_name,
  department_id,
  salary,
  ROUND(
    salary * 100.0 / AVG(salary) OVER (
      PARTITION BY department_id
    ), 1
  ) AS pct_of_dept_avg
FROM employees
ORDER BY department_id, pct_of_dept_avg DESC;

-- Approach 2: Correlated subquery
SELECT
  e.employee_name,
  e.department_id,
  e.salary,
  ROUND(
    e.salary * 100.0 / (
      SELECT AVG(e2.salary)
      FROM employees e2
      WHERE e2.department_id = e.department_id
    ), 1
  ) AS pct_of_dept_avg
FROM employees e
ORDER BY e.department_id, pct_of_dept_avg DESC;

Window function version is one pass. AVG with PARTITION BY computes the dept average inline; you divide each row by it. Correlated subquery runs the inner SELECT once per outer row, which is O(n²) before optimization. Leading with the window version lands well, then naming the subquery alternative shows you know more than one tool and picked the better one. Common mistakes: reaching for the correlated subquery first, dividing by overall average (forgetting PARTITION BY), integer division (without * 100.0, integer math truncates to 0).

How to prep for the SQL round

01
Start with JOINs and aggregation
These show up in every interview and are the foundation for harder questions. When GROUP BY and HAVING are automatic, you have spare cycles for the part of the problem that's actually being tested.
02
Get fluent on window functions
The single biggest skill gap between candidates who pass and candidates who don't. Practice until you can write ROW_NUMBER, RANK, LAG, LEAD without thinking about the OVER clause, then practice until you can reason about ties without looking it up.
03
Memorize the dedup pattern
ROW_NUMBER OVER (PARTITION BY key ORDER BY ts DESC) inside a CTE, then SELECT WHERE rn = 1. Some flavor of this is in nearly every loop because dedup is one of the daily tasks of the job. The keep-most-recent-per-user worked solution drills the exact form.
04
Practice on the clock
Knowing a pattern in your IDE is not the same as writing it in a shared editor under a timer. Do a few problems with a stopwatch and no autocomplete before the loop. The first 10 minutes of the real interview will feel hostile if you've never done that.
05
Run the code
Reading the solution and writing it from a blank file are different skills. Find your NULL bug in practice, not on the call.

SQL interview questions FAQ

Which SQL topics come up most in DE interviews?+

Aggregation is the largest single bucket at around 24%, JOINs next at 20%, window functions at about 15%. GROUP BY and INNER JOIN appear in the most questions. A typical 45-minute round combines two or three of these in a single problem, so the practice that pays off most is connecting them, not drilling each in isolation.

How many SQL questions should I solve before a loop?+

Fifty is a reasonable floor, but only if you actually solve them: write the query, run it, fix the edge case, then close the tab and try to re-derive the answer two days later. The signal you're ready isn't a count. It's whether you can write a medium window-function problem in under 12 minutes without looking up syntax.

How is DE SQL different from analyst SQL?+

Analyst SQL leans on reporting: aggregations, filters, the occasional pivot. DE SQL adds patterns analysts never have to write: dedup by composite key, incremental loads with merge semantics, gap-and-island analysis, time-zone-aware bucketing, queries that have to stay correct under late-arriving data. The keywords overlap. The questions don't.

Are DE SQL interviews like LeetCode SQL?+

Not really. LeetCode leans into puzzle problems: clever self-joins, NULL gotchas, one-liners that are hard to read. DE interviews lean into pipeline-shaped problems: multi-step CTEs, deduplication, data-quality predicates, time-window matching. Both can be hard, but the kind of hard is different.

What SQL engine should I practice on?+

Postgres is the safe default. Most interview platforms use Postgres or something close enough that ANSI features behave the same. If you're interviewing at a Snowflake or BigQuery shop, spend a couple of hours learning the date functions and the QUALIFY clause for those engines, but the bulk of what you practice will transfer.

02 / Why practice

Solve a few of these against a real warehouse

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Start with a window function problem

Related guides

DE Interview Prep→

The full loop: SQL, Python, modeling, system design, behavioral.

Data Modeling Questions→

Star schemas, normalization, SCD types.

Python Interview Questions→

Data manipulation and ETL patterns for DE interviews.