Interview Round Guide

The SQL Round

The SQL round is the single most common round in a data engineer interview. Across 1,042 interview reports we collected, 95% of loops contained at least one live SQL round, and 32.7% of all rounds in the dataset were a phone-screen SQL. This page is part of the our data engineer interview prep hub, which covers every round in the loop. If you skip preparation here, the loop ends at the screen.

The Short Answer
Expect a 30 to 60 minute live coding round with 2 to 4 SQL problems against a 3 to 5 table schema. The top patterns tested are GROUP BY, window functions (ROW_NUMBER, RANK, LAG, LEAD), self joins, gap and island problems, and rolling aggregations. To pass, write a working query in 12 minutes for medium and 20 minutes for hard, talk through edge cases out loud, and never let a query stand without a comment on NULL behavior, duplicates, or grain.
Updated April 2026ยทBy The DataDriven Team

What the SQL Round Actually Tests

The format is consistent across companies. Frequency numbers below come from 1,042 verified interview reports collected on DataDriven from 2024 to 2026.

PatternShare of SQL QuestionsCommon In
GROUP BY with HAVING15.3%Every loop, all levels
INNER and LEFT JOIN21.1%Every loop
Window functions (PARTITION BY)9.7%L4+, FAANG
ROW_NUMBER deduplication6.2%Every loop
RANK and DENSE_RANK4.9%L4+
Self join on inequality4.1%L4+, FAANG
Gap and island problems3.8%Senior, FAANG
Rolling and moving averages3.6%Analytics-heavy roles
Recursive CTE2.7%Senior, modeling-heavy roles
NULLIF and COALESCE patterns5.1%Every loop
Date trunc and date math7.8%Every loop
EXISTS vs IN2.4%Optimization rounds
Pivot with conditional aggregation3.2%Every loop
EXPLAIN plan reading2.1%Senior+, optimization rounds

Five Worked Solutions From Real Loops

Each problem below appeared in at least three reported interview loops in the last 18 months. The solution shows the query, the gotchas the interviewer is testing for, and the follow-up question they will ask next.

Pattern: Window Dedup

Keep the most recent row per user

A user_events table has duplicate rows because of an at-least-once pipeline. Return one row per user_id, keeping the row with the latest event_ts. Ties broken by event_id descending.
WITH ranked AS (
  SELECT
    *,
    ROW_NUMBER() OVER (
      PARTITION BY user_id
      ORDER BY event_ts DESC, event_id DESC
    ) AS rn
  FROM user_events
)
SELECT user_id, event_ts, event_id, payload
FROM ranked
WHERE rn = 1;
Pattern: Gap and Island

Find users active for 3+ consecutive days

The trick: subtract a per user ROW_NUMBER from the date column. Consecutive dates produce the same difference, so they collapse into one group. The interviewer wants you to show the math out loud with a 5-row example.
WITH daily AS (
  SELECT DISTINCT user_id, event_date
  FROM user_events
),
grouped AS (
  SELECT
    user_id,
    event_date,
    event_date - INTERVAL '1 day' *
      ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_date)
      AS streak_key
  FROM daily
)
SELECT user_id, MIN(event_date) AS streak_start, COUNT(*) AS streak_days
FROM grouped
GROUP BY user_id, streak_key
HAVING COUNT(*) >= 3;
Pattern: Rolling Window

7-day rolling revenue with explicit partial window handling

Most candidates write the window then stop. The follow-up is always: what does the first 6 rows show, and why is that wrong for a dashboard? Mention partial windows before being asked.
SELECT
  order_date,
  daily_revenue,
  CASE
    WHEN ROW_NUMBER() OVER (ORDER BY order_date) >= 7
    THEN AVG(daily_revenue) OVER (
      ORDER BY order_date
      ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    )
  END AS rolling_7d_avg
FROM daily_revenue_summary
ORDER BY order_date;
Pattern: Self Join

Pairs of employees who share a manager

The inequality e1.id < e2.id is the only thing the interviewer cares about. It eliminates self-pairs and reverse duplicates in one condition. Candidates who use !=, then dedupe in a CTE, lose points for inefficiency.
SELECT
  e1.id AS employee_a,
  e2.id AS employee_b,
  e1.manager_id
FROM employees e1
JOIN employees e2
  ON e1.manager_id = e2.manager_id
  AND e1.id < e2.id;
Pattern: Recursive CTE

All reports under a manager, including indirect

The interviewer is checking three things: that you write the base case correctly, that you JOIN the recursive step on the previous iteration not the source table, and that you have a depth guard against cycles in the org chart.
WITH RECURSIVE org AS (
  SELECT id, manager_id, name, 0 AS depth
  FROM employees
  WHERE id = :target_manager_id

  UNION ALL

  SELECT e.id, e.manager_id, e.name, o.depth + 1
  FROM employees e
  JOIN org o ON e.manager_id = o.id
  WHERE o.depth < 20
)
SELECT id, name, depth FROM org WHERE id != :target_manager_id;

What Interviewers Watch For (And Score You On)

1

Explicit NULL handling

Every JOIN, every WHERE, every aggregate. Saying 'COUNT ignores NULL but COUNT(*) does not' before being asked is a strong signal. Most candidates think about NULL only after a wrong answer.
2

Grain awareness

Before the first SELECT, state the grain of your output: 'one row per user per day'. This single sentence tells the interviewer you understand aggregation. Skipping it forces them to ask, which costs points.
3

Window vs aggregate choice

If the result needs a per-row context (running total, rank, lag), the answer is a window function. If the result collapses rows, it is GROUP BY. Mixing the two without a clear reason is the most common L4+ rejection signal.
4

ROW_NUMBER vs RANK vs DENSE_RANK

ROW_NUMBER assigns 1, 2, 3, 4. RANK assigns 1, 1, 3, 4 (skips). DENSE_RANK assigns 1, 1, 2, 3 (no skip). Pick based on whether ties matter for your use case. Saying which you chose and why is required for senior roles.
5

Edge cases stated proactively

Empty table, all NULL inputs, single-row groups, ties at the boundary, partial windows. Saying 'on an empty input this returns an empty set, which I think is correct here' is the experience signal interviewers are looking for.
6

Query readability

CTEs over nested subqueries. Lowercase columns, uppercase keywords. One CTE per logical step with a name that reads like English. Interviewers explicitly grade readability for L5+ roles.
7

Performance instinct

If a problem says 'over 5 billion rows', mention partition keys, predicate pushdown, and avoiding cross joins. You do not need to optimize. You need to show that performance is on your radar.

Why GROUP BY Beats Window Functions in Interviews

This is a hot take. Most candidates rush to window functions because they look senior. In our review of recorded mock interviews on DataDriven, 62% of L4 candidates who reached for a window function on a problem solvable with GROUP BY took 4 minutes longer and made more syntax errors than candidates who used GROUP BY first.

Use the simplest construct that produces the right answer. If the problem reduces to one row per group with no per-row context, GROUP BY is correct. If you need a value from a non-aggregated row alongside an aggregate (e.g., last login alongside total logins), that is when window functions earn their complexity. The interviewer wants the right tool, not the fanciest tool. The same judgment shows up in vanilla Python patterns interviewers test, where reaching for pandas on a problem that needs a dict is the equivalent failure mode.

How the SQL Round Connects to the Rest of the Loop

SQL fluency is the entry condition for the rest of the loop, but the patterns recur in surprising places. Gap-and-island problems reappear inside vanilla Python patterns interviewers test as sessionization questions. ROW_NUMBER deduplication shows up again inside star schema and SCD round prep when you argue for SCD Type 2 over Type 1. EXPLAIN-plan reading and partition pruning are the natural lead-in to system design framework for data engineers, where the same instincts scale up to data warehouse design.

Take-home assignments often bundle SQL with a small modeling task, which is why the data engineer take-home prep overlaps here. And if you're targeting a specific company, the live SQL round at Stripe leans hard on financial-precision SQL while the loop at Airbnb tests experimentation-platform SQL. Read the company-specific guides after this one.

Tools and Technologies Tested in the SQL Round

The SQL dialect of the round usually mirrors the company stack. Snowflake-heavy companies test Snowflake-specific functions (QUALIFY, FLATTEN). BigQuery-heavy companies test ARRAY_AGG and STRUCT manipulation. Postgres companies stay portable. The underlying patterns transfer, but the syntax does not. If you already know the company stack, drill the right dialect: see BigQuery internals interview prep, Redshift internals interview prep, PostgreSQL internals interview prep, or our SQL question reference for Data Engineer interviews reference for the full set.

How to Prepare in Four Weeks

1

Week 1: Joins and aggregation grain

30 problems on INNER, LEFT, FULL, and self join. Every problem requires you to state the output grain before writing. Use the joins-practice and group-by-practice problem sets on DataDriven. Goal: medium problems in under 12 minutes.
2

Week 2: Window functions end to end

Every window function: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, FIRST_VALUE, LAST_VALUE, NTILE, SUM OVER, AVG OVER. 25 problems. Practice ROWS vs RANGE frame clauses. Goal: explain partial-window behavior before the interviewer asks.
3

Week 3: Patterns and edge cases

Gap and island, deduplication, sessionization, top-N per group, percentile calculations, pivot with conditional aggregation. 20 problems. Time yourself: medium under 15 minutes, hard under 25.
4

Week 4: Live mock interviews

10 mock interview rounds with the SQL mock interview tool on DataDriven. Speak out loud the entire time, including state-the-grain, state-edge-cases, and state-the-trade-off. Record yourself and replay. The gap between silent thinking and verbal thinking is what gets candidates rejected.

Data Engineer Interview Prep FAQ

What SQL dialect is used in data engineering interviews?+
PostgreSQL syntax is the most common default in 2026, with about 70% of loops accepting it. Some companies use BigQuery (Google), Redshift (Amazon), Snowflake (data-warehouse-heavy roles), or Spark SQL. Write portable ANSI SQL when in doubt and ask the interviewer if a specific dialect is required.
How long should I take to write a medium SQL problem?+
12 to 15 minutes for a medium, 20 to 25 minutes for a hard. Faster than this is fine if your query is correct. Slower than this and the interviewer will start helping, which lowers your score even if you finish.
Can I use ChatGPT to practice SQL interview questions?+
Generating questions with an LLM is fine. Generating answers is counterproductive because you skip the muscle memory of writing the query. Practice on a tool that actually executes your SQL against real schemas. Reading correct queries does not build the speed you need to write them under interview pressure.
Do I need to memorize SQL window functions?+
Yes. ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, AVG OVER, FIRST_VALUE, LAST_VALUE, and the ROWS BETWEEN frame clause should be reflexive. If you have to look up syntax during the interview, the interviewer will note it and lower your fluency score.
What if I do not finish the SQL problem in time?+
Get the structure right, even if it does not run. A correctly structured query with one syntax bug shows 95% of what the interviewer is grading. A complete but incorrect approach shows nothing. Always state your approach in 30 seconds before writing, so the interviewer can give partial credit if you run out of time.
Are SQL questions different at FAANG vs startups?+
FAANG SQL rounds are more rigorous on edge cases (NULL behavior, ties, empty inputs) and on optimization (EXPLAIN plans, predicate pushdown). Startups care more about practical correctness on messy real-world data. The query patterns are identical. The depth of follow-up is different.
Should I use CTEs or subqueries?+
CTEs, almost always. They are easier to read, easier to debug, and easier to extend. Some legacy MySQL versions do not support them, so know the subquery form too. In any modern interview, CTE is the default.
How important is performance in a SQL interview?+
For L3 to L4 roles, mentioning performance once is enough. For L5+ roles, you should be able to read an EXPLAIN plan, identify a full table scan, and propose a fix. For L6+ data infrastructure roles, expect a dedicated optimization round.

Pass the SQL Round in 4 Weeks

Run live SQL against real schemas in the browser. Get instant feedback on every query. Build the muscle memory that lets you write a clean query in 12 minutes under pressure.

Start the SQL Mock Interview

More Data Engineer Interview Prep Guides

Continue your prep

Data Engineer Interview Prep, explore the full guide

50+ guides covering every round, company, role, and technology in the data engineer interview loop. Grounded in 2,817 verified interview reports across 929 companies, collected from real candidates.

Interview Rounds

By Company

By Role

By Technology

Decisions

Question Formats