SQL Tutorial

SQL Tutorial for Data Engineers

You're going to write CASE WHEN more than any other SQL construct in your first year on the job. Get fluent now. This path walks you from your first SELECT to recursive CTEs in the order your brain can actually absorb it, with practice problems at the end of every stage. No skipping ahead. The window functions section is where most people get stuck, and we'll tell you why before you hit it.

Eight stages. About 22 hours if you're also solving the problems. Every concept that shows up in real DE interviews, nothing that doesn't.

8

Stages to interview-ready

22h

Median time to finish

429

SQL questions mapped

61%

L5 senior-level coverage

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

SQL Learning Path

Each stage builds on the one before it. You'll want to resist the urge to skip ahead to window functions because they're the topic that separates junior from senior candidates. Trust us: if GROUP BY isn't automatic, window functions will feel impossible. Put in the reps now and they'll click later.

Stage 12 hours

SELECT, FROM, and WHERE

Every SQL query starts here. SELECT chooses which columns appear in your output. FROM specifies the table. WHERE filters rows before any other processing happens. This is not just syntax to memorize. Understanding the logical order of SQL execution (FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT) is what separates someone who copies queries from someone who writes them.

SELECT specific columns instead of SELECT * to control output and improve performance. In production pipelines, SELECT * is almost always wrong because schema changes break downstream consumers.

WHERE filters rows using comparison operators (=, !=, <, >, <=, >=), BETWEEN for ranges, IN for sets, and LIKE for pattern matching. Combine conditions with AND and OR. Parentheses matter: A OR B AND C groups differently than (A OR B) AND C.

NULL is not a value. It is the absence of a value. WHERE column = NULL does not work. You must use WHERE column IS NULL. This trips up beginners and experienced engineers alike. Three-valued logic (TRUE, FALSE, UNKNOWN) means any comparison involving NULL evaluates to UNKNOWN, which WHERE treats as FALSE.

Aliasing columns with AS makes your output readable. Aliasing tables with short names (FROM orders o) keeps complex queries manageable. You will use table aliases constantly once you start writing JOINs.

Practice SELECT queries
Stage 21.5 hours

Filtering and Sorting

Once you can retrieve rows, you need to control which rows appear and in what order. ORDER BY sorts your results (ASC or DESC). LIMIT caps the number of rows returned. DISTINCT removes duplicate rows. OFFSET skips rows for pagination. These four clauses appear in almost every production query you will write.

ORDER BY executes after SELECT, which means you can sort by column aliases or expressions. You can sort by multiple columns: ORDER BY department ASC, salary DESC sorts by department alphabetically, then by salary descending within each department.

LIMIT and OFFSET together enable pagination, but OFFSET-based pagination performs poorly on large tables because the database still scans all the skipped rows. In production, cursor-based pagination (WHERE id > last_seen_id) is faster.

DISTINCT applies to the entire row, not just the first column. SELECT DISTINCT city, state returns unique city-state pairs, not unique cities. If you want unique values in one column regardless of others, use GROUP BY or a subquery.

Sorting NULL values: in PostgreSQL, NULLs sort last in ASC order and first in DESC order by default. Use NULLS FIRST or NULLS LAST to override. This matters when you are building leaderboards or ranked lists where missing data should appear at the bottom.

Practice filtering and sorting
Stage 33 hours

Aggregation and GROUP BY

Aggregation is the most tested SQL skill in data engineering interviews. GROUP BY collapses rows into groups. Aggregate functions (COUNT, SUM, AVG, MIN, MAX) compute a single value per group. HAVING filters groups after aggregation, unlike WHERE which filters rows before. If you want to find departments where average salary exceeds $100K, you need HAVING, not WHERE.

COUNT(*) counts all rows including NULLs. COUNT(column) counts non-NULL values in that column. COUNT(DISTINCT column) counts unique non-NULL values. Knowing which COUNT to use is a common interview question.

GROUP BY groups rows that share the same value in the specified columns. Every column in your SELECT must either appear in GROUP BY or be wrapped in an aggregate function. This rule catches beginners constantly: SELECT name, department, AVG(salary) FROM employees GROUP BY department fails because name is not grouped or aggregated.

HAVING filters groups after aggregation. WHERE filters rows before aggregation. This distinction matters for performance and correctness. WHERE reduces the number of rows the database has to aggregate. HAVING filters the aggregated results. If you can express a condition in WHERE instead of HAVING, do it; it is faster.

Conditional aggregation using CASE WHEN inside aggregate functions is a power technique. SUM(CASE WHEN status = 'returned' THEN 1 ELSE 0 END) counts returned orders without filtering out non-returned ones. This pattern replaces complex self-joins and subqueries in many interview problems.

Practice aggregation problems
Stage 44 hours

JOINs

JOINs combine rows from two or more tables based on a related column. They are the second most-tested SQL concept after GROUP BY. In real data engineering work, you rarely query a single table. Orders reference customers. Events reference users. Pipeline runs reference configurations. JOINs are how you connect these relationships.

INNER JOIN returns only rows where the join condition matches in both tables. If a customer has no orders, they do not appear in the result. This is the default JOIN type and the one you will use most often.

LEFT JOIN returns all rows from the left table plus matching rows from the right table. Where there is no match, the right side fills with NULLs. This is how you find 'customers who never ordered' or 'products with no reviews': LEFT JOIN then WHERE right_table.id IS NULL.

Self-joins connect a table to itself. They are essential for hierarchical data (employees and their managers), finding duplicates (rows with the same email but different IDs), and comparing rows within the same table (orders placed within 30 minutes of each other using an inequality join).

JOIN performance depends on the join condition. Joining on indexed columns is fast. Joining on expressions (WHERE LOWER(a.name) = LOWER(b.name)) often forces a full scan. In interviews, mentioning indexing and join order shows you think about production performance, not just correctness.

CROSS JOIN produces the Cartesian product: every row in table A paired with every row in table B. Rarely useful on its own, but combined with LEFT JOIN, it generates 'all possible combinations' that you then fill with actual data. Common in calendar-based reporting where you need a row for every date-product pair even when sales are zero.

Practice JOIN problems
Stage 52.5 hours

Subqueries

A subquery is a query nested inside another query. They appear in WHERE clauses (scalar and correlated subqueries), FROM clauses (derived tables), and SELECT lists (scalar subqueries). Subqueries let you break complex problems into steps without creating temporary tables or CTEs.

Scalar subqueries return a single value. Use them in WHERE clauses: WHERE salary > (SELECT AVG(salary) FROM employees). The database executes the subquery once and uses the result as a constant. Simple and efficient for single-value comparisons.

Correlated subqueries reference the outer query, so they execute once per outer row. WHERE salary > (SELECT AVG(salary) FROM employees e2 WHERE e2.department_id = e1.department_id) compares each employee to their department average. Powerful but potentially slow on large tables because of the per-row execution.

EXISTS and NOT EXISTS check whether a subquery returns any rows. EXISTS stops scanning as soon as it finds one match, making it faster than COUNT(*) > 0 for large tables. NOT EXISTS is the cleanest way to express 'find rows with no matching rows in another table.'

Subqueries in the FROM clause (derived tables) let you pre-aggregate or transform data before joining. SELECT * FROM (SELECT department_id, AVG(salary) AS avg_sal FROM employees GROUP BY department_id) dept_avg JOIN ... is cleaner than trying to aggregate and join in the same query level.

Practice subquery problems
Stage 63 hours

Common Table Expressions (CTEs)

CTEs are named temporary result sets defined with the WITH keyword. They make complex queries readable by breaking them into named steps. In interviews, using CTEs signals that you write maintainable code, not just correct code. Most interviewers prefer a CTE-based solution over an equivalent nested subquery because it is easier to follow.

Basic CTE syntax: WITH cte_name AS (SELECT ...) SELECT * FROM cte_name. You can chain multiple CTEs separated by commas: WITH step1 AS (...), step2 AS (SELECT ... FROM step1) SELECT * FROM step2. Each CTE can reference any CTE defined before it.

CTEs do not inherently improve performance. In PostgreSQL, CTEs used to be optimization fences (materialized by default), but since PostgreSQL 12, the optimizer can inline them. Use CTEs for readability, not performance. If you need to force materialization, add MATERIALIZED.

Recursive CTEs traverse hierarchical data. WITH RECURSIVE org_tree AS (base case UNION ALL recursive case) handles org charts, category trees, and graph traversals. The recursive term references the CTE itself and runs until it produces zero new rows. Always include a termination condition to avoid infinite loops.

The deduplication pattern (WITH ranked AS (SELECT *, ROW_NUMBER() OVER (PARTITION BY key ORDER BY ts DESC) AS rn FROM table) SELECT * FROM ranked WHERE rn = 1) combines CTEs with window functions. This pattern appears in nearly every data engineering interview because deduplication is a daily pipeline task.

Practice CTE problems
Stage 74 hours

Window Functions

Window functions perform calculations across a set of rows related to the current row without collapsing the result set like GROUP BY does. They are the highest-difficulty SQL topic that appears consistently in data engineering interviews. If you can write window functions fluently, you can handle 90% of SQL interview problems.

ROW_NUMBER assigns a unique sequential integer to each row within a partition. RANK assigns the same number to ties but skips the next number. DENSE_RANK assigns the same number to ties without skipping. Interviewers test whether you know which one to use and why. The deduplication pattern uses ROW_NUMBER; ranking with ties uses DENSE_RANK.

LAG and LEAD access previous and next rows without a self-join. LAG(revenue, 1) OVER (ORDER BY month) gets the previous month's revenue. This is how you calculate month-over-month growth, time between events, and change detection. You cannot use LAG in a WHERE clause directly; wrap it in a CTE first.

Frame clauses (ROWS BETWEEN / RANGE BETWEEN) control which rows the window function considers. ROWS BETWEEN 6 PRECEDING AND CURRENT ROW creates a 7-day sliding window. Use ROWS, not RANGE, unless you specifically want to group rows with the same ORDER BY value. Forgetting the frame clause defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which gives running totals instead of moving averages.

SUM, AVG, COUNT, MIN, and MAX all work as window functions when paired with OVER(). SUM(revenue) OVER (ORDER BY month ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) gives a cumulative running total. These aggregate window functions avoid the need for self-joins or correlated subqueries.

NTILE(n) divides rows into n roughly equal groups. NTILE(4) creates quartiles, NTILE(100) creates percentiles. Useful for bucketing users by activity level, creating salary bands, or segmenting customers for A/B tests.

Practice window function problems
Stage 82 hours

Query Optimization Basics

Writing correct SQL is necessary. Writing fast SQL is what gets you hired at senior levels. Optimization questions appear in system design rounds and follow-up questions during SQL rounds. Understanding EXPLAIN plans, indexes, and common performance pitfalls shows interviewers you think about production workloads, not just toy datasets.

EXPLAIN ANALYZE shows you how PostgreSQL actually executes your query: which indexes it uses, how many rows each step processes, and where time is spent. Sequential Scan means the database reads every row in the table. Index Scan means it uses an index to jump directly to matching rows. If your query uses Sequential Scan on a large table, you probably need an index.

Index your JOIN columns and WHERE filter columns. A JOIN on customer_id without an index on that column forces a full table scan. Create indexes on columns used in equality conditions first, then range conditions. Multi-column indexes follow leftmost prefix rules: an index on (a, b, c) supports queries filtering on (a), (a, b), or (a, b, c), but not (b) or (c) alone.

Avoid functions on indexed columns in WHERE clauses. WHERE LOWER(email) = 'test@example.com' cannot use a regular index on email because the function transforms the value. Use a functional index (CREATE INDEX ON users (LOWER(email))) or store a normalized version of the column.

CTEs, subqueries, and temp tables have different performance characteristics. Subqueries are often inlined by the optimizer. CTEs in PostgreSQL 12+ can also be inlined. Temp tables force materialization, which helps when you reuse a result set multiple times but hurts when the intermediate result is large.

Learn query optimization

Recommended Study Schedule

This schedule assumes one hour per day, five days a week. Adjust based on your existing SQL knowledge. If GROUP BY already feels natural, spend less time on Stages 1 through 3 and more on window functions and CTEs.

Weeks 1 to 2: Stages 1 through 4 (SELECT to JOINs). Build your foundation. By the end of week 2, you should be able to write a multi-table JOIN with aggregation and HAVING without referencing documentation. Solve at least 30 problems.

Weeks 3 to 4: Stages 5 and 6 (Subqueries and CTEs). Learn to decompose complex problems into steps. Practice chaining CTEs until the pattern is second nature. Solve 20 more problems.

Weeks 5 to 6: Stage 7 (Window Functions). This is where most of your study time should go. ROW_NUMBER, RANK, LAG, LEAD, and frame clauses. These appear in the majority of mid-to-senior level SQL rounds. Solve at least 25 window function problems.

Weeks 7 to 8: Stage 8 and Timed Drills. Read the optimization section, then switch to timed practice. 15 minutes per problem, random topics. This simulates real interview conditions and surfaces gaps you did not know you had.

SQL Tutorial FAQ

How long does it take to learn SQL from scratch?+
Basic SQL (SELECT, WHERE, JOINs, GROUP BY) takes most people 2 to 4 weeks of consistent practice, about an hour a day. Reaching interview readiness, which means confidently writing window functions, CTEs, and multi-step queries under time pressure, typically takes 6 to 8 weeks. The bottleneck is not understanding concepts; it is building the muscle memory to write correct syntax quickly.
Do I need to install anything to follow this SQL tutorial?+
No. DataDriven runs your queries against a real PostgreSQL database in the cloud. Open the practice page in your browser, write your query, and hit run. If you prefer a local setup for experimentation, install PostgreSQL and use psql or pgAdmin, but it is not required.
Which SQL dialect should beginners learn?+
Start with PostgreSQL. It follows the SQL standard more closely than MySQL, it is the default engine on most interview platforms, and its documentation is excellent. Once you are comfortable with Postgres, switching to Snowflake, BigQuery, or Redshift requires learning a few dialect-specific differences (mostly date functions and some window function syntax), not relearning the language.
Is this tutorial enough to pass a data engineering SQL interview?+
This tutorial covers every SQL topic tested in DE interviews. Reading the tutorial teaches you concepts. Passing the interview requires practice. After working through each section, solve at least 10 practice problems for that topic. Focus extra time on window functions and CTEs, which are the highest-difficulty topics that appear most consistently.
What order should I learn SQL topics in?+
Follow the order on this page: SELECT/WHERE first, then sorting and filtering, then aggregation, JOINs, subqueries, CTEs, window functions, and optimization. Each topic builds on the previous one. Skipping to window functions before you are comfortable with GROUP BY is like trying to run before you can walk. The foundations are not exciting, but they make the advanced topics learnable.

Your First SELECT Is Waiting

Start at Stage 1. You'll be surprised how much you can learn in one sitting when the queries actually run.