SQL Query Optimization for Data Engineering Interviews (2026)

SQL is the most-tested skill in data engineering interviews, and optimization is where senior-level signal lives. Five patterns cover ~90% of what interviews ask: replacing correlated subqueries, covering indexes, partition pruning, predicate pushdown, and EXISTS vs DISTINCT. Each pattern below includes before/after code and the takeaway that interviewers reward.

Most-tested skill in DE interviews

Core optimization patterns

10–1000×

Typical speedup from a single pattern

L4+

Where optimization becomes table stakes

Replacing correlated subqueries with JOINs

Symptom: a SELECT-clause subquery references a column from the outer query. The engine executes the subquery once per outer row, giving O(n×m) work. Fix: rewrite as a LEFT JOIN against a pre-aggregated derived table. The engine scans the inner table once, then hash-joins. On 1M customers × 10M orders, this typically drops execution time from minutes to seconds.

Slow

BEFORE

SELECT customer_id, (SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.id) AS last_order FROM customers c; -- Scans orders ONCE PER customer row

Correlated subqueries are the most common performance trap in interview questions. Always check whether a JOIN or window function can replace them.

Fast

AFTER

SELECT c.customer_id, o.last_order FROM customers c LEFT JOIN ( SELECT customer_id, MAX(order_date) AS last_order FROM orders GROUP BY customer_id ) o ON o.customer_id = c.id; -- Scans orders ONCE, hash-joins to customers

Covering indexes for filter-heavy queries

Symptom: a query filters on two columns and selects three more. Without a matching composite index, the engine reads every row and applies filters in memory. Fix: composite index on (status, created_at) WITH INCLUDE (user_id, amount). The engine does an index-only scan — finds rows in the B-tree, reads included columns directly from the index, never touches the heap. Random I/O drops to zero for this query shape.

No index

BEFORE

SELECT user_id, amount, created_at FROM transactions WHERE status = 'completed' AND created_at >= '2025-01-01'; -- Without index: full table scan, 50M rows examined

Covering indexes are the single highest-impact optimization for read-heavy queries. Put equality filters first, range filters second in the column order.

Covering index

AFTER

CREATE INDEX idx_tx_status_date ON transactions (status, created_at) INCLUDE (user_id, amount); -- Same query now does an index-only scan. -- Equality filter (status) comes FIRST in the index, -- range filter (created_at) comes SECOND.

Reading EXPLAIN output

Ten plan node types that interviewers expect you to interpret on sight. These are Postgres terms; MySQL and the data warehouses use slightly different names but the concepts map cleanly.

Plan node	What it means	When to worry
Seq Scan	Full table scan. Reads every row.	Acceptable on small tables; red flag on >1M rows with a filter.
Index Scan	Walks a B-tree to find matching rows, then fetches them from the heap.	Good unless many rows match — high random I/O cost.
Index Only Scan	Reads matching columns directly from the index without touching the heap.	Ideal. Means your index COVERS the query.
Bitmap Heap Scan	Builds a bitmap of matching tuples, then reads heap pages in sequential order.	Often better than Index Scan when many rows match.
Nested Loop	For each outer row, scan the inner table. O(n×m) without an inner index.	OK when outer side is tiny; otherwise prefer hash or merge.
Hash Join	Builds a hash table on the smaller side, probes with the larger side.	Best general-purpose join for unordered inputs.
Merge Join	Both sides pre-sorted on the join key, then walked in parallel.	Best when sort order is already present (e.g. from an index).
Sort	Materializes results into sorted order in memory or on disk.	Disk sort is a red flag — increase work_mem or add an index.
Aggregate / HashAggregate	Groups rows for SUM/COUNT/AVG. HashAgg keeps a hash map of groups.	GroupAggregate requires sorted input; HashAgg works on unordered.
rows= (estimated)	Planner's row-count estimate for this node.	If estimate is wildly off vs actual, run ANALYZE.

Partition pruning for time-series data

Symptom: a 2B-row events table is queried by date but stored as a single heap. Every query scans the full dataset even with an index — the index is huge and random I/O kills the wall time. Fix: range-partition by event_date with monthly partitions. Queries with a date predicate prune all irrelevant partitions before execution begins. Combined with local indexes on each partition, the December 2025 query touches <3% of the data.

Single heap

BEFORE

-- Single heap table CREATE TABLE events ( event_id BIGINT, user_id BIGINT, event_date DATE, payload JSONB ); CREATE INDEX ON events (event_date); SELECT COUNT(*) FROM events WHERE event_date BETWEEN '2025-12-01' AND '2025-12-31'; -- Index helps, but full B-tree on 2B rows is itself huge

Partitioning is not about making queries faster on small data. It is about making queries possible on large data. Always partition time-series tables by the column you filter on most.

Range partitions

AFTER

-- Monthly partitions CREATE TABLE events ( event_id BIGINT, user_id BIGINT, event_date DATE, payload JSONB ) PARTITION BY RANGE (event_date); CREATE TABLE events_2025_12 PARTITION OF events FOR VALUES FROM ('2025-12-01') TO ('2026-01-01'); CREATE INDEX ON events_2025_12 (event_date); -- Planner prunes all other partitions before execution. -- Touches <3% of total data for the same query.

Four-step EXPLAIN diagnosis checklist

When an interviewer hands you a slow query, walk these in order. The first three steps solve about 80% of real-world performance problems.

Step 1

Look for Seq Scan on large tables

The first thing to find in EXPLAIN output is a Seq Scan on a table with millions of rows. That almost always means a missing index or an index that doesn't match the query's filter columns. The fix is usually a composite index ordered (equality filters, then range filters).

Seq Scan + WHERE filter = missing index

Step 2

Compare estimated vs actual row counts

EXPLAIN ANALYZE shows both. If the planner estimated 10 rows but produced 10,000, it picked a Nested Loop join that's now disastrously slow. Run ANALYZE on the affected tables to refresh statistics. On Postgres, set default_statistics_target = 1000 for high-cardinality columns.

Bad estimates → wrong plan

Step 3

Check for disk-based sorts

A Sort node with 'Disk: 245MB' means the sort spilled because work_mem couldn't hold it. Either increase work_mem, add an index that provides the sort order for free, or rewrite to avoid the sort entirely (use ROW_NUMBER instead of ORDER BY + DISTINCT ON, for example).

Sort + Disk = increase work_mem

Step 4

Trace the JOIN order

EXPLAIN shows joins as a tree, bottom-up. The smallest-row-count side should be the inner side of a Nested Loop or the build side of a Hash Join. If the planner picked the wrong order, it's usually because of bad statistics — re-ANALYZE, or hint with materialized CTEs.

Smallest table should drive the join

Predicate pushdown in multi-table JOINs

Symptom: a 3-table JOIN filters on a column from the LAST table in the chain. The engine joins everything first, materializes millions of intermediate rows, then filters. Fix: restructure so the most selective filter applies before the join. Write the filtered subquery explicitly as a CTE — most engines push predicates automatically, but writing it shows architectural awareness in an interview.

Late filter

BEFORE

SELECT * FROM orders o JOIN items i ON o.id = i.order_id JOIN returns r ON i.id = r.item_id WHERE r.reason = 'defective'; -- Joins all orders × items first (millions of rows) -- THEN filters returns. Wastes most of the work.

Filter early, join late. The fewer rows that enter a JOIN, the faster it runs. In interviews, always mention predicate pushdown when discussing JOIN optimization.

Early filter

AFTER

WITH defective AS ( SELECT item_id FROM returns WHERE reason = 'defective' ) SELECT o.*, i.*, d.item_id FROM defective d JOIN items i ON i.id = d.item_id JOIN orders o ON o.id = i.order_id; -- Filter returns FIRST. Smallest set drives the join.

EXISTS vs DISTINCT for deduplication

Symptom: SELECT DISTINCT on a large result set to find customers who ordered. DISTINCT must materialize and dedupe the entire intermediate result. Fix: rewrite as EXISTS — the engine short-circuits per customer the moment it finds a matching order. With an index on (customer_id, order_date), this typically runs 5–20x faster on multi-million-row tables.

DISTINCT

BEFORE

SELECT DISTINCT customer_id FROM orders WHERE order_date >= '2025-01-01'; -- Sorts or hashes all 50M matching rows to dedupe

DISTINCT is a blunt instrument. When you only need to check existence, EXISTS with a semi-join is almost always faster because it short-circuits after the first match.

EXISTS

AFTER

SELECT id FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.order_date >= '2025-01-01' ); -- Stops scanning orders as soon as the first match -- is found per customer. Short-circuit semi-join.

Four mistakes that tank optimization interviews

What separates strong candidates from passing ones. Each mistake is independent — you can fall into any one of them regardless of the others.

Mistake 1

Adding indexes without considering the query

An index on (a, b) does NOT help a query that filters on (b) alone. Composite indexes are usable left-to-right. WHERE a = 1 AND b = 2 → uses the index. WHERE b = 2 → doesn't. Always state the column order rule when proposing an index in an interview.

Leftmost-prefix rule

Mistake 2

Ignoring data volume

An approach that's fine on 10K rows might be unusable on 10M. A correlated subquery with a 1ms inner query is 10ms total at 10 rows, but 2.8 hours at 1M rows. Always ask 'how many rows' before proposing an optimization. The number changes the right answer.

Optimization depends on scale

Mistake 3

Forgetting the write side

Every index speeds up reads and slows down writes. On a high-throughput OLTP table, adding a 4-column composite index can degrade INSERT/UPDATE throughput by 20–30%. Strong candidates mention this tradeoff. 'I'd add this index, but only if write volume on this table is under 1k/sec.'

Indexes are not free

Mistake 4

Optimizing the wrong query

Spending 20 minutes optimizing a query that runs once per month wastes the interview. Ask about call frequency. A query that runs 1000x/sec deserves milliseconds of attention; one that runs once a day can be 'good enough.' Senior interviewers reward candidates who ask before they optimize.

Frequency > absolute runtime

SQL query optimization FAQ

Do data engineering interviews actually ask about query optimization?+

Yes. SQL is the most-tested skill in data engineering interviews, and optimization is a natural extension at mid-level and senior rounds. It might be explicit ('how would you speed up this query?') or embedded in a design question ('this pipeline processes 1B rows daily, how do you keep it under 30 minutes?'). Knowing EXPLAIN output and index strategy is expected.

Should I memorize execution plan syntax for interviews?+

You do not need to memorize exact syntax, but you should know what to look for: sequential scans vs. index scans, nested loops vs. hash joins, sort operations, and estimated row counts. Being able to say 'I would run EXPLAIN ANALYZE and look for sequential scans on large tables' shows practical experience.

What's the most common optimization mistake in SQL interviews?+

Adding indexes without considering the query pattern. An index on (a, b) does not help a query that filters on (b) alone. Candidates who explain composite index column ordering and covering indexes stand out. The second most common mistake is ignoring the data volume; an approach that works on 10K rows might be unusable on 10M.

How do materialized views fit into query optimization?+

Materialized views pre-compute expensive aggregations or joins and store the result. They trade storage and refresh cost for read performance. In interviews, mention them when the question involves repeated expensive reads on data that changes infrequently. Always discuss the refresh strategy: full refresh vs. incremental, and how stale data affects downstream consumers.

Does query optimization differ between Postgres, MySQL, and the data warehouses?+

The optimization principles transfer: filter early, use the right index, prune partitions, avoid materializing intermediate results. The mechanics differ. Snowflake and BigQuery don't expose B-tree indexes — they use micro-partitions and clustering. Redshift uses sort keys. Postgres has classic B-tree, GIN, GiST. In an interview, name the engine before proposing the mechanism. 'In Postgres I'd add a covering index. In Snowflake I'd cluster by user_id.'

02 / Why practice

Build optimization intuition by running real queries

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Practice slow-query optimization

Related SQL prep

SQL JOIN practice→

JOIN order and type selection are the biggest optimization levers

Date functions practice→

Time-series queries are the most common source of slow SQL in production

Window functions practice→

ROW_NUMBER and PARTITION BY queries that benefit most from optimization techniques