SQL Query Optimization for Data Engineering Interviews (2026)
SQL is the most-tested skill in data engineering interviews, and optimization is where senior-level signal lives. Five patterns cover ~90% of what interviews ask: replacing correlated subqueries, covering indexes, partition pruning, predicate pushdown, and EXISTS vs DISTINCT. Each pattern below includes before/after code and the takeaway that interviewers reward.
Replacing correlated subqueries with JOINs
Symptom: a SELECT-clause subquery references a column from the outer query. The engine executes the subquery once per outer row, giving O(n×m) work. Fix: rewrite as a LEFT JOIN against a pre-aggregated derived table. The engine scans the inner table once, then hash-joins. On 1M customers × 10M orders, this typically drops execution time from minutes to seconds.
BEFORE
SELECT customer_id, (SELECT MAX(order_date) FROM orders o WHERE o.customer_id = c.id) AS last_order FROM customers c; -- Scans orders ONCE PER customer row
AFTER
SELECT c.customer_id, o.last_order FROM customers c LEFT JOIN ( SELECT customer_id, MAX(order_date) AS last_order FROM orders GROUP BY customer_id ) o ON o.customer_id = c.id; -- Scans orders ONCE, hash-joins to customers
Covering indexes for filter-heavy queries
Symptom: a query filters on two columns and selects three more. Without a matching composite index, the engine reads every row and applies filters in memory. Fix: composite index on (status, created_at) WITH INCLUDE (user_id, amount). The engine does an index-only scan — finds rows in the B-tree, reads included columns directly from the index, never touches the heap. Random I/O drops to zero for this query shape.
BEFORE
SELECT user_id, amount, created_at FROM transactions WHERE status = 'completed' AND created_at >= '2025-01-01'; -- Without index: full table scan, 50M rows examined
AFTER
CREATE INDEX idx_tx_status_date ON transactions (status, created_at) INCLUDE (user_id, amount); -- Same query now does an index-only scan. -- Equality filter (status) comes FIRST in the index, -- range filter (created_at) comes SECOND.
Reading EXPLAIN output
Ten plan node types that interviewers expect you to interpret on sight. These are Postgres terms; MySQL and the data warehouses use slightly different names but the concepts map cleanly.
| Plan node | What it means | When to worry |
|---|---|---|
| Seq Scan | Full table scan. Reads every row. | Acceptable on small tables; red flag on >1M rows with a filter. |
| Index Scan | Walks a B-tree to find matching rows, then fetches them from the heap. | Good unless many rows match — high random I/O cost. |
| Index Only Scan | Reads matching columns directly from the index without touching the heap. | Ideal. Means your index COVERS the query. |
| Bitmap Heap Scan | Builds a bitmap of matching tuples, then reads heap pages in sequential order. | Often better than Index Scan when many rows match. |
| Nested Loop | For each outer row, scan the inner table. O(n×m) without an inner index. | OK when outer side is tiny; otherwise prefer hash or merge. |
| Hash Join | Builds a hash table on the smaller side, probes with the larger side. | Best general-purpose join for unordered inputs. |
| Merge Join | Both sides pre-sorted on the join key, then walked in parallel. | Best when sort order is already present (e.g. from an index). |
| Sort | Materializes results into sorted order in memory or on disk. | Disk sort is a red flag — increase work_mem or add an index. |
| Aggregate / HashAggregate | Groups rows for SUM/COUNT/AVG. HashAgg keeps a hash map of groups. | GroupAggregate requires sorted input; HashAgg works on unordered. |
| rows= (estimated) | Planner's row-count estimate for this node. | If estimate is wildly off vs actual, run ANALYZE. |
Partition pruning for time-series data
Symptom: a 2B-row events table is queried by date but stored as a single heap. Every query scans the full dataset even with an index — the index is huge and random I/O kills the wall time. Fix: range-partition by event_date with monthly partitions. Queries with a date predicate prune all irrelevant partitions before execution begins. Combined with local indexes on each partition, the December 2025 query touches <3% of the data.
BEFORE
-- Single heap table CREATE TABLE events ( event_id BIGINT, user_id BIGINT, event_date DATE, payload JSONB ); CREATE INDEX ON events (event_date); SELECT COUNT(*) FROM events WHERE event_date BETWEEN '2025-12-01' AND '2025-12-31'; -- Index helps, but full B-tree on 2B rows is itself huge
AFTER
-- Monthly partitions CREATE TABLE events ( event_id BIGINT, user_id BIGINT, event_date DATE, payload JSONB ) PARTITION BY RANGE (event_date); CREATE TABLE events_2025_12 PARTITION OF events FOR VALUES FROM ('2025-12-01') TO ('2026-01-01'); CREATE INDEX ON events_2025_12 (event_date); -- Planner prunes all other partitions before execution. -- Touches <3% of total data for the same query.
Four-step EXPLAIN diagnosis checklist
When an interviewer hands you a slow query, walk these in order. The first three steps solve about 80% of real-world performance problems.
Look for Seq Scan on large tables
The first thing to find in EXPLAIN output is a Seq Scan on a table with millions of rows. That almost always means a missing index or an index that doesn't match the query's filter columns. The fix is usually a composite index ordered (equality filters, then range filters).
Compare estimated vs actual row counts
EXPLAIN ANALYZE shows both. If the planner estimated 10 rows but produced 10,000, it picked a Nested Loop join that's now disastrously slow. Run ANALYZE on the affected tables to refresh statistics. On Postgres, set default_statistics_target = 1000 for high-cardinality columns.
Check for disk-based sorts
A Sort node with 'Disk: 245MB' means the sort spilled because work_mem couldn't hold it. Either increase work_mem, add an index that provides the sort order for free, or rewrite to avoid the sort entirely (use ROW_NUMBER instead of ORDER BY + DISTINCT ON, for example).
Trace the JOIN order
EXPLAIN shows joins as a tree, bottom-up. The smallest-row-count side should be the inner side of a Nested Loop or the build side of a Hash Join. If the planner picked the wrong order, it's usually because of bad statistics — re-ANALYZE, or hint with materialized CTEs.
Predicate pushdown in multi-table JOINs
Symptom: a 3-table JOIN filters on a column from the LAST table in the chain. The engine joins everything first, materializes millions of intermediate rows, then filters. Fix: restructure so the most selective filter applies before the join. Write the filtered subquery explicitly as a CTE — most engines push predicates automatically, but writing it shows architectural awareness in an interview.
BEFORE
SELECT * FROM orders o JOIN items i ON o.id = i.order_id JOIN returns r ON i.id = r.item_id WHERE r.reason = 'defective'; -- Joins all orders × items first (millions of rows) -- THEN filters returns. Wastes most of the work.
AFTER
WITH defective AS ( SELECT item_id FROM returns WHERE reason = 'defective' ) SELECT o.*, i.*, d.item_id FROM defective d JOIN items i ON i.id = d.item_id JOIN orders o ON o.id = i.order_id; -- Filter returns FIRST. Smallest set drives the join.
EXISTS vs DISTINCT for deduplication
Symptom: SELECT DISTINCT on a large result set to find customers who ordered. DISTINCT must materialize and dedupe the entire intermediate result. Fix: rewrite as EXISTS — the engine short-circuits per customer the moment it finds a matching order. With an index on (customer_id, order_date), this typically runs 5–20x faster on multi-million-row tables.
BEFORE
SELECT DISTINCT customer_id FROM orders WHERE order_date >= '2025-01-01'; -- Sorts or hashes all 50M matching rows to dedupe
AFTER
SELECT id FROM customers c WHERE EXISTS ( SELECT 1 FROM orders o WHERE o.customer_id = c.id AND o.order_date >= '2025-01-01' ); -- Stops scanning orders as soon as the first match -- is found per customer. Short-circuit semi-join.
Four mistakes that tank optimization interviews
What separates strong candidates from passing ones. Each mistake is independent — you can fall into any one of them regardless of the others.
Adding indexes without considering the query
An index on (a, b) does NOT help a query that filters on (b) alone. Composite indexes are usable left-to-right. WHERE a = 1 AND b = 2 → uses the index. WHERE b = 2 → doesn't. Always state the column order rule when proposing an index in an interview.
Ignoring data volume
An approach that's fine on 10K rows might be unusable on 10M. A correlated subquery with a 1ms inner query is 10ms total at 10 rows, but 2.8 hours at 1M rows. Always ask 'how many rows' before proposing an optimization. The number changes the right answer.
Forgetting the write side
Every index speeds up reads and slows down writes. On a high-throughput OLTP table, adding a 4-column composite index can degrade INSERT/UPDATE throughput by 20–30%. Strong candidates mention this tradeoff. 'I'd add this index, but only if write volume on this table is under 1k/sec.'
Optimizing the wrong query
Spending 20 minutes optimizing a query that runs once per month wastes the interview. Ask about call frequency. A query that runs 1000x/sec deserves milliseconds of attention; one that runs once a day can be 'good enough.' Senior interviewers reward candidates who ask before they optimize.
SQL query optimization FAQ
Do data engineering interviews actually ask about query optimization?+
Should I memorize execution plan syntax for interviews?+
What's the most common optimization mistake in SQL interviews?+
How do materialized views fit into query optimization?+
Does query optimization differ between Postgres, MySQL, and the data warehouses?+
Build optimization intuition by running real queries
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition