Data Engineering Interview Prep

SQL Query Optimization for Interviews

SQL is the most-tested skill in data engineering interviews, and optimization is where senior-level signal lives. Whether it is a phone screen or an onsite round, interviewers expect you to explain why a query is slow and how to fix it.

The most tested SQL concepts: GROUP BY, INNER JOIN, PARTITION BY, and LEFT JOIN. When these queries run slow at scale, optimization knowledge determines your level.

1. Replacing Correlated Subqueries with JOINs

A query uses a correlated subquery in the SELECT clause to look up the latest order date for each customer. The subquery executes once per row in the outer query, resulting in O(n*m) performance on large tables.

BEFORE

SELECT customer_id, (SELECT MAX(order_date) FROM orders WHERE orders.customer_id = customers.id) as last_order FROM customers. This scans the orders table once per customer row.

AFTER

Rewrite as a LEFT JOIN with a pre-aggregated subquery: join customers to a derived table that groups orders by customer_id and selects MAX(order_date). The database scans the orders table once, then performs a hash join. On a table with 1M customers and 10M orders, this typically drops execution time from minutes to seconds.

Correlated subqueries are the most common performance trap in interview questions. Always check if a JOIN or window function can replace them.

2. Adding Covering Indexes for Filter-Heavy Queries

A reporting query filters on status and date range, then selects three columns. The table has 50M rows. The query does a full table scan because no index covers the filter columns.

BEFORE

SELECT user_id, amount, created_at FROM transactions WHERE status = 'completed' AND created_at >= '2025-01-01'. Without an index, the engine reads every row and applies the filter in memory.

AFTER

Create a composite index on (status, created_at) with INCLUDE (user_id, amount). The engine uses an index-only scan: it finds matching rows in the B-tree and reads the included columns directly from the index without touching the heap. This eliminates random I/O entirely for this query pattern.

Covering indexes are the single highest-impact optimization for read-heavy queries. Put equality filters first, range filters second in the index column order.

3. Partition Pruning for Time-Series Data

An events table has 2B rows spanning three years. Queries always filter by event_date but the table is stored as a single heap. Every query scans the entire dataset.

BEFORE

SELECT COUNT(*) FROM events WHERE event_date BETWEEN '2025-12-01' AND '2025-12-31'. The planner has no way to skip irrelevant data. Even with an index, the sheer volume causes slow performance.

AFTER

Partition the table by RANGE on event_date, using monthly partitions. The same query now reads only the December 2025 partition. The planner prunes all other partitions before execution begins. Combined with a local index on each partition, this query touches less than 3% of the total data.

Partitioning is not about making queries faster on small data. It is about making queries possible on large data. Always partition time-series tables by the column you filter on most.

4. Optimizing JOINs with Predicate Pushdown

A three-table JOIN filters on a column from the last table in the chain. The engine joins all rows first, then filters. This materializes millions of intermediate rows that get discarded.

BEFORE

SELECT * FROM orders JOIN items ON orders.id = items.order_id JOIN returns ON items.id = returns.item_id WHERE returns.reason = 'defective'. The engine joins orders to items (millions of rows), then joins to returns, then filters.

AFTER

Rewrite to filter returns first using a CTE or subquery, then join the filtered result to items and orders. Better yet, restructure so the most selective filter is applied earliest. Some engines handle predicate pushdown automatically, but in interviews you should demonstrate awareness by writing it explicitly.

Filter early, join late. The fewer rows that enter a JOIN, the faster it runs. In interviews, always mention predicate pushdown when discussing JOIN optimization.

5. Replacing DISTINCT with EXISTS for Deduplication

A query uses SELECT DISTINCT on a large result set to find unique customer IDs who have placed orders. DISTINCT sorts or hashes the entire result, which is expensive when the base result has millions of rows.

BEFORE

SELECT DISTINCT customer_id FROM orders WHERE order_date >= '2025-01-01'. If orders has 50M rows in that date range, the engine must deduplicate all of them.

AFTER

Rewrite using EXISTS: SELECT id FROM customers WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id AND order_date >= '2025-01-01'). The engine can stop scanning orders for each customer as soon as it finds one match. With an index on (customer_id, order_date), this is significantly faster.

DISTINCT is a blunt instrument. When you only need to check existence, EXISTS with a semi-join is almost always faster because it short-circuits after the first match.

Optimization Questions in Data Engineering Interviews

Two question forms

Optimization questions come in two forms. The first is direct: you are given a slow query and asked to improve it. The second is embedded in a design question where performance constraints force you to think about indexing, partitioning, or query restructuring.

What interviewers evaluate

They want to see a systematic approach. Start by asking about data volume and access patterns. Then reason about the execution plan before jumping to solutions. Candidates who immediately say "add an index" without understanding the query pattern score poorly.

Common patterns

Most optimization questions involve one of: replacing a correlated subquery, adding a composite index, partitioning a large table, rewriting a DISTINCT with EXISTS, or restructuring JOINs to reduce intermediate result sizes. Master these five and you cover 90% of what interviews test.

Talk about trade-offs

Indexes speed up reads but slow down writes. Partitioning helps range queries but complicates point lookups. Materialized views trade freshness for speed. Interviewers want to hear that you understand the cost, not just the benefit.

SQL Optimization FAQ

Do data engineering interviews actually ask about query optimization?

Yes. SQL is the most-tested skill in data engineering interviews, and optimization is a natural extension at mid-level and senior rounds. It might be explicit ("how would you speed up this query?") or embedded in a design question ("this pipeline processes 1B rows daily, how do you keep it under 30 minutes?"). Knowing EXPLAIN output and index strategy is expected.

Should I memorize execution plan syntax for interviews?

You do not need to memorize exact syntax, but you should know what to look for: sequential scans vs. index scans, nested loops vs. hash joins, sort operations, and estimated row counts. Being able to say "I would run EXPLAIN ANALYZE and look for sequential scans on large tables" shows practical experience.

What is the most common optimization mistake in SQL interviews?

Adding indexes without considering the query pattern. An index on (a, b) does not help a query that filters on (b) alone. Candidates who explain composite index column ordering and covering indexes stand out. The second most common mistake is ignoring the data volume; an approach that works on 10K rows might be unusable on 10M.

How do materialized views fit into query optimization?

Materialized views pre-compute expensive aggregations or joins and store the result. They trade storage and refresh cost for read performance. In interviews, mention them when the question involves repeated expensive reads on data that changes infrequently. Always discuss the refresh strategy: full refresh vs. incremental, and how stale data affects downstream consumers.

Build Optimization Intuition

Write queries, run them against real data, and see execution results. The fastest way to develop query optimization instincts.