Data Engineering Interview Prep
SQL is the most-tested skill in data engineering interviews, and optimization is where senior-level signal lives. Whether it is a phone screen or an onsite round, interviewers expect you to explain why a query is slow and how to fix it.
The most tested SQL concepts: GROUP BY, INNER JOIN, PARTITION BY, and LEFT JOIN. When these queries run slow at scale, optimization knowledge determines your level.
A reporting query filters on status and date range, then selects three columns. The table has 50M rows. The query does a full table scan because no index covers the filter columns.
BEFORE
SELECT user_id, amount, created_at FROM transactions WHERE status = 'completed' AND created_at >= '2025-01-01'. Without an index, the engine reads every row and applies the filter in memory.
AFTER
Create a composite index on (status, created_at) with INCLUDE (user_id, amount). The engine uses an index-only scan: it finds matching rows in the B-tree and reads the included columns directly from the index without touching the heap. This eliminates random I/O entirely for this query pattern.
Covering indexes are the single highest-impact optimization for read-heavy queries. Put equality filters first, range filters second in the index column order.
An events table has 2B rows spanning three years. Queries always filter by event_date but the table is stored as a single heap. Every query scans the entire dataset.
BEFORE
SELECT COUNT(*) FROM events WHERE event_date BETWEEN '2025-12-01' AND '2025-12-31'. The planner has no way to skip irrelevant data. Even with an index, the sheer volume causes slow performance.
AFTER
Partition the table by RANGE on event_date, using monthly partitions. The same query now reads only the December 2025 partition. The planner prunes all other partitions before execution begins. Combined with a local index on each partition, this query touches less than 3% of the total data.
Partitioning is not about making queries faster on small data. It is about making queries possible on large data. Always partition time-series tables by the column you filter on most.
A three-table JOIN filters on a column from the last table in the chain. The engine joins all rows first, then filters. This materializes millions of intermediate rows that get discarded.
BEFORE
SELECT * FROM orders JOIN items ON orders.id = items.order_id JOIN returns ON items.id = returns.item_id WHERE returns.reason = 'defective'. The engine joins orders to items (millions of rows), then joins to returns, then filters.
AFTER
Rewrite to filter returns first using a CTE or subquery, then join the filtered result to items and orders. Better yet, restructure so the most selective filter is applied earliest. Some engines handle predicate pushdown automatically, but in interviews you should demonstrate awareness by writing it explicitly.
Filter early, join late. The fewer rows that enter a JOIN, the faster it runs. In interviews, always mention predicate pushdown when discussing JOIN optimization.
A query uses SELECT DISTINCT on a large result set to find unique customer IDs who have placed orders. DISTINCT sorts or hashes the entire result, which is expensive when the base result has millions of rows.
BEFORE
SELECT DISTINCT customer_id FROM orders WHERE order_date >= '2025-01-01'. If orders has 50M rows in that date range, the engine must deduplicate all of them.
AFTER
Rewrite using EXISTS: SELECT id FROM customers WHERE EXISTS (SELECT 1 FROM orders WHERE orders.customer_id = customers.id AND order_date >= '2025-01-01'). The engine can stop scanning orders for each customer as soon as it finds one match. With an index on (customer_id, order_date), this is significantly faster.
DISTINCT is a blunt instrument. When you only need to check existence, EXISTS with a semi-join is almost always faster because it short-circuits after the first match.
Two question forms
Optimization questions come in two forms. The first is direct: you are given a slow query and asked to improve it. The second is embedded in a design question where performance constraints force you to think about indexing, partitioning, or query restructuring.
What interviewers evaluate
They want to see a systematic approach. Start by asking about data volume and access patterns. Then reason about the execution plan before jumping to solutions. Candidates who immediately say "add an index" without understanding the query pattern score poorly.
Common patterns
Most optimization questions involve one of: replacing a correlated subquery, adding a composite index, partitioning a large table, rewriting a DISTINCT with EXISTS, or restructuring JOINs to reduce intermediate result sizes. Master these five and you cover 90% of what interviews test.
Talk about trade-offs
Indexes speed up reads but slow down writes. Partitioning helps range queries but complicates point lookups. Materialized views trade freshness for speed. Interviewers want to hear that you understand the cost, not just the benefit.
Yes. SQL is the most-tested skill in data engineering interviews, and optimization is a natural extension at mid-level and senior rounds. It might be explicit ("how would you speed up this query?") or embedded in a design question ("this pipeline processes 1B rows daily, how do you keep it under 30 minutes?"). Knowing EXPLAIN output and index strategy is expected.
You do not need to memorize exact syntax, but you should know what to look for: sequential scans vs. index scans, nested loops vs. hash joins, sort operations, and estimated row counts. Being able to say "I would run EXPLAIN ANALYZE and look for sequential scans on large tables" shows practical experience.
Adding indexes without considering the query pattern. An index on (a, b) does not help a query that filters on (b) alone. Candidates who explain composite index column ordering and covering indexes stand out. The second most common mistake is ignoring the data volume; an approach that works on 10K rows might be unusable on 10M.
Materialized views pre-compute expensive aggregations or joins and store the result. They trade storage and refresh cost for read performance. In interviews, mention them when the question involves repeated expensive reads on data that changes infrequently. Always discuss the refresh strategy: full refresh vs. incremental, and how stale data affects downstream consumers.
Write queries, run them against real data, and see execution results. The fastest way to develop query optimization instincts.