168 JOIN problems isolated from the data engineer SQL catalog. INNER, LEFT, RIGHT, FULL OUTER, anti-joins via NOT EXISTS, self-joins on inequality, and the many-to-many duplication trap. The largest single source of wrong answers in submission logs is the JOIN that should have been INNER but was written as LEFT.
JOINs are 20 percent of the data engineer SQL interview catalog by count and roughly 30 percent by graded weight. The disproportionate weight exists because JOINs are where most silent bugs live. A LEFT JOIN that should have been INNER produces extra rows with NULLs that pass simple SELECT but break downstream aggregations. A JOIN on a non-unique key in a many-to-many produces a Cartesian explosion that inflates COUNT and SUM silently. A self-join without an inequality predicate returns pairs in both directions plus self-pairs. Data engineer interviewers test for each of these specifically.
Six JOIN sub-patterns appear in this catalog. INNER versus LEFT distinction: the seeds include orphan rows on both sides. A LEFT JOIN with a WHERE clause on the right table accidentally converts to INNER because the WHERE filters out the NULL rows from non-matches. The fix is to move the predicate into the ON clause. Self-join on inequality: pairs of employees sharing a manager use e1.id less-than e2.id as the strict inequality; not-equal produces both directions and self-pairs. Anti-join via NOT EXISTS or LEFT JOIN with right.key IS NULL: NOT IN with a NULL value silently returns empty because NULL NOT IN list is unknown, not true; this is a graded trap. JOIN on a composite key with NULL: NULL equals NULL is unknown so the row drops; if NULL means "match" use IS NOT DISTINCT FROM in Postgres or COALESCE to a sentinel value. Many-to-many with explicit deduplication: LEFT JOIN to a bridge table inflates by the bridge cardinality, so SELECT DISTINCT on the keys or pre-aggregate the bridge to one row per fact key before joining. SCD2 half-open temporal joins: effective_from less-than-or-equal-to event_time AND (effective_to IS NULL OR event_time less-than effective_to); the closed-interval mistake doubles facts at the boundary microsecond, the open-interval mistake drops them.
The most-asked JOIN variant in 2026 data engineer interviews is the self-join for hierarchies and pairwise comparisons (employees-and-managers, products-frequently-bought-together, friends-of-friends). The second-most-asked is the anti-join for "find X without Y" (users who never made a purchase, products never reviewed, departments without any senior engineers). Both have multiple equivalent SQL formulations: LEFT JOIN with IS NULL, NOT EXISTS with correlated subquery, EXCEPT. The choice between them is a performance question in optimization rounds. NOT EXISTS short-circuits on first match and is almost always the right pick at scale.
Dialect notes for data engineer interviews. Postgres and Snowflake support IS NOT DISTINCT FROM for NULL-safe equality. BigQuery does not, but you can simulate with COALESCE. Hive and Presto support LATERAL VIEW EXPLODE and UNNEST for array column expansion which substitutes for some many-to-many bridges. MySQL 8 added native CTEs and window functions; pre-8 MySQL forces awkward self-join workarounds for ranking. EXPLAIN reading at L5 and above asks the data engineer to identify a hash join versus a sort-merge join versus a nested loop in the plan and explain when each is appropriate.
SQL JOIN Interview Questions
JOIN-focused SQL interview problems with multi-seed grading for data engineer practice.
Common questions
- What is the most common SQL JOIN bug interviewers fish for?
- Using a LEFT JOIN when an INNER JOIN is correct, then putting a WHERE clause on the right table's column. The WHERE clause filters out the NULL rows from non-matches, silently converting the LEFT JOIN to an INNER. The result looks plausible but drops the rows the LEFT was supposed to preserve. The fix is to move the predicate into the ON clause: LEFT JOIN t2 ON t1.key equals t2.key AND t2.col equals 'X'.
- When does a data engineer use a self-join in SQL?
- Hierarchical data (employees reporting to managers, categories reporting to parent categories), pairwise relationships (friends-of-friends, products-frequently-bought-together), and time-series comparisons (this month's revenue vs last month's). Pattern is JOIN the table to itself with aliases. For pairwise without duplicates, use strict inequality (e1.id less-than e2.id). Equality (e1.id not-equal e2.id) produces both (A, B) and (B, A) plus self-pairs.
- What is the difference between NOT IN and NOT EXISTS in SQL?
- NOT EXISTS is correlated and short-circuits on first match; performs well on large data. NOT IN materializes the subquery; performs worse on large data AND silently returns empty if the subquery has any NULLs because NULL NOT IN (1, NULL) is unknown, not true. For 'users who never made a purchase', use NOT EXISTS (SELECT 1 FROM orders WHERE orders.user_id equals users.id) or LEFT JOIN ... WHERE orders.user_id IS NULL.
- How do you JOIN on a composite key when one column is nullable?
- Standard equality (a.k1 equals b.k1 AND a.k2 equals b.k2) drops rows where either side is NULL because NULL equals NULL is unknown. If NULL means 'match', use Postgres IS NOT DISTINCT FROM or COALESCE both sides to a sentinel value that cannot appear in the data (rare in practice). If NULL means 'no match', the standard equality is correct. Verify the semantic with the interviewer.
- What is a many-to-many duplication trap in SQL?
- Joining a fact to a many-to-many bridge inflates row counts by the bridge cardinality. SELECT SUM(fact.amount) after a JOIN to a bridge that has 3 rows per fact returns 3x the true sum. The fix is to pre-aggregate the bridge to one row per fact key before joining, or to use SELECT DISTINCT on the keys before the SUM. Data engineer interviewers engineer this trap in the seeds; the query passes on data where the bridge happens to be 1-to-1 and fails when it is not.
- When should I use FULL OUTER JOIN?
- Rarely. The legitimate use cases are reconciliation (comparing two sources where either may have rows the other does not, plus matches in the middle) and pivoted comparisons (showing both sides of a delta). In most data engineer interview contexts where FULL OUTER seems right, a UNION ALL of two LEFT JOINs is clearer and performs the same. Mention FULL OUTER and defend whether it is the right choice.
- How do you JOIN to a slowly changing dimension Type 2 correctly?
- Half-open interval: ON dim.entity_id equals fact.entity_id AND dim.effective_from less-than-or-equal-to fact.event_time AND (dim.effective_to IS NULL OR fact.event_time less-than dim.effective_to). The half-open (less-than-or-equal on the left, strict less-than on the right) is what prevents two dim rows from matching at the boundary microsecond and doubling the fact. Closed-interval joins double; open-interval joins drop.