SQL JOIN Practice for Data Engineers (2026)

JOINs are the second most-tested SQL category after aggregation. INNER JOIN is the most common join type by far, with LEFT JOIN close behind. This guide covers all seven join types — INNER, LEFT, FULL OUTER, self, inequality, CROSS, LATERAL — with code, practice problems, and the six common traps that cost interview rounds.

#2
Most-tested SQL category (after aggregation)
~85%
INNER + LEFT cover this share of JOIN questions
6
Common interview traps to know
25
Practice problems across all 7 join types

JOIN types ranked by interview frequency

Seven join types. Spend prep time proportional to frequency — INNER and LEFT dominate, the rest cover specific patterns.

Join typeWhat it doesInterview frequencyWhen to reach for it
INNER JOINBoth tables match~55% of interview JOIN questionsMost common; default join
LEFT JOINKeep all left rows~30%Critical for anti-joins and zero-handling
FULL OUTER JOINKeep all from both~3%Reconciliation / data diffing
Self-joinTable joined to itself~7%Hierarchies, consecutive events, pairs
Inequality joinON uses <, >, BETWEEN~3%Date ranges, attribution windows, intervals
CROSS JOINCartesian product~2%Date spines, scaffolds, gap-filling
LATERAL JOINSubquery per row<1%Postgres-specific; advanced top-N per group

INNER JOIN (~55% of questions)

-- INNER JOIN: returns ONLY rows that match in both tables.
-- Default and most common. Interviewers test that you understand
-- unmatched rows are silently dropped.

-- Total spend per customer (matched customers only)
SELECT c.name, SUM(o.amount) AS total_spend
FROM customers c
INNER JOIN orders o ON o.customer_id = c.id
GROUP BY c.name
ORDER BY total_spend DESC;

-- Three-way JOIN to find active paying users
SELECT u.id, u.email
FROM users u
INNER JOIN subscriptions s ON s.user_id = u.id
INNER JOIN payments p ON p.subscription_id = s.id
WHERE s.status = 'active'
  AND p.paid_at >= CURRENT_DATE - INTERVAL '30 days';

-- Practice problems (5):
-- 1. Join orders to customers and calculate total spend per customer
-- 2. Find products ordered at least once (filter never-ordered products)
-- 3. Match employees to department names, filter by hire_date
-- 4. Three-way join: users + subscriptions + payments for active payers
-- 5. Find pairs of students enrolled in the same course (self-join via INNER)

Default join type. Tests whether you understand that unmatched rows are silently dropped. Most candidates write this fluently; the trap is forgetting to verify row counts after multi-table joins.

LEFT JOIN (~30% of questions)

-- LEFT JOIN: returns ALL rows from the left table.
-- NULLs fill in where the right table has no match.
-- The most interview-relevant join type.

-- Customers who NEVER ordered (the anti-join pattern)
SELECT c.id, c.name
FROM customers c
LEFT JOIN orders o ON o.customer_id = c.id
WHERE o.order_id IS NULL;
-- WHERE IS NULL converts LEFT JOIN to "left table only, no match"

-- Revenue per category INCLUDING categories with $0
SELECT cat.name, COALESCE(SUM(o.amount), 0) AS revenue
FROM categories cat
LEFT JOIN products p ON p.category_id = cat.id
LEFT JOIN order_items oi ON oi.product_id = p.id
LEFT JOIN orders o ON o.id = oi.order_id
GROUP BY cat.id, cat.name;
-- COALESCE turns NULL → 0 for zero-revenue categories

-- Practice problems (5):
-- 1. Find customers who have never placed an order
-- 2. List products with most recent review, INCLUDING products with zero reviews
-- 3. Every employee + manager name, including the CEO (no manager)
-- 4. Revenue per category including categories with $0
-- 5. Users who signed up but never completed onboarding

The most interview-relevant join type. The anti-join pattern (LEFT JOIN + WHERE IS NULL) appears in nearly every interview at every level.

Six JOIN traps that cost interviews

Each trap is independent. Mention NULL handling and row count verification proactively — they're signal-positive whether the interviewer asks or not.

Trap 1

LEFT JOIN + WHERE on right table = INNER JOIN

Adding WHERE right_table.column = 'value' after a LEFT JOIN converts it into an INNER JOIN. NULLs from unmatched rows fail the WHERE check and get dropped. Move the filter into the ON clause to keep it a true LEFT JOIN. This is the single most common JOIN interview trap.

ON for join conditions; WHERE for post-join filters
Trap 2

JOIN that silently duplicates rows (fan-out)

If your join key is not unique on one side, each row on the unique side gets duplicated for every match on the 'many' side. Joining orders to order_items gives you one row per item, not one row per order. Interviewers ask you to count orders after a join to see if you catch the fan-out.

Always check row count after joining
Trap 3

NULL != NULL in JOIN conditions

Joining on a nullable column drops rows where either side is NULL because NULL = NULL evaluates to NULL (falsy in WHERE/ON). If you need to match NULLs, use IS NOT DISTINCT FROM (Postgres) or COALESCE(col, sentinel). This trap costs candidates points who don't flag NULL handling.

Flag NULL behavior in every JOIN answer
Trap 4

Ambiguous column references in 3+ table joins

When you join three or more tables, always alias your tables and qualify every column. Interviewers watch for ambiguous references. If two tables both have an 'id' column and you write SELECT id, the query fails with 'ambiguous column'. Alias = professionalism signal.

Alias every table; qualify every column
Trap 5

Aggregate then join, OR join then aggregate

When joining a 1-table to a 1-to-many table and aggregating, the order matters for performance AND correctness. Pre-aggregate the many side to avoid fan-out, then join: smaller intermediate result, no duplicate inflation. Joining then aggregating works but is slower and easier to get wrong on multi-table joins.

Pre-aggregate the many side when possible
Trap 6

FULL OUTER JOIN with COALESCE for the join key

FULL OUTER preserves IDs from both sides; SELECT a.id won't return rows that only exist in B. Always SELECT COALESCE(a.id, b.id) to get the union of join keys. Forgetting this is the most common FULL OUTER mistake and causes silent data loss in reconciliation queries.

COALESCE both sides of the join key

Self-joins (~7% of questions)

-- SELF-JOIN: a table joined to itself.
-- For hierarchies, pairs, and consecutive events.

-- Employee + manager from a single employees table
SELECT e.name AS employee, m.name AS manager
FROM employees e
LEFT JOIN employees m ON e.manager_id = m.id;

-- Customers who ordered on CONSECUTIVE days
SELECT DISTINCT o1.customer_id
FROM orders o1
INNER JOIN orders o2
  ON o1.customer_id = o2.customer_id
  AND o2.order_date = o1.order_date + INTERVAL '1 day';

-- Employees who earn MORE than their manager
SELECT e.name, e.salary, m.name AS manager, m.salary AS manager_salary
FROM employees e
INNER JOIN employees m ON e.manager_id = m.id
WHERE e.salary > m.salary;

-- Practice problems (5):
-- 1. Each employee and their direct manager from a single employees table
-- 2. Customers who placed orders on consecutive days
-- 3. All pairs of products in the same category with price diff under $10
-- 4. Flights where arrival city = departure city of another flight (connections)
-- 5. Employees who earn more than their manager

Table joined to itself. Solve hierarchies (employee/manager), consecutive events, and pairs within a single table. Self-joins feel unnatural at first but solve a specific class no other join can.

Inequality joins (~3% of questions)

-- INEQUALITY JOIN: condition uses <, >, BETWEEN instead of =.
-- Date ranges, attribution windows, interval matching.

-- Match each sale to the price tier active on that date
SELECT s.sale_id, s.sale_date, pt.tier_name, s.amount
FROM sales s
INNER JOIN price_tiers pt
  ON s.sale_date BETWEEN pt.valid_from AND pt.valid_to;

-- Events within 30 minutes of a user signup
SELECT e.event_id, u.user_id, e.event_time
FROM users u
INNER JOIN events e
  ON e.user_id = u.user_id
  AND e.event_time BETWEEN u.signup_at AND u.signup_at + INTERVAL '30 minutes';

-- Attribute conversions to the most recent marketing touch
SELECT c.conversion_id, m.touch_id, c.conversion_time, m.touch_time
FROM conversions c
LEFT JOIN LATERAL (
  SELECT *
  FROM marketing_touches mt
  WHERE mt.user_id = c.user_id
    AND mt.touch_time <= c.conversion_time
  ORDER BY mt.touch_time DESC
  LIMIT 1
) m ON true;

-- Practice problems (5):
-- 1. Match each sale to the price tier active on that date
-- 2. Events within 30 minutes of a user signup
-- 3. Attribute conversions to the most recent marketing touch
-- 4. Salary history records with overlapping date ranges
-- 5. Orders matched to the exchange rate valid on the order date

JOIN with <, >, BETWEEN instead of =. Date ranges, attribution windows, interval matching. Many candidates have never written one.

Four worked-example patterns

Specific patterns that appear in actual interview questions. Each is a recognizable structural move you can deploy when you see the underlying problem shape.

Pattern: zero-handling

Customer LEFT JOIN orders with COALESCE

Given customers and orders tables, find each customer's name, order count, and total spend. INCLUDE customers who never ordered. Query: LEFT JOIN customers to orders, GROUP BY customer, COUNT(o.order_id), COALESCE(SUM(o.amount), 0). The LEFT JOIN keeps every customer; COALESCE turns NULL into 0 for the no-orders case. Without COALESCE, the report shows NULL where the reader expects 0 — silent bug.

LEFT JOIN + COALESCE = zero-handling
Pattern: temporal pairs

Self-join for consecutive-day orders

Find customers who ordered on two consecutive days. Self-join orders to itself on customer_id and o2.order_date = o1.order_date + 1 day. SELECT DISTINCT o1.customer_id. The +1 day inequality is the structural insight — self-joins for temporal patterns hinge on the date arithmetic in the join condition.

Self-join + date arithmetic = temporal patterns
Pattern: interval matching

Inequality join for price-tier matching

Match each sale to the price tier that was active on the sale date. Tiers have valid_from and valid_to columns. JOIN ON sale_date BETWEEN valid_from AND valid_to. The join produces one row per (sale, matching tier) — guard against overlapping tiers by validating the data first.

BETWEEN in ON = interval match
Pattern: scaffolding

CROSS JOIN for date spine + gap fill

Generate a report row for every (day, region) combination even when no sales happened. CROSS JOIN generate_series with regions to create the scaffold. LEFT JOIN sales onto the scaffold. COALESCE revenue to 0. This pattern guarantees the dashboard shows every period regardless of activity.

Spine + CROSS JOIN + LEFT JOIN + COALESCE

CROSS JOIN (~2% of questions)

-- CROSS JOIN: Cartesian product. Every row × every row.
-- For generating combinations, date spines, gap-filling.

-- Date spine × regions for reporting scaffold
SELECT d.dt, r.region_name, COALESCE(s.revenue, 0) AS revenue
FROM generate_series(DATE '2024-01-01', DATE '2024-12-31', INTERVAL '1 day') d(dt)
CROSS JOIN regions r
LEFT JOIN daily_sales s ON s.day = d.dt AND s.region_id = r.id;

-- All product-store combinations (find missing stocks)
SELECT p.product_id, s.store_id, COALESCE(i.qty, 0) AS qty
FROM products p
CROSS JOIN stores s
LEFT JOIN inventory i
  ON i.product_id = p.product_id
  AND i.store_id = s.store_id
WHERE i.qty IS NULL OR i.qty = 0;

-- Round-robin tournament: all team matchups
SELECT t1.name AS home, t2.name AS away
FROM teams t1
CROSS JOIN teams t2
WHERE t1.id < t2.id;  -- avoid (A,A) and duplicate (A,B)/(B,A)

-- Practice problems (5):
-- 1. All product-store combinations to find missing stocks
-- 2. Date spine × regions reporting scaffold
-- 3. Round-robin tournament matchups
-- 4. Pair every user with every survey question (find unanswered)
-- 5. Multiplication table from two integer sequences

Cartesian product. Use for date spines, gap filling, and combination generation. Small tables only — a 1M × 1M cross join produces a trillion rows.

FULL OUTER JOIN (~3% of questions)

-- FULL OUTER JOIN: keeps all rows from BOTH sides.
-- For reconciliation and data diffing.

-- Compare two source systems for data quality
SELECT
  COALESCE(s.id, w.id) AS id,
  s.amount AS source_amount,
  w.amount AS warehouse_amount,
  CASE
    WHEN s.id IS NULL THEN 'missing in source'
    WHEN w.id IS NULL THEN 'missing in warehouse'
    WHEN s.amount <> w.amount THEN 'amount mismatch'
    ELSE 'match'
  END AS diff_type
FROM source_orders s
FULL OUTER JOIN warehouse_orders w ON s.id = w.id
WHERE s.id IS NULL OR w.id IS NULL OR s.amount <> w.amount;

-- Find IDs in A but not B (and vice versa) in one query
SELECT
  COALESCE(a.id, b.id) AS id,
  CASE WHEN a.id IS NULL THEN 'only in B'
       WHEN b.id IS NULL THEN 'only in A'
       ELSE 'both' END AS membership
FROM table_a a
FULL OUTER JOIN table_b b ON a.id = b.id;

Keeps all rows from both sides. Most common use: reconciliation between two source systems. Always SELECT COALESCE(a.id, b.id) for the join key — forgetting this is the most common FULL OUTER mistake.

LATERAL JOIN (Postgres-specific, <1%)

-- LATERAL JOIN (Postgres): subquery executed per row of the outer table.
-- Advanced pattern for "top N per group" without window functions.
-- Often produces cleaner plans than ROW_NUMBER for small N.

-- Top 3 most recent orders per customer
SELECT c.id, c.name, o.order_id, o.order_date, o.amount
FROM customers c
LEFT JOIN LATERAL (
  SELECT order_id, order_date, amount
  FROM orders
  WHERE customer_id = c.id
  ORDER BY order_date DESC
  LIMIT 3
) o ON true;

-- Most recent failed login per user
SELECT u.id, u.email, lf.attempted_at, lf.ip_address
FROM users u
LEFT JOIN LATERAL (
  SELECT attempted_at, ip_address
  FROM login_failures
  WHERE user_id = u.id
  ORDER BY attempted_at DESC
  LIMIT 1
) lf ON true;

Advanced pattern for top-N per group. Postgres-only. Often cleaner than ROW_NUMBER for small N. Snowflake's LATERAL FLATTEN serves a similar purpose for nested JSON.

SQL JOINs FAQ

How many types of SQL JOINs are there?+
Five standard join types: INNER JOIN, LEFT JOIN (LEFT OUTER), RIGHT JOIN (RIGHT OUTER), FULL OUTER JOIN, CROSS JOIN. In practice, INNER and LEFT cover ~85% of real work. Self-joins and inequality joins are techniques that use these same join types with different conditions. LATERAL JOIN is a Postgres-specific addition for per-row subqueries.
What's the difference between LEFT JOIN and INNER JOIN?+
INNER JOIN returns only rows that match in both tables. LEFT JOIN returns all rows from the left table, filling NULLs where the right table has no match. If you need to find 'things that do NOT have a match,' use LEFT JOIN with WHERE right_table.key IS NULL — the canonical anti-join pattern.
Why do SQL JOINs appear in every data engineering interview?+
Because real data lives in normalized tables. Every analytical query requires combining data from multiple sources. JOINs are the second most-tested SQL category after aggregation. Interviewers test JOINs to verify you can reason about cardinality, handle NULLs from outer joins, and avoid row duplication.
How do I avoid row duplication when joining tables?+
Understand the cardinality of your join keys. If the key is one-to-many, each row on the 'one' side gets duplicated for every match on the 'many' side. Solutions: aggregate before joining, use DISTINCT, or join on a unique key. Always check row count after a join to catch unexpected fan-out.
When should I use a CROSS JOIN vs UNION?+
CROSS JOIN multiplies rows (Cartesian product). UNION concatenates rows. They solve different problems. CROSS JOIN: generate every (A, B) combination (scaffolds, date spines). UNION: combine two result sets vertically (deduped) or UNION ALL (kept duplicates). Confusing the two is a junior smell.
What's a LATERAL JOIN and when do I need one?+
Postgres-specific. Subquery executed per outer row. Useful for 'top N per group' patterns where the subquery references the outer row. ROW_NUMBER over window achieves the same goal more portably; LATERAL is sometimes faster on small N and is more readable. Not in MySQL or SQL Server, but Snowflake's lateral flatten serves a similar purpose for nested data.
02 / Why practice

Write multi-table joins under pressure

  1. 01

    Active recall beats re-reading by 50%

    Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom

  2. 02

    76% of hiring managers reject on the coding task, not the resume

    From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice

  3. 03

    Five problem shapes cover 80% of data engineer loops

    Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Related SQL prep