Joins: Advanced
GitHub's social graph connects over 100 million developers, and the "suggested repositories" feature works by self-joining a followers table to itself to find developers who share mutual interests with you. When GitHub asks "what repositories are popular among the developers I follow?", it queries a single massive table against a copy of itself, traversing second-degree relationships across hundreds of millions of edges. The same self-join pattern powers org-chart traversal at enterprise platforms like Workday and connection-of-connection discovery at every major professional network. The advanced techniques you are about to learn operate at this scale.
Broadcast joins (distributed)
Optimize joins across distributed systems
The Distribution Problem
Broadcast Join Mechanics
| Aspect | Shuffle Join | Broadcast Join |
|---|---|---|
| Network Traffic | High (both tables moved) | Low (only small table copied) |
| Memory Per Node | Lower | Higher (full small table) |
| Best When | Both tables large | One table is small |
| Scalability | Handles any size | Limited by memory |
Broadcast Selection
- Most engines use size thresholds between 10MB and 100MB
- Tables below the threshold are automatically broadcast
- Some engines use table statistics for smarter decisions
Broadcast Control
Explicit Broadcast Hints
Sometimes the optimizer picks shuffle when broadcast would be faster. Many engines support hints to force a broadcast join, typically using comment-style syntax like /*+ BROADCAST(table) */ after SELECT.
Hint tells the engine to broadcast dim_product to all nodes
Broadcast Hints
Broadcasting a filtered result:
- Use broadcast hints for dimension tables under 1GB
- Run EXPLAIN to verify the join strategy changed
- Update table statistics regularly so hints become unnecessary
- Broadcast tables that won't fit in worker memory
- Use hints as a substitute for proper table statistics
> Complete this query to identify which broadcast strategy is best for joining a 10TB event_data table with a 50MB users table.
SELECT e.event_type, u.username FROM event_data AS e INNER users AS u ON e.user_id = u.user_id
Inequality joins with BETWEEN
Join rows on range conditions
So far, all our joins used equality conditions: ON a.id = b.id. But join conditions can use any comparison operator: <, >, <=, >=, and BETWEEN.
Range Matching Example
| name | salary |
|---|---|
| Alice | $75000 |
| Bob | $35000 |
| Carol | $95000 |
| min_income | max_income | rate |
|---|---|---|
| $0 | $40000 | 12% |
| $40000 | $85000 | 22% |
| $85000 | $200000 | 32% |
> Find events that occurred during a batch job by matching timestamps to the job window.
SELECT e.event_type, e.event_timestamp, b.job_name FROM event_data AS e INNER JOIN batch_jobs AS b ON e.event_timestamp b.started e.event_timestamp b.ended
BETWEEN Syntax
The BETWEEN operator provides cleaner syntax for range conditions:
Real-World Applications
Event Attribution Example
Performance
Inequality joins are computationally expensive. Without an equality condition, the engine may need to compare every row in one table against every row in the other (a Cartesian product with filtering). Adding an equality condition alongside the inequality dramatically improves performance.
c.user_id = i.user_id allows the engine to first partition by user, then apply the time range filter within each partition.- Add an equality condition to partition data before the range comparison
- Pre-filter in subqueries to reduce row counts before the join
- Bucket continuous time ranges into discrete periods for partitioning
- Replace self joins with
window functionswhen possible
Self joins for hierarchical data
Query parent-child relationships in one table
Why Self Join?
| emp_id | name | manager_id |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | 1 |
| 3 | Carol | 1 |
| 4 | Dave | 2 |
| 5 | Eve | 2 |
| 6 | Frank | 3 |
Self Join Syntax
We use LEFT JOIN because Alice has no manager (NULL), and we still want to include her in the results. With INNER JOIN, Alice would disappear.
| emp_id | name | mgr_id |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | 1 |
| 3 | Carol | 1 |
| emp_id | name | mgr_id |
|---|---|---|
| 1 | Alice | NULL |
| 2 | Bob | 1 |
| 3 | Carol | 1 |
| col_0 | col_1 | ||||
|---|---|---|---|---|---|
| 1 | Alice | NULL | 1 | Alice | NULL |
NULL for her manager because she has no manager_id. The LEFT JOIN ensures she isn't dropped from the results, which is critical for including top-level nodes in a hierarchy.Self Join Patterns
Finding Colleagues
> Find metrics that share the same department using a self join.
SELECT e1.metric_name, e2.metric_name FROM employee_metrics AS e1 INNER JOIN employee_metrics AS e2 ON = e2.department AND e1.metric_id e2.metric_id
Comparing Rows
| product | price | date |
|---|---|---|
| Widget | $10 | 2024-01 |
| Widget | $12 | 2024-02 |
| Gadget | $25 | 2024-01 |
| Gadget | $23 | 2024-02 |
LAG() or LEAD() window functions instead of self joins. They're more efficient and easier to read for sequential comparisons.Recursive CTEs for trees
Traverse tree structures with SQL
A recursive CTE (Common Table Expression) is a query that references itself to traverse hierarchical or graph-structured data. It's like a loop in programming: start with base cases, then repeatedly expand until done.
Multi-Level Hierarchies
Recursive CTE Structure
A recursive CTE has two parts connected by UNION ALL:
> Complete the recursive CTE to find all messages in a reply thread.
WITH RECURSIVE thread AS ( SELECT msg_id, content FROM chat_msgs WHERE reply_to IS NULL SELECT c.msg_id, c.content FROM chat_msgs AS c INNER thread AS t ON c.reply_to = t.msg_id ) SELECT * FROM thread
Recursive CTE in Practice
Hierarchy Example
Building a Path
Each iteration appends the current employee's name to the path. The || operator concatenates strings (use CONCAT() in some databases).
Cycle Detection
- Most engines have a default recursion limit (100-1000 iterations)
- Use
CYCLEdetection syntax where supported - Track visited nodes in an array:
ARRAY[emp_id] - Add
WHERENOTCONTAINS(visited, id) to break cycles
CROSS JOIN (Cartesian product)
Generate every possible row combination
A CROSS JOIN produces the Cartesian product of two tables: every row from the first table paired with every row from the second table. If table A has 3 rows and table B has 4 rows, the result has 3 × 4 = 12 rows.
Syntax Options
CROSS JOIN syntax (preferred)
Comma syntax (older SQL-89 style)
The comma syntax is the older SQL-89 style. The explicit CROSS JOIN is clearer and preferred in modern code. Both produce identical results.
| category |
|---|
| Electronics |
| Clothing |
| device_type |
|---|
| phone |
| tablet |
| col_0 | col_1 |
|---|---|
| Electronics | phone |
Product Variants
| category |
|---|
| Electronics |
| Clothing |
| Home |
| device_type |
|---|
| phone |
| tablet |
| laptop |
> Identify which columns come from which table in this cross join that generates all category-device combinations.
SELECT , FROM products AS p CROSS JOIN devices AS d
Practical Cross Join Uses
Date × Category Reporting
The CROSS JOIN creates every date-category pair. The LEFT JOIN fills in actual sales data. Dates with no sales show 0 instead of missing entirely.
The Danger of Cross Joins
LIMIT during development. A forgotten WHERE clause after a comma join is a common cause of runaway queries.- Product variants: generate every combination of color and size
- Calendar scaffolds: cross dates with groups to fill reporting gaps
- Pairwise comparisons: match every row against every other row
- Test data: generate synthetic datasets from small seed tables
- Dense pivot tables: ensure all key combinations exist in reports
- Use explicit CROSS JOIN syntax for clarity
- Verify row counts before cross joining
- Cross join large tables without LIMIT
- Use comma syntax in modern SQL code
A hospital network with 3 facilities needs to build unified patient records by joining siloed clinical, billing, and scheduling data. You lead the data engineering effort to create the patient journey pipeline. Each facility uses a different EHR vendor. Patient IDs overlap across systems (facility A and facility B both have patient_id 1001, but they refer to different people). Clinicians need a single view of each patient across all three facilities for care coordination.
| visit_id | facility | patient_id | admit_date | diagnosis |
|---|---|---|---|---|
| V-001 | A | 1001 | 2024-03-15 | pneumonia |
| V-002 | B | 1001 | 2024-03-20 | fracture |
| V-003 | A | 2045 | 2024-03-22 | diabetes |
| patient_id | facility | name | dob | insurance_id |
|---|---|---|---|---|
| 1001 | A | Maria Garcia | 1985-06-12 | INS-447 |
| 1001 | B | James Wilson | 1972-11-03 | INS-891 |
| 2045 | A | Maria Garcia | 1985-06-12 | INS-447 |
You need to join 50M patient visits to 200K patient profiles. Patient IDs are only unique within a facility, not across the network. The same person (Maria Garcia) may have different IDs at different facilities. Which join approach do you use?
-- Goal: unified patient record SELECT v.visit_id, p.name, v.diagnosis FROM patient_visits v ??? patient_profiles p ON ??? -- your join logic here
> You are a data engineer at Workday building a workforce analytics platform that maps reporting chains, computes salary band compliance, and generates all role-skill combinations for a recommendation engine. All queries touch the same employee_metrics table in different ways.
salary_bands dimension table to all worker nodes, avoiding a shuffle of the 50M employee_metrics fact table.employee_metrics.salary BETWEEN bands.min_salary AND bands.max_salary assigns each employee to their correct compensation band.manager_id counterpart for org-chart traversal.CROSS JOIN between roles and required_skills generates every role-skill combination as the scaffold for the recommendation scoring model.ON salary BETWEEN min AND max); add equality conditions for performanceUNION ALL recursive stepCROSS JOIN produces Cartesian products (all combinations); row count = rows1 x rows2CROSS JOIN for product variants, date scaffolds, and combination generationSelf joins and CROSS joins. Send help.
- Category
- SQL
- Difficulty
- advanced
- Duration
- 32 minutes
- Challenges
- 0 hands-on challenges
Topics covered: Broadcast joins (distributed), Inequality joins with BETWEEN, Self joins for hierarchical data, Recursive CTEs for trees, CROSS JOIN (Cartesian product)
Lesson Sections
- Broadcast joins (distributed)
In distributed query engines, data is split across multiple nodes. When you join two tables, the engine must decide how to bring matching rows together. This decision dramatically affects query performance. The Distribution Problem Consider joining a 100 billion row fact table with a 10,000 row dimension table. The fact table is distributed across 100+ nodes. For each fact row to find its dimension match, the dimension data must be accessible. Broadcast Join Mechanics A broadcast join copies the
- Inequality joins with BETWEEN (concepts: sqlInequalityJoin)
Not all joins match on exact values. Range-based matching opens up powerful patterns for lookups, attribution, and temporal analysis. Range Matching Example Suppose you have employee metrics with salaries and a cost allocation table with budget brackets. Each bracket has a min and max income. To find each employee's tax rate, match their salary to the bracket range: Each employee's salary is matched to the bracket where it falls between min_income and max_income. This is impossible with an equal
- Self joins for hierarchical data
A self join is when a table is joined to itself. This sounds strange at first, but it's essential for querying hierarchical data, comparing rows within the same table, and finding related records. Why Self Join? Self joins solve problems where relationships exist within a single table. The most common example: an employee_metrics table where each employee has a manager_id that references another row in the same table. Alice is the CEO (no manager). Bob and Carol report to Alice. Dave and Eve rep
- Recursive CTEs for trees (concepts: sqlRecursiveCte)
Multi-Level Hierarchies Self joins work for one-level relationships (employee to manager). But what if you need all ancestors: employee to manager to manager's manager to CEO? You'd need multiple self joins, and you'd need to know the maximum depth in advance. Recursive CTE Structure Recursive CTE in Practice Recursive CTEs enable traversal of arbitrary-depth hierarchies, path building, and cycle detection in graph structures. Hierarchy Example Let's find all entries under Alice (the CEO) and th
- CROSS JOIN (Cartesian product) (concepts: sqlCrossJoin)
A cross join combines every row from one table with every row from another, creating all possible combinations. Syntax Options SQL supports two equivalent syntaxes for cross joins: CROSS JOIN syntax (preferred) Comma syntax (older SQL-89 style) Observe how every row from the first table pairs with every row from the second, producing all possible combinations. Two rows crossed with two rows yields four output rows. This multiplicative behavior is why cross joins require caution with larger table