Query Structure: Advanced

Spotify's fraud team writes a daily query that checks, for every one of 600 million accounts, whether suspicious streaming activity exceeds that account's own historical average. Not a global average, not a cohort average: a threshold computed from that specific account's own past behavior. The pattern that makes this possible is the correlated subquery, an inner query that runs once per outer row and reads that row's account ID to derive a personalized comparison. This lesson teaches correlated subqueries alongside the set operations and array functions that complete the advanced SQL toolkit.

Correlated subqueries

Daily Life
Interviews

Compare each row against its own group

A correlated subquery references a column from the outer query in its WHERE clause. Unlike a plain subquery that executes once and hands back a fixed result, a correlated subquery re-executes for every outer row using that row's values as input. The practical consequence: you can compute per-row aggregates or existence checks against a related table without collapsing the outer result set, which is something a JOIN cannot do cleanly when aggregation would remove the row-level detail you need.

The defining characteristic is the column reference that crosses query boundaries. The inner query references a column from the outer table alias, creating a dependency: the database cannot execute the subquery independently because it needs context from each outer row. Think of it as a nested loop in code, where the inner loop body runs once per iteration of the outer loop.

Understanding Correlated Subqueries

The two examples below show how the inner query can reference the outer row: first to pull a per-group aggregate into each detail row, then to trace the execution pass-by-pass.
//

Basic Example

This query demonstrates the key pattern: the inner query references a column from the outer query, creating the correlation.
1SELECT
2 e.name,
3 e.salary,
4 (
5 SELECT
6 AVG(salary)
7 FROM employee_metrics AS e2
8 WHERE e2.department = e.department
9 ) AS dept_avg
10FROM employee_metrics AS e
Result
namesalarydept_avg
Alice9500075000
Bob6500075000
Charlie12000085000
David7200085000

This query shows each employee with their department's average salary. Notice the correlation: "WHERE e2.department = e.department" - the inner query references e.department from the outer query. For each employee, it calculates their department's average.

Why can't you just use GROUP BY? Because you need each individual employee's row alongside their department average. A regular GROUP BY would collapse all employees in a department into one row. The correlated subquery lets you keep individual employee rows while accessing aggregate department data.

//

How It Executes

The database starts with the first device and immediately adds it to the result table. Now it needs to fill in the error_count column, so it runs the inner query.

The inner query scans through every single row in device_logs, looking for matching device_ids. When it finds a match, it checks: "Is this error a crash?" If yes, it counts as 1. If no, it counts as 0. After scanning all logs, it sums up those 1s and 0s to get the total crash count for that device.

This process repeats for every device. Each device gets added to the result, then the entire device_logs table gets scanned again for that specific device.

If you have 3 devices and 6 log entries, the database scans through all 6 logs three separate times, once per device. This is why correlated subqueries can be expensive: the inner query runs repeatedly, once for each outer row.

Practical Considerations

Correlated subqueries are powerful but expensive. Understanding when to use them versus a JOIN or CTE is the key to writing fast, readable queries.

//

Correlated Subquery Use

Correlated subqueries are easier to write and read when you need row-by-row calculations in SELECT or WHERE clauses. They clearly express "for each row, calculate this value from related data."

Later, when you learn about JOINs and aggregation (in the Joins and Aggregating lessons), you'll see alternative approaches. For now, correlated subqueries are a powerful tool for row-by-row calculations.

//

Common Mistakes

Mistake 1: Forgetting the correlation. If your subquery doesn't reference the outer query, it's not correlated -- it's just a regular subquery that runs once.

Mistake 2: Using correlated subqueries on massive datasets without indexes. Always index the columns used in the WHERE correlation to avoid full table scans on every outer row.

//

Real-World Use Cases

Finding the latest status for each user by selecting from user_events with a correlated filter. Calculating running totals or ranks within groups. Filtering tables based on complex conditions from related tables using EXISTS.

Any scenario where you need "for each X, compute Y from another table" is a natural fit for correlated subqueries.
Row-by-row calculations
Row-by-row calculations
Each outer row triggers a fresh inner query execution
Index correlation columns
Index correlation columns
Without indexes, performance degrades fast on large tables
Runs once per outer row
Runs once per outer row
Keep inner queries simple to avoid compounding cost
Here are the technical details that explain how correlated subqueries execute under the hood:
Technical Details
  • Executes once per row in the outer query (O(n) complexity)
  • Can reference columns from outer query using table aliases
  • Often rewritten by query optimizer as joins for better performance
  • Most efficient when used with EXISTS or NOT EXISTS
  • Requires careful indexing on correlation columns
  • Can appear in SELECT, WHERE, or HAVING clauses

> Complete the query to find products that have at least one transaction.

SELECT
  product_name
FROM products AS p
WHERE ___ (
  ___
    1
  FROM transactions AS t
  WHERE t.product_id = ___
)
IN
t.user_id
SELECT
EXISTS
p.product_id

EXISTS stops scanning as soon as it finds the first matching row, making it more efficient than IN for large subquery result sets because it does not need to materialize all matches.

The correlated subquery pattern with EXISTS is the standard way to express semi-join logic in SQL, and modern query optimizers often rewrite IN subqueries into the same execution plan.

NOT EXISTS is the complement and is typically more reliable than NOT IN when the subquery column can contain NULL values, since NOT IN with any NULL in the list returns no rows.

EXCEPT and EXCEPT ALL

Daily Life
Interviews

Find rows that exist in one set but not another

EXCEPT subtracts one result set from another, revealing rows that exist in the first query but not the second.

EXCEPT returns rows from the first query that don't appear in the second query. It's incredibly useful for data auditing, finding missing records, or identifying what's changed between two datasets. Think of it as subtraction for SQL; you're removing all matching rows from the first result set.

EXCEPT removes duplicate rows (like UNION), while EXCEPT ALL keeps duplicates (like UNION ALL). Use EXCEPT when you only care which unique records differ. Use EXCEPT ALL when the count of duplicates matters; for example, when comparing inventory snapshots or transaction logs.

EXCEPT Usage Examples

The key choice is whether duplicate counts matter. EXCEPT gives you unique unmatched rows; EXCEPT ALL tells you how many times each row is unmatched.

//

Basic Example

EXCEPT returns rows from the first query that do not appear in the second, showing what is missing or unmatched.

Imagine you have all user signups in one table, and confirmed (email-verified) signups in another. This query returns user IDs who signed up but never confirmed their email. EXCEPT removes duplicates, so each unconfirmed user appears once.
1SELECT
2 user_id
3FROM all_signups
4
5EXCEPT
6
7SELECT
8 user_id
9FROM confirmed_signups
//

When to Use EXCEPT

EXCEPT is cleaner when you're comparing entire rows and only need to identify what's missing. It operates on full result sets and clearly expresses "give me rows from A that aren't in B."

For readability and intent, EXCEPT is often the best choice when you simply want to find differences between two datasets without needing to combine columns from both.

Real-World Applications

EXCEPT shines in data reconciliation, auditing, and finding discrepancies between datasets.

//

EXCEPT Scenarios

EXCEPT is perfect for data reconciliation tasks. Compare yesterday's customer list with today's to find who unsubscribed. Compare production database with backup to find missing records.

Compare expected test results with actual results to find failures. Any time you need to answer "what's in A but not in B?", EXCEPT is your tool.

TIP
Use EXCEPT ALL when duplicate frequency matters (audit logs, transaction reconciliation). It reveals exact count differences between two result sets.

Rookie Mistake: Using EXCEPT to "remove duplicates from a table." That's not what EXCEPT does. EXCEPT compares two queries. If you want to deduplicate within one table, use SELECT DISTINCT or GROUP BY instead.

Do
  • Use EXCEPT to find records in one dataset but not another
  • Use EXCEPT ALL when duplicate tracking matters
  • Test both sides of EXCEPT to ensure they have matching schemas
  • Consider LEFT JOIN for better performance with large indexed tables
Don't
  • Don't use EXCEPT when you need columns from both sides in results
  • Avoid EXCEPT for simple deduplication within one table
  • Don't overuse EXCEPT ALL when EXCEPT is sufficient

A few technical constraints apply: both queries must select the same number of columns with compatible types, and column names come from the first query. EXCEPT deduplicates its output the same way UNION does; EXCEPT ALL subtracts occurrence counts rather than deduplicating. The operation is also directional: swapping the two sides produces a different result.

Parsons Problem

> Arrange the tiles to find users who have never placed an order.

Arrange each line of the EXCEPT query in order:

1
2
3

EXCEPT is directional: the first query defines the full set and the second query defines what to remove, so swapping the two sides produces a completely different result.

Unlike a NOT IN subquery, EXCEPT automatically handles NULL values correctly by using the same comparison semantics as DISTINCT, which treats two NULLs as equal.

In data quality workflows, EXCEPT is commonly used to find records that are present in a raw ingestion table but absent from the cleaned output, making it easy to detect rows that were filtered out.

The Reconciliation EngineStep 1
>

The firm processes trades through a front-office booking system, a middle-office risk platform, and a back-office settlement ledger. Discrepancies between these systems cause regulatory findings. You need to identify and resolve mismatches using advanced query structures.

bookings
trade_idsymbolquantitybook_time
T-1001AAPL5002024-03-15 09:31:00
T-1002GOOG2002024-03-15 09:45:00
T-1003MSFT1502024-03-15 10:02:00
risk_entries
ref_idsymbolnotionalrisk_time
R-1001AAPL875002024-03-15 09:32:00
R-1002GOOG560002024-03-15 09:46:00
R-1004TSLA420002024-03-15 10:15:00
settlements
settle_idtrade_refamountstatus
S-1001T-100187500settled
S-1003T-100333750pending
S-1005T-100512000failed
May 2026
Break Detection

First, you need to identify trades that exist in bookings but have no matching risk entry. How do you find these "break" records?

SQL
-- Find trades booked but missing from risk
-- bookings: T-1001, T-1002, T-1003
-- risk_entries: R-1001, R-1002, R-1004
-- Which bookings have no risk match?
Financial reconciliation is one of the most demanding use cases for set operations because regulatory requirements demand a complete and auditable account of every discrepancy between trading systems.

The combination of EXCEPT for finding breaks, INTERSECT for finding matches, and correlated subqueries for row-level comparison represents the full toolkit for multi-system reconciliation in SQL.

Indexing the correlation columns used in reconciliation queries is critical because the queries run once per row and a missing index can turn a seconds-long query into a minutes-long one on large trade tables.

INTERSECT and INTERSECT ALL

Daily Life
Interviews

Find rows that appear in both result sets

INTERSECT identifies rows that exist in both result sets, like finding the overlap in a Venn diagram.

INTERSECT returns only the rows that appear in both input datasets. Like UNION, it is a set operation, meaning it combines complete result sets from two queries rather than joining individual rows. INTERSECT focuses on overlap rather than combination. Think of it as a Venn diagram; INTERSECT gives you the middle section where both circles overlap.

INTERSECT automatically removes duplicates (like UNION does). If you need to keep duplicates, use INTERSECT ALL. Unlike a JOIN (which you'll learn about in the Joins lessons), INTERSECT operates on entire rows from each query, not just specific columns.

//

Basic Example

INTERSECT returns only rows that appear in both queries, filtering out anything unique to either side.

This query returns user IDs who appear in both users and user_sessions. INTERSECT removes duplicates, so each user appears once. If user_42 appears 5 times in users and 3 times in user_sessions, the result shows user_42 exactly once. Perfect when you only care about which users meet both criteria.
1SELECT
2 user_id
3FROM users
4
5INTERSECT
6
7SELECT
8 user_id
9FROM user_sessions

Practical Usage

Understanding when and how to apply INTERSECT effectively helps you choose the right tool for data comparison tasks.

//

When to Use INTERSECT

Use INTERSECT when you want to find exact row matches across all selected columns from two separate queries. It is ideal when you don't need to combine columns from different tables, just identify which rows appear in both result sets.

INTERSECT compares entire rows and returns only distinct matches. If you select multiple columns, INTERSECT ensures all column values match exactly between the two queries.

//

Real-World Use Cases

Finding users who exist in both your production database and your analytics warehouse (data validation). Identifying devices that appear in both active monitoring and incident reports (cross-system analysis).
Discovering which product IDs exist in both inventory and sales logs (stock reconciliation). Any time you need "show me what's in both places" without complex join logic.
INTERSECT
  • Returns rows in both result sets
  • Removes duplicates automatically
  • Order of queries doesn't matter
INTERSECT ALL
  • Preserves duplicate counts
  • Returns minimum occurrence count
  • Useful for frequency analysis

Both queries must return matching columns with compatible types; column names come from the first query. INTERSECT deduplicates results automatically. Unlike EXCEPT, the order of the two queries does not affect the result. INTERSECT ALL preserves duplicates by returning each matching row the minimum number of times it appears on either side.

Try building an INTERSECT query yourself to find users who belong to both groups.

> Complete the query to find users who appear in both orders and transactions.

___
  user_id
FROM orders

___

___
  user_id
FROM transactions
SELECT
UNION
INTERSECT
SELECT
EXCEPT

INTERSECT is particularly useful in data reconciliation tasks where you need to confirm that a record exists in two independent systems before treating it as valid.

Because INTERSECT deduplicates its results, it is safe to use even when both source tables contain duplicate rows, which is common in staging layers of a data warehouse.

If duplicate counts matter, INTERSECT ALL preserves them by returning the minimum number of times each row appears across both result sets.

UNNEST for arrays

Daily Life
Interviews

Expand array columns into individual rows

UNNEST expands array columns into separate rows, converting nested data into relational form. An array is a list of values stored in a single cell, like [1, 2, 3] or ["apple", "banana", "cherry"]. This is critical for analyzing complex data structures where multiple values are grouped together.

Why UNNEST exists: Traditional relational tables expect one value per cell. But modern applications often store arrays: [1, 2, 3] or ["tag1", "tag2", "tag3"]. To analyze each array element individually, you need to "unnest" it, which expands the array so each element becomes its own row. Without UNNEST, arrays are opaque strings you can't easily query.

Unnesting Arrays

The input table below shows the problem: multiple product IDs squeezed into one cell. You cannot filter, count, or JOIN on values locked inside an array.

//

Basic Example

Input table with arrays stored in a column:
order_idproduct_ids
ORD-101[SKU-A42, SKU-B17, SKU-C03]
ORD-102[SKU-A42, SKU-D91]
ORD-103[SKU-B17]

Notice the product_ids column contains arrays (lists of values). Each order can have multiple products, but they're all squeezed into one cell. This makes it impossible to count individual SKUs, filter by specific products, or JOIN with a products table to get pricing.

1SELECT
2 ord.order_id,
3 prod.product_id
4FROM orders AS ord
5CROSS JOIN UNNEST(ord.product_ids) AS prod
Result
order_idproduct_id
ORD-101SKU-A42
ORD-101SKU-B17
ORD-101SKU-C03
ORD-102SKU-A42
ORD-102SKU-D91
ORD-103SKU-B17

The mechanism: UNNEST takes an array column and creates a temporary table with one row per array element. The comma syntax (FROM table, UNNEST(array)) connects each row to every element in its array. If an order has 3 products, UNNEST generates 3 output rows. The result: one row per array element, making each value queryable individually.

//

Why We Need UNNEST

Real-world data is messy. APIs return JSON with arrays. Event logs store multiple tags per event. Analytics tools dump nested structures into your data warehouse. You need UNNEST to: analyze array elements individually, join on values inside arrays, count distinct elements across all arrays, or filter based on array contents.

Example use cases: Analyzing which product_ids appear in user shopping carts (each cart is an array). Counting how many events of each type occurred (event_types stored as arrays).

Flattening user permissions stored as arrays to check individual access rights. Finding which customers bought specific items (order_items as arrays).

Handling Edge Cases

Working with arrays in real-world data means encountering NULLs, empty arrays, and non-native array formats like comma-separated strings.

//

UNNEST with NULL/Empty

"What happens with empty arrays or NULL?" In most databases: Empty arrays produce zero rows (the original row disappears from the output). NULL arrays also produce zero rows. This can be surprising if you expect all original rows to appear in results.

Be aware that unnesting empty or NULL arrays will cause those rows to be excluded from your results. If you need to preserve rows with empty arrays, you'll learn techniques for this in the Joins lessons.

//

UNNEST with String Data

If your arrays are stored as strings like "101,102,103", you need to split them first using SPLIT(), then UNNEST the resulting array. For example: UNNEST(SPLIT(product_ids, ',')) converts comma-separated values into individual rows.

Which UNNEST approach would you use for a native array column versus a comma-separated string?

When the column is already an array type (ARRAY, JSON array), UNNEST works directly. Each element becomes its own row with no parsing needed.
1SELECT
2 order_id,
3 t.tag
4FROM orders
5CROSS JOIN UNNEST(tags) AS t(tag)
Do
  • Use UNNEST to flatten array columns for analysis
  • Test UNNEST behavior with empty arrays (both empty and NULL values)
  • Index frequently-used array values after unnesting
  • Be aware of how many rows UNNEST will generate
Don't
  • Avoid UNNEST on deeply nested structures without understanding output size
  • Don't forget UNNEST can explode row counts massively
  • Don't assume empty arrays produce rows because they produce zero rows

Here are the technical details behind how UNNEST operates internally:

Technical Details
  • Expands array elements into individual rows (row multiplication)
  • Use comma syntax: FROM table, UNNEST(array_column)
  • Empty or NULL arrays produce zero rows (row disappears)
  • Can be nested: UNNEST one array, then UNNEST arrays within it
  • Essential for working with array data types
  • Output row count = sum of all array lengths across all rows

> Complete this query to break apart an array column into individual rows for analysis.

SELECT
  order_id,
  t.product_id
FROM orders
CROSS JOIN ___(___) AS t
SPLIT
product_ids
UNNEST
ARRAY_AGG

After unnesting, you can use the expanded rows with any standard SQL clause such as WHERE, GROUP BY, or JOIN, making array columns fully queryable as if they were stored in separate rows.

If an array column contains NULLs or empty arrays, those rows produce zero output rows from UNNEST, which effectively filters them out and can cause a row count decrease that surprises analysts.

UNNEST is especially common in analytics databases like BigQuery and Redshift where repeated fields and array columns are used to denormalize data for storage efficiency.

SELECT Without FROM

Daily Life
Interviews

Generate values without querying a table

SELECT can compute values directly without referencing any table, useful for testing expressions or generating constants.

A SELECT statement does not always require a table. It can be used to compute expressions, constants, or system values directly. This is common in scripts, testing, or when returning a single value (called a "scalar" value) such as timestamps or calculated metrics.

When you're new to SQL, every query you see has FROM. But SELECT is fundamentally about computing and displaying values; the FROM clause just tells SQL where to get row data. Without FROM, you're computing one-time values or testing expressions before embedding them in larger queries.

//

Basic Example

Without a FROM clause, SQL simply evaluates the expression in SELECT and returns the result.

1SELECT
2 2 * 3 AS result
Result
result
6
This performs simple math. No table needed. SQL acts like a calculator. Result: 6.
1SELECT
2 CURRENT_TIMESTAMP AS query_time
Result
query_time
2025-11-10 14:32:18
This retrieves metadata from the database server itself; the current timestamp. Useful for logging when queries run or timestamping data pipeline operations.
1SELECT
2 1 + (
3 SELECT
4 COUNT(*)
5 FROM device_logs
6 ) AS total_plus_one
Result
total_plus_one
1543

Even without FROM in the outer query, you can still use subqueries that reference tables. This combines scalar computation (1 +) with table data (COUNT). The outer query has no table, but the inner query does.

Practical Applications

Tableless SELECT statements are useful for testing expressions, retrieving system values, and computing constants.

//

Why This Matters

We frequently need to test expressions before embedding them in production pipelines. SELECT without FROM lets you verify date math, test string functions, or check current system values instantly. It's also essential for generating config values, creating dynamic SQL based on calculations, or returning constants in UNION queries where one branch needs a literal value.

"Why doesn't this work everywhere?" Some older systems require a FROM clause even for simple expressions. Using SELECT without FROM is the modern approach and works in most SQL environments.

Which approach returns the current date in a standard format?
CURRENT_DATE returns the date directly with no time component. This is the simplest way to get today's date in most SQL engines.
1SELECT
2 CURRENT_DATE AS today
How It Works
  • Executes a single expression with no table reference
  • Useful for computed values, metadata queries, and testing
  • Returns exactly one row (unless you use UNION or set operations)
  • Often used in stored procedures, scripts, and ETL pipelines
This pattern has been available in SQL for decades, though it wasn't always so simple.

> Complete this query to retrieve the current date and time directly from the database engine.

SELECT
  ___ AS right_now
CURRENT_TIMESTAMP
GETDATE()
CURRENT_DATE
CURRENT_TIME

CURRENT_TIMESTAMP is the most commonly used tableless expression because it provides an audit trail of when a query ran without requiring any additional input.

Many ETL pipelines use SELECT CURRENT_TIMESTAMP to stamp records with a load time, making it easy to identify when each batch of data was processed.

Different databases use different functions (like NOW() in MySQL or GETDATE() in SQL Server), but CURRENT_TIMESTAMP is the ANSI SQL standard and the preferred choice for cross-database compatibility.

PUTTING IT ALL TOGETHER

> You are a backend engineer at a subscription SaaS company building a nightly billing anomaly report. The report must identify accounts where any invoice total exceeds that account's historical average, surface invoice IDs present in the billing system but missing from the ledger, confirm which IDs reconcile cleanly across both systems, expand the array of line-item codes per invoice into individual rows for fee analysis, and stamp each run with its generation timestamp.

A correlated subquery in the WHERE clause compares each invoice total to the average for that account by referencing the outer account_id in the inner AVG query, returning only anomalous rows without collapsing individual invoice detail.
EXCEPT surfaces invoice IDs present in the billing system that are absent from the ledger, flagging charges that did not post correctly.
INTERSECT confirms invoice IDs that appear in both the billing system and the ledger, providing a clean reconciled set for the finance team.
UNNEST expands the array of line-item fee codes stored per invoice into individual rows so each fee type can be counted and categorized independently.
SELECT without FROM generates the report timestamp inline at the top of the script without requiring a table reference.
KEY TAKEAWAYS
Correlated subqueries reference a column from the outer query and execute once per outer row, enabling per-row calculations against related tables
Use EXISTS with correlated subqueries for efficient existence checks that stop scanning after the first match
EXCEPT returns rows from the first query that do not appear in the second query
INTERSECT returns only rows that appear in both result sets (like a Venn diagram overlap)
INTERSECT ALL and EXCEPT ALL preserve duplicate counts instead of deduplicating
UNNEST expands array columns into individual rows for relational analysis
Set operations require matching column counts and compatible types between queries
SELECT without FROM computes expressions, constants, or system values without referencing a table

SQL operators nobody warned you about

Category
SQL
Difficulty
advanced
Duration
31 minutes
Challenges
0 hands-on challenges

Topics covered: Correlated subqueries, EXCEPT and EXCEPT ALL, INTERSECT and INTERSECT ALL, UNNEST for arrays, SELECT Without FROM

Lesson Sections

  1. Correlated subqueries (concepts: sqlSubqueryCorrelated)

    The defining characteristic is the column reference that crosses query boundaries. The inner query references a column from the outer table alias, creating a dependency: the database cannot execute the subquery independently because it needs context from each outer row. Think of it as a nested loop in code, where the inner loop body runs once per iteration of the outer loop. Understanding Correlated Subqueries The two examples below show how the inner query can reference the outer row: first to

  2. EXCEPT and EXCEPT ALL

    EXCEPT Usage Examples Basic Example When to Use EXCEPT Real-World Applications EXCEPT Scenarios Financial reconciliation is one of the most demanding use cases for set operations because regulatory requirements demand a complete and auditable account of every discrepancy between trading systems. Indexing the correlation columns used in reconciliation queries is critical because the queries run once per row and a missing index can turn a seconds-long query into a minutes-long one on large trade t

  3. INTERSECT and INTERSECT ALL

    Basic Example Practical Usage When to Use INTERSECT Real-World Use Cases Finding users who exist in both your production database and your analytics warehouse (data validation). Identifying devices that appear in both active monitoring and incident reports (cross-system analysis). Discovering which product IDs exist in both inventory and sales logs (stock reconciliation). Any time you need "show me what's in both places" without complex join logic.

  4. UNNEST for arrays (concepts: sqlUnnest)

    Unnesting Arrays Basic Example Input table with arrays stored in a column: Why We Need UNNEST Handling Edge Cases UNNEST with NULL/Empty UNNEST with String Data

  5. SELECT Without FROM

    Basic Example This performs simple math. No table needed. SQL acts like a calculator. Result: 6. This retrieves metadata from the database server itself; the current timestamp. Useful for logging when queries run or timestamping data pipeline operations. Practical Applications Why This Matters Which approach returns the current date in a standard format? This pattern has been available in SQL for decades, though it wasn't always so simple.