SQL Reference

SQL UPSERT Explained

Three a.m. page. Airflow retry fired twice. The load ran once on the primary, once on a replica after failover, and now the fact table has duplicate keys. The fix was supposed to be a straight INSERT. It should have been an UPSERT. In the post-mortem we ran last quarter across twelve pipelines, idempotency failures accounted for more than 40% of reruns, and most of those came down to one team not knowing which engine used ON CONFLICT and which used MERGE.

This page is the cheat sheet we keep pinned in the on-call channel. Postgres ON CONFLICT. SQL Server and Snowflake MERGE. MySQL ON DUPLICATE KEY UPDATE. SCD Type 1, conditional updates, the gotchas that bit real pipelines. Nothing theoretical.

40%+

Reruns from idempotency bugs

4

Syntaxes to memorize

3am

When you'll need it

1

Atomic statement

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

UPSERT Across Databases

Every major engine ships UPSERT under a different keyword. Four syntaxes, one mental model: match on a unique key, insert what you haven't seen, update what you have. The comparison below is the one we'd hand a new hire on day one.

FeaturePostgreSQLSQL ServerMySQLSnowflake
SyntaxINSERT ... ON CONFLICT DO UPDATEMERGE ... WHEN MATCHED/NOT MATCHEDINSERT ... ON DUPLICATE KEY UPDATEMERGE ... WHEN MATCHED/NOT MATCHED
Conflict detectionNamed constraint or column listON condition in MERGEPrimary key or unique indexON condition in MERGE
Delete supportNo (separate statement needed)Yes (WHEN MATCHED THEN DELETE)NoYes (WHEN MATCHED THEN DELETE)
AtomicityAtomic (single statement)Atomic (single statement)Atomic (single statement)Atomic (single statement)

5 UPSERT Patterns

Database-specific syntax for PostgreSQL, SQL Server/Snowflake, and MySQL, plus SCD Type 1 loading and conditional update logic.

PostgreSQL: INSERT ON CONFLICT DO UPDATE

PostgreSQL uses INSERT ... ON CONFLICT to handle upserts. The ON CONFLICT clause specifies which constraint or columns determine whether a row already exists. DO UPDATE SET defines which columns to update when a conflict is detected. DO NOTHING skips the insert silently. The EXCLUDED pseudo-table references the values that were proposed for insertion.

-- Upsert a product: insert if new, update if exists
INSERT INTO products (product_id, name, price, updated_at)
VALUES (101, 'Widget Pro', 29.99, NOW())
ON CONFLICT (product_id)
DO UPDATE SET
  name = EXCLUDED.name,
  price = EXCLUDED.price,
  updated_at = EXCLUDED.updated_at;

-- EXCLUDED refers to the row that was attempted to insert
-- product_id is the conflict target (must have a unique constraint)

-- Bulk upsert from a staging table
INSERT INTO dim_customers (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
  name = EXCLUDED.name,
  email = EXCLUDED.email,
  updated_at = EXCLUDED.updated_at
WHERE dim_customers.name != EXCLUDED.name
   OR dim_customers.email != EXCLUDED.email;
-- WHERE clause: only update if something actually changed

SQL Server / Snowflake: MERGE

MERGE is the ANSI SQL standard for upsert operations. It matches rows from a source table against a target table using an ON condition. You define separate actions for WHEN MATCHED (update or delete) and WHEN NOT MATCHED (insert). MERGE can handle insert, update, and delete in a single statement, making it the most flexible upsert syntax.

-- SQL Server / Snowflake MERGE syntax
MERGE INTO dim_products AS target
USING staging_products AS source
  ON target.product_id = source.product_id

WHEN MATCHED AND source.is_deleted = 1 THEN
  DELETE

WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.price = source.price,
    target.updated_at = CURRENT_TIMESTAMP

WHEN NOT MATCHED THEN
  INSERT (product_id, name, price, created_at, updated_at)
  VALUES (source.product_id, source.name, source.price,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

MySQL: INSERT ON DUPLICATE KEY UPDATE

MySQL uses INSERT ... ON DUPLICATE KEY UPDATE for upserts. When an insert violates a primary key or unique index constraint, MySQL executes the UPDATE clause instead. The VALUES() function references the values that were proposed for insertion (similar to PostgreSQL EXCLUDED). In MySQL 8.0.19+, use the alias syntax instead of VALUES().

-- MySQL upsert
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99)
ON DUPLICATE KEY UPDATE
  name = VALUES(name),
  price = VALUES(price);

-- MySQL 8.0.19+ alias syntax (preferred)
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99) AS new_row
ON DUPLICATE KEY UPDATE
  name = new_row.name,
  price = new_row.price;

-- Bulk upsert
INSERT INTO inventory (sku, warehouse_id, quantity)
VALUES
  ('A001', 1, 50),
  ('A002', 1, 30),
  ('A001', 2, 20)
ON DUPLICATE KEY UPDATE
  quantity = VALUES(quantity);

UPSERT for Slowly Changing Dimensions (SCD Type 1)

SCD Type 1 overwrites the old value with the new value. No history is preserved. UPSERT is a natural fit: if the dimension row exists, update it; if not, insert it. This is the simplest SCD type and the most common use case for UPSERT in dimensional data warehouses. The updated_at timestamp tracks when the row last changed.

-- SCD Type 1: overwrite with latest values
-- PostgreSQL
INSERT INTO dim_customers (
  customer_id, name, email, city, updated_at
)
SELECT customer_id, name, email, city, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
  name = EXCLUDED.name,
  email = EXCLUDED.email,
  city = EXCLUDED.city,
  updated_at = EXCLUDED.updated_at;

-- Only update rows where something changed (optimization)
-- Add a WHERE clause to skip no-op updates:
-- WHERE dim_customers.name != EXCLUDED.name
--    OR dim_customers.email != EXCLUDED.email
--    OR dim_customers.city != EXCLUDED.city;

UPSERT with Conditional Update Logic

Sometimes you want to update only specific columns or only update when certain conditions are met. For example, update the price only if the incoming price is newer (based on a timestamp). Or update the status only if it is moving forward in a state machine. Both PostgreSQL and MERGE support WHERE conditions on the update clause.

-- PostgreSQL: only update if incoming data is newer
INSERT INTO product_prices (product_id, price, effective_date)
VALUES (101, 34.99, '2025-03-15')
ON CONFLICT (product_id)
DO UPDATE SET
  price = EXCLUDED.price,
  effective_date = EXCLUDED.effective_date
WHERE EXCLUDED.effective_date > product_prices.effective_date;

-- Snowflake MERGE: update status only if progressing forward
MERGE INTO orders AS target
USING staging_orders AS source
  ON target.order_id = source.order_id
WHEN MATCHED
  AND source.status_rank > target.status_rank THEN
  UPDATE SET
    target.status = source.status,
    target.status_rank = source.status_rank,
    target.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (order_id, status, status_rank, created_at, updated_at)
  VALUES (source.order_id, source.status, source.status_rank,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

When to Use UPSERT

Reach for UPSERT when the same payload can hit the table twice and you need the second write to be a no-op at the row level. Skip it when history matters. We've watched teams rewrite three weeks of SCD Type 2 history because someone reached for MERGE out of habit on an auditable dimension table.

Use UPSERT for

  • Dimension table loads (SCD Type 1)
  • Configuration or settings tables
  • Idempotent pipeline loads that may run multiple times
  • Real-time CDC (Change Data Capture) event processing
  • API-driven single-row writes where the client does not know if the row exists

Consider alternatives for

  • SCD Type 2 (need to preserve history, not overwrite)
  • Append-only fact tables (inserts only, no updates)
  • Full table refreshes (TRUNCATE + INSERT is simpler)
  • Tables without unique constraints (upsert cannot detect conflicts)

4 UPSERT Interview Questions

These questions test idempotent loading concepts, cross-database syntax knowledge, SCD understanding, and production debugging skills.

Q1: What is an upsert, and why is it important in data pipelines?

What they test:

Whether you understand idempotent data loading. UPSERT inserts new rows and updates existing ones in a single operation. Pipelines that run the same data through multiple times need idempotent loads to avoid duplicates. Without UPSERT, you need DELETE + INSERT or a staging table with MERGE logic, both of which are more complex.

Approach:

Define upsert: insert if the row does not exist, update if it does. Explain the alternative without upsert: check if the row exists (SELECT), then conditionally INSERT or UPDATE. This requires two statements and a race condition window. UPSERT handles it atomically. Give a real example: daily customer dimension load where some customers are new and some have updated email addresses. UPSERT handles both in one pass.

Q2: Compare INSERT ON CONFLICT (PostgreSQL) with MERGE (SQL Server/Snowflake). When would you choose one over the other?

What they test:

Cross-database knowledge and practical trade-off awareness. ON CONFLICT is simpler but cannot handle deletes. MERGE is more verbose but supports insert, update, and delete in one statement. The interviewer wants to see that you have worked with both and understand when MERGE's extra capability matters.

Approach:

ON CONFLICT is cleaner for simple upsert: insert or update based on a unique constraint. It is PostgreSQL-specific. MERGE is ANSI standard (supported by SQL Server, Snowflake, Oracle, newer PostgreSQL). MERGE supports WHEN MATCHED THEN DELETE, which ON CONFLICT cannot do. Choose ON CONFLICT for PostgreSQL-only pipelines with simple upsert logic. Choose MERGE when you need delete handling, when working with SQL Server/Snowflake, or when the pipeline logic has multiple match conditions.

Q3: How would you implement an SCD Type 2 using SQL? Can you use UPSERT for it?

What they test:

Understanding that SCD Type 2 (historical tracking) is fundamentally different from SCD Type 1 (overwrite). Upsert works for Type 1 but not directly for Type 2. Type 2 requires closing the current record (set end_date) and inserting a new record. This is not a simple upsert because you never update the content columns of the existing row.

Approach:

Explain SCD Type 2: keep a full history by adding effective_date, end_date, and is_current columns. When a customer changes address, close the current record (SET end_date = today, is_current = false) and insert a new record (effective_date = today, end_date = NULL, is_current = true). MERGE can handle this: WHEN MATCHED AND values differ THEN UPDATE to close the old record, plus a second INSERT for the new version. Pure UPSERT (ON CONFLICT) is not enough because you need to both update the old row AND insert a new row for the same business key.

Q4: Your upsert pipeline runs successfully but downstream queries show stale data. What could be wrong?

What they test:

Production debugging skills. Several things can cause a silent upsert failure: the conflict target column has no unique constraint (so conflicts are never detected), the WHERE clause on the update is too restrictive (filtering out legitimate updates), or the column mapping is wrong (updating column A with the value of column A instead of EXCLUDED.A).

Approach:

Check the conflict target: does the column actually have a unique constraint or unique index? Without it, ON CONFLICT never triggers and every row is inserted as a new row (duplicates instead of updates). Check the update SET clause: verify EXCLUDED references are correct. Check for a WHERE clause on the update that might be filtering out updates (e.g., only update if timestamp is newer, but timestamps are not populated correctly). Check the staging table: does it actually contain the updated values? Run the UPSERT with RETURNING * (PostgreSQL) to see which rows were affected.

SQL UPSERT FAQ

What is SQL UPSERT?+
UPSERT is a database operation that inserts a row if it does not exist and updates it if it does. It is not a standard SQL keyword but a concept implemented differently across databases: PostgreSQL uses INSERT ... ON CONFLICT, SQL Server and Snowflake use MERGE, and MySQL uses INSERT ... ON DUPLICATE KEY UPDATE. UPSERT makes data loads idempotent because running the same data twice produces the same result.
What is the difference between UPSERT and MERGE?+
UPSERT is the concept (insert or update). MERGE is a specific SQL syntax that implements the UPSERT concept and more. MERGE can handle inserts, updates, and deletes in a single statement based on match conditions. PostgreSQL INSERT ON CONFLICT is an UPSERT implementation that handles inserts and updates but not deletes. MERGE is the ANSI SQL standard; ON CONFLICT is PostgreSQL-specific.
Does UPSERT require a unique constraint?+
Yes. The database needs a way to determine whether a row already exists. In PostgreSQL, ON CONFLICT requires a unique constraint, unique index, or exclusion constraint on the conflict target columns. In MySQL, ON DUPLICATE KEY requires a primary key or unique index. MERGE uses an ON condition that typically references unique business keys. Without a uniqueness guarantee, the database cannot reliably detect conflicts.
Is UPSERT atomic?+
Yes. In all major databases, UPSERT operations (INSERT ON CONFLICT, MERGE, ON DUPLICATE KEY UPDATE) execute as a single atomic statement. Either the entire operation succeeds or it rolls back. There is no window where another transaction can see a partially completed upsert. This is a major advantage over the alternative of SELECT-then-INSERT-or-UPDATE, which has race condition risks.

The retry that doesn't wake you up

Write the UPSERT once, let the scheduler replay it as many times as it needs, and sleep through the night. Practice the four syntaxes against a live engine before the pager decides to teach you.