SQL UPSERT: INSERT ON CONFLICT and MERGE
Three a.m. page. Airflow retry fired twice. The load ran once on the primary, once on a replica after failover, and now the fact table has duplicate keys. The fix was supposed to be a straight INSERT. It should have been an UPSERT. In the post-mortem we ran last quarter across twelve pipelines, idempotency failures accounted for more than 40% of reruns, and most of those came down to one team not knowing the difference between a write that composes and one that just appends.
UPSERT Across Databases
Know UPSERT the way the interviewer who asks it knows it.
| Feature | PostgreSQL | SQL Server | MySQL | Snowflake |
|---|---|---|---|---|
| Syntax | INSERT ... ON CONFLICT DO UPDATE | MERGE ... WHEN MATCHED/NOT MATCHED | INSERT ... ON DUPLICATE KEY UPDATE | MERGE ... WHEN MATCHED/NOT MATCHED |
| Conflict detection | Named constraint or column list | ON condition in MERGE | Primary key or unique index | ON condition in MERGE |
| Delete support | No (separate statement needed) | Yes (WHEN MATCHED THEN DELETE) | No | Yes (WHEN MATCHED THEN DELETE) |
| Atomicity | Atomic (single statement) | Atomic (single statement) | Atomic (single statement) | Atomic (single statement) |
5 UPSERT Patterns
Pulled from debriefs where SQL was the gate.
PostgreSQL: INSERT ON CONFLICT DO UPDATE
PostgreSQL uses INSERT ... ON CONFLICT to handle upserts. The ON CONFLICT clause specifies which constraint or columns determine whether a row already exists. DO UPDATE SET defines which columns to update when a conflict is detected. DO NOTHING skips the insert silently. The EXCLUDED pseudo-table references the values that were proposed for insertion.
-- Upsert a product: insert if new, update if exists
INSERT INTO products (product_id, name, price, updated_at)
VALUES (101, 'Widget Pro', 29.99, NOW())
ON CONFLICT (product_id)
DO UPDATE SET
name = EXCLUDED.name,
price = EXCLUDED.price,
updated_at = EXCLUDED.updated_at;
-- EXCLUDED refers to the row that was attempted to insert
-- product_id is the conflict target (must have a unique constraint)
-- Bulk upsert from a staging table
INSERT INTO dim_customers (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
name = EXCLUDED.name,
email = EXCLUDED.email,
updated_at = EXCLUDED.updated_at
WHERE dim_customers.name != EXCLUDED.name
OR dim_customers.email != EXCLUDED.email;
-- WHERE clause: only update if something actually changedSQL Server / Snowflake: MERGE
MERGE is the ANSI SQL standard for upsert operations. It matches rows from a source table against a target table using an ON condition. You define separate actions for WHEN MATCHED (update or delete) and WHEN NOT MATCHED (insert). MERGE can handle insert, update, and delete in a single statement, making it the most flexible upsert syntax.
-- SQL Server / Snowflake MERGE syntax
MERGE INTO dim_products AS target
USING staging_products AS source
ON target.product_id = source.product_id
WHEN MATCHED AND source.is_deleted = 1 THEN
DELETE
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.price = source.price,
target.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (product_id, name, price, created_at, updated_at)
VALUES (source.product_id, source.name, source.price,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);MySQL: INSERT ON DUPLICATE KEY UPDATE
MySQL uses INSERT ... ON DUPLICATE KEY UPDATE for upserts. When an insert violates a primary key or unique index constraint, MySQL executes the UPDATE clause instead. The VALUES() function references the values that were proposed for insertion (similar to PostgreSQL EXCLUDED). In MySQL 8.0.19+, use the alias syntax instead of VALUES().
-- MySQL upsert
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99)
ON DUPLICATE KEY UPDATE
name = VALUES(name),
price = VALUES(price);
-- MySQL 8.0.19+ alias syntax (preferred)
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99) AS new_row
ON DUPLICATE KEY UPDATE
name = new_row.name,
price = new_row.price;
-- Bulk upsert
INSERT INTO inventory (sku, warehouse_id, quantity)
VALUES
('A001', 1, 50),
('A002', 1, 30),
('A001', 2, 20)
ON DUPLICATE KEY UPDATE
quantity = VALUES(quantity);UPSERT for Slowly Changing Dimensions (SCD Type 1)
SCD Type 1 overwrites the old value with the new value. No history is preserved. UPSERT is a natural fit: if the dimension row exists, update it; if not, insert it. This is the simplest SCD type and the most common use case for UPSERT in dimensional data warehouses. The updated_at timestamp tracks when the row last changed.
-- SCD Type 1: overwrite with latest values
-- PostgreSQL
INSERT INTO dim_customers (
customer_id, name, email, city, updated_at
)
SELECT customer_id, name, email, city, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
name = EXCLUDED.name,
email = EXCLUDED.email,
city = EXCLUDED.city,
updated_at = EXCLUDED.updated_at;
-- Only update rows where something changed (optimization)
-- Add a WHERE clause to skip no-op updates:
-- WHERE dim_customers.name != EXCLUDED.name
-- OR dim_customers.email != EXCLUDED.email
-- OR dim_customers.city != EXCLUDED.city;UPSERT with Conditional Update Logic
Sometimes you want to update only specific columns or only update when certain conditions are met. For example, update the price only if the incoming price is newer (based on a timestamp). Or update the status only if it is moving forward in a state machine. Both PostgreSQL and MERGE support WHERE conditions on the update clause.
-- PostgreSQL: only update if incoming data is newer
INSERT INTO product_prices (product_id, price, effective_date)
VALUES (101, 34.99, '2025-03-15')
ON CONFLICT (product_id)
DO UPDATE SET
price = EXCLUDED.price,
effective_date = EXCLUDED.effective_date
WHERE EXCLUDED.effective_date > product_prices.effective_date;
-- Snowflake MERGE: update status only if progressing forward
MERGE INTO orders AS target
USING staging_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED
AND source.status_rank > target.status_rank THEN
UPDATE SET
target.status = source.status,
target.status_rank = source.status_rank,
target.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (order_id, status, status_rank, created_at, updated_at)
VALUES (source.order_id, source.status, source.status_rank,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);When to Use UPSERT
Use UPSERT for
- Dimension table loads (SCD Type 1)
- Configuration or settings tables
- Idempotent pipeline loads that may run multiple times
- Real-time CDC (Change Data Capture) event processing
- API-driven single-row writes where the client does not know if the row exists
Consider alternatives for
- SCD Type 2 (need to preserve history, not overwrite)
- Append-only fact tables (inserts only, no updates)
- Full table refreshes (TRUNCATE + INSERT is simpler)
- Tables without unique constraints (upsert cannot detect conflicts)
4 UPSERT Interview Questions
Q1: What is an upsert, and why is it important in data pipelines?
What they test: Whether you understand idempotent data loading. UPSERT inserts new rows and updates existing ones in a single operation. Pipelines that run the same data through multiple times need idempotent loads to avoid duplicates. Without UPSERT, you need DELETE + INSERT or a staging table with MERGE logic, both of which are more complex. Approach: Define upsert: insert if the row does not exist, update if it does. Explain the alternative without upsert: check if the row exists (SELECT), then conditionally INSERT or UPDATE. This requires two statements and a race condition window. UPSERT handles it atomically. Give a real example: daily customer dimension load where some customers are new and some have updated email addresses. UPSERT handles both in one pass.
Q2: Compare INSERT ON CONFLICT (PostgreSQL) with MERGE (SQL Server/Snowflake). When would you choose one over the other?
What they test: Cross-database knowledge and practical trade-off awareness. ON CONFLICT is simpler but cannot handle deletes. MERGE is more verbose but supports insert, update, and delete in one statement. The interviewer wants to see that you have worked with both and understand when MERGE's extra capability matters. Approach: ON CONFLICT is cleaner for simple upsert: insert or update based on a unique constraint. It is PostgreSQL-specific. MERGE is ANSI standard (supported by SQL Server, Snowflake, Oracle, newer PostgreSQL). MERGE supports WHEN MATCHED THEN DELETE, which ON CONFLICT cannot do. Choose ON CONFLICT for PostgreSQL-only pipelines with simple upsert logic. Choose MERGE when you need delete handling, when working with SQL Server/Snowflake, or when the pipeline logic has multiple match conditions.
Q3: How would you implement an SCD Type 2 using SQL? Can you use UPSERT for it?
What they test: Understanding that SCD Type 2 (historical tracking) is fundamentally different from SCD Type 1 (overwrite). Upsert works for Type 1 but not directly for Type 2. Type 2 requires closing the current record (set end_date) and inserting a new record. This is not a simple upsert because you never update the content columns of the existing row. Approach: Explain SCD Type 2: keep a full history by adding effective_date, end_date, and is_current columns. When a customer changes address, close the current record (SET end_date = today, is_current = false) and insert a new record (effective_date = today, end_date = NULL, is_current = true). MERGE can handle this: WHEN MATCHED AND values differ THEN UPDATE to close the old record, plus a second INSERT for the new version. Pure UPSERT (ON CONFLICT) is not enough because you need to both update the old row AND insert a new row for the same business key.
Q4: Your upsert pipeline runs successfully but downstream queries show stale data. What could be wrong?
What they test: Production debugging skills. Several things can cause a silent upsert failure: the conflict target column has no unique constraint (so conflicts are never detected), the WHERE clause on the update is too restrictive (filtering out legitimate updates), or the column mapping is wrong (updating column A with the value of column A instead of EXCLUDED.A). Approach: Check the conflict target: does the column actually have a unique constraint or unique index? Without it, ON CONFLICT never triggers and every row is inserted as a new row (duplicates instead of updates). Check the update SET clause: verify EXCLUDED references are correct. Check for a WHERE clause on the update that might be filtering out updates (e.g., only update if timestamp is newer, but timestamps are not populated correctly). Check the staging table: does it actually contain the updated values? Run the UPSERT with RETURNING * (PostgreSQL) to see which rows were affected.
Frequently asked questions
What is SQL UPSERT?+
What is the difference between UPSERT and MERGE?+
Does UPSERT require a unique constraint?+
Is UPSERT atomic?+
The retry that doesn't wake you up
- 01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
- 02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
- 03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition