Three a.m. page. Airflow retry fired twice. The load ran once on the primary, once on a replica after failover, and now the fact table has duplicate keys. The fix was supposed to be a straight INSERT. It should have been an UPSERT. In the post-mortem we ran last quarter across twelve pipelines, idempotency failures accounted for more than 40% of reruns, and most of those came down to one team not knowing which engine used ON CONFLICT and which used MERGE.
This page is the cheat sheet we keep pinned in the on-call channel. Postgres ON CONFLICT. SQL Server and Snowflake MERGE. MySQL ON DUPLICATE KEY UPDATE. SCD Type 1, conditional updates, the gotchas that bit real pipelines. Nothing theoretical.
Reruns from idempotency bugs
Syntaxes to memorize
When you'll need it
Atomic statement
Source: DataDriven analysis of 1,042 verified data engineering interview rounds.
Every major engine ships UPSERT under a different keyword. Four syntaxes, one mental model: match on a unique key, insert what you haven't seen, update what you have. The comparison below is the one we'd hand a new hire on day one.
| Feature | PostgreSQL | SQL Server | MySQL | Snowflake |
|---|---|---|---|---|
| Syntax | INSERT ... ON CONFLICT DO UPDATE | MERGE ... WHEN MATCHED/NOT MATCHED | INSERT ... ON DUPLICATE KEY UPDATE | MERGE ... WHEN MATCHED/NOT MATCHED |
| Conflict detection | Named constraint or column list | ON condition in MERGE | Primary key or unique index | ON condition in MERGE |
| Delete support | No (separate statement needed) | Yes (WHEN MATCHED THEN DELETE) | No | Yes (WHEN MATCHED THEN DELETE) |
| Atomicity | Atomic (single statement) | Atomic (single statement) | Atomic (single statement) | Atomic (single statement) |
Database-specific syntax for PostgreSQL, SQL Server/Snowflake, and MySQL, plus SCD Type 1 loading and conditional update logic.
PostgreSQL uses INSERT ... ON CONFLICT to handle upserts. The ON CONFLICT clause specifies which constraint or columns determine whether a row already exists. DO UPDATE SET defines which columns to update when a conflict is detected. DO NOTHING skips the insert silently. The EXCLUDED pseudo-table references the values that were proposed for insertion.
-- Upsert a product: insert if new, update if exists
INSERT INTO products (product_id, name, price, updated_at)
VALUES (101, 'Widget Pro', 29.99, NOW())
ON CONFLICT (product_id)
DO UPDATE SET
name = EXCLUDED.name,
price = EXCLUDED.price,
updated_at = EXCLUDED.updated_at;
-- EXCLUDED refers to the row that was attempted to insert
-- product_id is the conflict target (must have a unique constraint)
-- Bulk upsert from a staging table
INSERT INTO dim_customers (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
name = EXCLUDED.name,
email = EXCLUDED.email,
updated_at = EXCLUDED.updated_at
WHERE dim_customers.name != EXCLUDED.name
OR dim_customers.email != EXCLUDED.email;
-- WHERE clause: only update if something actually changedMERGE is the ANSI SQL standard for upsert operations. It matches rows from a source table against a target table using an ON condition. You define separate actions for WHEN MATCHED (update or delete) and WHEN NOT MATCHED (insert). MERGE can handle insert, update, and delete in a single statement, making it the most flexible upsert syntax.
-- SQL Server / Snowflake MERGE syntax
MERGE INTO dim_products AS target
USING staging_products AS source
ON target.product_id = source.product_id
WHEN MATCHED AND source.is_deleted = 1 THEN
DELETE
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.price = source.price,
target.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (product_id, name, price, created_at, updated_at)
VALUES (source.product_id, source.name, source.price,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);MySQL uses INSERT ... ON DUPLICATE KEY UPDATE for upserts. When an insert violates a primary key or unique index constraint, MySQL executes the UPDATE clause instead. The VALUES() function references the values that were proposed for insertion (similar to PostgreSQL EXCLUDED). In MySQL 8.0.19+, use the alias syntax instead of VALUES().
-- MySQL upsert
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99)
ON DUPLICATE KEY UPDATE
name = VALUES(name),
price = VALUES(price);
-- MySQL 8.0.19+ alias syntax (preferred)
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99) AS new_row
ON DUPLICATE KEY UPDATE
name = new_row.name,
price = new_row.price;
-- Bulk upsert
INSERT INTO inventory (sku, warehouse_id, quantity)
VALUES
('A001', 1, 50),
('A002', 1, 30),
('A001', 2, 20)
ON DUPLICATE KEY UPDATE
quantity = VALUES(quantity);SCD Type 1 overwrites the old value with the new value. No history is preserved. UPSERT is a natural fit: if the dimension row exists, update it; if not, insert it. This is the simplest SCD type and the most common use case for UPSERT in dimensional data warehouses. The updated_at timestamp tracks when the row last changed.
-- SCD Type 1: overwrite with latest values
-- PostgreSQL
INSERT INTO dim_customers (
customer_id, name, email, city, updated_at
)
SELECT customer_id, name, email, city, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
name = EXCLUDED.name,
email = EXCLUDED.email,
city = EXCLUDED.city,
updated_at = EXCLUDED.updated_at;
-- Only update rows where something changed (optimization)
-- Add a WHERE clause to skip no-op updates:
-- WHERE dim_customers.name != EXCLUDED.name
-- OR dim_customers.email != EXCLUDED.email
-- OR dim_customers.city != EXCLUDED.city;Sometimes you want to update only specific columns or only update when certain conditions are met. For example, update the price only if the incoming price is newer (based on a timestamp). Or update the status only if it is moving forward in a state machine. Both PostgreSQL and MERGE support WHERE conditions on the update clause.
-- PostgreSQL: only update if incoming data is newer
INSERT INTO product_prices (product_id, price, effective_date)
VALUES (101, 34.99, '2025-03-15')
ON CONFLICT (product_id)
DO UPDATE SET
price = EXCLUDED.price,
effective_date = EXCLUDED.effective_date
WHERE EXCLUDED.effective_date > product_prices.effective_date;
-- Snowflake MERGE: update status only if progressing forward
MERGE INTO orders AS target
USING staging_orders AS source
ON target.order_id = source.order_id
WHEN MATCHED
AND source.status_rank > target.status_rank THEN
UPDATE SET
target.status = source.status,
target.status_rank = source.status_rank,
target.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
INSERT (order_id, status, status_rank, created_at, updated_at)
VALUES (source.order_id, source.status, source.status_rank,
CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);Reach for UPSERT when the same payload can hit the table twice and you need the second write to be a no-op at the row level. Skip it when history matters. We've watched teams rewrite three weeks of SCD Type 2 history because someone reached for MERGE out of habit on an auditable dimension table.
These questions test idempotent loading concepts, cross-database syntax knowledge, SCD understanding, and production debugging skills.
What they test:
Whether you understand idempotent data loading. UPSERT inserts new rows and updates existing ones in a single operation. Pipelines that run the same data through multiple times need idempotent loads to avoid duplicates. Without UPSERT, you need DELETE + INSERT or a staging table with MERGE logic, both of which are more complex.
Approach:
Define upsert: insert if the row does not exist, update if it does. Explain the alternative without upsert: check if the row exists (SELECT), then conditionally INSERT or UPDATE. This requires two statements and a race condition window. UPSERT handles it atomically. Give a real example: daily customer dimension load where some customers are new and some have updated email addresses. UPSERT handles both in one pass.
What they test:
Cross-database knowledge and practical trade-off awareness. ON CONFLICT is simpler but cannot handle deletes. MERGE is more verbose but supports insert, update, and delete in one statement. The interviewer wants to see that you have worked with both and understand when MERGE's extra capability matters.
Approach:
ON CONFLICT is cleaner for simple upsert: insert or update based on a unique constraint. It is PostgreSQL-specific. MERGE is ANSI standard (supported by SQL Server, Snowflake, Oracle, newer PostgreSQL). MERGE supports WHEN MATCHED THEN DELETE, which ON CONFLICT cannot do. Choose ON CONFLICT for PostgreSQL-only pipelines with simple upsert logic. Choose MERGE when you need delete handling, when working with SQL Server/Snowflake, or when the pipeline logic has multiple match conditions.
What they test:
Understanding that SCD Type 2 (historical tracking) is fundamentally different from SCD Type 1 (overwrite). Upsert works for Type 1 but not directly for Type 2. Type 2 requires closing the current record (set end_date) and inserting a new record. This is not a simple upsert because you never update the content columns of the existing row.
Approach:
Explain SCD Type 2: keep a full history by adding effective_date, end_date, and is_current columns. When a customer changes address, close the current record (SET end_date = today, is_current = false) and insert a new record (effective_date = today, end_date = NULL, is_current = true). MERGE can handle this: WHEN MATCHED AND values differ THEN UPDATE to close the old record, plus a second INSERT for the new version. Pure UPSERT (ON CONFLICT) is not enough because you need to both update the old row AND insert a new row for the same business key.
What they test:
Production debugging skills. Several things can cause a silent upsert failure: the conflict target column has no unique constraint (so conflicts are never detected), the WHERE clause on the update is too restrictive (filtering out legitimate updates), or the column mapping is wrong (updating column A with the value of column A instead of EXCLUDED.A).
Approach:
Check the conflict target: does the column actually have a unique constraint or unique index? Without it, ON CONFLICT never triggers and every row is inserted as a new row (duplicates instead of updates). Check the update SET clause: verify EXCLUDED references are correct. Check for a WHERE clause on the update that might be filtering out updates (e.g., only update if timestamp is newer, but timestamps are not populated correctly). Check the staging table: does it actually contain the updated values? Run the UPSERT with RETURNING * (PostgreSQL) to see which rows were affected.
Write the UPSERT once, let the scheduler replay it as many times as it needs, and sleep through the night. Practice the four syntaxes against a live engine before the pager decides to teach you.