SQL UPSERT: INSERT ON CONFLICT and MERGE

Three a.m. page. Airflow retry fired twice. The load ran once on the primary, once on a replica after failover, and now the fact table has duplicate keys. The fix was supposed to be a straight INSERT. It should have been an UPSERT. In the post-mortem we ran last quarter across twelve pipelines, idempotency failures accounted for more than 40% of reruns, and most of those came down to one team not knowing the difference between a write that composes and one that just appends.

40%+

Reruns from idempotency bugs

Syntaxes to memorize

3am

When you'll need it

Atomic statement

Source: DataDriven analysis of 1,042 verified data engineering interview rounds.

Prepare for the interview

01 / Open invite

02min.

Know UPSERT the way the interviewer who asks it knows it.

a UPSERT query, the same shape a screen would give you.

The diff against expected. Where ties broke. What you missed.

sandbox

1SELECT user_id,

2 COUNT(*) AS sessions

3FROM events

4WHERE ts >= NOW() - INTERVAL '7 day'

Execute your solution0.4s avg.

SpotifyInterview question

Solve a UPSERT problem

UPSERT Across Databases

Every major engine ships UPSERT under a different keyword. Four syntaxes, one mental model: match on a unique key, insert what you haven't seen, update what you have.

Feature	PostgreSQL	SQL Server	MySQL	Snowflake
Syntax	INSERT ... ON CONFLICT DO UPDATE	MERGE ... WHEN MATCHED/NOT MATCHED	INSERT ... ON DUPLICATE KEY UPDATE	MERGE ... WHEN MATCHED/NOT MATCHED
Conflict detection	Named constraint or column list	ON condition in MERGE	Primary key or unique index	ON condition in MERGE
Delete support	No (separate statement needed)	Yes (WHEN MATCHED THEN DELETE)	No	Yes (WHEN MATCHED THEN DELETE)
Atomicity	Atomic (single statement)	Atomic (single statement)	Atomic (single statement)	Atomic (single statement)

5 UPSERT Patterns

Database-specific syntax for PostgreSQL, SQL Server/Snowflake, and MySQL, plus SCD Type 1 loading and conditional update logic.

PostgreSQL: INSERT ON CONFLICT DO UPDATE

PostgreSQL uses INSERT ... ON CONFLICT to handle upserts. The ON CONFLICT clause specifies which constraint or columns determine whether a row already exists. DO UPDATE SET defines which columns to update when a conflict is detected. DO NOTHING skips the insert silently. The EXCLUDED pseudo-table references the values that were proposed for insertion.

-- Upsert a product: insert if new, update if exists
INSERT INTO products (product_id, name, price, updated_at)
VALUES (101, 'Widget Pro', 29.99, NOW())
ON CONFLICT (product_id)
DO UPDATE SET
  name = EXCLUDED.name,
  price = EXCLUDED.price,
  updated_at = EXCLUDED.updated_at;

-- EXCLUDED refers to the row that was attempted to insert
-- product_id is the conflict target (must have a unique constraint)

-- Bulk upsert from a staging table
INSERT INTO dim_customers (customer_id, name, email, updated_at)
SELECT customer_id, name, email, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
  name = EXCLUDED.name,
  email = EXCLUDED.email,
  updated_at = EXCLUDED.updated_at
WHERE dim_customers.name != EXCLUDED.name
   OR dim_customers.email != EXCLUDED.email;
-- WHERE clause: only update if something actually changed

SQL Server / Snowflake: MERGE

MERGE is the ANSI SQL standard for upsert operations. It matches rows from a source table against a target table using an ON condition. You define separate actions for WHEN MATCHED (update or delete) and WHEN NOT MATCHED (insert). MERGE can handle insert, update, and delete in a single statement, making it the most flexible upsert syntax.

-- SQL Server / Snowflake MERGE syntax
MERGE INTO dim_products AS target
USING staging_products AS source
  ON target.product_id = source.product_id

WHEN MATCHED AND source.is_deleted = 1 THEN
  DELETE

WHEN MATCHED THEN
  UPDATE SET
    target.name = source.name,
    target.price = source.price,
    target.updated_at = CURRENT_TIMESTAMP

WHEN NOT MATCHED THEN
  INSERT (product_id, name, price, created_at, updated_at)
  VALUES (source.product_id, source.name, source.price,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

MySQL: INSERT ON DUPLICATE KEY UPDATE

MySQL uses INSERT ... ON DUPLICATE KEY UPDATE for upserts. When an insert violates a primary key or unique index constraint, MySQL executes the UPDATE clause instead. The VALUES() function references the values that were proposed for insertion (similar to PostgreSQL EXCLUDED). In MySQL 8.0.19+, use the alias syntax instead of VALUES().

-- MySQL upsert
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99)
ON DUPLICATE KEY UPDATE
  name = VALUES(name),
  price = VALUES(price);

-- MySQL 8.0.19+ alias syntax (preferred)
INSERT INTO products (product_id, name, price)
VALUES (101, 'Widget Pro', 29.99) AS new_row
ON DUPLICATE KEY UPDATE
  name = new_row.name,
  price = new_row.price;

-- Bulk upsert
INSERT INTO inventory (sku, warehouse_id, quantity)
VALUES
  ('A001', 1, 50),
  ('A002', 1, 30),
  ('A001', 2, 20)
ON DUPLICATE KEY UPDATE
  quantity = VALUES(quantity);

UPSERT for Slowly Changing Dimensions (SCD Type 1)

SCD Type 1 overwrites the old value with the new value. No history is preserved. UPSERT is a natural fit: if the dimension row exists, update it; if not, insert it. This is the simplest SCD type and the most common use case for UPSERT in dimensional data warehouses. The updated_at timestamp tracks when the row last changed.

-- SCD Type 1: overwrite with latest values
-- PostgreSQL
INSERT INTO dim_customers (
  customer_id, name, email, city, updated_at
)
SELECT customer_id, name, email, city, NOW()
FROM staging_customers
ON CONFLICT (customer_id)
DO UPDATE SET
  name = EXCLUDED.name,
  email = EXCLUDED.email,
  city = EXCLUDED.city,
  updated_at = EXCLUDED.updated_at;

-- Only update rows where something changed (optimization)
-- Add a WHERE clause to skip no-op updates:
-- WHERE dim_customers.name != EXCLUDED.name
--    OR dim_customers.email != EXCLUDED.email
--    OR dim_customers.city != EXCLUDED.city;

UPSERT with Conditional Update Logic

Sometimes you want to update only specific columns or only update when certain conditions are met. For example, update the price only if the incoming price is newer (based on a timestamp). Or update the status only if it is moving forward in a state machine. Both PostgreSQL and MERGE support WHERE conditions on the update clause.

-- PostgreSQL: only update if incoming data is newer
INSERT INTO product_prices (product_id, price, effective_date)
VALUES (101, 34.99, '2025-03-15')
ON CONFLICT (product_id)
DO UPDATE SET
  price = EXCLUDED.price,
  effective_date = EXCLUDED.effective_date
WHERE EXCLUDED.effective_date > product_prices.effective_date;

-- Snowflake MERGE: update status only if progressing forward
MERGE INTO orders AS target
USING staging_orders AS source
  ON target.order_id = source.order_id
WHEN MATCHED
  AND source.status_rank > target.status_rank THEN
  UPDATE SET
    target.status = source.status,
    target.status_rank = source.status_rank,
    target.updated_at = CURRENT_TIMESTAMP
WHEN NOT MATCHED THEN
  INSERT (order_id, status, status_rank, created_at, updated_at)
  VALUES (source.order_id, source.status, source.status_rank,
          CURRENT_TIMESTAMP, CURRENT_TIMESTAMP);

When to Use UPSERT

Reach for UPSERT when the same payload can hit the table twice and you need the second write to be a no-op at the row level. Skip it when history matters.

Use UPSERT for

Dimension table loads (SCD Type 1)
Configuration or settings tables
Idempotent pipeline loads that may run multiple times
Real-time CDC (Change Data Capture) event processing
API-driven single-row writes where the client does not know if the row exists

Consider alternatives for

SCD Type 2 (need to preserve history, not overwrite)
Append-only fact tables (inserts only, no updates)
Full table refreshes (TRUNCATE + INSERT is simpler)
Tables without unique constraints (upsert cannot detect conflicts)

4 UPSERT Interview Questions

Q1: What is an upsert, and why is it important in data pipelines?

What they test: Whether you understand idempotent data loading. UPSERT inserts new rows and updates existing ones in a single operation. Pipelines that run the same data through multiple times need idempotent loads to avoid duplicates. Without UPSERT, you need DELETE + INSERT or a staging table with MERGE logic, both of which are more complex. Approach: Define upsert: insert if the row does not exist, update if it does. Explain the alternative without upsert: check if the row exists (SELECT), then conditionally INSERT or UPDATE. This requires two statements and a race condition window. UPSERT handles it atomically. Give a real example: daily customer dimension load where some customers are new and some have updated email addresses. UPSERT handles both in one pass.

Q2: Compare INSERT ON CONFLICT (PostgreSQL) with MERGE (SQL Server/Snowflake). When would you choose one over the other?

What they test: Cross-database knowledge and practical trade-off awareness. ON CONFLICT is simpler but cannot handle deletes. MERGE is more verbose but supports insert, update, and delete in one statement. The interviewer wants to see that you have worked with both and understand when MERGE's extra capability matters. Approach: ON CONFLICT is cleaner for simple upsert: insert or update based on a unique constraint. It is PostgreSQL-specific. MERGE is ANSI standard (supported by SQL Server, Snowflake, Oracle, newer PostgreSQL). MERGE supports WHEN MATCHED THEN DELETE, which ON CONFLICT cannot do. Choose ON CONFLICT for PostgreSQL-only pipelines with simple upsert logic. Choose MERGE when you need delete handling, when working with SQL Server/Snowflake, or when the pipeline logic has multiple match conditions.

Q3: How would you implement an SCD Type 2 using SQL? Can you use UPSERT for it?

What they test: Understanding that SCD Type 2 (historical tracking) is fundamentally different from SCD Type 1 (overwrite). Upsert works for Type 1 but not directly for Type 2. Type 2 requires closing the current record (set end_date) and inserting a new record. This is not a simple upsert because you never update the content columns of the existing row. Approach: Explain SCD Type 2: keep a full history by adding effective_date, end_date, and is_current columns. When a customer changes address, close the current record (SET end_date = today, is_current = false) and insert a new record (effective_date = today, end_date = NULL, is_current = true). MERGE can handle this: WHEN MATCHED AND values differ THEN UPDATE to close the old record, plus a second INSERT for the new version. Pure UPSERT (ON CONFLICT) is not enough because you need to both update the old row AND insert a new row for the same business key.

Q4: Your upsert pipeline runs successfully but downstream queries show stale data. What could be wrong?

What they test: Production debugging skills. Several things can cause a silent upsert failure: the conflict target column has no unique constraint (so conflicts are never detected), the WHERE clause on the update is too restrictive (filtering out legitimate updates), or the column mapping is wrong (updating column A with the value of column A instead of EXCLUDED.A). Approach: Check the conflict target: does the column actually have a unique constraint or unique index? Without it, ON CONFLICT never triggers and every row is inserted as a new row (duplicates instead of updates). Check the update SET clause: verify EXCLUDED references are correct. Check for a WHERE clause on the update that might be filtering out updates (e.g., only update if timestamp is newer, but timestamps are not populated correctly). Check the staging table: does it actually contain the updated values? Run the UPSERT with RETURNING * (PostgreSQL) to see which rows were affected.

Frequently asked questions

What is SQL UPSERT?+

UPSERT is a database operation that inserts a row if it does not exist and updates it if it does. It is not a standard SQL keyword but a concept implemented differently across databases: PostgreSQL uses INSERT ... ON CONFLICT, SQL Server and Snowflake use MERGE, and MySQL uses INSERT ... ON DUPLICATE KEY UPDATE. UPSERT makes data loads idempotent because running the same data twice produces the same result.

What is the difference between UPSERT and MERGE?+

UPSERT is the concept (insert or update). MERGE is a specific SQL syntax that implements the UPSERT concept and more. MERGE can handle inserts, updates, and deletes in a single statement based on match conditions. PostgreSQL INSERT ON CONFLICT is an UPSERT implementation that handles inserts and updates but not deletes. MERGE is the ANSI SQL standard; ON CONFLICT is PostgreSQL-specific.

Does UPSERT require a unique constraint?+

Yes. The database needs a way to determine whether a row already exists. In PostgreSQL, ON CONFLICT requires a unique constraint, unique index, or exclusion constraint on the conflict target columns. In MySQL, ON DUPLICATE KEY requires a primary key or unique index. MERGE uses an ON condition that typically references unique business keys. Without a uniqueness guarantee, the database cannot reliably detect conflicts.

Is UPSERT atomic?+

Yes. In all major databases, UPSERT operations (INSERT ON CONFLICT, MERGE, ON DUPLICATE KEY UPDATE) execute as a single atomic statement. Either the entire operation succeeds or it rolls back. There is no window where another transaction can see a partially completed upsert. This is a major advantage over the alternative of SELECT-then-INSERT-or-UPDATE, which has race condition risks.

02 / Why practice

The retry that doesn't wake you up

01
Active recall beats re-reading by 50%
Cognitive-science meta-reviews (Dunlosky et al., 2013) rank practice testing as a top-tier study technique, while re-reading and highlighting rank near the bottom
02
76% of hiring managers reject on the coding task, not the resume
From HackerRank's 2024 Developer Skills Report. Candidates who look strong on paper still fail the live screen if they haven't done timed, executable practice
03
Five problem shapes cover 80% of data engineer loops
Dedup, sessionization, top-N-per-group, slowly-changing dimensions, partition tricks. Writing the shapes by hand turns the unfamiliar into pattern recognition

Try a SQL Problem