Identifying the Duplicate Key

Concepts covered: sqlWindowDedup

The common case: rows are duplicates by some key (customer_id, order_date) but differ in other columns (created_at, amount, status). DISTINCT does not help. The canonical tool is ROW_NUMBER OVER (PARTITION BY key ORDER BY tiebreaker) filtered to rn = 1. The partition defines the key; the order defines which copy is picked first; the filter keeps only the picked copy. This is the pattern that handles 80% of real-world dedup questions. The canonical query Reading the pattern Why a CTE is mandatory here You cannot filter on the rn column in the same SELECT that defines it; SQL evaluates WHERE before window functions. The rn column does not exist yet when WHERE runs. You have to compute rn first in a CTE or subquery, then filter on it in an outer query. The classic failure: putting WHERE rn =

About This Interactive Section

This section is part of the Deduplication: Beginner lesson on DataDriven, a free data engineering interview prep platform. Each section includes explanations, worked examples, and hands-on code challenges that execute in real time. SQL queries run against a live PostgreSQL database. Python runs in a sandboxed Docker container. Data modeling problems validate against interactive schema canvases. All content is framed around what data engineering interviewers actually test at companies like Meta, Google, Amazon, Netflix, Stripe, and Databricks.

How DataDriven Lessons Work

DataDriven combines four interview rounds (SQL, Python, Data Modeling, Pipeline Architecture) with adaptive difficulty and spaced repetition. Easy problems get harder as you improve. Weak concepts resurface until you master them. Your readiness score tracks progress across every topic interviewers test. Every lesson section ends with problems you solve by writing and running real code, not by picking multiple-choice answers.