Loading lesson...
Every record deserves a fingerprint
Every record deserves a fingerprint
Topics covered: The Problem of Identity, Primary Keys: Data Identity, Foreign Keys, Composite Keys, Key Generation Strategies
Without a reliable way to uniquely identify each row, queries produce wrong results. Joins multiply rows. Updates touch the wrong records. Deletes wipe related data. The entire relational model depends on every row being unambiguously identifiable. Keys are how you provide that guarantee. Natural vs Surrogate Keys A natural key uses data that already exists in the business domain: an email address, a product SKU, a government ID number. A surrogate key is a system-generated identifier with no bu
A primary key is a column (or set of columns) that uniquely identifies each row. It is the minimum contract every table must fulfill: no two rows can represent the same entity. The primary key is not just a constraint. It is the mechanism by which every other operation in the relational model is made safe. Joins reference it. Foreign keys point to it. Deduplication logic anchors on it. If you choose it poorly, the damage ripples through every downstream query. What Makes a Good Primary Key Compo
A foreign key is a column in one table that references the primary key of another. It is how relational databases enforce referential integrity: every FK value must either match an existing PK in the parent table, or be NULL. Without FK enforcement, your tables can accumulate orphaned rows that reference customers, products, or orders that no longer exist. Cloud Data Warehouses Do Not Enforce FKs This is one of the most important things to know as a data engineer. Redshift, BigQuery, Snowflake,
A composite key uses multiple columns together to uniquely identify a row. The most common use case is a junction table for a many-to-many relationship. An enrollment table linking students to courses has a composite PK of (student_id, course_id). Neither column is unique alone, but the combination is. Composite PK vs Surrogate PK in Junction Tables Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction ta
How you generate surrogate keys has real implications for performance, correctness, and pipeline complexity. There are three main strategies: database sequences, hash-based keys, and UUIDs. Each fits different scenarios. Database Sequences A sequence is a database-maintained counter that increments atomically. Each call to NEXTVAL returns a unique integer. Sequences are fast and produce compact, ordered values that B-tree indexes love. The downside: in distributed insert pipelines (e.g., Spark w