Keys & Identity
The Problem of Identity
Understand why every row needs a unique ID
Natural vs Surrogate Keys
The choice is not 'surrogate keys are always better.' It is situational. For a patient record system, HIPAA-regulated IDs change when patients move regions, so a surrogate key protects downstream joins. For an internal config table with 20 rows, a natural key like config_name is perfectly fine.
- Business-meaningful identifier
- Can change over time
- May be reused by the domain
- Exposes business logic in joins
- Risky for SCD2 history
- System-generated, no meaning
- Immutable by design
- Never reused
- Decouples joins from business rules
- Safe for SCD2 history
The INT Overflow Problem
| Type | Max Value | Storage | Use When |
|---|---|---|---|
| INT | 2,147,483,647 | 4 bytes | Small lookup tables under ~100M rows |
| BIGINT | 9,223,372,036,854,775,807 | 8 bytes | Any high-volume or event table |
| SERIAL | Same as INT | 4 bytes | PostgreSQL auto-increment (INT range) |
| BIGSERIAL | Same as BIGINT | 8 bytes | PostgreSQL auto-increment (BIGINT range) |
Primary Keys: Data Identity
Choose primary keys that never collide
What Makes a Good Primary Key
Composite Primary Keys
Every FK referencing a composite PK must copy all columns. A two-column PK means two-column FK references everywhere downstream. This is the main practical downside of composite keys.
Auto-Increment vs UUID
- Compact, ordered, fast for range scans and B-tree indexes. Weakness: predictable sequence exposes record counts. Requires coordination in distributed writes.
- Globally unique without coordination. No central sequence needed. Weakness: 36 characters, random ordering fragments B-tree indexes, poor for range scans.
- Embeds a timestamp prefix so values are roughly time-ordered. Combines the coordination-free property of UUIDs with the index-friendly ordering of auto-increment. The modern default for distributed systems.
Foreign Keys
Link tables together with references
Cloud Data Warehouses Do Not Enforce FKs
Cascade Behavior
| ON DELETE | What Happens | Use When |
|---|---|---|
| RESTRICT (default) | Block the delete if child rows exist | Most production systems. Forces explicit cleanup. |
| CASCADE | Delete all child rows automatically | Dangerous at scale. One parent delete can cascade to millions of rows. |
| SET NULL | Set the FK column to NULL on child rows | When the child row is valid without the parent (e.g., a customer deletes their account but orders remain for reporting). |
| SET DEFAULT | Set the FK to a default value | Rare. Requires a meaningful default FK value. |
The FK Design Principle
Three things to always get right with FKs: (1) the FK column lives on the many side of a one-to-many relationship, (2) in cloud warehouses, FK enforcement is your pipeline's responsibility, not the database's, and (3) orphaned rows are silent bugs that only surface when a VP asks why the numbers do not match.
Composite Keys
Identify rows using multiple columns
Composite PK vs Surrogate PK in Junction Tables
- Enforces uniqueness of the relationship at the schema level
- No extra column needed
- Self-documenting: the PK IS the relationship
- FKs referencing it must carry both columns
- Single-column FK references downstream
- Easier to reference from other tables
- Required if the junction table gains its own attributes (grade, enrollment_date)
- The composite columns still need a UNIQUE constraint
Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction table becomes an entity in its own right (e.g., enrollments with grades and dates), add a surrogate key and keep the composite as a UNIQUE constraint.
Composite Keys in Slowly Changing Dimensions
Partial Key Dependency
Key Generation Strategies
Pick auto-increment, UUID, or natural keys
Database Sequences
Hash-Based Keys
UUIDs
| Version | Structure | Ordering | Best For |
|---|---|---|---|
| v4 | 128-bit random | None (random) | Simple global uniqueness where ordering does not matter |
| v7 | 48-bit timestamp + 80-bit random | Time-ordered | Distributed systems that need both uniqueness and index-friendly ordering |
Choosing a Strategy
- Database sequence (BIGSERIAL). Fast, compact, ordered. No contention with one writer.
- Hash-based keys (MD5 of natural key). Idempotent, no coordination, reprocessing-safe.
- UUID v7. Time-ordered for indexes, globally unique without a central sequence.
- Hash-based keys if idempotency matters (most ETL). Sequences if ordering matters (time-series).
> You are designing a new customer dimension table for your company's data warehouse. The source system uses email as the customer identifier, but customers can change their email.
Every record deserves a fingerprint
- Category
- Data Modeling
- Duration
- 22 minutes
- Challenges
- 12 hands-on challenges
Topics covered: The Problem of Identity, Primary Keys: Data Identity, Foreign Keys, Composite Keys, Key Generation Strategies
Lesson Sections
- The Problem of Identity
Without a reliable way to uniquely identify each row, queries produce wrong results. Joins multiply rows. Updates touch the wrong records. Deletes wipe related data. The entire relational model depends on every row being unambiguously identifiable. Keys are how you provide that guarantee. Natural vs Surrogate Keys A natural key uses data that already exists in the business domain: an email address, a product SKU, a government ID number. A surrogate key is a system-generated identifier with no bu
- Primary Keys: Data Identity (concepts: dmPrimaryKeys)
A primary key is a column (or set of columns) that uniquely identifies each row. It is the minimum contract every table must fulfill: no two rows can represent the same entity. The primary key is not just a constraint. It is the mechanism by which every other operation in the relational model is made safe. Joins reference it. Foreign keys point to it. Deduplication logic anchors on it. If you choose it poorly, the damage ripples through every downstream query. What Makes a Good Primary Key Compo
- Foreign Keys (concepts: dmForeignKeys)
A foreign key is a column in one table that references the primary key of another. It is how relational databases enforce referential integrity: every FK value must either match an existing PK in the parent table, or be NULL. Without FK enforcement, your tables can accumulate orphaned rows that reference customers, products, or orders that no longer exist. Cloud Data Warehouses Do Not Enforce FKs This is one of the most important things to know as a data engineer. Redshift, BigQuery, Snowflake,
- Composite Keys (concepts: dmCompositeKeys)
A composite key uses multiple columns together to uniquely identify a row. The most common use case is a junction table for a many-to-many relationship. An enrollment table linking students to courses has a composite PK of (student_id, course_id). Neither column is unique alone, but the combination is. Composite PK vs Surrogate PK in Junction Tables Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction ta
- Key Generation Strategies (concepts: dmKeyGeneration, dmSurrogateKeys)
How you generate surrogate keys has real implications for performance, correctness, and pipeline complexity. There are three main strategies: database sequences, hash-based keys, and UUIDs. Each fits different scenarios. Database Sequences A sequence is a database-maintained counter that increments atomically. Each call to NEXTVAL returns a unique integer. Sequences are fast and produce compact, ordered values that B-tree indexes love. The downside: in distributed insert pipelines (e.g., Spark w