Keys & Identity

Every row in every table needs a fingerprint. Without one, joins multiply rows, updates touch the wrong records, and deletes wipe data you meant to keep. Key design is one of the first decisions you make when modeling a table, and one of the hardest to change later. This lesson covers how to choose the right key type, when to use natural vs surrogate keys, and the failure modes that catch teams who do not think about keys carefully.

The Problem of Identity

Daily Life
Interviews

Understand why every row needs a unique ID

IdentityNatural KeySurrogate Key
Identity
Why Every Row Needs a Fingerprint
Without a unique row identifier, joins multiply rows, updates touch the wrong records, and deletes wipe unintended data.
Natural Key
Business Data as Identifier
Uses existing domain values like email or SKU. Meaningful but fragile when those values change or get reused.
Surrogate Key
System-Generated Identifier
A meaningless integer or UUID created by the system. Stable, immutable, and safe for historical records.
Without a reliable way to uniquely identify each row, queries produce wrong results. Joins multiply rows. Updates touch the wrong records. Deletes wipe related data. The entire relational model depends on every row being unambiguously identifiable. Keys are how you provide that guarantee.

Natural vs Surrogate Keys

A natural key uses data that already exists in the business domain: an email address, a product SKU, a government ID number. A surrogate key is a system-generated identifier with no business meaning. Both have real tradeoffs, and the right choice depends on the specific table and how it will be used downstream.
alert
Natural keys change: a user updates their email, a SKU gets reassigned to a different product.
alert
Natural keys get reused: a discontinued product SKU is recycled for a new item, corrupting deduplication logic.
alert
Natural keys leak internals: exposing sequential IDs to external systems reveals business scale information.
check
Surrogate keys are immutable by design and never reused.
check
Surrogate keys decouple joins from business rules, keeping SCD2 history intact.

The choice is not 'surrogate keys are always better.' It is situational. For a patient record system, HIPAA-regulated IDs change when patients move regions, so a surrogate key protects downstream joins. For an internal config table with 20 rows, a natural key like config_name is perfectly fine.

Natural Key
  • Business-meaningful identifier
  • Can change over time
  • May be reused by the domain
  • Exposes business logic in joins
  • Risky for SCD2 history
Surrogate Key
  • System-generated, no meaning
  • Immutable by design
  • Never reused
  • Decouples joins from business rules
  • Safe for SCD2 history

The INT Overflow Problem

INT in most databases is 32-bit signed, maxing out at 2,147,483,647. High-volume tables like events, clickstreams, and log entries can exhaust an INT primary key in under two years. When the counter overflows, inserts fail. The table stops accepting new rows. Recovery requires ALTER TABLE to change the column type, which locks the table on most engines. Always use BIGINT for any table that could grow past 100 million rows.
TypeMax ValueStorageUse When
INT2,147,483,6474 bytesSmall lookup tables under ~100M rows
BIGINT9,223,372,036,854,775,8078 bytesAny high-volume or event table
SERIALSame as INT4 bytesPostgreSQL auto-increment (INT range)
BIGSERIALSame as BIGINT8 bytesPostgreSQL auto-increment (BIGINT range)
TIP
Default to BIGINT for any table you expect to grow beyond tens of millions of rows. The cost is 4 extra bytes per row. On a 500-million-row table, that is 2 GB. The cost of an INT overflow is an outage and a table-locking migration.

Primary Keys: Data Identity

Daily Life
Interviews

Choose primary keys that never collide

A primary key is a column (or set of columns) that uniquely identifies each row. It is the minimum contract every table must fulfill: no two rows can represent the same entity. The primary key is not just a constraint. It is the mechanism by which every other operation in the relational model is made safe. Joins reference it. Foreign keys point to it. Deduplication logic anchors on it. If you choose it poorly, the damage ripples through every downstream query.

What Makes a Good Primary Key

check
Unique: no two rows share the same PK value.
check
Non-null: the database enforces this at the constraint level.
check
Immutable: the value never changes after the row is created.
check
Concise: small as possible, because every FK reference and index entry copies it.
alert
A 36-character UUID PK on a 500-million-row fact table adds roughly 17 GB of storage for the PK column alone, before indexes.

Composite Primary Keys

A composite PK uses two or more columns together to identify a row. They are the natural fit for junction tables in many-to-many relationships. An enrollment table linking students to courses has a composite PK of (student_id, course_id). The combination is unique even though neither column alone is.
enrollmentPKstudent_idFKcourse_idenrolled_at

Every FK referencing a composite PK must copy all columns. A two-column PK means two-column FK references everywhere downstream. This is the main practical downside of composite keys.

Auto-Increment vs UUID

  • Compact, ordered, fast for range scans and B-tree indexes. Weakness: predictable sequence exposes record counts. Requires coordination in distributed writes.
  • Globally unique without coordination. No central sequence needed. Weakness: 36 characters, random ordering fragments B-tree indexes, poor for range scans.
  • Embeds a timestamp prefix so values are roughly time-ordered. Combines the coordination-free property of UUIDs with the index-friendly ordering of auto-increment. The modern default for distributed systems.
TIP
If you are building a new system and need globally unique IDs without a central sequence, UUID v7 is the best default. It gives you time-ordering (good for indexes) and uniqueness (good for distributed writes) in one format.

Foreign Keys

Daily Life
Interviews

Link tables together with references

Referential IntegrityThe Many Side
Referential Integrity
The FK Promise
Every FK value must either match an existing PK in the referenced table or be NULL. This prevents orphaned rows that reference data that does not exist.
The Many Side
Where the FK Lives
In a one-to-many relationship, the FK always lives on the 'many' side. An order has a customer_id FK because many orders can belong to one customer.
A foreign key is a column in one table that references the primary key of another. It is how relational databases enforce referential integrity: every FK value must either match an existing PK in the parent table, or be NULL. Without FK enforcement, your tables can accumulate orphaned rows that reference customers, products, or orders that no longer exist.
customersPKcustomer_idnameemailordersPKorder_idFKcustomer_idamountorder_date

Cloud Data Warehouses Do Not Enforce FKs

This is one of the most important things to know as a data engineer. Redshift, BigQuery, Snowflake, and Databricks all let you declare FK constraints but do not enforce them at write time. You can insert an order with customer_id = 999 even if no customer 999 exists. The constraint is purely informational, used by the query optimizer for join planning.
alert
Orphaned rows accumulate silently. A fact row referencing a deleted dimension row produces NULL when joined, corrupting aggregations.
alert
FK violations in warehouses are only caught by data quality tests (dbt tests, Great Expectations), not by the database itself.
check
In OLTP databases (PostgreSQL, MySQL), FK constraints ARE enforced. Inserts that violate them fail immediately.
check
Declaring FKs in warehouses is still useful: the optimizer uses them to eliminate unnecessary joins.

Cascade Behavior

ON DELETEWhat HappensUse When
RESTRICT (default)Block the delete if child rows existMost production systems. Forces explicit cleanup.
CASCADEDelete all child rows automaticallyDangerous at scale. One parent delete can cascade to millions of rows.
SET NULLSet the FK column to NULL on child rowsWhen the child row is valid without the parent (e.g., a customer deletes their account but orders remain for reporting).
SET DEFAULTSet the FK to a default valueRare. Requires a meaningful default FK value.
TIP
CASCADE is almost never what you want in a data warehouse. A single DELETE on a parent table can silently wipe millions of child rows. Use RESTRICT and handle deletions explicitly in your pipeline logic.

The FK Design Principle

Three things to always get right with FKs: (1) the FK column lives on the many side of a one-to-many relationship, (2) in cloud warehouses, FK enforcement is your pipeline's responsibility, not the database's, and (3) orphaned rows are silent bugs that only surface when a VP asks why the numbers do not match.

Composite Keys

Daily Life
Interviews

Identify rows using multiple columns

A composite key uses multiple columns together to uniquely identify a row. The most common use case is a junction table for a many-to-many relationship. An enrollment table linking students to courses has a composite PK of (student_id, course_id). Neither column is unique alone, but the combination is.

Composite PK vs Surrogate PK in Junction Tables

Composite PK
  • Enforces uniqueness of the relationship at the schema level
  • No extra column needed
  • Self-documenting: the PK IS the relationship
  • FKs referencing it must carry both columns
Surrogate PK
  • Single-column FK references downstream
  • Easier to reference from other tables
  • Required if the junction table gains its own attributes (grade, enrollment_date)
  • The composite columns still need a UNIQUE constraint

Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction table becomes an entity in its own right (e.g., enrollments with grades and dates), add a surrogate key and keep the composite as a UNIQUE constraint.

Composite Keys in Slowly Changing Dimensions

In SCD Type 2, every historical version of a dimension row needs a unique identifier. One approach is a composite key of (natural_key, effective_date). Another is a surrogate key (sk_customer) that is unique per version. The surrogate key approach is standard because fact tables only need to store one column (the SK) instead of two (natural key + effective date).
dim_customerPKcustomer_skcustomer_idnamecityeffective_fromeffective_tois_current

Partial Key Dependency

A partial key dependency occurs when a non-key attribute depends on only part of a composite key. In the enrollment table, if you store the student's major alongside the enrollment record, the major depends only on student_id, not on the full composite (student_id, course_id). This violates Second Normal Form and causes update anomalies: change the major once but it needs updating in every enrollment row.
alert
The fix: move the major to the students table where it depends on the full primary key (student_id).
key
Partial dependencies are the most common normalization mistake in tables with composite keys.
check
Rule of thumb: every non-key column in a composite-PK table must depend on ALL key columns, not just some.

Key Generation Strategies

Daily Life
Interviews

Pick auto-increment, UUID, or natural keys

How you generate surrogate keys has real implications for performance, correctness, and pipeline complexity. There are three main strategies: database sequences, hash-based keys, and UUIDs. Each fits different scenarios.
SequencesHash KeysUUIDs
Sequences
Database-Managed Counters
Fast, compact, ordered. The database guarantees uniqueness. But they require coordination in distributed writes: two parallel Spark executors cannot both call NEXTVAL without a bottleneck.
Hash Keys
Deterministic from Business Data
MD5 or SHA-256 of one or more business columns. Same input always produces the same key. Enables idempotent loading: reprocessing produces the same keys without coordination.
UUIDs
Globally Unique, No Coordination
Generated anywhere without a central sequence. UUID v4 is random (bad for indexes). UUID v7 is time-ordered (good for indexes). The modern default for distributed systems.

Database Sequences

A sequence is a database-maintained counter that increments atomically. Each call to NEXTVAL returns a unique integer. Sequences are fast and produce compact, ordered values that B-tree indexes love. The downside: in distributed insert pipelines (e.g., Spark writing to PostgreSQL), every executor contends for the same sequence, creating a bottleneck.
dim_productPKproduct_sk (BIGSERIAL)product_idnamecategorycreated_at

Hash-Based Keys

A hash-based surrogate key applies a deterministic hash (MD5, SHA-256) to one or more business attributes. The same input always produces the same output. This is what makes hash keys powerful for data pipelines: if you reprocess the same source data, you get the same keys. No sequence coordination needed. The tradeoff: hash collisions are theoretically possible (though astronomically unlikely with MD5's 2^64 collision resistance), and hash values are not human-readable.
check
Idempotent: same input always produces the same key. Reprocessing is safe.
check
Distributed: every worker can compute keys independently. No shared state.
alert
Not ordered: hash values are random, so range scans and time-ordered queries are slow.
alert
Not reversible: you cannot derive the business columns from the hash.

UUIDs

VersionStructureOrderingBest For
v4128-bit randomNone (random)Simple global uniqueness where ordering does not matter
v748-bit timestamp + 80-bit randomTime-orderedDistributed systems that need both uniqueness and index-friendly ordering

Choosing a Strategy

  • Database sequence (BIGSERIAL). Fast, compact, ordered. No contention with one writer.
  • Hash-based keys (MD5 of natural key). Idempotent, no coordination, reprocessing-safe.
  • UUID v7. Time-ordered for indexes, globally unique without a central sequence.
  • Hash-based keys if idempotency matters (most ETL). Sequences if ordering matters (time-series).
PUTTING IT ALL TOGETHER

> You are designing a new customer dimension table for your company's data warehouse. The source system uses email as the customer identifier, but customers can change their email.

You choose a surrogate key (customer_sk as BIGINT) because the natural key (email) is mutable. The surrogate key is immutable and safe for SCD2 history.
You use a hash-based key (MD5 of email + source_system) so that reprocessing the same source data produces the same surrogate keys without a database sequence.
You add a dbt test to check for orphaned fact rows referencing customer_sk values that do not exist in the dimension, because Snowflake will not enforce the FK for you.
KEY TAKEAWAYS
Natural keys fail when they change or get reused: use surrogate keys for stable identity in analytical models
Always use BIGINT for high-volume tables: INT overflows at 2.1 billion rows and causes outages
Cloud warehouses do not enforce FKs: your pipeline owns referential integrity, not the database
Hash-based keys enable idempotent loading: same input always produces the same key, making reprocessing safe
UUID v7 is the modern default: time-ordered for indexes, globally unique without coordination

Every record deserves a fingerprint

Category
Data Modeling
Duration
22 minutes
Challenges
12 hands-on challenges

Topics covered: The Problem of Identity, Primary Keys: Data Identity, Foreign Keys, Composite Keys, Key Generation Strategies

Lesson Sections

  1. The Problem of Identity

    Without a reliable way to uniquely identify each row, queries produce wrong results. Joins multiply rows. Updates touch the wrong records. Deletes wipe related data. The entire relational model depends on every row being unambiguously identifiable. Keys are how you provide that guarantee. Natural vs Surrogate Keys A natural key uses data that already exists in the business domain: an email address, a product SKU, a government ID number. A surrogate key is a system-generated identifier with no bu

  2. Primary Keys: Data Identity (concepts: dmPrimaryKeys)

    A primary key is a column (or set of columns) that uniquely identifies each row. It is the minimum contract every table must fulfill: no two rows can represent the same entity. The primary key is not just a constraint. It is the mechanism by which every other operation in the relational model is made safe. Joins reference it. Foreign keys point to it. Deduplication logic anchors on it. If you choose it poorly, the damage ripples through every downstream query. What Makes a Good Primary Key Compo

  3. Foreign Keys (concepts: dmForeignKeys)

    A foreign key is a column in one table that references the primary key of another. It is how relational databases enforce referential integrity: every FK value must either match an existing PK in the parent table, or be NULL. Without FK enforcement, your tables can accumulate orphaned rows that reference customers, products, or orders that no longer exist. Cloud Data Warehouses Do Not Enforce FKs This is one of the most important things to know as a data engineer. Redshift, BigQuery, Snowflake,

  4. Composite Keys (concepts: dmCompositeKeys)

    A composite key uses multiple columns together to uniquely identify a row. The most common use case is a junction table for a many-to-many relationship. An enrollment table linking students to courses has a composite PK of (student_id, course_id). Neither column is unique alone, but the combination is. Composite PK vs Surrogate PK in Junction Tables Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction ta

  5. Key Generation Strategies (concepts: dmKeyGeneration, dmSurrogateKeys)

    How you generate surrogate keys has real implications for performance, correctness, and pipeline complexity. There are three main strategies: database sequences, hash-based keys, and UUIDs. Each fits different scenarios. Database Sequences A sequence is a database-maintained counter that increments atomically. Each call to NEXTVAL returns a unique integer. Sequences are fast and produce compact, ordered values that B-tree indexes love. The downside: in distributed insert pipelines (e.g., Spark w