Keys & Identity

Every row in every table needs a fingerprint. Without one, joins multiply rows, updates touch the wrong records, and deletes wipe data you meant to keep. Key design is one of the first decisions you make when modeling a table, and one of the hardest to change later. This lesson covers how to choose the right key type, when to use natural vs surrogate keys, and the failure modes that catch teams who do not think about keys carefully.

The Problem of Identity

Daily Life

Interviews

Understand why every row needs a unique ID

IdentityNatural KeySurrogate Key

Identity

Why Every Row Needs a Fingerprint

Without a unique row identifier, joins multiply rows, updates touch the wrong records, and deletes wipe unintended data.

Natural Key

Business Data as Identifier

Uses existing domain values like email or SKU. Meaningful but fragile when those values change or get reused.

Surrogate Key

System-Generated Identifier

A meaningless integer or UUID created by the system. Stable, immutable, and safe for historical records.

Without a reliable way to uniquely identify each row, queries produce wrong results. Joins multiply rows. Updates touch the wrong records. Deletes wipe related data. The entire relational model depends on every row being unambiguously identifiable. Keys are how you provide that guarantee.

Natural vs Surrogate Keys

A natural key uses data that already exists in the business domain: an email address, a product SKU, a government ID number. A surrogate key is a system-generated identifier with no business meaning. Both have real tradeoffs, and the right choice depends on the specific table and how it will be used downstream.

Natural keys change: a user updates their email, a SKU gets reassigned to a different product.

Natural keys get reused: a discontinued product SKU is recycled for a new item, corrupting deduplication logic.

Natural keys leak internals: exposing sequential IDs to external systems reveals business scale information.

Surrogate keys are immutable by design and never reused.

Surrogate keys decouple joins from business rules, keeping SCD2 history intact.

The choice is not 'surrogate keys are always better.' It is situational. For a patient record system, HIPAA-regulated IDs change when patients move regions, so a surrogate key protects downstream joins. For an internal config table with 20 rows, a natural key like config_name is perfectly fine.

•Natural Key

Business-meaningful identifier
Can change over time
May be reused by the domain
Exposes business logic in joins
Risky for SCD2 history

•Surrogate Key

System-generated, no meaning
Immutable by design
Never reused
Decouples joins from business rules
Safe for SCD2 history

The INT Overflow Problem

INT in most databases is 32-bit signed, maxing out at 2,147,483,647. High-volume tables like events, clickstreams, and log entries can exhaust an INT primary key in under two years. When the counter overflows, inserts fail. The table stops accepting new rows. Recovery requires ALTER TABLE to change the column type, which locks the table on most engines. Always use BIGINT for any table that could grow past 100 million rows.

Type	Max Value	Storage	Use When
INT	2,147,483,647	4 bytes	Small lookup tables under ~100M rows
BIGINT	9,223,372,036,854,775,807	8 bytes	Any high-volume or event table
SERIAL	Same as INT	4 bytes	PostgreSQL auto-increment (INT range)
BIGSERIAL	Same as BIGINT	8 bytes	PostgreSQL auto-increment (BIGINT range)

TIP

Default to BIGINT for any table you expect to grow beyond tens of millions of rows. The cost is 4 extra bytes per row. On a 500-million-row table, that is 2 GB. The cost of an INT overflow is an outage and a table-locking migration.

Primary Keys: Data Identity

Daily Life

Interviews

Choose primary keys that never collide

A primary key is a column (or set of columns) that uniquely identifies each row. It is the minimum contract every table must fulfill: no two rows can represent the same entity. The primary key is not just a constraint. It is the mechanism by which every other operation in the relational model is made safe. Joins reference it. Foreign keys point to it. Deduplication logic anchors on it. If you choose it poorly, the damage ripples through every downstream query.

What Makes a Good Primary Key

Unique: no two rows share the same PK value.

Non-null: the database enforces this at the constraint level.

Immutable: the value never changes after the row is created.

Concise: small as possible, because every FK reference and index entry copies it.

A 36-character UUID PK on a 500-million-row fact table adds roughly 17 GB of storage for the PK column alone, before indexes.

Composite Primary Keys

A composite PK uses two or more columns together to identify a row. They are the natural fit for junction tables in many-to-many relationships. An enrollment table linking students to courses has a composite PK of (student_id, course_id). The combination is unique even though neither column alone is.

Every FK referencing a composite PK must copy all columns. A two-column PK means two-column FK references everywhere downstream. This is the main practical downside of composite keys.

Auto-Increment vs UUID

Compact, ordered, fast for range scans and B-tree indexes. Weakness: predictable sequence exposes record counts. Requires coordination in distributed writes.
Globally unique without coordination. No central sequence needed. Weakness: 36 characters, random ordering fragments B-tree indexes, poor for range scans.
Embeds a timestamp prefix so values are roughly time-ordered. Combines the coordination-free property of UUIDs with the index-friendly ordering of auto-increment. The modern default for distributed systems.

TIP

If you are building a new system and need globally unique IDs without a central sequence, UUID v7 is the best default. It gives you time-ordering (good for indexes) and uniqueness (good for distributed writes) in one format.

Foreign Keys

Daily Life

Interviews

Link tables together with references

Referential IntegrityThe Many Side

Referential Integrity

The FK Promise

Every FK value must either match an existing PK in the referenced table or be NULL. This prevents orphaned rows that reference data that does not exist.

The Many Side

Where the FK Lives

In a one-to-many relationship, the FK always lives on the 'many' side. An order has a customer_id FK because many orders can belong to one customer.

A foreign key is a column in one table that references the primary key of another. It is how relational databases enforce referential integrity: every FK value must either match an existing PK in the parent table, or be NULL. Without FK enforcement, your tables can accumulate orphaned rows that reference customers, products, or orders that no longer exist.

Cloud Data Warehouses Do Not Enforce FKs

This is one of the most important things to know as a data engineer. Redshift, BigQuery, Snowflake, and Databricks all let you declare FK constraints but do not enforce them at write time. You can insert an order with customer_id = 999 even if no customer 999 exists. The constraint is purely informational, used by the query optimizer for join planning.

Orphaned rows accumulate silently. A fact row referencing a deleted dimension row produces NULL when joined, corrupting aggregations.

FK violations in warehouses are only caught by data quality tests (dbt tests, Great Expectations), not by the database itself.

In OLTP databases (PostgreSQL, MySQL), FK constraints ARE enforced. Inserts that violate them fail immediately.

Declaring FKs in warehouses is still useful: the optimizer uses them to eliminate unnecessary joins.

Cascade Behavior

ON DELETE	What Happens	Use When
RESTRICT (default)	Block the delete if child rows exist	Most production systems. Forces explicit cleanup.
CASCADE	Delete all child rows automatically	Dangerous at scale. One parent delete can cascade to millions of rows.
SET NULL	Set the FK column to NULL on child rows	When the child row is valid without the parent (e.g., a customer deletes their account but orders remain for reporting).
SET DEFAULT	Set the FK to a default value	Rare. Requires a meaningful default FK value.

TIP

CASCADE is almost never what you want in a data warehouse. A single DELETE on a parent table can silently wipe millions of child rows. Use RESTRICT and handle deletions explicitly in your pipeline logic.

The FK Design Principle

Three things to always get right with FKs: (1) the FK column lives on the many side of a one-to-many relationship, (2) in cloud warehouses, FK enforcement is your pipeline's responsibility, not the database's, and (3) orphaned rows are silent bugs that only surface when a VP asks why the numbers do not match.

Composite Keys

Daily Life

Interviews

Identify rows using multiple columns

A composite key uses multiple columns together to uniquely identify a row. The most common use case is a junction table for a many-to-many relationship. An enrollment table linking students to courses has a composite PK of (student_id, course_id). Neither column is unique alone, but the combination is.

Composite PK vs Surrogate PK in Junction Tables

✓Composite PK

Enforces uniqueness of the relationship at the schema level
No extra column needed
Self-documenting: the PK IS the relationship
FKs referencing it must carry both columns

✓Surrogate PK

Single-column FK references downstream
Easier to reference from other tables
Required if the junction table gains its own attributes (grade, enrollment_date)
The composite columns still need a UNIQUE constraint

Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction table becomes an entity in its own right (e.g., enrollments with grades and dates), add a surrogate key and keep the composite as a UNIQUE constraint.

Composite Keys in Slowly Changing Dimensions

In SCD Type 2, every historical version of a dimension row needs a unique identifier. One approach is a composite key of (natural_key, effective_date). Another is a surrogate key (sk_customer) that is unique per version. The surrogate key approach is standard because fact tables only need to store one column (the SK) instead of two (natural key + effective date).

Partial Key Dependency

A partial key dependency occurs when a non-key attribute depends on only part of a composite key. In the enrollment table, if you store the student's major alongside the enrollment record, the major depends only on student_id, not on the full composite (student_id, course_id). This violates Second Normal Form and causes update anomalies: change the major once but it needs updating in every enrollment row.

The fix: move the major to the students table where it depends on the full primary key (student_id).

Partial dependencies are the most common normalization mistake in tables with composite keys.

Rule of thumb: every non-key column in a composite-PK table must depend on ALL key columns, not just some.

Key Generation Strategies

Daily Life

Interviews

Pick auto-increment, UUID, or natural keys

How you generate surrogate keys has real implications for performance, correctness, and pipeline complexity. There are three main strategies: database sequences, hash-based keys, and UUIDs. Each fits different scenarios.

SequencesHash KeysUUIDs

Sequences

Database-Managed Counters

Fast, compact, ordered. The database guarantees uniqueness. But they require coordination in distributed writes: two parallel Spark executors cannot both call NEXTVAL without a bottleneck.

Hash Keys

Deterministic from Business Data

MD5 or SHA-256 of one or more business columns. Same input always produces the same key. Enables idempotent loading: reprocessing produces the same keys without coordination.

UUIDs

Globally Unique, No Coordination

Generated anywhere without a central sequence. UUID v4 is random (bad for indexes). UUID v7 is time-ordered (good for indexes). The modern default for distributed systems.

Database Sequences

A sequence is a database-maintained counter that increments atomically. Each call to NEXTVAL returns a unique integer. Sequences are fast and produce compact, ordered values that B-tree indexes love. The downside: in distributed insert pipelines (e.g., Spark writing to PostgreSQL), every executor contends for the same sequence, creating a bottleneck.

Hash-Based Keys

A hash-based surrogate key applies a deterministic hash (MD5, SHA-256) to one or more business attributes. The same input always produces the same output. This is what makes hash keys powerful for data pipelines: if you reprocess the same source data, you get the same keys. No sequence coordination needed. The tradeoff: hash collisions are theoretically possible (though astronomically unlikely with MD5's 2^64 collision resistance), and hash values are not human-readable.

Idempotent: same input always produces the same key. Reprocessing is safe.

Distributed: every worker can compute keys independently. No shared state.

Not ordered: hash values are random, so range scans and time-ordered queries are slow.

Not reversible: you cannot derive the business columns from the hash.

UUIDs

Version	Structure	Ordering	Best For
v4	128-bit random	None (random)	Simple global uniqueness where ordering does not matter
v7	48-bit timestamp + 80-bit random	Time-ordered	Distributed systems that need both uniqueness and index-friendly ordering

Choosing a Strategy

Database sequence (BIGSERIAL). Fast, compact, ordered. No contention with one writer.
Hash-based keys (MD5 of natural key). Idempotent, no coordination, reprocessing-safe.
UUID v7. Time-ordered for indexes, globally unique without a central sequence.
Hash-based keys if idempotency matters (most ETL). Sequences if ordering matters (time-series).

❯❯❯PUTTING IT ALL TOGETHER

> You are designing a new customer dimension table for your company's data warehouse. The source system uses email as the customer identifier, but customers can change their email.

You choose a surrogate key (customer_sk as BIGINT) because the natural key (email) is mutable. The surrogate key is immutable and safe for SCD2 history.

You use a hash-based key (MD5 of email + source_system) so that reprocessing the same source data produces the same surrogate keys without a database sequence.

You add a dbt test to check for orphaned fact rows referencing customer_sk values that do not exist in the dimension, because Snowflake will not enforce the FK for you.

KEY TAKEAWAYS

Natural keys fail when they change or get reused: use surrogate keys for stable identity in analytical models

Always use BIGINT for high-volume tables: INT overflows at 2.1 billion rows and causes outages

Cloud warehouses do not enforce FKs: your pipeline owns referential integrity, not the database

Hash-based keys enable idempotent loading: same input always produces the same key, making reprocessing safe

UUID v7 is the modern default: time-ordered for indexes, globally unique without coordination

Every record deserves a fingerprint

Category: Data Modeling
Duration: 22 minutes
Challenges: 12 hands-on challenges

Topics covered: The Problem of Identity, Primary Keys: Data Identity, Foreign Keys, Composite Keys, Key Generation Strategies

Lesson Sections

The Problem of Identity
Without a reliable way to uniquely identify each row, queries produce wrong results. Joins multiply rows. Updates touch the wrong records. Deletes wipe related data. The entire relational model depends on every row being unambiguously identifiable. Keys are how you provide that guarantee. Natural vs Surrogate Keys A natural key uses data that already exists in the business domain: an email address, a product SKU, a government ID number. A surrogate key is a system-generated identifier with no bu
Primary Keys: Data Identity (concepts: dmPrimaryKeys)
A primary key is a column (or set of columns) that uniquely identifies each row. It is the minimum contract every table must fulfill: no two rows can represent the same entity. The primary key is not just a constraint. It is the mechanism by which every other operation in the relational model is made safe. Joins reference it. Foreign keys point to it. Deduplication logic anchors on it. If you choose it poorly, the damage ripples through every downstream query. What Makes a Good Primary Key Compo
Foreign Keys (concepts: dmForeignKeys)
A foreign key is a column in one table that references the primary key of another. It is how relational databases enforce referential integrity: every FK value must either match an existing PK in the parent table, or be NULL. Without FK enforcement, your tables can accumulate orphaned rows that reference customers, products, or orders that no longer exist. Cloud Data Warehouses Do Not Enforce FKs This is one of the most important things to know as a data engineer. Redshift, BigQuery, Snowflake,
Composite Keys (concepts: dmCompositeKeys)
A composite key uses multiple columns together to uniquely identify a row. The most common use case is a junction table for a many-to-many relationship. An enrollment table linking students to courses has a composite PK of (student_id, course_id). Neither column is unique alone, but the combination is. Composite PK vs Surrogate PK in Junction Tables Neither approach is universally correct. If the junction table is purely a link (no extra attributes), a composite PK is cleaner. If the junction ta
Key Generation Strategies (concepts: dmKeyGeneration, dmSurrogateKeys)
How you generate surrogate keys has real implications for performance, correctness, and pipeline complexity. There are three main strategies: database sequences, hash-based keys, and UUIDs. Each fits different scenarios. Database Sequences A sequence is a database-maintained counter that increments atomically. Each call to NEXTVAL returns a unique integer. Sequences are fast and produce compact, ordered values that B-tree indexes love. The downside: in distributed insert pipelines (e.g., Spark w