Normalization

When the same piece of data lives in two places, it will eventually disagree with itself. A customer's address is stored in the orders table and the customers table. Someone updates one but not the other. Now your system has two truths. Normalization is the discipline of organizing data so that every fact is stored exactly once. This lesson teaches you how to recognize when data is duplicated, why that causes bugs, and how to restructure tables to eliminate the problem.

Data Gets Out of Sync

Daily Life

Interviews

Spot when duplicated data causes bugs

Why Normalization Exists

Normalization exists to prevent a specific class of production failure: data that contradicts itself. When the same fact is stored in multiple rows or multiple tables, any update must touch every copy. Miss one, and your database has two different answers to the same question. This is not a theoretical concern. It is the number one source of 'the numbers do not match' bugs in real systems.

Here is the canonical example: an employee table that looks reasonable until a department name changes.

emp_id	name	department_name	dept_location
1	Alice	Engineering	Floor 3
2	Bob	Engineering	Floor 3
3	Carol	Engineering	Floor 5
4	Dave	Sales	Floor 1

Alice and Bob say Engineering is on Floor 3. Carol says Floor 5. Which is correct? Nobody knows without checking a separate source. This happened because someone updated Carol's row but not Alice's and Bob's. The data contradicts itself.

The fix: store 'Engineering' and its location in exactly one place. Create a departments table with department_name and dept_location. The employees table references it via dept_id. Now updating the department location means changing one row, not hunting through every employee.

departments

PKdept_id

dept_name

location

employees

PKemp_id

name

FKdept_id

TIP

If you can change one fact about the world (a department moves floors) and it requires updating more than one row, you have a normalization problem.

The Three Anomaly Types

Update anomaly: changing Engineering's floor requires updating every employee in Engineering. Miss one and the data contradicts itself.

Insert anomaly: you cannot record a new department until at least one employee is assigned to it (if departments only exist as employee attributes).

Delete anomaly: if Dave is the only Sales employee and you delete him, the Sales department and its location disappear from the database entirely.

All three anomalies are caused by the same root problem: data that should live in one place is duplicated across multiple rows.

All three are fixed the same way: separate the duplicated data into its own table and reference it via a foreign key.

First Normal Form (1NF)

Daily Life

Interviews

Ensure every cell holds a single value

1NF: No Lists in Cells

A table is in first normal form when every cell contains a single atomic value. No comma-separated lists. No arrays stored as strings. No multiple values crammed into one column. This sounds obvious, but 1NF violations are everywhere in real data. A tags column with 'python, sql, spark'. A phone_numbers column with '555-1234, 555-5678'. These break filtering, indexing, and joining.

Here is a 1NF violation:

student_id	name	courses
1	Alice	Math, Physics, Chemistry
2	Bob	Math, Biology

You cannot filter for 'students taking Physics' without string parsing. You cannot count courses per student without splitting on commas. You cannot join to a courses table because the FK is a comma-separated string, not a single value.

The fix: one row per student-course combination.

students

PKstudent_id

name

courses

PKcourse_id

course_name

enrollments

FKstudent_id

FKcourse_id

student_id	course_id
1	MATH101
1	PHYS101
1	CHEM101
2	MATH101
2	BIO101

✗Violates 1NF

Multiple values in one cell
Cannot filter without string parsing
Cannot JOIN to related tables
Cannot enforce referential integrity

✓Satisfies 1NF

One value per cell
Standard WHERE clause filtering
JOIN works with FK relationships
FK constraints enforce valid values

TIP

The most common 1NF violation in modern systems: storing JSON arrays as strings in a VARCHAR column. If you need to query individual elements of that array, either use a native ARRAY/JSON column type or normalize into a separate table.

Second Normal Form (2NF)

Daily Life

Interviews

Remove partial key dependencies

2NF: No Partial Dependencies on Composite Keys

2NF only matters when your primary key spans multiple columns. If your PK is a single column, your table automatically satisfies 2NF. The violation: a non-key column depends on only PART of the composite key, not the whole thing.

Consider a store inventory table with a composite PK of (store_id, product_id). The table includes store_name and product_name alongside the quantity.

store_id	product_id	store_name	product_name	quantity
S1	P1	Downtown	Widget	100
S1	P2	Downtown	Gadget	50
S2	P1	Uptown	Widget	75

store_name depends only on store_id, not on the full key (store_id, product_id). product_name depends only on product_id. These are partial dependencies. The problem: 'Downtown' is stored once per product at that store. Add a third product and you have three copies of 'Downtown'. Rename the store and you must update every row.

The fix: separate tables for stores and products. The inventory table keeps only the composite key and the quantity.

stores

PKstore_id

store_name

products

PKproduct_id

product_name

inventory

FKstore_id

FKproduct_id

quantity

Store names live in stores. Product names live in products. Each fact is stored exactly once. Updating a store name is a single-row update.

store_id	product_id	quantity
S1	P1	100
S1	P2	50
S2	P1	75

TIP

2NF violations only occur with composite keys. If every table in your schema uses a single-column surrogate key, 2NF is automatically satisfied. This is one reason surrogate keys are so popular in practice.

Third Normal Form (3NF)

Daily Life

Interviews

Eliminate columns that depend on non-keys

3NF: No Transitive Dependencies

3NF is the most practically important normal form because it catches the most common data duplication pattern: a column that depends on another non-key column instead of the primary key. This applies to every table, not just those with composite keys.

Consider an orders table where each order has a customer_id, customer_name, and customer_city. customer_name and customer_city depend on customer_id, not on order_id. They are transitively dependent: order_id determines customer_id, which determines customer_name.

order_id	customer_id	customer_name	customer_city	amount
O1	C1	Alice	Seattle	$100
O2	C1	Alice	Seattle	$200
O3	C2	Bob	Portland	$150

Alice's name and city are stored twice. If Alice moves to Portland, you must update every order row. Miss one and the data contradicts itself. The test for any non-key column: does this value depend directly on the primary key, or does it depend on another column that happens to be in this table?

order_id determines customer_id (direct dependency, fine).

customer_id determines customer_name (transitive dependency through customer_id, not order_id).

customer_id determines customer_city (same transitive dependency).

order_id determines amount (direct dependency, fine).

customers

PKcustomer_id

name

city

orders

PKorder_id

FKcustomer_id

amount

Identifying Normal Form

Daily Life

Interviews

Diagnose which normal form a table meets

The Three-Step Check

When you look at any table and need to determine its normal form, apply these three checks in order. Stop at the first failure.

Check 1: Does any cell contain multiple values (comma-separated lists, arrays as strings)? If yes, the table violates 1NF.

Check 2: Is the PK composite? If so, does any non-key column depend on only part of the key? If yes, the table violates 2NF.

Check 3: Does any non-key column depend on another non-key column instead of the PK? If yes, the table violates 3NF.

In practice, 90% of real-world normalization problems are 3NF violations: a customer name stored on an orders table, a department location stored on an employee table, a product category stored on a sales table. These all have the same shape: an attribute that belongs to a related entity is denormalized onto the current table.

When Not to Normalize: OLTP vs OLAP

Normalization is the right choice for write-heavy transactional systems (OLTP). Every fact stored once means updates are safe and consistent. But for read-heavy analytical systems (OLAP), normalization creates excessive joins that slow queries down. A dashboard query that joins 8 normalized tables is slower than one that reads a single denormalized table.

The standard architecture: normalize in OLTP (source systems), denormalize in OLAP (data warehouse). The ETL pipeline transforms normalized source data into denormalized analytical tables optimized for read performance.

OLTPOLAPETL Bridge

OLTP

Normalize for Write Safety

Every fact stored once. Updates are safe. JOINs are acceptable because queries touch few rows at a time. Optimized for INSERT/UPDATE/DELETE.

OLAP

Denormalize for Read Speed

Pre-join dimensions into facts. Queries scan millions of rows. Fewer JOINs means faster queries. Optimized for SELECT with GROUP BY.

ETL Bridge

Transform Between Both

The pipeline reads normalized source data and writes denormalized analytical tables. This is where data engineers spend most of their time.

TIP

Start normalized. Denormalize specific tables only when you have measured a real performance problem. 'It might be slow' is not a reason to denormalize. 'This query takes 45 seconds because it joins 8 tables' is.

Worked Example: Video Watch History

A video streaming platform tracks watches. The raw table has: user_id, user_name, video_title, video_category, watched_at, duration_seconds. Let us check its normal form.

user_id	user_name	video_title	video_category	watched_at	duration_sec
U1	Alice	SQL Deep Dive	Education	2024-03-15 10:00	1800
U1	Alice	Python Tips	Education	2024-03-15 14:00	900
U2	Bob	SQL Deep Dive	Education	2024-03-16 09:00	1800

Check 1: No lists in cells. Passes 1NF. Check 2: If the PK is (user_id, video_title, watched_at), user_name depends only on user_id (partial dependency). Violates 2NF. Check 3: video_category depends on video_title, not on the PK (transitive dependency). Violates 3NF.

The fix: users table (user_id, user_name), videos table (video_id, title, category), watch_history junction table (user_id, video_id, watched_at, duration_sec).

users

PKuser_id

name

videos

PKvideo_id

title

Lesson Sections

Data Gets Out of Sync (concepts: dmThirdNormalForm)
Why Normalization Exists Normalization exists to prevent a specific class of production failure: data that contradicts itself. When the same fact is stored in multiple rows or multiple tables, any update must touch every copy. Miss one, and your database has two different answers to the same question. This is not a theoretical concern. It is the number one source of 'the numbers do not match' bugs in real systems. Here is the canonical example: an employee table that looks reasonable until a dep
First Normal Form (1NF) (concepts: dmFirstNormalForm)
1NF: No Lists in Cells A table is in first normal form when every cell contains a single atomic value. No comma-separated lists. No arrays stored as strings. No multiple values crammed into one column. This sounds obvious, but 1NF violations are everywhere in real data. A tags column with 'python, sql, spark'. A phone_numbers column with '555-1234, 555-5678'. These break filtering, indexing, and joining. Here is a 1NF violation: You cannot filter for 'students taking Physics' without string pars
Second Normal Form (2NF) (concepts: dmSecondNormalForm)
2NF: No Partial Dependencies on Composite Keys 2NF only matters when your primary key spans multiple columns. If your PK is a single column, your table automatically satisfies 2NF. The violation: a non-key column depends on only PART of the composite key, not the whole thing. Consider a store inventory table with a composite PK of (store_id, product_id). The table includes store_name and product_name alongside the quantity. store_name depends only on store_id, not on the full key (store_id, prod
Third Normal Form (3NF) (concepts: dmThirdNormalForm)
3NF: No Transitive Dependencies 3NF is the most practically important normal form because it catches the most common data duplication pattern: a column that depends on another non-key column instead of the primary key. This applies to every table, not just those with composite keys. Consider an orders table where each order has a customer_id, customer_name, and customer_city. customer_name and customer_city depend on customer_id, not on order_id. They are transitively dependent: order_id determi
Identifying Normal Form (concepts: dmThirdNormalForm)
The Three-Step Check When you look at any table and need to determine its normal form, apply these three checks in order. Stop at the first failure. In practice, 90% of real-world normalization problems are 3NF violations: a customer name stored on an orders table, a department location stored on an employee table, a product category stored on a sales table. These all have the same shape: an attribute that belongs to a related entity is denormalized onto the current table. When Not to Normalize: