Normalization

When the same piece of data lives in two places, it will eventually disagree with itself. A customer's address is stored in the orders table and the customers table. Someone updates one but not the other. Now your system has two truths. Normalization is the discipline of organizing data so that every fact is stored exactly once. This lesson teaches you how to recognize when data is duplicated, why that causes bugs, and how to restructure tables to eliminate the problem.

Data Gets Out of Sync

Daily Life
Interviews

Spot when duplicated data causes bugs

Why Normalization Exists

Normalization exists to prevent a specific class of production failure: data that contradicts itself. When the same fact is stored in multiple rows or multiple tables, any update must touch every copy. Miss one, and your database has two different answers to the same question. This is not a theoretical concern. It is the number one source of 'the numbers do not match' bugs in real systems.
Here is the canonical example: an employee table that looks reasonable until a department name changes.
emp_idnamedepartment_namedept_location
1AliceEngineeringFloor 3
2BobEngineeringFloor 3
3CarolEngineeringFloor 5
4DaveSalesFloor 1
Alice and Bob say Engineering is on Floor 3. Carol says Floor 5. Which is correct? Nobody knows without checking a separate source. This happened because someone updated Carol's row but not Alice's and Bob's. The data contradicts itself.
The fix: store 'Engineering' and its location in exactly one place. Create a departments table with department_name and dept_location. The employees table references it via dept_id. Now updating the department location means changing one row, not hunting through every employee.
departmentsPKdept_iddept_namelocationemployeesPKemp_idnameFKdept_id
TIP
If you can change one fact about the world (a department moves floors) and it requires updating more than one row, you have a normalization problem.

The Three Anomaly Types

Update anomaly: changing Engineering's floor requires updating every employee in Engineering. Miss one and the data contradicts itself.
Insert anomaly: you cannot record a new department until at least one employee is assigned to it (if departments only exist as employee attributes).
Delete anomaly: if Dave is the only Sales employee and you delete him, the Sales department and its location disappear from the database entirely.
All three anomalies are caused by the same root problem: data that should live in one place is duplicated across multiple rows.
All three are fixed the same way: separate the duplicated data into its own table and reference it via a foreign key.

First Normal Form (1NF)

Daily Life
Interviews

Ensure every cell holds a single value

1NF: No Lists in Cells

A table is in first normal form when every cell contains a single atomic value. No comma-separated lists. No arrays stored as strings. No multiple values crammed into one column. This sounds obvious, but 1NF violations are everywhere in real data. A tags column with 'python, sql, spark'. A phone_numbers column with '555-1234, 555-5678'. These break filtering, indexing, and joining.
Here is a 1NF violation:
student_idnamecourses
1AliceMath, Physics, Chemistry
2BobMath, Biology
You cannot filter for 'students taking Physics' without string parsing. You cannot count courses per student without splitting on commas. You cannot join to a courses table because the FK is a comma-separated string, not a single value.
The fix: one row per student-course combination.
studentsPKstudent_idnameenrollmentsFKstudent_idFKcourse_idcoursesPKcourse_idcourse_name
student_idcourse_id
1MATH101
1PHYS101
1CHEM101
2MATH101
2BIO101
Violates 1NF
  • Multiple values in one cell
  • Cannot filter without string parsing
  • Cannot JOIN to related tables
  • Cannot enforce referential integrity
Satisfies 1NF
  • One value per cell
  • Standard WHERE clause filtering
  • JOIN works with FK relationships
  • FK constraints enforce valid values
TIP
The most common 1NF violation in modern systems: storing JSON arrays as strings in a VARCHAR column. If you need to query individual elements of that array, either use a native ARRAY/JSON column type or normalize into a separate table.

Second Normal Form (2NF)

Daily Life
Interviews

Remove partial key dependencies

2NF: No Partial Dependencies on Composite Keys

2NF only matters when your primary key spans multiple columns. If your PK is a single column, your table automatically satisfies 2NF. The violation: a non-key column depends on only PART of the composite key, not the whole thing.
Consider a store inventory table with a composite PK of (store_id, product_id). The table includes store_name and product_name alongside the quantity.
store_idproduct_idstore_nameproduct_namequantity
S1P1DowntownWidget100
S1P2DowntownGadget50
S2P1UptownWidget75
store_name depends only on store_id, not on the full key (store_id, product_id). product_name depends only on product_id. These are partial dependencies. The problem: 'Downtown' is stored once per product at that store. Add a third product and you have three copies of 'Downtown'. Rename the store and you must update every row.
The fix: separate tables for stores and products. The inventory table keeps only the composite key and the quantity.
storesPKstore_idstore_nameinventoryFKstore_idFKproduct_idquantityproductsPKproduct_idproduct_name
Store names live in stores. Product names live in products. Each fact is stored exactly once. Updating a store name is a single-row update.
store_idproduct_idquantity
S1P1100
S1P250
S2P175
TIP
2NF violations only occur with composite keys. If every table in your schema uses a single-column surrogate key, 2NF is automatically satisfied. This is one reason surrogate keys are so popular in practice.

Third Normal Form (3NF)

Daily Life
Interviews

Eliminate columns that depend on non-keys

3NF: No Transitive Dependencies

3NF is the most practically important normal form because it catches the most common data duplication pattern: a column that depends on another non-key column instead of the primary key. This applies to every table, not just those with composite keys.
Consider an orders table where each order has a customer_id, customer_name, and customer_city. customer_name and customer_city depend on customer_id, not on order_id. They are transitively dependent: order_id determines customer_id, which determines customer_name.
order_idcustomer_idcustomer_namecustomer_cityamount
O1C1AliceSeattle$100
O2C1AliceSeattle$200
O3C2BobPortland$150
Alice's name and city are stored twice. If Alice moves to Portland, you must update every order row. Miss one and the data contradicts itself. The test for any non-key column: does this value depend directly on the primary key, or does it depend on another column that happens to be in this table?
order_id determines customer_id (direct dependency, fine).
customer_id determines customer_name (transitive dependency through customer_id, not order_id).
customer_id determines customer_city (same transitive dependency).
order_id determines amount (direct dependency, fine).
customersPKcustomer_idnamecityordersPKorder_idFKcustomer_idamount

Identifying Normal Form

Daily Life
Interviews

Diagnose which normal form a table meets

The Three-Step Check

When you look at any table and need to determine its normal form, apply these three checks in order. Stop at the first failure.
Check 1: Does any cell contain multiple values (comma-separated lists, arrays as strings)? If yes, the table violates 1NF.
Check 2: Is the PK composite? If so, does any non-key column depend on only part of the key? If yes, the table violates 2NF.
Check 3: Does any non-key column depend on another non-key column instead of the PK? If yes, the table violates 3NF.
In practice, 90% of real-world normalization problems are 3NF violations: a customer name stored on an orders table, a department location stored on an employee table, a product category stored on a sales table. These all have the same shape: an attribute that belongs to a related entity is denormalized onto the current table.

When Not to Normalize: OLTP vs OLAP

Normalization is the right choice for write-heavy transactional systems (OLTP). Every fact stored once means updates are safe and consistent. But for read-heavy analytical systems (OLAP), normalization creates excessive joins that slow queries down. A dashboard query that joins 8 normalized tables is slower than one that reads a single denormalized table.
The standard architecture: normalize in OLTP (source systems), denormalize in OLAP (data warehouse). The ETL pipeline transforms normalized source data into denormalized analytical tables optimized for read performance.
OLTPOLAPETL Bridge
OLTP
Normalize for Write Safety
Every fact stored once. Updates are safe. JOINs are acceptable because queries touch few rows at a time. Optimized for INSERT/UPDATE/DELETE.
OLAP
Denormalize for Read Speed
Pre-join dimensions into facts. Queries scan millions of rows. Fewer JOINs means faster queries. Optimized for SELECT with GROUP BY.
ETL Bridge
Transform Between Both
The pipeline reads normalized source data and writes denormalized analytical tables. This is where data engineers spend most of their time.
TIP
Start normalized. Denormalize specific tables only when you have measured a real performance problem. 'It might be slow' is not a reason to denormalize. 'This query takes 45 seconds because it joins 8 tables' is.

Worked Example: Video Watch History

A video streaming platform tracks watches. The raw table has: user_id, user_name, video_title, video_category, watched_at, duration_seconds. Let us check its normal form.
user_iduser_namevideo_titlevideo_categorywatched_atduration_sec
U1AliceSQL Deep DiveEducation2024-03-15 10:001800
U1AlicePython TipsEducation2024-03-15 14:00900
U2BobSQL Deep DiveEducation2024-03-16 09:001800
Check 1: No lists in cells. Passes 1NF. Check 2: If the PK is (user_id, video_title, watched_at), user_name depends only on user_id (partial dependency). Violates 2NF. Check 3: video_category depends on video_title, not on the PK (transitive dependency). Violates 3NF.
The fix: users table (user_id, user_name), videos table (video_id, title, category), watch_history junction table (user_id, video_id, watched_at, duration_sec).
usersPKuser_idnamewatch_historyFKuser_idFKvideo_idwatched_atduration_secvideosPKvideo_idtitlecategory
user_name is stored once. Each video title is stored once. The watch_history table contains only keys and measures. Every fact lives in exactly one place.
Normalized (OLTP)
  • 3 tables, each fact stored once
  • Update a user name: 1 row
  • Safe for concurrent writes
  • Requires JOINs for analytical queries
Denormalized (OLAP)
  • 1 wide table, facts duplicated
  • Update a user name: every watch row
  • Fast for read-heavy analytics
  • No JOINs needed, but updates are expensive
PUTTING IT ALL TOGETHER

> You find a sales table with: order_id, product_id, product_name, product_category, customer_id, customer_name, amount.

product_name and product_category depend on product_id, not order_id. Transitive dependency. Violates 3NF. Extract to a products table.
customer_name depends on customer_id, not order_id. Same pattern. Extract to a customers table.
The normalized orders table has: order_id, product_id, customer_id, amount. Each fact stored once. Updates to product names or customer names happen in one place.
KEY TAKEAWAYS
Normalization prevents data contradictions: every fact stored once eliminates update, insert, and delete anomalies
1NF: no lists in cells. Every column holds a single atomic value
2NF: no partial dependencies. Non-key columns depend on the entire composite key, not just part
3NF: no transitive dependencies. Non-key columns depend on the key and nothing but the key
Normalize in OLTP, denormalize in OLAP: the ETL pipeline bridges between write-safe sources and read-fast analytics

Normalization

Why copying data breaks everything

Category
Data Modeling
Duration
15 minutes
Challenges
12 hands-on challenges

Topics covered: Data Gets Out of Sync, First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Identifying Normal Form

Lesson Sections

  1. Data Gets Out of Sync

    Why Normalization Exists Normalization exists to prevent a specific class of production failure: data that contradicts itself. When the same fact is stored in multiple rows or multiple tables, any update must touch every copy. Miss one, and your database has two different answers to the same question. This is not a theoretical concern. It is the number one source of 'the numbers do not match' bugs in real systems. Here is the canonical example: an employee table that looks reasonable until a dep

  2. First Normal Form (1NF) (concepts: dmFirstNormalForm)

    1NF: No Lists in Cells A table is in first normal form when every cell contains a single atomic value. No comma-separated lists. No arrays stored as strings. No multiple values crammed into one column. This sounds obvious, but 1NF violations are everywhere in real data. A tags column with 'python, sql, spark'. A phone_numbers column with '555-1234, 555-5678'. These break filtering, indexing, and joining. Here is a 1NF violation: You cannot filter for 'students taking Physics' without string pars

  3. Second Normal Form (2NF) (concepts: dmSecondNormalForm)

    2NF: No Partial Dependencies on Composite Keys 2NF only matters when your primary key spans multiple columns. If your PK is a single column, your table automatically satisfies 2NF. The violation: a non-key column depends on only PART of the composite key, not the whole thing. Consider a store inventory table with a composite PK of (store_id, product_id). The table includes store_name and product_name alongside the quantity. store_name depends only on store_id, not on the full key (store_id, prod

  4. Third Normal Form (3NF) (concepts: dmThirdNormalForm)

    3NF: No Transitive Dependencies 3NF is the most practically important normal form because it catches the most common data duplication pattern: a column that depends on another non-key column instead of the primary key. This applies to every table, not just those with composite keys. Consider an orders table where each order has a customer_id, customer_name, and customer_city. customer_name and customer_city depend on customer_id, not on order_id. They are transitively dependent: order_id determi

  5. Identifying Normal Form

    The Three-Step Check When you look at any table and need to determine its normal form, apply these three checks in order. Stop at the first failure. In practice, 90% of real-world normalization problems are 3NF violations: a customer name stored on an orders table, a department location stored on an employee table, a product category stored on a sales table. These all have the same shape: an attribute that belongs to a related entity is denormalized onto the current table. When Not to Normalize: