When the same piece of data lives in two places, it will eventually disagree with itself. A customer's address is stored in the orders table and the customers table. Someone updates one but not the other. Now your system has two truths. Normalization is the discipline of organizing data so that every fact is stored exactly once. This lesson teaches you how to recognize when data is duplicated, why that causes bugs, and how to restructure tables to eliminate the problem.
Data Gets Out of Sync
Daily Life
Interviews
Spot when duplicated data causes bugs
Why Normalization Exists
Normalization exists to prevent a specific class of production failure: data that contradicts itself. When the same fact is stored in multiple rows or multiple tables, any update must touch every copy. Miss one, and your database has two different answers to the same question. This is not a theoretical concern. It is the number one source of 'the numbers do not match' bugs in real systems.
Here is the canonical example: an employee table that looks reasonable until a department name changes.
emp_id
name
department_name
dept_location
1
Alice
Engineering
Floor 3
2
Bob
Engineering
Floor 3
3
Carol
Engineering
Floor 5
4
Dave
Sales
Floor 1
Alice and Bob say Engineering is on Floor 3. Carol says Floor 5. Which is correct? Nobody knows without checking a separate source. This happened because someone updated Carol's row but not Alice's and Bob's. The data contradicts itself.
The fix: store 'Engineering' and its location in exactly one place. Create a departments table with department_name and dept_location. The employees table references it via dept_id. Now updating the department location means changing one row, not hunting through every employee.
TIP
If you can change one fact about the world (a department moves floors) and it requires updating more than one row, you have a normalization problem.
The Three Anomaly Types
Update anomaly: changing Engineering's floor requires updating every employee in Engineering. Miss one and the data contradicts itself.
Insert anomaly: you cannot record a new department until at least one employee is assigned to it (if departments only exist as employee attributes).
Delete anomaly: if Dave is the only Sales employee and you delete him, the Sales department and its location disappear from the database entirely.
All three anomalies are caused by the same root problem: data that should live in one place is duplicated across multiple rows.
All three are fixed the same way: separate the duplicated data into its own table and reference it via a foreign key.
First Normal Form (1NF)
Daily Life
Interviews
Ensure every cell holds a single value
1NF: No Lists in Cells
A table is in first normal form when every cell contains a single atomic value. No comma-separated lists. No arrays stored as strings. No multiple values crammed into one column. This sounds obvious, but 1NF violations are everywhere in real data. A tags column with 'python, sql, spark'. A phone_numbers column with '555-1234, 555-5678'. These break filtering, indexing, and joining.
Here is a 1NF violation:
student_id
name
courses
1
Alice
Math, Physics, Chemistry
2
Bob
Math, Biology
You cannot filter for 'students taking Physics' without string parsing. You cannot count courses per student without splitting on commas. You cannot join to a courses table because the FK is a comma-separated string, not a single value.
The fix: one row per student-course combination.
student_id
course_id
1
MATH101
1
PHYS101
1
CHEM101
2
MATH101
2
BIO101
✗Violates 1NF
Multiple values in one cell
Cannot filter without string parsing
Cannot JOIN to related tables
Cannot enforce referential integrity
✓Satisfies 1NF
One value per cell
Standard WHERE clause filtering
JOIN works with FK relationships
FK constraints enforce valid values
TIP
The most common 1NF violation in modern systems: storing JSON arrays as strings in a VARCHAR column. If you need to query individual elements of that array, either use a native ARRAY/JSON column type or normalize into a separate table.
Second Normal Form (2NF)
Daily Life
Interviews
Remove partial key dependencies
2NF: No Partial Dependencies on Composite Keys
2NF only matters when your primary key spans multiple columns. If your PK is a single column, your table automatically satisfies 2NF. The violation: a non-key column depends on only PART of the composite key, not the whole thing.
Consider a store inventory table with a composite PK of (store_id, product_id). The table includes store_name and product_name alongside the quantity.
store_id
product_id
store_name
product_name
quantity
S1
P1
Downtown
Widget
100
S1
P2
Downtown
Gadget
50
S2
P1
Uptown
Widget
75
store_name depends only on store_id, not on the full key (store_id, product_id). product_name depends only on product_id. These are partial dependencies. The problem: 'Downtown' is stored once per product at that store. Add a third product and you have three copies of 'Downtown'. Rename the store and you must update every row.
The fix: separate tables for stores and products. The inventory table keeps only the composite key and the quantity.
Store names live in stores. Product names live in products. Each fact is stored exactly once. Updating a store name is a single-row update.
store_id
product_id
quantity
S1
P1
100
S1
P2
50
S2
P1
75
TIP
2NF violations only occur with composite keys. If every table in your schema uses a single-column surrogate key, 2NF is automatically satisfied. This is one reason surrogate keys are so popular in practice.
Third Normal Form (3NF)
Daily Life
Interviews
Eliminate columns that depend on non-keys
3NF: No Transitive Dependencies
3NF is the most practically important normal form because it catches the most common data duplication pattern: a column that depends on another non-key column instead of the primary key. This applies to every table, not just those with composite keys.
Consider an orders table where each order has a customer_id, customer_name, and customer_city. customer_name and customer_city depend on customer_id, not on order_id. They are transitively dependent: order_id determines customer_id, which determines customer_name.
order_id
customer_id
customer_name
customer_city
amount
O1
C1
Alice
Seattle
$100
O2
C1
Alice
Seattle
$200
O3
C2
Bob
Portland
$150
Alice's name and city are stored twice. If Alice moves to Portland, you must update every order row. Miss one and the data contradicts itself. The test for any non-key column: does this value depend directly on the primary key, or does it depend on another column that happens to be in this table?
When you look at any table and need to determine its normal form, apply these three checks in order. Stop at the first failure.
Check 1: Does any cell contain multiple values (comma-separated lists, arrays as strings)? If yes, the table violates 1NF.
Check 2: Is the PK composite? If so, does any non-key column depend on only part of the key? If yes, the table violates 2NF.
Check 3: Does any non-key column depend on another non-key column instead of the PK? If yes, the table violates 3NF.
In practice, 90% of real-world normalization problems are 3NF violations: a customer name stored on an orders table, a department location stored on an employee table, a product category stored on a sales table. These all have the same shape: an attribute that belongs to a related entity is denormalized onto the current table.
When Not to Normalize: OLTP vs OLAP
Normalization is the right choice for write-heavy transactional systems (OLTP). Every fact stored once means updates are safe and consistent. But for read-heavy analytical systems (OLAP), normalization creates excessive joins that slow queries down. A dashboard query that joins 8 normalized tables is slower than one that reads a single denormalized table.
The standard architecture: normalize in OLTP (source systems), denormalize in OLAP (data warehouse). The ETL pipeline transforms normalized source data into denormalized analytical tables optimized for read performance.
OLTPOLAPETL Bridge
OLTP
Normalize for Write Safety
Every fact stored once. Updates are safe. JOINs are acceptable because queries touch few rows at a time. Optimized for INSERT/UPDATE/DELETE.
OLAP
Denormalize for Read Speed
Pre-join dimensions into facts. Queries scan millions of rows. Fewer JOINs means faster queries. Optimized for SELECT with GROUP BY.
ETL Bridge
Transform Between Both
The pipeline reads normalized source data and writes denormalized analytical tables. This is where data engineers spend most of their time.
TIP
Start normalized. Denormalize specific tables only when you have measured a real performance problem. 'It might be slow' is not a reason to denormalize. 'This query takes 45 seconds because it joins 8 tables' is.
Worked Example: Video Watch History
A video streaming platform tracks watches. The raw table has: user_id, user_name, video_title, video_category, watched_at, duration_seconds. Let us check its normal form.
user_id
user_name
video_title
video_category
watched_at
duration_sec
U1
Alice
SQL Deep Dive
Education
2024-03-15 10:00
1800
U1
Alice
Python Tips
Education
2024-03-15 14:00
900
U2
Bob
SQL Deep Dive
Education
2024-03-16 09:00
1800
Check 1: No lists in cells. Passes 1NF. Check 2: If the PK is (user_id, video_title, watched_at), user_name depends only on user_id (partial dependency). Violates 2NF. Check 3: video_category depends on video_title, not on the PK (transitive dependency). Violates 3NF.
user_name is stored once. Each video title is stored once. The watch_history table contains only keys and measures. Every fact lives in exactly one place.
✓Normalized (OLTP)
3 tables, each fact stored once
Update a user name: 1 row
Safe for concurrent writes
Requires JOINs for analytical queries
✓Denormalized (OLAP)
1 wide table, facts duplicated
Update a user name: every watch row
Fast for read-heavy analytics
No JOINs needed, but updates are expensive
❯❯❯PUTTING IT ALL TOGETHER
> You find a sales table with: order_id, product_id, product_name, product_category, customer_id, customer_name, amount.
product_name and product_category depend on product_id, not order_id. Transitive dependency. Violates 3NF. Extract to a products table.
customer_name depends on customer_id, not order_id. Same pattern. Extract to a customers table.
The normalized orders table has: order_id, product_id, customer_id, amount. Each fact stored once. Updates to product names or customer names happen in one place.
KEY TAKEAWAYS
Normalization prevents data contradictions: every fact stored once eliminates update, insert, and delete anomalies
1NF: no lists in cells. Every column holds a single atomic value
2NF: no partial dependencies. Non-key columns depend on the entire composite key, not just part
3NF: no transitive dependencies. Non-key columns depend on the key and nothing but the key
Normalize in OLTP, denormalize in OLAP: the ETL pipeline bridges between write-safe sources and read-fast analytics
Normalization
Why copying data breaks everything
Category
Data Modeling
Duration
15 minutes
Challenges
12 hands-on challenges
Topics covered: Data Gets Out of Sync, First Normal Form (1NF), Second Normal Form (2NF), Third Normal Form (3NF), Identifying Normal Form
Why Normalization Exists Normalization exists to prevent a specific class of production failure: data that contradicts itself. When the same fact is stored in multiple rows or multiple tables, any update must touch every copy. Miss one, and your database has two different answers to the same question. This is not a theoretical concern. It is the number one source of 'the numbers do not match' bugs in real systems. Here is the canonical example: an employee table that looks reasonable until a dep
1NF: No Lists in Cells A table is in first normal form when every cell contains a single atomic value. No comma-separated lists. No arrays stored as strings. No multiple values crammed into one column. This sounds obvious, but 1NF violations are everywhere in real data. A tags column with 'python, sql, spark'. A phone_numbers column with '555-1234, 555-5678'. These break filtering, indexing, and joining. Here is a 1NF violation: You cannot filter for 'students taking Physics' without string pars
2NF: No Partial Dependencies on Composite Keys 2NF only matters when your primary key spans multiple columns. If your PK is a single column, your table automatically satisfies 2NF. The violation: a non-key column depends on only PART of the composite key, not the whole thing. Consider a store inventory table with a composite PK of (store_id, product_id). The table includes store_name and product_name alongside the quantity. store_name depends only on store_id, not on the full key (store_id, prod
3NF: No Transitive Dependencies 3NF is the most practically important normal form because it catches the most common data duplication pattern: a column that depends on another non-key column instead of the primary key. This applies to every table, not just those with composite keys. Consider an orders table where each order has a customer_id, customer_name, and customer_city. customer_name and customer_city depend on customer_id, not on order_id. They are transitively dependent: order_id determi
The Three-Step Check When you look at any table and need to determine its normal form, apply these three checks in order. Stop at the first failure. In practice, 90% of real-world normalization problems are 3NF violations: a customer name stored on an orders table, a department location stored on an employee table, a product category stored on a sales table. These all have the same shape: an attribute that belongs to a related entity is denormalized onto the current table. When Not to Normalize: