Data Normalization
1NF, 2NF, 3NF, BCNF
What is Data Normalization?
Database normalization is a technique for organizing your database structure in a manner that helps:
Enhance data integrity
Minimize data redundancy
What Is Data Integrity?
"Data integrity" refers to the accuracy and consistency of data over time. For example, if an employee's years_of_service is stored in a database, rather than their start_date, that data becomes incorrect automatically as time passes. It would be better to store the start date and calculate years of service as needed.
What Is Data Redundancy?
"Data redundancy" happens when identical data exists in multiple locations. For instance: saving the same document in several different folders on your computer. Data redundancy creates problems, particularly when updates occur in one location but not others, causing the information to become inconsistent across all copies.
candidate key, composite key, etc.
1 Normal Form (1NF)
To comply with first normal form, a database table must satisfy 2 requirements:
Each row must be uniquely identifiable by a primary key.
Each column must contain atomic values (no lists, sets, or nested structures within a single cell)
Example of NOT 1st Normal Form
jsmith
555-1234
Sales
jsmith
555-1234
Sales
kdavis
555-5678
Marketing
This table does not comply with 1NF. It contains duplicate rows, meaning there's no way to uniquely identify each individual row.
Example of 1st Normal Form
The most straightforward approach (though not the only solution) to achieve first normal form is to introduce a unique id column.
1
jsmith
555-1234
Sales
2
jsmith
555-1234
Sales
3
kdavis
555-5678
Marketing
It's important to note that if you establish a "primary key" by ensuring that multiple columns are always "unique together," that approach is equally valid.
You Should Almost Always Design Tables That Adhere to 1NF
First normal form is simply good practice. I've rarely encountered a well-designed database schema where each table isn't at minimum in first normal form.
Note: The example still shows duplicate data (same user twice), which illustrates that 1NF alone doesn't solve all data quality issues - it just ensures each row can be uniquely identified. Higher normal forms address the redundancy problem.
2 Normal Form (2NF)
A table in second normal form adheres to all the requirements of 1st normal form, plus one additional requirement that specifically applies to composite primary keys:
Every non-key column must depend on the complete primary key, not just a portion of it.
Example of 1st NF, but Not 2nd NF
In this table, the primary key is a composite of product_id + warehouse_id.
101
5
Electronics
50
101
8
Electronics
30
102
5
Furniture
15
This table does not comply with 2NF. The product_category column depends only on product_id, not on the full composite key (product_id + warehouse_id). This creates redundant data since the same product category repeats for every warehouse location.
Example of 2nd Normal Form
One approach to convert the table above to 2NF is to create a separate table that associates each product_id directly with its product_category. This eliminates duplication!
Inventory Table:
101
5
50
101
8
30
102
5
15
Products Table:
101
Electronics
102
Furniture
2NF Is Usually a Good Idea
You should probably default to maintaining your tables in second normal form. However, there are legitimate reasons to deviate from it, particularly for performance considerations. The tradeoff is that retrieving data from multiple tables through joins can take slightly longer.
My guideline is: Prioritize data integrity and eliminating duplication first. If you encounter performance bottlenecks, strategically denormalize as needed.
3 Normal Form (3NF)
A table in 3rd normal form adheres to all the requirements of 2nd normal form, plus one additional requirement:
Every non-key column must depend directly on the primary key, and nothing else.
Notice that this differs only slightly from second normal form. In second normal form we can't have a column dependent on only part of a composite primary key, while in third normal form we can't have a column dependent on any non-key column.
Example of 2nd NF, but Not 3rd NF
In this table, the primary key is simply the employee_id column.
1001
Engineering
Building A
75000
1002
Marketing
Building C
65000
1003
Engineering
Building A
80000
This table is in 2nd normal form because department_location is not dependent on part of the primary key (since the primary key is a single column). However, because department_location is dependent on the department_name column rather than the primary key directly, it doesn't comply with 3rd normal form.
Example of 3rd Normal Form
The approach to convert the table above to 3NF is to create a separate table that associates each department_name directly with its department_location. Notice how this solution parallels the 2NF approach.
Employees Table:
1001
Engineering
75000
1002
Marketing
65000
1003
Engineering
80000
Departments Table:
Engineering
Building A
Marketing
Building C
3NF Is Usually a Good Idea
The same guideline applies to both second and third normal forms.
Prioritize data integrity and eliminating duplication first by adhering to 3NF. If you encounter performance bottlenecks, strategically denormalize as needed.
Boyce-Codd Normal Form (BCNF)
A table in Boyce-Codd normal form (developed by Raymond F. Boyce and Edgar F. Codd) adheres to all the requirements of 3rd normal form, plus one additional requirement:
A column that's part of a candidate key cannot be entirely dependent on a column that's not part of any candidate key.
This situation only arises when there are multiple potential candidate keys that share overlapping columns. Another term for this is "overlapping candidate keys".
Only in uncommon scenarios does a table in third normal form fail to meet the requirements of Boyce-Codd normal form!
Example of 3rd NF, but Not Boyce-Codd NF
1001
CS101
Dr. Smith
A
1002
CS101
Dr. Smith
B
1001
MATH200
Prof. Johnson
A
1003
PHYS150
Dr. Smith
C
The notable aspect here is that there are 2 possible candidate keys:
student_id+course_codestudent_id+instructor_name(assuming each student can only take one course per instructor)
This means that by definition this table is in 2nd and 3rd normal form because those forms only restrict dependencies for columns that are not part of any candidate key.
This table is not in Boyce-Codd normal form because instructor_name (part of a candidate key) is entirely dependent on course_code (assuming each course has only one instructor).
Example of Boyce-Codd Normal Form
The most effective way to resolve the issue in our example is to separate the instructor assignment into its own table.
Enrollments Table:
1001
CS101
A
1002
CS101
B
1001
MATH200
A
1003
PHYS150
C
Courses Table:
CS101
Dr. Smith
MATH200
Prof. Johnson
PHYS150
Dr. Smith
BCNF Is Usually a Good Idea
The same guideline applies to 2nd, 3rd, and Boyce-Codd normal forms. That said, it's uncommon you'll encounter BCNF-specific violations in practice.
Prioritize data integrity and eliminating duplication first by adhering to Boyce-Codd normal form. If you encounter performance bottlenecks, strategically denormalize as needed.
Last updated