Data Normalization

1NF, 2NF, 3NF, BCNF

What is Data Normalization?

Database normalization is a technique for organizing your database structure in a manner that helps:

  • Enhance data integrity

  • Minimize data redundancy

What Is Data Integrity?

"Data integrity" refers to the accuracy and consistency of data over time. For example, if an employee's years_of_service is stored in a database, rather than their start_date, that data becomes incorrect automatically as time passes. It would be better to store the start date and calculate years of service as needed.

What Is Data Redundancy?

"Data redundancy" happens when identical data exists in multiple locations. For instance: saving the same document in several different folders on your computer. Data redundancy creates problems, particularly when updates occur in one location but not others, causing the information to become inconsistent across all copies.


candidate key, composite key, etc.

1 Normal Form (1NF)

To comply with first normal form, a database table must satisfy 2 requirements:

  • Each row must be uniquely identifiable by a primary key.

  • Each column must contain atomic values (no lists, sets, or nested structures within a single cell)

Example of NOT 1st Normal Form

username
phone_number
department

jsmith

555-1234

Sales

jsmith

555-1234

Sales

kdavis

555-5678

Marketing

This table does not comply with 1NF. It contains duplicate rows, meaning there's no way to uniquely identify each individual row.

Example of 1st Normal Form

The most straightforward approach (though not the only solution) to achieve first normal form is to introduce a unique id column.

id
username
phone_number
department

1

jsmith

555-1234

Sales

2

jsmith

555-1234

Sales

3

kdavis

555-5678

Marketing

It's important to note that if you establish a "primary key" by ensuring that multiple columns are always "unique together," that approach is equally valid.

You Should Almost Always Design Tables That Adhere to 1NF

First normal form is simply good practice. I've rarely encountered a well-designed database schema where each table isn't at minimum in first normal form.

Note: The example still shows duplicate data (same user twice), which illustrates that 1NF alone doesn't solve all data quality issues - it just ensures each row can be uniquely identified. Higher normal forms address the redundancy problem.


2 Normal Form (2NF)

A table in second normal form adheres to all the requirements of 1st normal form, plus one additional requirement that specifically applies to composite primary keys:

  • Every non-key column must depend on the complete primary key, not just a portion of it.

Example of 1st NF, but Not 2nd NF

In this table, the primary key is a composite of product_id + warehouse_id.

product_id
warehouse_id
product_category
stock_quantity

101

5

Electronics

50

101

8

Electronics

30

102

5

Furniture

15

This table does not comply with 2NF. The product_category column depends only on product_id, not on the full composite key (product_id + warehouse_id). This creates redundant data since the same product category repeats for every warehouse location.

Example of 2nd Normal Form

One approach to convert the table above to 2NF is to create a separate table that associates each product_id directly with its product_category. This eliminates duplication!

Inventory Table:

product_id
warehouse_id
stock_quantity

101

5

50

101

8

30

102

5

15

Products Table:

product_id
product_category

101

Electronics

102

Furniture

2NF Is Usually a Good Idea

You should probably default to maintaining your tables in second normal form. However, there are legitimate reasons to deviate from it, particularly for performance considerations. The tradeoff is that retrieving data from multiple tables through joins can take slightly longer.

My guideline is: Prioritize data integrity and eliminating duplication first. If you encounter performance bottlenecks, strategically denormalize as needed.


3 Normal Form (3NF)

A table in 3rd normal form adheres to all the requirements of 2nd normal form, plus one additional requirement:

  • Every non-key column must depend directly on the primary key, and nothing else.

Notice that this differs only slightly from second normal form. In second normal form we can't have a column dependent on only part of a composite primary key, while in third normal form we can't have a column dependent on any non-key column.

Example of 2nd NF, but Not 3rd NF

In this table, the primary key is simply the employee_id column.

employee_id
department_name
department_location
salary

1001

Engineering

Building A

75000

1002

Marketing

Building C

65000

1003

Engineering

Building A

80000

This table is in 2nd normal form because department_location is not dependent on part of the primary key (since the primary key is a single column). However, because department_location is dependent on the department_name column rather than the primary key directly, it doesn't comply with 3rd normal form.

Example of 3rd Normal Form

The approach to convert the table above to 3NF is to create a separate table that associates each department_name directly with its department_location. Notice how this solution parallels the 2NF approach.

Employees Table:

employee_id
department_name
salary

1001

Engineering

75000

1002

Marketing

65000

1003

Engineering

80000

Departments Table:

department_name
department_location

Engineering

Building A

Marketing

Building C

3NF Is Usually a Good Idea

The same guideline applies to both second and third normal forms.

Prioritize data integrity and eliminating duplication first by adhering to 3NF. If you encounter performance bottlenecks, strategically denormalize as needed.


Boyce-Codd Normal Form (BCNF)

A table in Boyce-Codd normal form (developed by Raymond F. Boyce and Edgar F. Codd) adheres to all the requirements of 3rd normal form, plus one additional requirement:

  • A column that's part of a candidate key cannot be entirely dependent on a column that's not part of any candidate key.

This situation only arises when there are multiple potential candidate keys that share overlapping columns. Another term for this is "overlapping candidate keys".

Only in uncommon scenarios does a table in third normal form fail to meet the requirements of Boyce-Codd normal form!

Example of 3rd NF, but Not Boyce-Codd NF

student_id
course_code
instructor_name
grade

1001

CS101

Dr. Smith

A

1002

CS101

Dr. Smith

B

1001

MATH200

Prof. Johnson

A

1003

PHYS150

Dr. Smith

C

The notable aspect here is that there are 2 possible candidate keys:

  • student_id + course_code

  • student_id + instructor_name (assuming each student can only take one course per instructor)

This means that by definition this table is in 2nd and 3rd normal form because those forms only restrict dependencies for columns that are not part of any candidate key.

This table is not in Boyce-Codd normal form because instructor_name (part of a candidate key) is entirely dependent on course_code (assuming each course has only one instructor).

Example of Boyce-Codd Normal Form

The most effective way to resolve the issue in our example is to separate the instructor assignment into its own table.

Enrollments Table:

student_id
course_code
grade

1001

CS101

A

1002

CS101

B

1001

MATH200

A

1003

PHYS150

C

Courses Table:

course_code
instructor_name

CS101

Dr. Smith

MATH200

Prof. Johnson

PHYS150

Dr. Smith

BCNF Is Usually a Good Idea

The same guideline applies to 2nd, 3rd, and Boyce-Codd normal forms. That said, it's uncommon you'll encounter BCNF-specific violations in practice.

Prioritize data integrity and eliminating duplication first by adhering to Boyce-Codd normal form. If you encounter performance bottlenecks, strategically denormalize as needed.


Last updated