# Data Normalization

## What is Data Normalization?

Database normalization is a technique for organizing your database structure in a manner that helps:

* Enhance data integrity
* Minimize data redundancy

#### **What Is Data Integrity?**

"Data integrity" refers to the accuracy and consistency of data over time. For example, if an employee's *years\_of\_service* is stored in a database, rather than their *start\_date*, that data becomes incorrect automatically as time passes. It would be better to *store* the start date and *calculate* years of service as needed.

#### **What Is Data Redundancy?**

"Data redundancy" happens when identical data exists in multiple locations. For instance: saving the same document in several different folders on your computer. Data redundancy creates problems, particularly when updates occur in one location but not others, causing the information to become inconsistent across all copies.

***

candidate key, composite key, etc.

## 1 Normal Form (**1NF**)

To comply with first normal form, a database table **must satisfy 2 requirements:**

* Each row must be uniquely identifiable by a primary key.
* Each column must contain atomic values (no lists, sets, or nested structures within a single cell)

#### **Example of NOT 1st Normal Form**

| username | phone\_number | department |
| -------- | ------------- | ---------- |
| jsmith   | 555-1234      | Sales      |
| jsmith   | 555-1234      | Sales      |
| kdavis   | 555-5678      | Marketing  |

This table does *not* comply with 1NF. It contains duplicate rows, meaning there's no way to uniquely identify each individual row.

#### **Example of 1st Normal Form**

The most straightforward approach (though not the only solution) to achieve first normal form is to introduce a unique `id` column.

| id | username | phone\_number | department |
| -- | -------- | ------------- | ---------- |
| 1  | jsmith   | 555-1234      | Sales      |
| 2  | jsmith   | 555-1234      | Sales      |
| 3  | kdavis   | 555-5678      | Marketing  |

```sql
CREATE TABLE employees (
    id INT PRIMARY KEY,
    username VARCHAR(50),
    phone_number VARCHAR(20),
    department VARCHAR(50)
);

```

It's important to note that if you establish a "primary key" by ensuring that multiple columns are always "unique together," that approach is equally valid.

**You Should&#x20;*****Almost*****&#x20;Always Design Tables That Adhere to 1NF**

First normal form is simply *good practice*. I've *rarely* encountered a well-designed database schema where each table isn't *at minimum* in first normal form.

**Note:** The example still shows duplicate data (same user twice), which illustrates that 1NF alone doesn't solve all data quality issues - it just ensures each row can be uniquely identified. Higher normal forms address the redundancy problem.

***

## 2 Normal Form (2NF)

A table in second normal form adheres to all the requirements of **1st normal form, plus one&#x20;*****additional*****&#x20;requirement** that specifically applies to composite primary keys:

* Every non-key column must depend on the *complete* primary key, not just a portion of it.

#### **Example of 1st NF, but Not 2nd NF**

In this table, the primary key is a composite of `product_id` + `warehouse_id`.

| product\_id | warehouse\_id | product\_category | stock\_quantity |
| ----------- | ------------- | ----------------- | --------------- |
| 101         | 5             | Electronics       | 50              |
| 101         | 8             | Electronics       | 30              |
| 102         | 5             | Furniture         | 15              |

```sql
CREATE TABLE inventory (
    product_id INT,
    warehouse_id INT,
    product_category VARCHAR(50),
    stock_quantity INT,
    PRIMARY KEY (product_id, warehouse_id)
);

```

This table does *not* comply with 2NF. The `product_category` column depends *only* on `product_id`, not on the full composite key (`product_id` + `warehouse_id`). This creates *redundant* data since the same product category repeats for every warehouse location.

#### **Example of 2nd Normal Form**

One approach to convert the table above to 2NF is to create a separate table that associates each `product_id` directly with its `product_category`. This eliminates duplication!

**Inventory Table:**

| product\_id | warehouse\_id | stock\_quantity |
| ----------- | ------------- | --------------- |
| 101         | 5             | 50              |
| 101         | 8             | 30              |
| 102         | 5             | 15              |

```sql
CREATE TABLE inventory (
    product_id INT,
    warehouse_id INT,
    stock_quantity INT,
    PRIMARY KEY (product_id, warehouse_id),
    FOREIGN KEY (product_id) REFERENCES products(product_id)
);

```

**Products Table:**

| product\_id | product\_category |
| ----------- | ----------------- |
| 101         | Electronics       |
| 102         | Furniture         |

```sql
CREATE TABLE products (
    product_id INT PRIMARY KEY,
    product_category VARCHAR(50)
);

```

**2NF Is&#x20;*****Usually*****&#x20;a Good Idea**

You should probably *default* to maintaining your tables in second normal form. However, there are legitimate reasons to deviate from it, particularly for *performance* considerations. The tradeoff is that retrieving data from multiple tables through joins can take slightly longer.

My guideline is: Prioritize data integrity and eliminating duplication *first*. If you encounter performance bottlenecks, strategically denormalize as needed.

***

## 3 Normal Form (3NF)

A table in 3rd normal form adheres to all the requirements of 2nd normal form, plus one *additional* requirement:

* Every non-key column must depend *directly* on the primary key, and nothing else.

Notice that this differs only *slightly* from second normal form. In second normal form we can't have a column dependent on only *part* of a composite primary key, while in third normal form we can't have a column dependent on *any non-key column*.

#### **Example of 2nd NF, but Not 3rd NF**

In this table, the primary key is simply the `employee_id` column.

| employee\_id | department\_name | department\_location | salary |
| ------------ | ---------------- | -------------------- | ------ |
| 1001         | Engineering      | Building A           | 75000  |
| 1002         | Marketing        | Building C           | 65000  |
| 1003         | Engineering      | Building A           | 80000  |

```sql
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    department_name VARCHAR(50),
    department_location VARCHAR(50),
    salary DECIMAL(10, 2)
);

```

This table *is* in 2nd normal form because `department_location` is *not* dependent on part of the primary key (since the primary key is a single column). However, because `department_location` *is* dependent on the `department_name` column rather than the primary key directly, it doesn't comply with 3rd normal form.

#### **Example of 3rd Normal Form**

The approach to convert the table above to 3NF is to create a separate table that associates each `department_name` directly with its `department_location`. Notice how this solution parallels the 2NF approach.

**Employees Table:**

| employee\_id | department\_name | salary |
| ------------ | ---------------- | ------ |
| 1001         | Engineering      | 75000  |
| 1002         | Marketing        | 65000  |
| 1003         | Engineering      | 80000  |

```sql
CREATE TABLE employees (
    employee_id INT PRIMARY KEY,
    department_name VARCHAR(50),
    salary DECIMAL(10, 2),
    FOREIGN KEY (department_name) REFERENCES departments(department_name)
);

```

**Departments Table:**

| department\_name | department\_location |
| ---------------- | -------------------- |
| Engineering      | Building A           |
| Marketing        | Building C           |

```sql
CREATE TABLE departments (
    department_name VARCHAR(50) PRIMARY KEY,
    department_location VARCHAR(50)
);

```

**3NF Is&#x20;*****Usually*****&#x20;a Good Idea**

The same guideline applies to both second and third normal forms.

Prioritize data integrity and eliminating duplication *first* by adhering to 3NF. If you encounter performance bottlenecks, strategically denormalize as needed.

***

## Boyce-Codd Normal Form (BCNF)

A table in Boyce-Codd normal form (developed by Raymond F. Boyce and Edgar F. Codd) adheres to all the requirements of 3rd normal form, plus one *additional* requirement:

* A column that's part of a candidate key cannot be *entirely* dependent on a column that's *not* part of any candidate key.

This situation only arises when there are multiple potential candidate keys that share overlapping columns. Another term for this is "overlapping candidate keys".

Only in uncommon scenarios does a table in third normal form *fail* to meet the requirements of Boyce-Codd normal form!

#### **Example of 3rd NF, but Not Boyce-Codd NF**

| student\_id | course\_code | instructor\_name | grade |
| ----------- | ------------ | ---------------- | ----- |
| 1001        | CS101        | Dr. Smith        | A     |
| 1002        | CS101        | Dr. Smith        | B     |
| 1001        | MATH200      | Prof. Johnson    | A     |
| 1003        | PHYS150      | Dr. Smith        | C     |

```sql
CREATE TABLE enrollments (
    student_id INT,
    course_code VARCHAR(20),
    instructor_name VARCHAR(100),
    grade CHAR(1)
);

```

The notable aspect here is that there are 2 possible candidate keys:

* `student_id` + `course_code`
* `student_id` + `instructor_name` (assuming each student can only take one course per instructor)

This means that by definition this table *is* in 2nd and 3rd normal form because those forms only restrict dependencies for columns that are *not* part of any candidate key.

This table is *not* in Boyce-Codd normal form because `instructor_name` (part of a candidate key) is entirely dependent on `course_code` (assuming each course has only one instructor).

#### **Example of Boyce-Codd Normal Form**

The most effective way to resolve the issue in our example is to separate the instructor assignment into its own table.

**Enrollments Table:**

| student\_id | course\_code | grade |
| ----------- | ------------ | ----- |
| 1001        | CS101        | A     |
| 1002        | CS101        | B     |
| 1001        | MATH200      | A     |
| 1003        | PHYS150      | C     |

```sql
CREATE TABLE enrollments (
    student_id INT,
    course_code VARCHAR(20),
    grade CHAR(1),
    PRIMARY KEY (student_id, course_code),
    FOREIGN KEY (course_code) REFERENCES courses(course_code)
);

```

**Courses Table:**

| course\_code | instructor\_name |
| ------------ | ---------------- |
| CS101        | Dr. Smith        |
| MATH200      | Prof. Johnson    |
| PHYS150      | Dr. Smith        |

```sql
CREATE TABLE courses (
    course_code VARCHAR(20) PRIMARY KEY,
    instructor_name VARCHAR(100)
);

```

**BCNF Is&#x20;*****Usually*****&#x20;a Good Idea**

The same guideline applies to 2nd, 3rd, and Boyce-Codd normal forms. That said, it's uncommon you'll encounter BCNF-specific violations in practice.

Prioritize data integrity and eliminating duplication *first* by adhering to Boyce-Codd normal form. If you encounter performance bottlenecks, strategically denormalize as needed.

***
