img of Database Theory - Normalization
Database

Database Theory - Normalization

Feb 23, 2025
12 min read

When I first started working with databases, I often heard about “normalization,” but it felt like one of those theoretical concepts that didn’t matter much in real-world applications. However, as I began designing more complex systems, I quickly realized that without proper normalization, databases could become a mess—full of redundant data, inconsistencies, and performance issues.

So, what exactly is database normalization? In simple terms, it’s the process of structuring a database to reduce redundancy and improve data integrity. It follows a set of rules called “normal forms,” which guide how data should be organized. Each normal form builds on the previous one, fixing problems and making the database more efficient.

In this article, I’ll walk you through each level of normalization with practical examples.

First Normal Form (1NF) – Eliminating Duplicates and Ensuring Atomicity

The First Normal Form is all about making sure each column and each row in a table is used correctly. In other words, 1NF requires that each cell in a table holds a single, indivisible value, and that each record (row) is unique. This means no repeating groups of columns and no nested lists in a single column.

For example, if a person has multiple phone numbers, we handle that by creating additional rows or a separate table rather than packing them into a single column. Ensuring 1NF is the foundation of a well-structured database because it fosters clarity and consistency right from the start.

Imagine we have a table to store customer information. Here’s a poor design that violates 1NF:

CustomerIDNamePhoneNumbersAddress
1Alice Lee555-1234, 555-5678123 Elm Street, Springfield
2Bob King555-7890456 Oak Avenue, Maple Town

Notice how the “PhoneNumbers” column contains multiple numbers in the first row separated by a comma. This breaks 1NF because the phone numbers are not atomic values. To fix this, we could create a new table or additional rows. One simple method is to create another row for each additional phone number:

CustomerIDNamePhoneNumberAddress
1Alice Lee555-1234123 Elm Street, Springfield
1Alice Lee555-5678123 Elm Street, Springfield
2Bob King555-7890456 Oak Avenue, Maple Town

Now each field holds a single value, and each row is unique. This satisfies the conditions of 1NF.

Second Normal Form (2NF) – Eliminating Partial Dependencies

After ensuring your table meets 1NF, the next step is the Second Normal Form, but only for tables that have a composite primary key. A composite primary key means the table’s primary key is made up of two or more columns. 2NF requires that all non-key attributes in a table are fully functionally dependent on the entire primary key, not just part of it.

Think of it this way: If a table’s primary key is (A, B), then each column that isn’t part of the key should depend on both A and B for its value, rather than just A or just B. If you find that a non-key column depends on only one part of the composite key, you’ve got a partial dependency, which is what 2NF aims to eliminate. A classic symptom of a partial dependency is data repetition. For example, if you have a table combining “orders” and “products,” you might store product details in the same table. If that product detail depends only on the product ID (one part of the key) and not the order ID, you’re violating 2NF.

Let’s say we have a table called OrderDetails that tries to store all details about an order and the product in the same place. Its primary key might be a combination of OrderID and ProductID:

OrderIDProductIDProductNameQuantityPriceSupplier
10015001Blue Widget53.99WidgetCo
10015002Red Widget24.99WidgetCo
10025001Blue Widget33.99WidgetCo

Notice that ProductName and Supplier really only depend on ProductID, not the combination of (OrderID, ProductID). That means they’re partially dependent on the primary key. To satisfy 2NF, we remove ProductName and Supplier to a separate Products table where ProductID is the primary key. OrderDetails would then store only data specific to the combination of OrderID and ProductID, like Quantity and Price (if price can vary per order or date). The new design might look like this:

Products Table

ProductIDProductNameSupplier
5001Blue WidgetWidgetCo
5002Red WidgetWidgetCo

OrderDetails Table

OrderIDProductIDQuantityPrice
1001500153.99
1001500224.99
1002500133.99

This separation eliminates the partial dependency and ensures that any information about the product is fully dependent on its own primary key in the Products table.

Third Normal Form (3NF) – Eliminating Transitive Dependencies

Once your tables meet 2NF (or 1NF if you don’t have composite keys), it’s time to aim for the Third Normal Form. 3NF states that every non-key column must depend directly on the primary key, and not on any other non-key attribute. This means no transitive dependencies, where a column depends on another column that isn’t the primary key.

I like to think of 3NF as the final push to make sure data is stored only once, in the right table. With 3NF, if an attribute can be derived from another attribute, or if it logically belongs to a different entity, that data should be moved. This helps avoid anomalies where updating one attribute might require updating multiple rows or tables. For example, in an “employees” table, if you store the department name along with the department ID, but the department name also depends on the department ID (not the employee), that’s a transitive dependency. You’d want to store department name in a separate “departments” table to satisfy 3NF.

Suppose we have an Employees table:

EmployeeIDEmployeeNameDepartmentIDDepartmentNameLocation
1Sarah Lopez10AccountingBuilding A
2John Smith20MarketingBuilding C
3Raj Verma10AccountingBuilding A

Here, DepartmentName depends on DepartmentID, not on EmployeeID. This is a transitive dependency because the primary key is EmployeeID, and DepartmentName isn’t directly related to EmployeeID; it’s related to DepartmentID. In 3NF, we move that department information to its own table.

Departments Table

DepartmentIDDepartmentNameLocation
10AccountingBuilding A
20MarketingBuilding C

Employees Table

EmployeeIDEmployeeNameDepartmentID
1Sarah Lopez10
2John Smith20
3Raj Verma10

Now Employees.DepartmentID references the department, and the name and location of the department are stored in Departments. This removes the transitive dependency and satisfies 3NF.

Boyce-Codd Normal Form (BCNF) – Handling More Advanced Anomalies

BCNF, or Boyce-Codd Normal Form, is often considered a stronger version of 3NF. While 3NF focuses on transitive dependencies, BCNF addresses any functional dependency issues that involve candidate keys. In BCNF, every determinant must be a candidate key. A determinant is a column or set of columns that determines another column. So if X → Y is a functional dependency, then X has to be a candidate key in BCNF.

When I learned about BCNF, I found it often comes into play with more complex data relationships, especially when a table has multiple candidate keys. Most tables that are in 3NF will also be in BCNF, but there can be tricky edge cases. This might happen if your table has multiple candidate keys and a functional dependency creeps in that’s not directly tied to the primary key. Ensuring BCNF means carefully analyzing all dependencies in your database schema and making sure they align with the keys.

Let’s consider a scenario with course scheduling at a university, where a table called CourseSchedule might look like this:

CourseIDInstructorRoomTimeSlot
CS101Dr. BrownRoom A9:00 AM
CS101Dr. BrownRoom A10:00 AM
CS202Dr. GreenRoom B9:00 AM

Assume we have the following functional dependencies:

  1. (CourseID, TimeSlot) → Instructor
  2. Instructor → CourseID (Let’s say each instructor can only teach one course at a time)
  3. (Room, TimeSlot) → CourseID (Because a specific room and timeslot can only hold one course)

Here, we might have a situation where (Instructor → CourseID) makes Instructor a determinant, but Instructor may not be a candidate key if we also consider that multiple instructors could theoretically exist with different schedules. If the database is set up such that each instructor can only teach one course, Instructor becomes a candidate key for CourseID. But if it’s not strictly enforced, we could run into a BCNF violation.

To satisfy BCNF, we might reorganize the tables to ensure that every determinant is a candidate key. For example, create separate tables ensuring that instructor-to-course relationships are enforced in a dedicated relationship table, while the schedule table references those constraints. This could look like:

Instructors Table

InstructorIDInstructorNameCourseID
1Dr. BrownCS101
2Dr. GreenCS202

CourseSchedule Table

CourseIDRoomTimeSlot
CS101Room A9:00 AM
CS101Room A10:00 AM
CS202Room B9:00 AM

By breaking out the relationships carefully, we ensure each determinant is indeed a candidate key in its own table, satisfying BCNF.

Fourth Normal Form (4NF) – Eliminating Multivalued Dependencies

4NF deals with multi-valued dependencies. Once you’ve reached BCNF, you likely have eliminated most common anomalies. However, 4NF ensures that there are no non-trivial multi-valued dependencies of a non-key attribute on the primary key. A multi-valued dependency means that for a given key, you can have multiple values of one attribute independent of multiple values of another attribute.

4NF becomes relevant when handling databases that store multiple independent sets of information about an entity. For example, you might have a table that stores a teacher’s multiple certifications and the multiple languages they speak. Each of these sets of values is independent (the languages they speak do not depend on the certifications they hold). When these multi-valued attributes coexist in the same table, you risk duplicating data in complicated ways.

By achieving 4NF, you ensure that these attributes are broken out into separate tables if they’re truly independent. This keeps your main entity table from blowing up with every combination of multi-valued attributes.

Let’s say we have a Teachers table that stores each teacher’s ID, the subjects they can teach, and the sports they can coach. Hypothetically, a teacher can teach multiple subjects and also coach multiple sports. If we put that all in one table:

TeacherIDSubjectsSports
T1Math, ScienceBasketball, Soccer
T2EnglishBaseball, Soccer

We can end up with complex combinations if we try to store them properly. A more normalized design would be to separate these multi-valued attributes into their own tables:

Teachers Table

TeacherIDTeacherName
T1Mr. Davies
T2Ms. Roberts

TeacherSubjects Table

TeacherIDSubject
T1Math
T1Science
T2English

TeacherSports Table

TeacherIDSport
T1Basketball
T1Soccer
T2Baseball
T2Soccer

Now, the multi-valued attributes (subjects and sports) are each in their own table, removing any multi-valued dependency issues and satisfying 4NF.

Fifth Normal Form (5NF) – Resolving Join Dependencies

5NF, also known as Project-Join Normal Form (PJNF), is the highest commonly recognized normal form. It states that every join dependency in a table is implied by the candidate keys. In simpler terms, 5NF ensures that if you break your data down into multiple tables, you can’t lose information when you recombine (join) them, and there’s no unnecessary duplication when you do.

This level of normalization is often about handling complex relationships between three or more tables. The typical scenario is when you have multiple many-to-many relationships that intersect. For example, a table might connect multiple entities (like teachers, students, and courses) all at once. If the relationships among these entities are truly separate, they should be broken into smaller tables so that you don’t end up with partial or redundant data. 5NF is quite advanced, and not every database needs to reach this level. But if you find yourself managing very large, interconnected datasets, 5NF can be a lifesaver in ensuring your data integrity stays intact.

Consider a scenario with three entities: Company, Product, and Distributor. You might have a single table that attempts to store which distributor handles which product for which company:

CompanyIDProductIDDistributorID
C1P1D1
C1P2D2
C2P1D1

Let’s say:

  • A product can be offered by multiple distributors.
  • A distributor can handle products from multiple companies.
  • A company can contract multiple distributors for multiple products.

If these three sets of relationships (Company-Product, Company-Distributor, Product-Distributor) are truly independent, you can decompose the table into three separate tables: CompanyProduct, CompanyDistributor, and ProductDistributor. Each table would store only the relationships between its two entities. For example:

CompanyProduct Table

CompanyIDProductID
C1P1
C1P2
C2P1

CompanyDistributor Table

CompanyIDDistributorID
C1D1
C1D2
C2D1

ProductDistributor Table

ProductIDDistributorID
P1D1
P2D2

When you join these three tables, you can still recover the original relationships. If this decomposition is valid (i.e., you don’t lose any information or create spurious tuples), and each of these relationships is truly independent, then your tables could be in 5NF. This means all join dependencies can be derived from the candidate keys without introducing redundant or contradictory data.


Database normalization can be a game-changer in how you store and manage your data. Each normal form—1NF through 5NF—tackles a different aspect of data structure, from eliminating repeating groups to removing subtle functional and multi-valued dependencies. While real-world databases don’t always need to reach the highest level of normalization, knowing the full range of normal forms helps you make informed decisions about how to organize your data.

© 2025 Programmer Codex

GitHub