Sunday, April 5, 2015

Database 1- Basic Concepts of Normalization

Normalization is an important concept in database design, which helps to simplify the overall design and help to extract, update and delete any data from tables in future. Usually if a table contains too many columns then there is possibility that it will create complexity to maintain data in that database and we need to decompose the table based on the functional dependencies between the columns in table and normal forms. 

In this series of post, I will discuss the basic concepts of normalization and provide an example with explanation. I will use some of the references we were provided with in our LIS490 Introduction to Databases course at GSLIS. I give the credit to Professor John Weible for the reading reference and the sample problem. The solution is completely based on my own understanding of normalization. 

We first start with the qualities of a good database design. There are three basic qualities that should be maintained. 
1. Good Semantics, 
2. Avoiding Redundancy, 
3. Minimum Use of Nulls.

To ensure this, we need to make sure that the non-key fields in our tables cannot have different primary keys or are not closely related to the primary key. And this is the key concept of Normalization and the rules are called normal forms. The available rules are:

1. First Normal Form (1NF)
2. Second Normal Form (2NF)
3. Third Normal Form (3NF)
4. Boyce-Codd Normal Form (BCNF or 3.5 NF)
5. Fourth Normal Form (4NF)
6. Fifth Normal Form (5NF)

Usually it is good enough to understand till 3NF or BCNF and the rules higher than this get more complicated and are generally not implemented. In this post we will discuss till 3NF.

1NF: It means that all values in the tables must consist of same number of fields. In general the fields should be single-valued and not atomic. For example, the first row of Name column and second row of Hobby column is violating first normal form.

Name
Hobby
Nushrat, Khan
Painting
Brishti
Cooking, Painting

2NF: This rule means all the non-key elements should depend on the whole key. It needs to be considered if there is more than one key in the table. The important thing to remember here is, 2NF is violated when a non-key field is a fact about a subset of a key. We will see the example of this rule in our main example. 

3NF: Third Normal Form is all non-key fields will tell fact only about the key. In other words, there should not be any dependency between non-key elements. The important thing to remember here is, 3NF is violated when a non-key field is a fact about another non-key field.

Functional Dependency: Functional dependency between columns is another important concept that needs to be considered. If the functional dependency violates any normal form for an existing table then it needs to be further decomposed. 
The definition is, for two columns X and Y, Y is functionally dependent on X if for any combination of X and Y, the value of X will always be unique and cannot be repeated where Y can be repeated.
The following table violates the rule of functional dependency because if we want to update any information about 'Brishti' then it can cause anomaly in both of the rows.


                                Name
                              Hobby
                               Brishti
                              Painting
                               Brishti
                           Photography

But in the table below Hobby is functionally dependent on Name.

Name
Hobby
Barsha
Singing
Brishti
Painting
Barno
Painting

A classic reference to read more about normalization.
W. Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26(2), Feb. 1983, 120-125 (bkent.net/Doc/simple5.htm)

Another source from Microsoft: http://support.microsoft.com/en-us/kb/283878

In the following post I will give an example of a table that violates the normal forms and show how to decompose that based on the concept of functional dependency and normal form.