Why Edgar Codd came up with the Relational Model
During my graduation days we were introduced to the concept of relational databases. The first thing that we learned was the theory proposed by Edgar Codd. The concepts were very clear after I studied them, but I only focused on the HOWs and not the WHYs of them. This article is an attempt to throw some light on both: how and why relational model proposed by Codd came to be well known and especially what motivated him to do so.
How was data stored in the 1960s and 70s?
In the year 1967, IBM developed the ICS: Information Control System. The purpose of this system was to manage the bills of all materials of construction required for the Apollo space mission, to send the first man to land on the moon. The ICS was developed by the year 1967 and was installed in 1968. This ICS was renamed as IMS (Information Management System) and was released for public usage in 1969.The IMS was successful in its purposes.
The IMS used the hierarchical model to store data, where every entity in the data is assumed to have attributes like a tree structure. As an example, a person has an address, and that address has street, city, state and country. We can see a tree forming here.
For storing the “Address” entity in a “Person”, links were used where a link would provide the address data.
Problems with the hierarchical model
The hierarchical model has its own limitations. For example, let’s consider a scenario where a customer purchases multiple products, and one product can be purchased by many customers. In such a case, the hierarchical model stores one record of a product for a customer record. For other customers, if they have purchased the same product, that product’s data will be duplicated in all the other records (this is also called as denormalisation today, where it is done purposely to improve performance in NoSQL systems 😄). The main problem is that we are storing the same data multiple times. As in our example, the product data is duplicated of the same product for multiple customers.
Technically, we say that the hierarchical model was not able to define the many-to-many relationships between the entities properly.
Relational Database Systems: A theory
In the year 1970, just after one year when IMS was public and was the most renowned database management system, Codd released his paper on relational model:
A Relational Model of Data for Large Shared Data Banks
Codd was working for IBM at that time as a researcher.
The paper starts with describing the problems of the hierarchical model in terms of the cohesiveness of current application softwares and data management systems. He mentions that the software must assume an inherent structure is present and may work based on that structure only. What this means is that if we change the way how data is stored then the application software cannot work anymore. Codd was suggesting that there should be no dependencies between how data is stored and the application logic. He mentioned it in the paper as “data independence”
…the problems treated here are those of data independence-the independence of application programs and terminal activities from growth in data types and changes in data representation…
…the variety of data representation characteristics which can be changed without logically impairing some application programs is still quite limited…
Then he goes on to explain the concept of primary keys, foreign keys and non-simple domains with examples.
The problem of non-simple domains:
As discussed in the example above, in many-to-many relationships, such as a customer purchasing multiple products, we need to duplicate product data. This is unavoidable in the hierarchical model. Here, the product is also another complex entity, with its own attributes such as name, price, etc. Codd refers to such entities as non-simple domains.
Here I would like to remind you that this theory was written in 1970. In modern database applications, it is possible to store such complex entities in a single column by using JSON or XML format. In the case of a product we can encode it to JSON to something as follows:
{name: Sneakers, price: 84.99, size: 8}
MySQL allows storing JSON in columns and MSSQL allows storing XML. These data types were not present in the 1970s but were implemented 30 years later.
To tackle this problem in those times Codd presented the idea of normalisation. The basic idea is to identify the relationships between entities with the help of primary keys (a unique identifier of that row). In our example, what we would do is instead of storing the entire product entity in our columns, store it in a separate table with one column of a person’s unique identifier.
In the example, two customers purchased the same product but we are only storing the identity of that product in product_purchased column instead of storing its entire data.
Normalisation was later more well defined by Codd in the coming years. You can read about it here: Database Normalisation
In the rest of the paper, Codd wrote about the possibility of a query language, which applications developers use today to retrieve and store data in the database. Also discussed are the concepts such as joins, projection, permutation and all this is demonstrated with the help of notations.
From theory to practice. How the model was implemented years later
The first relational database was developed around 6 years later, known as the Multics Relational Data Store which was developed for the Multics operating system. This was followed by Oracle in 1979, which had SQL and joins. Many relational databases were developed later, including MySQL in 1995 which is still widely used today.
Codd’s 12 rules:
Due to the gaining popularity of the relational databases in the 1980s, database vendors were repackaging their softwares and claiming it to have relational capabilities. This led Codd to declare thirteen rules (numbered 0 to 12) in 1985 to define the properties and capabilities of a relational database system. This was released as a two part article titled as “Is your DBMS really relational?” and “Does your DBMS run by the rules?”
As an example, let us look at rule 0 and 12
Rule 0 states that
For any system that is advertised as, or claimed to be, a relational data base management system, that system must be able to manage data bases entirely through its relational capabilities.
and rule 12:
If a relational system has a low-level (single-record-at-a-time) language, that low level cannot be used to subvert or bypass the integrity rules and constraints expressed in the higher level relational language (multiple-records-at-a-time)
Appreciations received
Codd received the Turing Award for his contributions to the field of database management, an award equivalent to the Nobel Prize in the computer science field. There has been a lot of evolution in the database management domain since Codd proposed his theory. Still the relational model is one of the most widely implemented model even today.
That’s because relational databases provide functionalities of normalisation and joins at the database level. This removes the burden as an application developer to maintain schema structure. For example, in a NoSQL system, people misunderstand that there is no schema, but in fact the schema is not on the database level, only the schema structure is moved to the application code.
In today’s world, we as application developers have a choice to choose what type of database model we need: Relational, NoSQL, and even Graph Model maybe. That depends on the use case we want to implement. But even with so many options available, the fact that relational models are still widely used proves their reliability. And to think of such a model in the 1970s is what fascinates me.
Hello there! I like to write articles on tech topics which fascinate me. If you enjoyed reading this article, please let me know in the comments 😃 That will motivate me to write more. If you didn’t like something, of find something incorrect in the article, I would love to hear that. I am still exploring so you can guide me 😜