This chapter1 on application data modeling starts getting into the more familiar and technical parts of data modeling. There will be more draft sections coming for this chapter over the next few weeks, including an in-depth look at the relational model, data modeling for NoSQL and streams, and much more. This is a fun chapter to write, since it’s something I wish more software engineers and developers knew more about. As always, kick the tires on anything I write. Your feedback is much appreciated and welcome.
Thanks,
Joe
Tables are everywhere. Whether you use spreadsheets, most garden variety databases, or dataframes, tables are how most of us think about and interact with data. But have you ever stopped to consider the elegance behind these tables and how they interact? In this section, we’ll go over the core components and concepts of the relational model—a model that revolutionized how we interact with data and underpins countless applications we use every day. We'll explore when this model is the perfect tool for the job and when other approaches might be more suitable.
When we talk about the relational model, we’re talking about organizing data into tables consisting of rows and columns. A crucial part of the relational model is that each table should be as free from duplication and redundancy. We’ll discuss how this works and the motivations very shortly.
The relational model is commonly associated with relational database management systems (RDBMS). This doesn’t mean it can’t be used in other types of databases or different use cases outside of databases. However, for historical reasons, the relational model is associated with transactional and operational workloads, often called online transactional processing (OLTP). By convention, this section will focus on the relational model for OLTP workloads.
We need to understand its historical context to truly appreciate why the relational model is a cornerstone of data modeling. Let’s take a brief journey through its history, starting with its development by E.F. Codd.
A Very Brief History of the Relational Model
Let’s look at a very brief history of the relational model. As discussed in Chapter 3, databases are constantly evolving, and many options are available. Today, we have a wide variety of database technologies and vendors. In the 1960s, choices were much slimmer, with databases supplied by a few vendors, usually in two flavors - hierarchical or network. In these examples, we’ll assume the data model in these databases follows from the database itself. The data model was tightly coupled to the underlying database implementation. Let’s first look at the hierarchical database.
Hierarchical Databases
A simple way to think about hierarchical databases is your computer’s file system. You organize files within folders and those folders within other folders. Files were originally how data was stored in the early days of computing. As you learned earlier, databases were developed to simplify data storage. Like a file and folder hierarchy, a hierarchical database’s data model organizes data in a tree-like structure, with a single root node at the top and branches of child nodes extending downwards. Each node contains one or more records. A record is a collection of related data fields, similar to a row in a table. For example, in an employee database, a record might hold an employee's ID, name, and job title. Each parent node can have multiple child nodes, but each child node can only have one parent. This creates a rigid parent-child relationship between data points.
There were some severe downsides to hierarchical databases. First, the rigid parent-child structure was massively inflexible. This made it difficult to represent complex relationships where entities have multiple connections. Querying data across different branches was brutal, involving complicated navigational operations. Also, due to the strict parent-child structure, modeling many-to-many relationships was extremely difficult, requiring redundant and costly workarounds.
Redundancy was also an issue. Data could be duplicated across multiple branches, leading to redundancy, inconsistencies, and inefficient storage. If you needed to locate data, you might query many places. Given the complexity of querying data, this could be very exhausting. The data redundancy would also lead to update and deletion anomalies, where you’ll need to access and update or delete records spread across multiple nodes.
Finally, due to the rigid hierarchy and potential for duplicate data spread across the database, changing the hierarchy structure was complex and required significant changes to applications accessing the database.
IBM IMS has a storied history, built to track the millions of parts needed to make the Saturn V moon rocket as part of the NASA Apollo Space Program. And it’s still survived to this day. Interestingly, as of 2022, IBM estimates2 that around 2000 companies, and 95% of the Fortune 1000, still use IBM IMS. These organizations are mainly verticals like banks, finance, healthcare, and government, where legacy infrastructure is complicated to migrate from, and supporting it is the cheaper option. Despite the prevalence of hierarchical databases, the hierarchical model isn’t widely used outside of this legacy infrastructure, and we won’t cover it further in this book.
Network Databases
Another popular type of database in the 1960s was the network database. It is built on the hierarchical model, allowing child nodes to have multiple parents. You might recall many-to-many relationships, and this is what’s described. This is the significant difference between network and hierarchical data models.
The downsides of network databases were similar to those of hierarchical databases. The significant difference was the additional complexity of managing the many-to-many relationships. This introduced further complications, where accessing data required navigating through a network of pointers, limiting network databases’ query capability. Database records were linked using physical addresses referred to by pointers. The network model would become very brittle and error-prone if these physical addresses became corrupt.
You might still find Integrated Data Store (IDS) in some organizations for network databases. Like IBM IMS, these types of databases exist for legacy reasons. We won’t cover the network database or model further except as a historical artifact.
While hierarchical and network databases were popular in the past, they have largely been replaced by other types. Today, hierarchical structures are better represented as graphs, which we’ll explore later in this book. However, as mentioned above, some legacy systems may still rely on these older databases. We’ve looked at some popular options, such as IMS and IDS. That said, it’s difficult to pinpoint specific organizations using these databases publicly, as companies usually don't disclose their internal database architectures.
A Brief History of the Relational Model
“The network model, on the other hand, has spawned a number of confusions, not the least of which is mistaking the derivation of connections for the derivation of relations.” - Edgar F. Codd (A Relational Model of Data for Large Data Banks, 1970)
“The original objectives in developing the Relational Model were to address each and every one of the shortcomings that plagued those systems that existed at the end of the 1960s decade.” - Edgar F. Codd (Providing OLAP to User-Analysts: An IT Mandate, 1993)
During the late 1960s, businesses became more complex, and their data needs were evolving beyond the capabilities of hierarchical and network models. People pursued a more flexible, efficient, and logical way of managing data. In 1970, Edgar F. Codd introduced the relational model with his revolutionary paper, “A Relational Model of Data for Large Data Banks.” Codd proposed organizing data into relations, which are commonly known as tables. These tables have rows (tuples) and columns (attributes), providing a structured and intuitive way to represent information. Under the hood, Codd took a first-principles approach to the relational model based on a solid set of mathematical principles such as set theory and predicate logic. This allowed for formal reasoning about queries and data, resulting in more consistent and predictable results.
Instead of relying on the rigid and predefined relationships in hierarchical and network data models, especially with many-to-many relationships, the relational model provided more flexibility and expressiveness. The relational model introduced flexibility by using foreign keys to represent relationships. It allowed for easy changes to relationships between data without altering the underlying database structure.
Another key goal of the relational model was to separate the logical structure of data from the physical storage details, which Codd called data independence. Whereas the hierarchical and network models tightly coupled data to physical storage, data independence meant that users and applications could interact with the data without knowing how it was physically stored or organized on disk. This separation of logical and physical data models allowed for changes in storage without affecting applications.
The relational model was initially met with skepticism, but researchers and developers gradually recognized its potential. Early implementations of relational database management systems (RDBMS) like System R (IBM) and Ingres (UC Berkeley) emerged. It also spawned SQL in the early 1970s, which fostered interoperability between different varieties of RDBMS and became the standardized way to query the RDBMS. Fast forward to the 1980s and 1990s, and the relational model effectively replaced older database models like hierarchical and network databases, becoming the dominant approach for data models and databases. Companies like Oracle and IBM dominated this era of RDBMS, and adoption of the RDBMS was widespread across businesses and government. To this day, the RDBMS - and the relational model - are the cornerstone of nearly every database used in the wild. Unarguably, the relational model is a major cornerstone of data modeling. Understand it and you’ll have a firm foundation for data modeling. Ignore it and you miss a vital entire context of what data modeling is about.
Please note that as we dive into the relational model, I’m giving you a perspective on what I think you need to know about it, especially against its historical context. The relational model is extremely intricate, with countless books and articles devoted to the topic. To attempt to cover every nuance of the relational model in a chapter is impossible. CJ Date has probably written tens of thousands of pages about the relational model. So, if you want to pursue your knowledge of the relational model, please read widely, as there’s no shortage of great resources on the topic.
Let’s next look at a high-level view of the relational model.
Updated 12/15/2024
https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/deepak-kohli/2022/06/03/ims-speed-storage
What’s cool about this Joe is the historical perspective - it is so absent in most tech literature. Context is lost without it, and newcomers wont know why things are the way they are.
Wow. PTSD. My first professional programming job out of college - using IMS as the database for internal Order management, Purchasing and Fixed Assets. Before ERP's were prevalent. A bit of a pain to navigate, incentivized me to take on a job at a new employer with DB2 relational database. Technically IMS much more efficient and less DASD ..(or Storage ;) ). P.S. It seems we are back to hierarchical structures with JSON and XML and NOSQL DB's.