This is an excerpt from Chapter 5, Creating a Data Model. Data modeling is often written from a forward engineering perspective, as if everyone creates brand-new data models. That’s sometimes the case, but people are frequently confused about what to do when encountering existing data models, which are much more common in the real world.
Please note that I’m currently writing Chapter 4, Data Model Building Blocks, in tandem with Chapter 5, as there’s a lot of overlap between the two. I aim to have both chapters out by the end of next week. In the meantime, I’ll publish excerpts that I think might be useful for all subscribers (paid and free). As a reminder, full chapters and “special content” are only available to paid subscribers.
Forward Engineering
Forward engineering is the design and creation of a new data model. It starts with creating the conceptual and logical data model before translating it into the physical data model. You first define the conceptual model’s entities, attributes, and relationships. Using this conceptual model, you create the logical data model. Finally, your logical data model is mapped to the data system you plan to store the model physically. Most commonly, this is a database of some sort. Let’s say you’re using a relational database for argument’s sake. Your logical data model is mapped to the physical data model, including tables, columns, data types, constraints, and indexes. The physical data model is optimized for the data system you intend to use. In forward engineering, the goal is to translate the model to the data system.
Reverse Engineering
As you can guess, reverse engineering is the opposite of forward engineering. When you reverse engineer, you start with an existing data system and extract a data model representing its structure. The output is often some form of data model documentation, data catalog, or other artifact that helps others understand the data model.
For example, let’s say the relational database you created in the forward engineering example is now 15 years old. I’ll couple this with an example from a client of mine. In this case, the designer of the data model and maintainer of the database didn’t document the data model. Tragically, this person unexpectedly passed away one day, leaving the IT team scratching their heads on how to work with the database. The IT team decided to reverse engineer the database to understand the data model. First, they connected to the database and analyzed the database schema (tables, columns, data types, and constraints). Next, they created an entity-relationship diagram (ERD) of the database structure. Finally, they documented the database in a data catalog to make the database’s data searchable and understandable by others. This reverse engineering project helped give people the context and understanding of an existing (and cryptic) database.
Other reverse engineering examples include understanding an existing REST API endpoint, the data sources that go into a machine learning model, and countless others. You’ll use reverse engineering to understand existing data sources and systems. In a sense, this is more like data archaeology. You’re trying to decipher a data model created in the past—what it consists of and its purpose.
Which Will You Do?
You might be wondering what you’ll be doing in the real world. It’s probably a mix of forward and reverse engineering, but I’m guessing you’ll do more reverse engineering in practice. The reason is there are seemingly infinite existing data sources with varying degrees of documentation. Consider a situation where you integrate data from your ERP, CRM, and various third-party applications. Your job will be to understand the building blocks of the existing data model(s) and how it might be helpful or integrated with your data model.
If you do forward engineering, take the time to get it right. A solid start to your data model is one of the best investments you can make. You lower the long-term technical debt while giving users a data model that’s easy to understand and enjoyable to use.
The amount of Forward or Reverse Engineering might depend on the project team you are working on, but as well the data management maturity of your organization. If you are on a full-on custom development application, you should be forward engineering most of the time. Reverse engineering if taking in 3rd party data, or other systems you might be interacting with.
As you mentioned, reverse engineering, partly data archeology, is necessary for understanding data relationships for extracting, or making enhancements. Getting information from data catalogs, but importantly enriching the data catalog with the data model (visual) and relationships.
Data modeling, and the resulting DDL, DML, scripts, should be treated like development code, check into source control and deployed to environment.
After reading the excerpt, It is interesting that I never thought of the logical being "mapped" to a physical model. More experience using a data modeling tool, where the physical is "generated" to a physical data model. I guess, behind the scenes, a translation and mapping does occur - attributes, to columns, naming standard is applied, data types are applied to target physical, primary keys syntax, etc. is generated to DDL. The flip side of the coin, reverse engineering with a data tool, the physical and the logical data model can be automatically created from the source database. Your mileage may vary, depending on how well the original model was created. Understanding the book is meant to be tool independent / agnostic - and meant to teach the underlying need and benefits of data modeling as part of software development.
Very clear read and I agree that many times, practitioners will have to start with reverse engineering to forward engineer something new. Specifically, I speak from the perspective of a business/process analyst looking to achieve more valuable BI. Coming into a process that "works" but is not very data-informed allows us the opportunity for a starting point based off of a few "obvious" entities that already exist. However, only by reverse engineering the process, speaking from a lense of "what data is involved" can we discover where the gaps are (missing entities, relationships, wrongfully defined attributes, etc.).