As in many other times I can only agree with most of the statements. It’s more of a “this is how things are done and it’s hard to believe people do it otherwise”.
I’d like to add one thought or a hint maybe, I tend to be strict to 3NF when creating an application where I have all the CRUD coded and supported with an UI (with some MVC or Flux architecture pattern). But when I’m in analytics and reporting world with code first (SQL, Python etc.) access to data plus visualization tools, I tend to allow myself a bit of denormalization which leads to data redundancy (like for example codes kept in some dictionary- I have the dictionary but I copy over the values to the entity, not just the relationship ) and needs more work in case of a change but makes the querying easier.
Thank you for the write up and interesting facts to revise the fundamentals of data modelling.
@JOE REIS are you planning to put down a thought how the data modelling be prevalent or need to be considered if the data lakehouse approach considered for Analytical work loads. Do we still consider relational modelling for datalake solutions as well ?
how about the real time streaming data and consumption pattern from those data sets ?
funny you mention this, as this is the next Chapter I'm currently writing. Well, at least the alternatives of relational modeling as it relates to apps. Analytics is after this chapter.
Long story short. Yes, relational modeling gives you a framework for reducing or eliminating data redundancies. However, given the data-intensive nature of lakehouses at scale (TBs and PBs of data) or high velocity, the relational model will have problems at the physical level. Cuz data physics.
The order table example can be put into 3NF by simply pulling out the Customer info into own table. Then you have a 3NF Fact table with keys of the Dimensions: Date, Customer, Product, (OrderID also, as degenerate dimension. Not denormalized, but reorganized into one table at Product level. The analytical design is 3NF, just like a design with OrderID table and Line table. More than one way to 3NF. The fact table at core of the dimensional model is not denormalized. Just following line of thought from Kimball and Date.
I thought that JavaScript Object Notation (JSON) is a lightweight, text-based format for exchanging data between systems and applications. It is not a data model but an output that can be generated from an LDM or even a PDM to represent the JSON structure desired to support exchange of actual business data. I can show you what mean via usage of a data modeling such ER/Studio Data Architect as well as options in creating a LDM to multiple PDM specific physical destinations. Since there is usually no LDM created but starting from a physical endpoint. (i.e. physical tables created in day Oracle), some believe JSON is the data model I sense sometimes. Joe, does this make sense and help me understand where I am off base.
If I indicated JSON is a data model, I apologize. It is a way that data’s commonly organized, stored and passed around. But it certainly isn’t a data model in itself, no more than a table is a data model.
I’ll have to pushback on your comment that there’s no LDM created with JSON. I could make a strong argument that JSON IS a way to represent an LDM, assuming you’re working with nested data. It’s a human readable way of capturing a CDM (assuming you’re capturing that…CDM as a practice feels like a dead art), before it’s physically stored.
I haven’t seen a JSON document Business Names, relationships, PK/AK, data types and other business and technical metadata that is captured in an LDM. I would be interested in seeing an example of a JSON that does has all this plus PII and other meta content that can be extended and captured in a data model tool for an LDM.
Here's another thought: when you say Relational Model, are you thinking of the Relational Model of Data, as the theoretical foundation, OR, a specific relational model for an application? It seems to me both are used depending on the context and it's a very fine distinction most won't notice. Darn semantics.
The first one is the set of principles guiding (a blueprint if you will), while the second is the result of using the first (a practical implementation of the blueprint).
The Relational Model of Data is the theoretical foundation (analogous to a class in programming), whereas a specific relational model for an application is its practical implementation (like an instance of the class).
When I refer to the relational model, it's always in the context of modeling data according to relational theory. Same way Codd and CJ Date describe the relational model.
Use numbering for the ItemX, QuantityX in the dumb table - cannot have multiple columns with the same name :) And can make the point that X can go up to whatever the physical is supporting, but it's an awful way of org. I did hear of tables with hundreds of columns, in this exact vein.
I don't remember reading the normalization part - is it out?
You'll see tables like the "dumb" in the wild, both in databases and spreadsheets. I left it like that on purpose, because it's insanely stupid but actually happens
yeah, and dataframes too. A common workflow is to covert JSON to a dataframe, which might persist in Parquet or something similar. There are some nutty tables out there.
Thankfully the vast majority of databases won't allow duplicate names in the same table.
I would include this info, Joe, it's an important lesson for:
- how not to do things when starting from scratch, and
- bringing awareness of how bad it is and the need for restructure.
A terrible practice w/o awareness is just another Tuesday. Granted, chances of something changing are ?, but everything has to start somewhere, so why not here?
This is a relational database that goes beyond the capabilities of SQL and does not suffer performance issues from large numbers of joins etc.
Over 40+ years exploring the RDM and build a relational database and language to be as conformant as possible I have learned a lot and included those learnings in the product.
Some further comments:
- We regularly deal with tables of millions and sometimes 10s of millions of records and queries that involve hundreds of millions of calculations (possibly billions).
- The set theoretical approach of RDM is the best basis for data management that I have come across
- SQL - an invention of IBM marketing - is holding back application development, not helping it. At least partly because it is not a RDM
- Using joins is theoretically good and in mainframe applications probably good enough. However tehre are many reasons why joins are a bad way to go for modern interactive applications
- Unibase uses associations which are part of its ontological model of an application and the engine can use this data for many optimisations
- Relational integrity is extremely difficult if not impossible to maintain in large applications
- I made the call some years ago to impose an all digits primary key on tables (number of digits to be enough to count all the records in the table plus a bit more - and can be expanded easily if necessary).
- The advantage of this primary key is first that it can be passed easily using httpd
- The next advantage is that it doesn't change, eliminating the need for relational integrity
This a brief and relevant set of thoughts. Much more in the documentaion, particularly the "Data Dictionary".
It could be but I am far more interested in the theory and establishing a better model. My main interests now and for some time include using semantics to improve performance by giving the engine a higher level view of the application.
Some of this is about considering complex calculations as attributes of a table and I exploring extending the tensor concept to data and calculations in related tables.
I am also interested in discussing these issues with those that have an interest.
You’ve so far made no effort to be part of the community here or with me, and your first comment is a plug for your database. See how this might be perceived?
For some time. Here’s the thing, I spend 12-14 hours a day researching better ways to build data bases and applications and not as much time as I would like to share the ideas. When you are self funded from a very low base this is reality
Kimball also uses 3NF for Fact tables. Relaxes Dimension tables to 2NF although you can 3NF (snowflake) if you like. Real difference is use of shared dimensions as means to integrate the data.
Yeah, it's a bit of a fuzzy area too. Some literature about dimensional modeling says data is not normalized. But as Donald points out, it is (or, I should say, can be...). This is where knowing the difference and being able to see various forms of normalization is key.
The relational model was created for OLTP workloads, which where the first use cases that emerged. Analytical (OLAP) use cases came later and were only possible because there was data stored in OLTP DBs. Folks realized that data had tremendous value for analytical needs. So I like being explicit about the relational model being specific to applications (OLTP).
Of course, analytical modeling (particularly dimensional) draws heavily from the relational model. I would argue that for you to be a good analytical data modeler, you need to know how to do it for applications and thus have a solid understanding of the relational model. This is why the chapters on application modeling are essential, even for those data practitioners who only work on the analytical side. I really like the layout of the book and can’t wait to see the analytical model chapter, from which you can draw from these chapters.
Great discussions. Got me thinking when the threads started mentioning some may not know both transactional and analytical modeling. Especially coming from some of us that started before analytical modeling was prevalent, or decision support systems, or data warehouses, and reports were on green bar paper, not screens ;). So at the point the typical modeling behavior was built in to our thinking. Especially because normalized models quickly broke down for large data reads vs. OLTP data writes.
Also enduring through creating data models without having data is really where a data modeler earns their stripes. Teasing out business rules, business process, and data requirements is a great skill and can be an eye opening whiteboard journey! With analytical modeling, the data already exists in some form or another, and you can analyze a source. Which is hopefully well modeled ;).
I guess my underlying point is data modeling should be business driven, not done in a vacuum, and the 'end goal' must be anticipated. The model connects to our enterprise transactional models, and the anticipation forward thought of harmonizing in our analytical models, be it shared conformed dimensions or other methods.
As Joe mentioned, "Thinking"... a good data modeler/data architect is thinking not just within the current model, but the enterprise model.
Completely agree, I think both relational/non-relational OLTP sections need to be covered before analytics/OLAP. This is the type of book we are missing, one that covers the fundamentals of modeling across different use cases, including new approaches not widely (if at all) used when other classic textbooks were released. That way, practitioners have a better understanding of the entire landscape, can appreciate the importance of data models, and can build more appropriate ones for their use cases.
Just as important, it will help us build a common language & set of concepts to elevate DE as a discipline. Right now it’s too scattered & informal, and many of us come from different use cases & domains, so we don’t share the same terminology nor ground truth. I suspect once we do, we’ll find we have more in common than we thought and can more easily build a set of practices tailored to different uses.
Definitely. There's also increasingly a lot of overlap with oltp and olap. As I've started writing this new chapter, there's a ton of gray area. For instance, "real time analytics." How is that modeled? There's data in streaming systems that are queried either in the stream or in a low latency olap database. Star schema won't work in most cases cuz joins add latency, so we need another way to model the data. This type of modeling has received VERY little treatment so far.
Technically, real-time/streaming analytics is still OLAP. It’s just a different kind of analytics use case, one with very different constraints and thus data modeling approaches. Wherever this goes, having a clear separation/categorization of use cases will be key.
P.S. Checking the table of contents, you already have set it up this way. Having an additional Hybrid chapter would allow you to discuss those gray area topics that mix OLTP and OLAP.
Yes, that’s definitely the case and it has contributed to the noise and confusion. Hybrid approaches will only increase, as technology evolves and new needs arise. The way around this is simple, to first describe both OLTP and OLAP separately, with the necessary level of depth. Once that is understood, it’s easier to mix them and look into overlapping use cases, which you can cover in subsequent chapters. That’s another area this book is well positioned to address. There are books dedicated to development of solutions for each niche, but data modeling isn’t typically a focus, and there’s nothing that covers all data modeling use cases. It’s incredibly valuable as a modeler to be aware of other use cases and other approaches, so this is a huge gap we have. Love the direction the book is taking!
Hi,
As in many other times I can only agree with most of the statements. It’s more of a “this is how things are done and it’s hard to believe people do it otherwise”.
I’d like to add one thought or a hint maybe, I tend to be strict to 3NF when creating an application where I have all the CRUD coded and supported with an UI (with some MVC or Flux architecture pattern). But when I’m in analytics and reporting world with code first (SQL, Python etc.) access to data plus visualization tools, I tend to allow myself a bit of denormalization which leads to data redundancy (like for example codes kept in some dictionary- I have the dictionary but I copy over the values to the entity, not just the relationship ) and needs more work in case of a change but makes the querying easier.
The classic “it depends” I guess.
Cheers
Thank you for the write up and interesting facts to revise the fundamentals of data modelling.
@JOE REIS are you planning to put down a thought how the data modelling be prevalent or need to be considered if the data lakehouse approach considered for Analytical work loads. Do we still consider relational modelling for datalake solutions as well ?
how about the real time streaming data and consumption pattern from those data sets ?
funny you mention this, as this is the next Chapter I'm currently writing. Well, at least the alternatives of relational modeling as it relates to apps. Analytics is after this chapter.
Long story short. Yes, relational modeling gives you a framework for reducing or eliminating data redundancies. However, given the data-intensive nature of lakehouses at scale (TBs and PBs of data) or high velocity, the relational model will have problems at the physical level. Cuz data physics.
The order table example can be put into 3NF by simply pulling out the Customer info into own table. Then you have a 3NF Fact table with keys of the Dimensions: Date, Customer, Product, (OrderID also, as degenerate dimension. Not denormalized, but reorganized into one table at Product level. The analytical design is 3NF, just like a design with OrderID table and Line table. More than one way to 3NF. The fact table at core of the dimensional model is not denormalized. Just following line of thought from Kimball and Date.
Also love the passion and thanks for reinvigorating value of digitizing details of business data via data models!
Thanks Dave! And thanks for being a part of this community
I thought that JavaScript Object Notation (JSON) is a lightweight, text-based format for exchanging data between systems and applications. It is not a data model but an output that can be generated from an LDM or even a PDM to represent the JSON structure desired to support exchange of actual business data. I can show you what mean via usage of a data modeling such ER/Studio Data Architect as well as options in creating a LDM to multiple PDM specific physical destinations. Since there is usually no LDM created but starting from a physical endpoint. (i.e. physical tables created in day Oracle), some believe JSON is the data model I sense sometimes. Joe, does this make sense and help me understand where I am off base.
If I indicated JSON is a data model, I apologize. It is a way that data’s commonly organized, stored and passed around. But it certainly isn’t a data model in itself, no more than a table is a data model.
I’ll have to pushback on your comment that there’s no LDM created with JSON. I could make a strong argument that JSON IS a way to represent an LDM, assuming you’re working with nested data. It’s a human readable way of capturing a CDM (assuming you’re capturing that…CDM as a practice feels like a dead art), before it’s physically stored.
I haven’t seen a JSON document Business Names, relationships, PK/AK, data types and other business and technical metadata that is captured in an LDM. I would be interested in seeing an example of a JSON that does has all this plus PII and other meta content that can be extended and captured in a data model tool for an LDM.
Here's another thought: when you say Relational Model, are you thinking of the Relational Model of Data, as the theoretical foundation, OR, a specific relational model for an application? It seems to me both are used depending on the context and it's a very fine distinction most won't notice. Darn semantics.
how are these different?
The first one is the set of principles guiding (a blueprint if you will), while the second is the result of using the first (a practical implementation of the blueprint).
The Relational Model of Data is the theoretical foundation (analogous to a class in programming), whereas a specific relational model for an application is its practical implementation (like an instance of the class).
When I refer to the relational model, it's always in the context of modeling data according to relational theory. Same way Codd and CJ Date describe the relational model.
Use numbering for the ItemX, QuantityX in the dumb table - cannot have multiple columns with the same name :) And can make the point that X can go up to whatever the physical is supporting, but it's an awful way of org. I did hear of tables with hundreds of columns, in this exact vein.
I don't remember reading the normalization part - is it out?
yes, normalization is Part 2
You'll see tables like the "dumb" in the wild, both in databases and spreadsheets. I left it like that on purpose, because it's insanely stupid but actually happens
Wow - I have learned something and my head hurts. I wouldn't have made the comment had I known.
I can see why in a spreadsheet won't matter, since they will reference it in calculations by the column identifier. Wow, just wow.
yeah, and dataframes too. A common workflow is to covert JSON to a dataframe, which might persist in Parquet or something similar. There are some nutty tables out there.
Thankfully the vast majority of databases won't allow duplicate names in the same table.
I would include this info, Joe, it's an important lesson for:
- how not to do things when starting from scratch, and
- bringing awareness of how bad it is and the need for restructure.
A terrible practice w/o awareness is just another Tuesday. Granted, chances of something changing are ?, but everything has to start somewhere, so why not here?
Data independence 🤌
Joe, you might want to look at Unibase (https://unibase.zenucom.com)
This is a relational database that goes beyond the capabilities of SQL and does not suffer performance issues from large numbers of joins etc.
Over 40+ years exploring the RDM and build a relational database and language to be as conformant as possible I have learned a lot and included those learnings in the product.
Some further comments:
- We regularly deal with tables of millions and sometimes 10s of millions of records and queries that involve hundreds of millions of calculations (possibly billions).
- The set theoretical approach of RDM is the best basis for data management that I have come across
- SQL - an invention of IBM marketing - is holding back application development, not helping it. At least partly because it is not a RDM
- Using joins is theoretically good and in mainframe applications probably good enough. However tehre are many reasons why joins are a bad way to go for modern interactive applications
- Unibase uses associations which are part of its ontological model of an application and the engine can use this data for many optimisations
- Relational integrity is extremely difficult if not impossible to maintain in large applications
- I made the call some years ago to impose an all digits primary key on tables (number of digits to be enough to count all the records in the table plus a bit more - and can be expanded easily if necessary).
- The advantage of this primary key is first that it can be passed easily using httpd
- The next advantage is that it doesn't change, eliminating the need for relational integrity
This a brief and relevant set of thoughts. Much more in the documentaion, particularly the "Data Dictionary".
This better not be vendor promotion...
It could be but I am far more interested in the theory and establishing a better model. My main interests now and for some time include using semantics to improve performance by giving the engine a higher level view of the application.
Some of this is about considering complex calculations as attributes of a table and I exploring extending the tensor concept to data and calculations in related tables.
I am also interested in discussing these issues with those that have an interest.
You’ve so far made no effort to be part of the community here or with me, and your first comment is a plug for your database. See how this might be perceived?
PS my suggestion to look at unibase is for the ideas not to become a customer.
For some time. Here’s the thing, I spend 12-14 hours a day researching better ways to build data bases and applications and not as much time as I would like to share the ideas. When you are self funded from a very low base this is reality
Yes but I have been following you and wandering about the best way to engage
it's fine for now. Please be mindful of being overly "vendor-ish." But you post interesting questions, so I'll let this slide. Thanks.
Regarding this great question, Is the Relational Model Only for Applications?
The inmon approach uses 3NF to design data warehouse or am i wrong ?
Kimball also uses 3NF for Fact tables. Relaxes Dimension tables to 2NF although you can 3NF (snowflake) if you like. Real difference is use of shared dimensions as means to integrate the data.
Yeah, it's a bit of a fuzzy area too. Some literature about dimensional modeling says data is not normalized. But as Donald points out, it is (or, I should say, can be...). This is where knowing the difference and being able to see various forms of normalization is key.
The relational model was created for OLTP workloads, which where the first use cases that emerged. Analytical (OLAP) use cases came later and were only possible because there was data stored in OLTP DBs. Folks realized that data had tremendous value for analytical needs. So I like being explicit about the relational model being specific to applications (OLTP).
Of course, analytical modeling (particularly dimensional) draws heavily from the relational model. I would argue that for you to be a good analytical data modeler, you need to know how to do it for applications and thus have a solid understanding of the relational model. This is why the chapters on application modeling are essential, even for those data practitioners who only work on the analytical side. I really like the layout of the book and can’t wait to see the analytical model chapter, from which you can draw from these chapters.
Thanks! And I’ll make oltp more explicit.
Up next - “non relational” modeling for oltp/applications.
Then analytics, which many people seem excited about
Great discussions. Got me thinking when the threads started mentioning some may not know both transactional and analytical modeling. Especially coming from some of us that started before analytical modeling was prevalent, or decision support systems, or data warehouses, and reports were on green bar paper, not screens ;). So at the point the typical modeling behavior was built in to our thinking. Especially because normalized models quickly broke down for large data reads vs. OLTP data writes.
Also enduring through creating data models without having data is really where a data modeler earns their stripes. Teasing out business rules, business process, and data requirements is a great skill and can be an eye opening whiteboard journey! With analytical modeling, the data already exists in some form or another, and you can analyze a source. Which is hopefully well modeled ;).
I guess my underlying point is data modeling should be business driven, not done in a vacuum, and the 'end goal' must be anticipated. The model connects to our enterprise transactional models, and the anticipation forward thought of harmonizing in our analytical models, be it shared conformed dimensions or other methods.
As Joe mentioned, "Thinking"... a good data modeler/data architect is thinking not just within the current model, but the enterprise model.
The non-relational part will be a nice segue to the analytics chapter
Completely agree, I think both relational/non-relational OLTP sections need to be covered before analytics/OLAP. This is the type of book we are missing, one that covers the fundamentals of modeling across different use cases, including new approaches not widely (if at all) used when other classic textbooks were released. That way, practitioners have a better understanding of the entire landscape, can appreciate the importance of data models, and can build more appropriate ones for their use cases.
Just as important, it will help us build a common language & set of concepts to elevate DE as a discipline. Right now it’s too scattered & informal, and many of us come from different use cases & domains, so we don’t share the same terminology nor ground truth. I suspect once we do, we’ll find we have more in common than we thought and can more easily build a set of practices tailored to different uses.
Definitely. There's also increasingly a lot of overlap with oltp and olap. As I've started writing this new chapter, there's a ton of gray area. For instance, "real time analytics." How is that modeled? There's data in streaming systems that are queried either in the stream or in a low latency olap database. Star schema won't work in most cases cuz joins add latency, so we need another way to model the data. This type of modeling has received VERY little treatment so far.
Technically, real-time/streaming analytics is still OLAP. It’s just a different kind of analytics use case, one with very different constraints and thus data modeling approaches. Wherever this goes, having a clear separation/categorization of use cases will be key.
P.S. Checking the table of contents, you already have set it up this way. Having an additional Hybrid chapter would allow you to discuss those gray area topics that mix OLTP and OLAP.
Yes, that’s definitely the case and it has contributed to the noise and confusion. Hybrid approaches will only increase, as technology evolves and new needs arise. The way around this is simple, to first describe both OLTP and OLAP separately, with the necessary level of depth. Once that is understood, it’s easier to mix them and look into overlapping use cases, which you can cover in subsequent chapters. That’s another area this book is well positioned to address. There are books dedicated to development of solutions for each niche, but data modeling isn’t typically a focus, and there’s nothing that covers all data modeling use cases. It’s incredibly valuable as a modeler to be aware of other use cases and other approaches, so this is a huge gap we have. Love the direction the book is taking!
Thank you Donald and Joe for clarification 😊💐🙏