42 Comments
User's avatar
Michał I.'s avatar

Hi,

As in many other times I can only agree with most of the statements. It’s more of a “this is how things are done and it’s hard to believe people do it otherwise”.

I’d like to add one thought or a hint maybe, I tend to be strict to 3NF when creating an application where I have all the CRUD coded and supported with an UI (with some MVC or Flux architecture pattern). But when I’m in analytics and reporting world with code first (SQL, Python etc.) access to data plus visualization tools, I tend to allow myself a bit of denormalization which leads to data redundancy (like for example codes kept in some dictionary- I have the dictionary but I copy over the values to the entity, not just the relationship ) and needs more work in case of a change but makes the querying easier.

The classic “it depends” I guess.

Cheers

Expand full comment
Rama krishna's avatar

Thank you for the write up and interesting facts to revise the fundamentals of data modelling.

@JOE REIS are you planning to put down a thought how the data modelling be prevalent or need to be considered if the data lakehouse approach considered for Analytical work loads. Do we still consider relational modelling for datalake solutions as well ?

how about the real time streaming data and consumption pattern from those data sets ?

Expand full comment
Joe Reis's avatar

funny you mention this, as this is the next Chapter I'm currently writing. Well, at least the alternatives of relational modeling as it relates to apps. Analytics is after this chapter.

Expand full comment
Joe Reis's avatar

Long story short. Yes, relational modeling gives you a framework for reducing or eliminating data redundancies. However, given the data-intensive nature of lakehouses at scale (TBs and PBs of data) or high velocity, the relational model will have problems at the physical level. Cuz data physics.

Expand full comment
Donald Parish's avatar

The order table example can be put into 3NF by simply pulling out the Customer info into own table. Then you have a 3NF Fact table with keys of the Dimensions: Date, Customer, Product, (OrderID also, as degenerate dimension. Not denormalized, but reorganized into one table at Product level. The analytical design is 3NF, just like a design with OrderID table and Line table. More than one way to 3NF. The fact table at core of the dimensional model is not denormalized. Just following line of thought from Kimball and Date.

Expand full comment
Dave's avatar

Also love the passion and thanks for reinvigorating value of digitizing details of business data via data models!

Expand full comment
Joe Reis's avatar

Thanks Dave! And thanks for being a part of this community

Expand full comment
Dave's avatar

I thought that JavaScript Object Notation (JSON) is a lightweight, text-based format for exchanging data between systems and applications. It is not a data model but an output that can be generated from an LDM or even a PDM to represent the JSON structure desired to support exchange of actual business data. I can show you what mean via usage of a data modeling such ER/Studio Data Architect as well as options in creating a LDM to multiple PDM specific physical destinations. Since there is usually no LDM created but starting from a physical endpoint. (i.e. physical tables created in day Oracle), some believe JSON is the data model I sense sometimes. Joe, does this make sense and help me understand where I am off base.

Expand full comment
Joe Reis's avatar

If I indicated JSON is a data model, I apologize. It is a way that data’s commonly organized, stored and passed around. But it certainly isn’t a data model in itself, no more than a table is a data model.

I’ll have to pushback on your comment that there’s no LDM created with JSON. I could make a strong argument that JSON IS a way to represent an LDM, assuming you’re working with nested data. It’s a human readable way of capturing a CDM (assuming you’re capturing that…CDM as a practice feels like a dead art), before it’s physically stored.

Expand full comment
Dave's avatar

I haven’t seen a JSON document Business Names, relationships, PK/AK, data types and other business and technical metadata that is captured in an LDM. I would be interested in seeing an example of a JSON that does has all this plus PII and other meta content that can be extended and captured in a data model tool for an LDM.

Expand full comment
Ramona C Truta's avatar

Here's another thought: when you say Relational Model, are you thinking of the Relational Model of Data, as the theoretical foundation, OR, a specific relational model for an application? It seems to me both are used depending on the context and it's a very fine distinction most won't notice. Darn semantics.

Expand full comment
Joe Reis's avatar

how are these different?

Expand full comment
Ramona C Truta's avatar

The first one is the set of principles guiding (a blueprint if you will), while the second is the result of using the first (a practical implementation of the blueprint).

The Relational Model of Data is the theoretical foundation (analogous to a class in programming), whereas a specific relational model for an application is its practical implementation (like an instance of the class).

Expand full comment
Joe Reis's avatar

When I refer to the relational model, it's always in the context of modeling data according to relational theory. Same way Codd and CJ Date describe the relational model.

Expand full comment
Ramona C Truta's avatar

Use numbering for the ItemX, QuantityX in the dumb table - cannot have multiple columns with the same name :) And can make the point that X can go up to whatever the physical is supporting, but it's an awful way of org. I did hear of tables with hundreds of columns, in this exact vein.

I don't remember reading the normalization part - is it out?

Expand full comment
Joe Reis's avatar

yes, normalization is Part 2

Expand full comment
Joe Reis's avatar

You'll see tables like the "dumb" in the wild, both in databases and spreadsheets. I left it like that on purpose, because it's insanely stupid but actually happens

Expand full comment
Ramona C Truta's avatar

Wow - I have learned something and my head hurts. I wouldn't have made the comment had I known.

I can see why in a spreadsheet won't matter, since they will reference it in calculations by the column identifier. Wow, just wow.

Expand full comment
Joe Reis's avatar

yeah, and dataframes too. A common workflow is to covert JSON to a dataframe, which might persist in Parquet or something similar. There are some nutty tables out there.

Thankfully the vast majority of databases won't allow duplicate names in the same table.

Expand full comment
Ramona C Truta's avatar

I would include this info, Joe, it's an important lesson for:

- how not to do things when starting from scratch, and

- bringing awareness of how bad it is and the need for restructure.

A terrible practice w/o awareness is just another Tuesday. Granted, chances of something changing are ?, but everything has to start somewhere, so why not here?

Expand full comment
Eevamaija Virtanen's avatar

Data independence 🤌

Expand full comment
Rick Marshall's avatar

Joe, you might want to look at Unibase (https://unibase.zenucom.com)

This is a relational database that goes beyond the capabilities of SQL and does not suffer performance issues from large numbers of joins etc.

Over 40+ years exploring the RDM and build a relational database and language to be as conformant as possible I have learned a lot and included those learnings in the product.

Some further comments:

- We regularly deal with tables of millions and sometimes 10s of millions of records and queries that involve hundreds of millions of calculations (possibly billions).

- The set theoretical approach of RDM is the best basis for data management that I have come across

- SQL - an invention of IBM marketing - is holding back application development, not helping it. At least partly because it is not a RDM

- Using joins is theoretically good and in mainframe applications probably good enough. However tehre are many reasons why joins are a bad way to go for modern interactive applications

- Unibase uses associations which are part of its ontological model of an application and the engine can use this data for many optimisations

- Relational integrity is extremely difficult if not impossible to maintain in large applications

- I made the call some years ago to impose an all digits primary key on tables (number of digits to be enough to count all the records in the table plus a bit more - and can be expanded easily if necessary).

- The advantage of this primary key is first that it can be passed easily using httpd

- The next advantage is that it doesn't change, eliminating the need for relational integrity

This a brief and relevant set of thoughts. Much more in the documentaion, particularly the "Data Dictionary".

Expand full comment
Joe Reis's avatar

This better not be vendor promotion...

Expand full comment
Rick Marshall's avatar

It could be but I am far more interested in the theory and establishing a better model. My main interests now and for some time include using semantics to improve performance by giving the engine a higher level view of the application.

Some of this is about considering complex calculations as attributes of a table and I exploring extending the tensor concept to data and calculations in related tables.

I am also interested in discussing these issues with those that have an interest.

Expand full comment
Joe Reis's avatar

You’ve so far made no effort to be part of the community here or with me, and your first comment is a plug for your database. See how this might be perceived?

Expand full comment
Rick Marshall's avatar

PS my suggestion to look at unibase is for the ideas not to become a customer.

Expand full comment
Rick Marshall's avatar

For some time. Here’s the thing, I spend 12-14 hours a day researching better ways to build data bases and applications and not as much time as I would like to share the ideas. When you are self funded from a very low base this is reality

Expand full comment
Rick Marshall's avatar

Yes but I have been following you and wandering about the best way to engage

Expand full comment
Joe Reis's avatar

it's fine for now. Please be mindful of being overly "vendor-ish." But you post interesting questions, so I'll let this slide. Thanks.

Expand full comment
Pipeline to Insights's avatar

Regarding this great question, Is the Relational Model Only for Applications?

The inmon approach uses 3NF to design data warehouse or am i wrong ?

Expand full comment
Donald Parish's avatar

Kimball also uses 3NF for Fact tables. Relaxes Dimension tables to 2NF although you can 3NF (snowflake) if you like. Real difference is use of shared dimensions as means to integrate the data.

Expand full comment
Joe Reis's avatar

Yeah, it's a bit of a fuzzy area too. Some literature about dimensional modeling says data is not normalized. But as Donald points out, it is (or, I should say, can be...). This is where knowing the difference and being able to see various forms of normalization is key.

Expand full comment
Jaime Villalpando's avatar

The relational model was created for OLTP workloads, which where the first use cases that emerged. Analytical (OLAP) use cases came later and were only possible because there was data stored in OLTP DBs. Folks realized that data had tremendous value for analytical needs. So I like being explicit about the relational model being specific to applications (OLTP).

Of course, analytical modeling (particularly dimensional) draws heavily from the relational model. I would argue that for you to be a good analytical data modeler, you need to know how to do it for applications and thus have a solid understanding of the relational model. This is why the chapters on application modeling are essential, even for those data practitioners who only work on the analytical side. I really like the layout of the book and can’t wait to see the analytical model chapter, from which you can draw from these chapters.

Expand full comment
Joe Reis's avatar

Thanks! And I’ll make oltp more explicit.

Up next - “non relational” modeling for oltp/applications.

Then analytics, which many people seem excited about

Expand full comment
Maury's avatar

Great discussions. Got me thinking when the threads started mentioning some may not know both transactional and analytical modeling. Especially coming from some of us that started before analytical modeling was prevalent, or decision support systems, or data warehouses, and reports were on green bar paper, not screens ;). So at the point the typical modeling behavior was built in to our thinking. Especially because normalized models quickly broke down for large data reads vs. OLTP data writes.

Also enduring through creating data models without having data is really where a data modeler earns their stripes. Teasing out business rules, business process, and data requirements is a great skill and can be an eye opening whiteboard journey! With analytical modeling, the data already exists in some form or another, and you can analyze a source. Which is hopefully well modeled ;).

I guess my underlying point is data modeling should be business driven, not done in a vacuum, and the 'end goal' must be anticipated. The model connects to our enterprise transactional models, and the anticipation forward thought of harmonizing in our analytical models, be it shared conformed dimensions or other methods.

As Joe mentioned, "Thinking"... a good data modeler/data architect is thinking not just within the current model, but the enterprise model.

Expand full comment
Joe Reis's avatar

The non-relational part will be a nice segue to the analytics chapter

Expand full comment
Jaime Villalpando's avatar

Completely agree, I think both relational/non-relational OLTP sections need to be covered before analytics/OLAP. This is the type of book we are missing, one that covers the fundamentals of modeling across different use cases, including new approaches not widely (if at all) used when other classic textbooks were released. That way, practitioners have a better understanding of the entire landscape, can appreciate the importance of data models, and can build more appropriate ones for their use cases.

Just as important, it will help us build a common language & set of concepts to elevate DE as a discipline. Right now it’s too scattered & informal, and many of us come from different use cases & domains, so we don’t share the same terminology nor ground truth. I suspect once we do, we’ll find we have more in common than we thought and can more easily build a set of practices tailored to different uses.

Expand full comment
Joe Reis's avatar

Definitely. There's also increasingly a lot of overlap with oltp and olap. As I've started writing this new chapter, there's a ton of gray area. For instance, "real time analytics." How is that modeled? There's data in streaming systems that are queried either in the stream or in a low latency olap database. Star schema won't work in most cases cuz joins add latency, so we need another way to model the data. This type of modeling has received VERY little treatment so far.

Expand full comment
Jaime Villalpando's avatar

Technically, real-time/streaming analytics is still OLAP. It’s just a different kind of analytics use case, one with very different constraints and thus data modeling approaches. Wherever this goes, having a clear separation/categorization of use cases will be key.

P.S. Checking the table of contents, you already have set it up this way. Having an additional Hybrid chapter would allow you to discuss those gray area topics that mix OLTP and OLAP.

Expand full comment
Jaime Villalpando's avatar

Yes, that’s definitely the case and it has contributed to the noise and confusion. Hybrid approaches will only increase, as technology evolves and new needs arise. The way around this is simple, to first describe both OLTP and OLAP separately, with the necessary level of depth. Once that is understood, it’s easier to mix them and look into overlapping use cases, which you can cover in subsequent chapters. That’s another area this book is well positioned to address. There are books dedicated to development of solutions for each niche, but data modeling isn’t typically a focus, and there’s nothing that covers all data modeling use cases. It’s incredibly valuable as a modeler to be aware of other use cases and other approaches, so this is a huge gap we have. Love the direction the book is taking!

Expand full comment
Pipeline to Insights's avatar

Thank you Donald and Joe for clarification 😊💐🙏

Expand full comment