This is the draft introduction to Chapter 4, Data Modeling Building Blocks1. The goal is to provide the essentials for data modeling, regardless of the use case - applications, analytics, and ML/AI. This chapter is challenging to write, and I’m sure I’ll get feedback about what I should add or remove. It’s hard because I’ve tried to narrow down the things you’d need to know if you thought about data modeling holistically and not just narrowly, say analytics only. To my knowledge, this approach hasn’t been tried before, so I’m bound to get some things wrong. But hey, you get to watch me write a book in public, in real time, so I’m sure it will be entertaining either way :)
This section will be accessible to all readers, as more eyes on it are better. The rest of the chapter—additional and complete sections—will be for paid subscribers. This is a long chapter (my draft currently sits around 30 pages). Ultimately, it might be longer or shorter, depending on how it goes.
Thanks,
Joe Reis
I often find that people working with data don’t understand its building blocks. Data is frequently treated as something that’s stored, queried, and cobbled together. Consequently, data models are often an unintentional side effect or artifact created without deliberate thought.
Before we build a data model, we must grasp the critical building blocks forming it. This chapter aims to provide the foundational knowledge that will enable you to ‘see’ data from first principles across various cross-disciplinary use cases. This is a departure from how data modeling has been treated, which often focuses on a particular approach, like relational or dimensional data modeling. In Mixed Model Arts, we need cross-disciplinary building blocks that will work across various forms of data and use cases - applications, analytics, and ML/AI.
Before you model data, you need to understand its building blocks. These building blocks can be viewed from two angles.
1. What form of data am I working with?
2. What am I trying to represent?
First, what form of data am I working with? Traditional data modeling focuses on tabular data and takes for granted that data neatly conforms to tables, rows, and columns. However, given the integration of many different forms of data, focusing only on tabular data misses the bigger picture of how data is used. Here are the primary forms of raw data you’ll encounter when modeling data. Metadata is included because it’s the glue that forms the basis for a unified data model.
Tabular (aka structured data)
Semistructured
Unstructured
Machine learning/AI artifacts
Metadata
What am I trying to represent?
Entities - the core concepts or objects you represent in your data model.
Attributes - the characteristics or properties describing those entities.
Relationships - how different entities connect and relate to each other.
Grain - what’s the level of detail of the data?
Subject Areas, Domains, Processes, and More - what subject area, domain, or process are you trying to model?
Some people might nitpick the terminology above. Would it be better to use another term besides “entity,” “domain,” and so on? The data modeling world is full of pedantic arguments over terminology. If you want to go down the rabbit hole of hair-splitting about definitions, there are plenty of resources and discussions out there. That’s not the approach of this book. As much as possible, we’re sticking with conventional terminology - most of which has been around for decades. Given the stickiness of this terminology, I see no point in being clever or reinventing the wheel with terms that only add more vocabulary to the world without adding proportionate value. I will extend existing terminology from the traditional focus on databases to other use cases, like data science, machine learning, and structured and unstructured datasets. These same concepts and approaches apply across any use case for data modeling.
My approach to discussing the building blocks of data modeling differs from past attempts to describe them. Traditional discussions focus almost exclusively on tables. Given the intersection of transactional used cases analytics and machine learning, there are more forms of data to consider, each with its nuances. The traditional approach of shoehorning everything limits and misses the broader evolution of how data is used today. That said, some commonalities should apply, even roughly, to data models across various cases and forms of data.
With that, let’s dive into the building blocks of data models.
Updated 9/10/2024
I think one massive gap in modeling is how to model "other stuff".
Techniques for OLTP is pretty well documented (normalization)
Techniques for Cubes/Warehouse is pretty well documented (star schema, dimensions, etc.)
We could use a lot more for Master Data Management
We could use a lot more for "everything else" - data mart structures for analytics, etc.
Is this going to be a book?