43 Comments
User's avatar
Joe Reis's avatar

Thanks all. Extremely helpful and awesome comments. I'll start posting articles next week, and will open this group up for a discussion call if anyone is interested.

Thanks,

Joe

Expand full comment
Michelle Currie's avatar

Interested in a discussion call

Expand full comment
Wendy Smoak's avatar

Conceptual modeling of the business domain *before* thinking about how it will be represented in code or in a database. I'd be interested in some 'mob modeling' sessions where a bunch of us get together and practice interviewing a stakeholder and iterating on a conceptual model.

Expand full comment
Joe Reis's avatar

This

Expand full comment
Michelle Currie's avatar

I can help with how to collect business requirements from the business side

Expand full comment
Paul Alex Ahlstrom's avatar

I have an idea. The 3 books I have read so far on modeling fail to:

1. simply and quickly demonstrate the value of modeled data. How is this going to make my life way easier?

2. Simply show how to model a few source tables end to end into a data warehouse star schema. Can you demonstrate the process of modeling so the conversation is intuitive going forward?

A lot of data people know something is wrong with their way of modeling right now but can't put their finger on it. Give them an example that makes them go "dang fact and dim tables make so much sense, I need this." Maybe demonstrate the difficulty of doing a "query driven model" on transaction tables with a bunch of joins and filtering. Then compare that to a star schema or a OBT view of the star schema that makes it a simpler question to answer.

Then you can demonstrate how to model the data. You could say for example "here are 4 tables, video views, project, season and episode metadata" and then model those into a star schema (and show the logical, conceptual, physical, transformational steps). That would be a super practical guide.

Expand full comment
Marc Reid's avatar

Looking forward to these posts, Joe. I'd be interested to learn more about data modelling with or in the context of AI. Do you see a future where AI will be able to scan data across a range of sources and create a workable data model(s) that a human modeller then 'only' has to quality check? Will the need for clear and meaningful column names reduce (not that it should, of course) due to AI being able to make data source connections based on field values rather than field names (for larger data sets)?

Expand full comment
Joe Reis's avatar

This will be a huge and ongoing discussion. So much to unpack here!

Expand full comment
Michelle Currie's avatar

I’m working on something similar to what you describe in healthcare, specifically for AI.

Expand full comment
Ben Doremus's avatar

Data modeling as presented is often too abstract.

It would be amazing to give real-life examples of when different types of modeling are appropriate. What is the business use case, what does raw data look like, what does modeled data look like, and how does it get used?

Tangible, contextual examples. And a discussion about why other models would be less effective than the one chosen.

Expand full comment
paulx.eth's avatar

My starting place has been the Kimball book; and it's certainly overwhelming. Should that be the starting place in 2024? If so, what's the practical way to digest that tome? The book makes recommendations on chapter sequences to read, but wondering, if there's a tl;dr sequence.

Expand full comment
Donald Parish's avatar

I think Kimball dimensional model aka star schema is misunderstood as denormalized, when it is “reorganized” into facts and dimension tables for purpose of making data usable for reporting for business. The fact tables are perfectly 3NF normalized. The dimensions are 2NF “denormalized”, and if you “snowflake” them, they are 3NF also. I dislike the implication that the dimensional model of a database is worse just because it isn’t organized the same as a transaction oriented database.

Expand full comment
Jan Kaul's avatar

The data modeling techniques for Analytics lie on a spectrum of how much data normalization they apply. From Inmon to Kimball to OBT less and less normalization is applied.

I would like some practical guidelines on how much normalization to apply for analytical use cases and how to get started. It would be great to have some resources that help you get off the ground quickly, especially since most projects have the problem of not having a data model instead of having the wrong data model.

I'm really looking forward to your articles!

Happy holidays

Expand full comment
Joe Reis's avatar

Thanks Jan. Good insights

Expand full comment
Ubert, T. (Tanja)'s avatar

I think an overview of models like erd, datavault, star and snowflake and anchor is useful. But also why datamodels matter in for data scientists in production. (Life after the jupyter notebook).

I have trouble convincing new data scientists of the use of models for accountability etc.

Expand full comment
Joe Reis's avatar

Perfect context. Thanks Tanja!

Expand full comment
Manoj Agarwal's avatar

How relevant is the star schema data model in today's days when storage is cheap and distributed, and compute is expensive. Does 'One Big Table' data model make more sense now. With that is Data modeling dead?

Expand full comment
Martin Chesbrough's avatar

One Big Table is still a model. So my question is how do you implement Slowly Changing Dimensions (SCD) into the OBT model and get consistent historical results?

Expand full comment
Michelle Currie's avatar

Hi Joe, I’m a clinical informaticist with over 20 years of experience in HealthIT. I have just started speaking to the importance of conceptual, logical, and physical data modeling in Healthcare and its significance in using data to make any real progress in improving patient outcomes. Especially with the excitement around AI. I am hoting theee webinars starting in a few weeks to talk about the new FHIR standard in healthcare, but want to expand the scope to talk about data modeling in general. I’d appreciate your thoughts on how to accelerate this conversation in the healthcare space. Please let me know if you’d be interested in speaking specifically in this space, or developing a partnership to do that. I’m a clinician, and having a technical person on the same page would help accelerate the conversation with both business and IT stakeholders. I’m specifically focused on the importance of data modeling and AIs use in healthcare. I can be reached at michelle@sos4hit.com

Expand full comment
Ian Koh's avatar

How data vault and dimensional modelling work together. Thanks for setting up this Substack!

Expand full comment
Peter O'Kelly's avatar

Relationships among conceptual data models, metrics stores, and semantic models/layers

Expand full comment
Joe Reis's avatar

For sure. Where do you see conceptual data models fitting in with these newer innovations?

Expand full comment
NoogaTiger's avatar

I’m a newish analytics engineer and would love a straight 101 and 201 overview - what is data modeling/how do you actually do it/best resources to get started/etc?

Expand full comment
Joe Reis's avatar

Very cool. What have you learned about data modeling so far?

Expand full comment
NoogaTiger's avatar

Frankly, not a ton other than basic star schema stuff (fact vs event tables). I inherited a dbt project and redshift instance from a disinterested software engineer and am unsure the best way to design/refactoe the data structures.

Expand full comment
Tsvetelina Petrova's avatar

Hi Joe and everybody!

Great idea to kick off a place to discuss data modeling and everything related - I wish you a lot of success with this!

My suggestion for a topic and something I’m currently trying to figure out for my project is what would be the best approach to handle big legacy data models, which depend heavily on a lot of hardcoded workarounds written throughout the years without proper documentation of “why”.

The case is that the business heavily relies on this data and each refactoring effort creates a risk to “break things”. Therefore there is no external motivation for code refactoring, but the snowball effect will inevitably make the logic unmanageable at some point and the risk of wrong data representation is always on the table.

I hope a lot of people will find this topic useful and interesting!

Expand full comment
Martin Chesbrough's avatar

Hi Joe, I'm going to suggest another angle - let's see if it resonates.

A couple of years back Nick Tune wrote a post on ddd vs DDD (https://medium.com/nick-tune-tech-strategy-blog/domain-driven-design-ddd-vs-domain-driven-design-ddd-10ec1d5ca6c7) and i think this is relevant for data modeling. The relevance of Eric Evans DDD is that he built a formal approach to ddd, but Nick Tune's point is that developers usually do some form of ddd anyway, in the sense that they break the problem space into relevant domains and model them.

So I think we do this in the data and AI space - in order to understand a data problem we build a model, even if it is the One Big Table per dashboard - that table is still a form of model (which is the point in your recent Data Modeling is dead talk). And to get to the OBT we need to do some transforms (de-normalise, aggregate, map columns, etc) - which is also a form of model.

Do we need to do it the Kimball way? or The Data Vault way? or the BEAM way? or using Inmon's approach? Not necessarily but understanding all these approaches ... and why they exist (specifically the WHY) is helpful to understand what to do next.

So there is Data Modeling and data modeling. We all do some form of it. How much is needed for your business? I guess that depends on how much shared knowledge, how much re-use, how much useful documentation you want?

Expand full comment
Sirojiddin's avatar

Data Engineering. Especially the part related to dev/dataOps. As I want to learn that part deeper

Expand full comment
SEN Labs's avatar

Would be interested in DDD style modeling, graphs, mapping different models and views, synchronization and standardization, semantics

Expand full comment
Andy Nelson's avatar

As a student of data modeling I feel like there is a bit of a gap in educational offerings. Yes you can fill your bookcase with all the data modeling books (and I have), but in a way these are like car manuals (they are somewhat hard for beginners to apply to when they are currently driving a 20 year old junker). I wish there was more interactive training that didn't cost 1000's of $$$.

Expand full comment
Joe Reis's avatar

When you say "interactive", what comes to mind?

Expand full comment
Andy Nelson's avatar

When a change is made in the DWH (perhaps a change in the data model), there's a long latency to grasp the implications of that change in a lot of analytics pipelines. I think that would be a challenge for instructors to show these changes more interactively.

Expand full comment
Adam Baker's avatar

At the start of my career (>15 years ago) data architecture involved the generation of Conceptual, Logical and Physical data models. I find myself wanting to create something similar to capture the flow of data via DAGs (Airflow etc.). This would be conceptually different to a physical data model in that only a subset of fields in the data model are used dependent on the pipeline. I guess what I'm trying to capture is data lineage at a field level. I'm wondering - does this type of thing exist and has a name? I find myself generating these DAG data models to help me trace field lineage and wonder if this is standard practice or not

Expand full comment
Joe Reis's avatar

Very interesting approach. Do you think DAGs hold some promise for the next generation of conceptual modeling approaches?

Expand full comment
Martin Chesbrough's avatar

I'd love to explore operational vs analytical data through the lens of data modeling - this might be more than a quick Substack-style post but it touches on the idea that the operational data (whether enterprise app, mobile/web app, IOT feed or whatever) will eventually become your analytical data. Obviously techniques like Kimball/BEAM/Data Vault tend to dominate in the analytical field but these models are easier to build if the operational data has already been well modeled, whether 3NF or not. This also gets into the territory that your mate, Bill Inmon, started from - capturing all operational data into a data warehouse before creating the analytical data marts. Bottom line: if your operational data is not modeled then it makes it a whole heap harder to model the analytical data (IMHO)!

Expand full comment
Martin Chesbrough's avatar

If you go back to the "good ol' days" (I'm only saying this to tease others ;-) ), we used to build enterprise data models, which encapsulated the business rules in their data model. This made it easier to build data marts or OLAP cubes because you knew the "rules" that the source system had to obey.

Expand full comment
Joe Reis's avatar

Martin - Do you see this (or related approaches) making a comeback?

Expand full comment
Martin Chesbrough's avatar

Perhaps... judging by you and many others it feels like data modeling is staging a comeback ... but it won’t come back in the same form ... I think semantic modeling, knowledge graphs etc will start to dominate as they help us model concepts for AI better

Expand full comment
Michelle Currie's avatar

This!

Expand full comment
Naresh Rohra's avatar

Data modeling with respect to Data products

Expand full comment
Curt Lansing's avatar

One big problem I have is the lack of key constraint enforcement in many cloud data environments such as data lakes and snowflake. I know there are ways to test that the data meets the logical constraints, but this is not always practical or cost effective with huge data sizes. If you can't trust the data model upstream, it can make for a mess downstream.

Expand full comment
Paul Zuradzki's avatar

What books and papers to read on this subject. Before yours comes out :P

(looking forward to it)

Expand full comment
Paul Zuradzki's avatar

I’d be interested in:

- Data model and schema evolution.

- Schema integration and strategies (“you just acquired a company with overlapping or conflicting abstractions; now what”)

- How to rescue or change an already deployed model vs start from scratch.

- Migration strategies

- Ways to design for flexibility. Avoid designing or coding yourself into a corner.

- Portability of abstractions at the various layers (conceptual, logical, physical)

- ETL patterns and where the transformation to a data model should happen; trade offs. E.g., source -> staging -> transformed? source -> transformed.

- Design decisions that affect key needs like the ability the backfill and do over-time snapshots.

- Elaboration on the Kimball/X “subsystems” and how to evaluate which tools are right for your context / resources.

- Ways to quickly generate ERDs past the conceptual modeling stage. Ex, I like using ORM+ERD generator over clickity GUIs. Plus you get the SQL schema to play with the design.

- An analysis of how Kimball principles apply in modern context (cloud DBa, cheaper storage, column store, etc)

- Data modeling tools; evergreen or not

- The people aspects in data modeling process. What’s more important - upskilling domain experts in data modeling or upskilling data modeler in the domain? Who should lead the process once we get to logical and physical stage?

Expand full comment