101 Comments
User's avatar
John Gilmore's avatar

Before I even think about a data solution, I must first understand the data from a business perspective.

This means building a single conceptual data model; that is, a business-focused model of the data.

From here, I can start thinking about data solution design. In effect, this means building logical data models, which typically differ from each other depending on their purpose.

Think about it: a logical model for a dimensional solution will be very different to that for an OLTP solution. Yet they will both reference the same conceptual model.

Again, when we consider implementation, we might have multiple physical models for a particular logical model. These typically differ depending on the chosen platform. Think Oracle versus SQL Server for example.

So I end up with a hierarchy of models, each with a distinct purpose. There will only be one conceptual model, but it will have one or more logical models, each of which will have one or more physical models.

My feeling is that too many people jump straight in at the logical level without even thinking about the conceptual. What they’re doing is design, not modelling.

Expand full comment
Joe Reis's avatar

"My feeling is that too many people jump straight in at the logical level without even thinking about the conceptual. What they’re doing is design, not modelling."

Truth bomb.

I also see people jump straight to physical modeling, ignoring the conceptual and logical steps. You can imagine what happens from there.

Expand full comment
Robert Sanderson's avatar

> What they're doing is design, not modelling.

I need this on a t-shirt! So true, and so well put.

Expand full comment
Phil Maddocks's avatar

I do model data.

We typically look towards the data vault methodology to combine data from multiple sources around business domains. We'll speak to the business to understand how they see the data which helps us understand what our hubs may be and how they link to one another (John Giles book Elephant in the Fridge is a fantastic read on this approach).

From there we'll create a high level model connecting these hubs to each other using links. Then we'll look at the source data and look to bottom fill this top down designed model. On top of that warehouse, we'll build reports, fact/dims, and OBTs if required.

Expand full comment
Joe Reis's avatar

John's books are fantastic! Was actually just chatting with him the other day. Wonderful bloke.

Expand full comment
Juha Korpela's avatar

Wholeheartedly agreed with John Giles' book! Probably the most practical book on Data Vault, and it's a fun read!

Expand full comment
Mohamed Souare's avatar

We do follow exactly the same pattern. I would say that data is first modelled in a first layer using 3NF before getting decomposed using the value method.

Expand full comment
Daniel Rothamel's avatar

Yes! Elephant in the Fridge is fantastic!

Expand full comment
Donald Parish's avatar

Star Schema aka dimensional modeling using the Microsoft Power BI stack. Didn't actually study Kimball until about 10 years after starting. Wish I had, since I was missing Slowly Changing Dimensions (SCD) and Accumulating Snapshot Fact tables. I certainly drink the Kimball Kool-Aid. Found this interview last night: https://archive.org/details/sim_dbms_1994-07_7_8/sim_dbms_1994-07_7_8

Expand full comment
Joe Reis's avatar

cool, lemme watch this. Thanks Donald!

Expand full comment
Donald Parish's avatar

Also mag article from 1995. Interesting to see the thinking before the 1996 book came out. https://sigmodrecord.org/publications/sigmodRecord/9509/pdfs/211990.212023.pdf

Expand full comment
Ash Smith's avatar

we do, we loosely follow Kimball and it’s no silver bullet. We do have high-level standards but where we fail is modelling collaboratively across departments.

we try approach it as a data product (e.g not use case specific, reuseability in mind)

Expand full comment
Joe Reis's avatar

"We do have high-level standards but where we fail is modelling collaboratively across departments"

Ahh yes...collaboration can be hard.

Expand full comment
Rakesh Rikhra's avatar

We found some synergy in collabration when using concepts from B.E.A.M maybe its already tried on your side or worth a try

Expand full comment
Michael S. Armentrout's avatar

Same situation here. Anyone done much with Unified Star Schema?

Expand full comment
Donald Parish's avatar

Don’t think the USS is going anywhere. Just watched presentations and read excerpts of the book.

Expand full comment
Rakesh Rikhra's avatar

Would be interested to know why because i see it adds a lot of practical flexibility for us which became quite useful

Expand full comment
Shagility's avatar

I do model

I use a modified version of Lawrence Corrs BEAM pattern to first create a Business / Conceptual Data Model based on Core Business Concepts and Core Business Events (Core Business Processes)

Once I have got agreement that they represent how the business sees its data I then move to a Physical Data Model that is based on a modified version of the Data Vault data model, hydrate the Concepts and Events and then add the Details.

That process automatically creates a series of One Big Table models that can be consumed.

I then work to fill the gaps with inferred Details, Conformed/Master Concepts, calculated Measures and Metrics and custom Consume Views.

Expand full comment
Joe Reis's avatar

Very cool. I keep seeing BEAM pop up in this thread. Cool to see it being adopted.

Expand full comment
Shagility's avatar

Its been one of my go to patterns since I began blending agile, product and data patterns.

Expand full comment
Robert Sanderson's avatar

We model our cultural heritage objects and knowledge using the knowledge graph paradigm, and international standards (with community and local extensions) for cross-institutional interoperability. We use a high level conceptual model, encoded as ontologies and made specific to our use cases and domains with taxonomy/vocabulary. (This is subsequently encoded as JSON-LD, chunked according to graph boundary constraints, and published using REST as a web API)

Our approach is to identify the entities and their functional relationships, then map those entities into the model according to existing patterns. If a use case doesn't match to an existing class, relationship or pattern, we try to abstract it until it does, and then assess the value of being more specific compared to clarity via taxonomy. If the same case is repeated at multiple institutions, we try to collate them into new patterns for the community to adopt in their related work.

Agree with Ash Smith as approaching data (or indeed knowledge) as a product, with an engineering mindset and reuse/sustainability always in mind. Perfection and slavish completeness are the enemy of both.

Expand full comment
Abhinav Goyal's avatar

Sounds super interesting. Do you have a publicly available example of this? Thanks

Expand full comment
Robert Sanderson's avatar

Sure! The knowledge graph data is all publicly accessible here: https://lux.collections.yale.edu/ and the documentation (still in progress) for the model and API is here: https://linked.art/. All of the images are served via IIIF: https://iiif.io/ The code is being cleaned up and will be available as open source in the next month or two. From any view on the graph, you can see the data that was used to construct it (aggregated from about 20 different sources), and collection objects have links back to their origin sites at the different museums and libraries.

An overview of the model and modeling practice: https://www.slideshare.net/azaroth42/understanding-linked-art

And my favorite media piece about it from when we launched last June: https://www.popsci.com/technology/yale-lux/

Expand full comment
Abhinav Goyal's avatar

Thanks, Robert. Will examine, and if you don't mind, ask questions. :)

Expand full comment
Robert Sanderson's avatar

Absolutely. Happy to discuss if that's okay with Joe -- don't want to hijack the topic :)

Expand full comment
Joe Reis's avatar

Discuss away!

Expand full comment
Dave's avatar

I data model and document the business data semantic content to then allowing me to select the physical implementation of it. As many have commented you use specific design approaches based on what the objective is (i.e. CDM is support MDM, LDM’s to support transactional based needs, BI/DW/Data lake house needs etc.). My goal is to digitally capture all business names, definitions, PK’s, AK’s relationships (design view) and what the physical implementation is which can be different from the design view. This all is consumed and integrated into the Data Catalog to support increased velocity in supporting various needs.

Expand full comment
Abhinav Goyal's avatar

What tools do you use for the MDM/Data Catalog?

Expand full comment
Dave's avatar

I have used IBM Watson (Infosphere), Informatica EDC and now Atlan. For MDM, I have used Informatica Multi-Domain MDM. For data modeling presently using IDERA ER Studio Data Architect and have used ErWin primarily. Starting work on integrating data models with Stardog KG. So much to think about!

Expand full comment
Abhinav Goyal's avatar

Wow. Any words of advice re: strengths and weaknesses of these various tools?

Expand full comment
Curt Lansing's avatar

I do model data.

I try to stay general/abstract when it comes to data modeling, instead of modeling too specific to the current version of the application. I find this approach is more flexible and able to adjust to changing application features.

Expand full comment
Joe Reis's avatar

How often do your application features evolve?

Expand full comment
Curt Lansing's avatar

In my experience, as part of an agile development team, features are changing or added quite regularly. I feel like I'm always working on stories that are part of some epic feature, and when that epic is complete, on to the next.

To be clear, I'm referring to my experiences working with OLTP applications.

Expand full comment
Abhinav Goyal's avatar

Which application(s) do you use, Curt?

Expand full comment
Curt Lansing's avatar

Hi Abhinav, if you are referring to the applications I mentioned in my comment, I'm referring to custom applications.

Expand full comment
Abhinav Goyal's avatar

Got it. Thanks, Curt! :)

Expand full comment
Johnny Winter's avatar

Yes, I model data.

I work as an analytics consultant, so predominantly Kimball style modelling, though sometimes with an OBT as a mart on top.

I use a collaborative data modelling framework that encourages business/end users to design a conceptual model based on their needs and understanding of their data

Expand full comment
Joe Reis's avatar

"collaborative data modelling framework". Can you elaborate on this? Someone else in this thread mentioned they're having challenges with collaborative data modeling.

Expand full comment
Johnny Winter's avatar

Hoping to elaborate on it a lot... Currently planning a book if you happen to know any publishers that might be interested 😜 - it's ultimately a set of workshops aimed at a non-technical audience to draw out a conceptual model on a whiteboard bases on a specific business event based on the "6 honest serving men" - it's designed to think about data not from an analytical requirments perspective, but more from a descriptive basis, which helps give a more holistic view... Capturing it all in a short substack reply will probably be a challenge

Expand full comment
Joe Reis's avatar

Yeah, HMU about the book. Happy to chat

Expand full comment
Juha Korpela's avatar

I model data everywhere I work :) For me the most important thing has always been understanding and modeling the "business objects" before I jump into any conclusions about Kimball or Data Vault or OBT or anything. Need to know what the reality looks like before designing a solution! Thus, I always start with Conceptual modeling. This gives an overview of the domain we're dealing with, and it's infinitely reusable. My approach is also to make this As Simple As Possible - no complex methodologies or notations, just entities and relationships in a purely business-language diagram that anyone can understand.

Kimball/DV/etc is then a solution-specific Logical data model, which depending on the project/product might change drastically - even though the Conceptual level model stays more or less the same. The Logical model is thus derived from the Conceptual model, and then, if the chosen technology platform requires some kind of optimization from the model structure's part, the Logical model is turned into a Physical model for that particular technology.

Expand full comment
Joe Reis's avatar

Hi Juha. You're definitely the conceptual model master, in my book!

Expand full comment
Juha Korpela's avatar

Much obliged - like all of us, I'm standing on the shoulders of giants!

Expand full comment
Jason Dexter's avatar

@juha/joe: either of you have any resources you recommend for learning more about the conceptual -> logical -> physical modeling (i'd love to study some "great" examples of what this looks like, especially the conceptual bit).

Expand full comment
Kenneth Myers's avatar

Would also love to get some good resources here!

Expand full comment
Joe Reis's avatar

For some classics. I suggest books from John Giles, Len Silverstein, and William Kent

Expand full comment
Jason Dexter's avatar

appreciate the info, will queue one of these up for the new year - getting ready to focus on data modeling for 2025 💪

Expand full comment
Zach Loertscher's avatar

How are you choosing between kimball and data vault? I've heard you only need data vault in highly regulatory environments (i.e. banking/health-care) but am curious how you go about making that decision

Expand full comment
Juha Korpela's avatar

Well, the first question is "what am I building"... Data Vault is a good choice if it's a large-scale Data Warehouse (or Lakehouse or whatever you want to call it - a centralized, integrated, well-modeled and -governed place for enterprise data!). With all the modern Data Vault automation tools, it's not much overhead to go for DV in any industry (and indeed I've seen them in all kinds of places). To put it in very very simplified terms, I think Data Vault is always a good choice if you're doing data warehousing and you have multiple source systems with some degree of complexity.

I don't actually personally believe that Kimball is a solid method for large-scale reusable solutions: the individual "stars" are too use-case-specific. If the scope is smaller, then sure - and in any case I do use star schemas with the Data Mart layer on top of a Data Vault, for example. But this is maybe a matter of personal preference ;)

But not every data product is a data warehouse or a mart! If the use case calls for a certain set of data to be fed into an ML model or something like that, it's going to need something else than a Vault or a Star. It's always like that: figure out the need, figure out what I'm building, and then choose a suitable "shape" - your flavor of logical model for this particular piece of data.

That being said, of course I never *start* with choosing logical model - always with conceptual first ;)

Expand full comment
Zach Loertscher's avatar

This is fantastic insight - thanks for your detailed response! Like you said, there are so many automation tools out there to help with getting data vault set up. We predominantly use star schemas but I've run into the brittleness of SC2's and always having to join fact tables. This feels like a mature approach!

Expand full comment
Adrian Sibilla's avatar

Lots of chat about approaches to data modelling, and here lies the reason for its demise. There is no clear articulation of why we model, to what end. Data modelling can be completely different process depending on the outcome your looking to achieve. Kimble, Vault, OBT mean absolutely nothing to the business (and shouldn't) but is table stakes for data designers and engineers. A business model that describes what data is available for consumption and what we need to govern will resonate more with the business, but for data engineers and data scientists not so much with the pressure they are under.

Seems to be that as a community we need to be able to clearly articulate and monitor the business value of all types of modelling via metrics and KPI's - Not unlike Dora Metrics in Devops. If we can demonstrate the tangible benefit to the bottom line it would get all the funding needed.

I have high hopes for machine learning these days. If chatGPT is able to write a PHD paper in the style of Ozzie Osbourne, why can't we expect it to reverse engineer a model from a petabyte of data given an ontology as an input framework? Value.

Expand full comment
Joe Reis's avatar

you must be a mind reader. I was literally writing last week about how poor articulation about the value of data modeling is a contributing factor in its demise. It's part of the larger "business value" conversation that's back in vogue these days in the data world. Thanks!

Expand full comment
Darren Lynch's avatar

I don’t do any data modelling in my current role. Really excited to see what I can learn from the wealth of experience in this community.

Expand full comment
Joe Reis's avatar

Very cool

Expand full comment
Dilip's avatar

Yes we do traditional data modeling (CDM-LDM-PDM) for critical products. We have flavors of kimball in some areas and OBT in some areas depending on what we are trying to achieve and the consumer base. We do data modeling to bring consistency and harmony into the data space

Expand full comment
Joe Reis's avatar

Very cool. What determines a critical product?

Expand full comment
Ben Labenski's avatar

Yes, I model data!

At two different companies, a blend of Kimball & entity-centric has worked wonderfully. Probably 80/20, respectively, with basic Facts $ Dimensions as the core components, and a few deviations where logical and generally agreed upon (eg denormalization).

Expand full comment
Joe Reis's avatar

Are there any criteria for choosing certain approaches?

Expand full comment
Ben Labenski's avatar

Kind of!

For the core components of having Fact and Dimension coverage for what is analytically valuable (what we'll call the 'Kimball Foundation'), that's pretty cut-and-dry.

There is some choosiness when going a step further with the entity-centric stuff (which - from what I understand, still tangentially maps to a lot of Kimball methodologies). Here, one starts modelling facts 'around' entities.

Determining what entities one should be modelling around (e.g. the entities that folks are interested in) is where some of the "choosiness" comes into play. Generally, though, I would say that there are some indicators that makes these choices fairly easy though:

1. What dimension tables do you have already? Those are probably the entities people are most interested in.

2. What foreign keys exist in a given fact table? Those FKs are representative of an entity you could aggregate this fact to the context of, and thereby making 1:1 joinable to the dimension table(s) from (1).

One could refer to this as almost building a "reverse star schema" (where a dimension table is at the center, and a bunch of aggregated-fact tables to the dimension's grain are the star points).

Longwinded way of saying there's choosiness in terms of what one might want to represent in what we'll call a "reverse star schema", and that choice (though often indicated by what's already in your data warehouse) mostly depends on what the business determines as analytically valuable.

Expand full comment
Aditya's avatar

I do model Data.

Loosely stick to Kimball Dimensional modelling approach. Basically understand the business needs, identify measures and attributes, group them into facts and dimensions and connect using star schema.

Expand full comment
Joe Reis's avatar

very cool

Expand full comment
Kalle Bylin's avatar

I also model data.

Similar to another comment above I lately also start with a modified and very simplified version of Lawrence Corrs BEAM pattern that also has a lot of similarities with a Kimball enterprise data warehouse bus matrix.

For me it is critical that I conceptually understand the big picture and the business processes that generate the data. I use whatever artifact works to ensure alignment with business stakeholders and software engineers, and the tools above have resonated quite well instead of a traditional ER conceptual diagram.

Identifying core concepts and processes early, even if they are outside of the initial scope, is very useful to detect potential future architectural breakers.

With this in place, the next step is often a hybrid between a logical and physical model. Drawing things out before coding helps a lot (and is much cheaper to modify), but purely physical models tend to change quite a bit during implementation, so I don't find it worth the effort to draw out every single detail upfront and maintain these detailed diagrams.

In terms of modeling methodology, I'm not married to a single method. I find it more valuable to use the patterns that match the project at hand. For example, in a recent project using a medallion architecture we ended up with something resembling OBTs in the silver layer, with more classical dimensional models in the gold layer. It is worth noting that the OBTs were based on a pre-existing normalized data model used in the application though, not created from scratch.

Expand full comment
Joe Reis's avatar

"For me it is critical that I conceptually understand the big picture and the business processes that generate the data."

How do you go about getting this big picture?

Expand full comment
Kalle Bylin's avatar

This is something I'm still iterating on, but a few years back I was wearing the data scientist and product manager hats simultaneously at a company I worked for and periodically did customer interviews. I continued using some of the practices as a data engineer to build a conceptual big picture for data modelig.

One of the main practices was user story mapping. I think I felt the same frustration with ER conceptual models as Jeff Patton did with the typical flat backlog of user stories.

Benefits:

- Keeps the focus on user and customer journeys

- It is visual and offers most of the benefits of storytelling, especially context

- Widely adopted by product teams. So in many cases it has already been created or at least it offers a shared framework for us to work with.

Still, user story maps usually have much more information than what is needed for a data model, and each map tends to represent one of many different journeys in the case of an enterprise.

That's where I found Lawrence Corr's framework interesting. He presents data stories as comparable to user stories. So lately I have used the event matrix from the BEAM framework mapping the user story map to both the events (rows of matrix) and to define the dimensions (columns of the matrix) answering the who, why, what, where and how.

Conceptually, this ends up looking like a model of the business and is grounded in the same language and frameworks used by the rest of the product teams so we have shared understanding. It also makes easier to identify conformed data concepts across the different processes and journeys.

So, not completely satisfied with the artifact yet, but I do think 1) we need stories to build conceptual understanding of the big picture and 2) the stories should be focused on customers and users (internal and/or external).

Expand full comment
Daniel Rothamel's avatar

I work at a data consultancy, and yes, we do model data. We also work hard to educate our clients about the importance of modeling for achieving the analytics goals that they have.

We utilize Data Vault and a business-driven approach, as opposed to a source-driven approach. This enables the business to get immediate value out of the modeling effort and enables the model to grow as necessary.

Expand full comment
Joe Reis's avatar

"We also work hard to educate our clients about the importance of modeling for achieving the analytics goals that they have."

How do you educate the client on the importance of modeling? This is one of THE big questions in the field. Thanks.

Expand full comment
Daniel Rothamel's avatar

It's definitely been a bit of a moving target that sometimes we hit, and sometimes not so much. lol.

Overall, we make sure to educate the client as to *why* we do what we do, not just how we do it. So that begins during the presales and sales motions. We let them know that our philosphy is to build a cloud data warehouse that can be BI tool independent. The only way to really achieve that it is model the data *in the warehouse* and not in the BI tool.

One of the most common ways our clients find us is because they're suffering the pain of being caught in the trap of having a ton of business logic in their BI tool or Excel or Google Sheets. Once we alert them to the fact that this is what is causing their pain, they're more open to the remedy of modeling.

We've also recently started using the metaphor of building a farm, not a car. A lot of data pros think of modeling like building the engine of a car. In that case, you have to completely build the engine (the modeling) before starting the car and driving (using the DW for analysis). This creates all kinds of problems, which you well know.

We tell them that we want to approach it like building a farm (or a garden), where we are going to plant some crops over here so that they can begin harvesting, and then we'll add other crops, and we can go back to the first patch and do some weeding, but they'll always be harvesting along the way. And the farm will keep growing and producing along the way.

And on a practical level, we're actually adding a clause in our SOW that specifically addresses how we do the modeling and that we WILL be doing modeling. It is not optional. lol.

Expand full comment
Tim Hiebenthal's avatar

We do model data by loosely following Kimball, but instead of delivering facts and dimensions we add the metrics directly to entities. This is similar to what Max Beauchemin introduces as the "Entity-Centric"-modelling

https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/

We believe that it's a bit easier for business stakeholders to grasp, interpret and use.

Additionally we then build further "flattened" datasets (e.g. adding all the product- and customer- metadata to the orders and potentially pre-aggregate) to simplify analytics

Expand full comment
Zach Loertscher's avatar

That article is fantastic - thank you for sharing!

Expand full comment
Joe Reis's avatar

Curious how much adoption the entity-centric model is getting?

Expand full comment
Tim Hiebenthal's avatar

We like it a lot and use it since several years (much before Max posted the blog post).

A big benefit I see is that

a) end up with less entities/tables in your data model, because you don't need a dedicated fact for every process

b) you can share the conceptual data model with business stakeholders and without much technical knowledge they're able to understand the relations between client <-> subscription <-> invoice etc.

c) it makes e.g. Reverse ETL pretty easy because you already have a lot of metrics in your customer-entity (same goes for Data Science projects as described in Max's blog post)

But as a disclaimer: we work in the Start- and Scale Up industry and don't work with huge Enterprises

But for our "scale", I consider it a very good approach

Expand full comment
Melinda Hodkiewicz's avatar

We are currently working to reduce the errors in complex equipment selection through modelling equipment, material and process concepts. The data is largely held in international engineering standards. We are aligning our modelling with the Industrial Data Ontology (IDO). This is an OWL DL upper ontology. Currently open and available on the web but on a pathway to ISO standardisation. The group involved includes asset owners, engineering design consultants, and equipment suppliers. The use of an agreed upper ontology will allow us to leverage previous data models built by group members and also be usable across the multiple organisations. The use of an upper ontology constrains modelling decisions. In addition one of the live discussions we are having right now is how to find a balance between having for example, very detailed object properties (good for reasoning) vs more generic object properties (easier to train the software engineers who will have to use these entities and relations). No easy answers here.

Expand full comment
Abhinav Goyal's avatar

How are you finding working with OWL? Do you also use RDF reasoners? Also, how’s that going, if you don’t mind sharing. Thanks

Expand full comment
Joe Reis's avatar

Got the same question

Expand full comment
Wiliam Theisinger's avatar

Yes, I am modeling data all over the place in my role. In cloud data platforms, storage layers, ML pipelines, ML ops processing flows, loading data, event based systems, multi-cloud environments and my favorite (sic), using DMS schema migration to move from engine type 1 to engine type 2.

May come as no surprise, I approach it based on user requirements, environmental and technical limitations, technology (this is the second primary focus outside user requirement), money, time, .. I am sure I am missing other considerations

Expand full comment
Joe Reis's avatar

👍

Expand full comment
Wiliam Theisinger's avatar

a lot of interesting insights on this thread. I feel my primary deviation from leading with the model is that I lead with the technology. Sometimes the tech is not a choice you can make. e.g. I would avoid dimensionalization on a column oriented db engine just based on all the testing I have personally done with performance and maintainability. In addition, simple principles like pushing structure low, understanding usage on read and write that informs data patterns you would need to create for salting, pruning, etc. When someone says "model" to me that means a ton of things from schema design and enforcement in event platforms and yes, ultimately the model that lives in more user facing platforms. So, from my perspective, model is really subjective depending on the ownership one has over the complete platform design

Expand full comment
Abhinav Goyal's avatar

Answered this in some detail - so just posted the link instead of answering again.

Self-promotion, a tad bit. :)

Cheers

Expand full comment
Dave A's avatar

Guilty of going straight to LDM and not keeping CDM as a separate artifact. Often doing both at the same time that way. By LDM meaning a 3NF (kind of) representation of entities including their artifacts and relationships, often many to many relationships already resolved which further adds to the complexity and decreases readability I guess. I think it would be good to keep CDMs separate and then add multiple LDMs for different use cases, contexts and design patterns.

Expand full comment
Adrian Sibilla's avatar

@Joe Reis - would be good in your travels across data modelling to look into the world of feature tables. They seem to have become the next silver bullet in the ML world and I’m seeing some conceptual modelling practises being put around this as well.

If anyone in the community knows of some good content around this topic please share 🙏

Expand full comment
bryangoodrich's avatar

I do work somewhere, and I model data 🤓

If it’s greenfield, I approach it with a Lean mindset and workshop with my direct stakeholders to build out a rapid prototype MVP to discover if its valuable. Otherwise, workshops are geared towards business understanding to align existing product with need to identify value.

We use a Kimball approach. Some of our data is in a lake, and I’m trying to push more into that space, but I’m really the only developer (data engineer) on the BI team. I’ve used some design features from Vault and temporal tables in the warehouse, and OBT in the lake to make life easier. We’re struggling with trying to handle more real time data in a very 90s style IT world (SAP shop) without actually doing anything streaming. But we have grants to get 1 minute consumption data soon! 😂

Expand full comment
Dilip's avatar

Business value the product creates and the size.

But all other products we do follow fundamental standards such as naming, data types, keys, definitions etc., In large data lakes / lake houses just the naming and definitions go a long way.

Expand full comment
Jordi Puig's avatar

We organize our data warehouse following the Kimball dimensional modeling. Usually the data extracted from the origin goes directly to a fact or dimension table, with little "modeling" implied in this step. Relation tables are filled manually each year more or less. After that, some processes are in place to transform the data to serve specific dashboards. This modeling is designed in an iterative way and is usually overlooked by the team.

I would say we model data, but we don't usually give it enough importance.

Expand full comment
paulx.eth's avatar

I try to model data i'm building pipelines for. I'm inclined to start with Kimball, but have found myself going very bare bones.

Which is to identify discrete tables for different datasets, identify primary and foreign keys, get a sense for granularity of each row. Connect the tables through primary/foreign keys visually in a entity-relationship diagram (Mermaid on Notion). And that's it.

Open to adding more complexity, but this currently works for both devs and analysts alike.

Expand full comment
Joe Reis's avatar

interesting. How do you approach the data models for devs and analysts?

Expand full comment
paulx.eth's avatar

I find the model approach I described above broadly sufficient for the kind of analysis we're doing (so far).

For devs, it's been more about using SQLalchemy as an ORM and making changes to database tables that way (instead of mixing raw SQL in python code). I guess more an issue of code maintainability, rather than pure data modeling per se.

Expand full comment
Mike's avatar

It's a basic technique, but for data visualization I like to start from the dummy of a final model and then work back through the stages of modelling I would need all the way to the raw data.

Expand full comment
Joe Reis's avatar

Start with the end in mind. Very smart.

Expand full comment
Benjamin Cogrel's avatar

We model using RDF-based virtual knowledge graphs, where relational data sources are mapped to classes and properties using R2RML mappings. Classes and properties are modelled in RDFS or lightweight OWL ontologies.

Expand full comment
Joe Reis's avatar

Very cool

Expand full comment
Oskar Lindberg's avatar

Speaking generally for Swedish municipalities which I represent, data modelling generally not a big topic (yet, the data driven organisation is slowly and steadily approaching due to several factors such as tuffer economies, AI on everyone's agenda and the advent of the smart city) but those that do generally go for a data vault model (I presume for audit reasons and a huge amount of source systems steadily changing). A lot of our data is semi/unstructured though (vasts amount of physical documents an a rapidly increasing amount of IoT data) needing somewhat other methods to handle

Expand full comment
Joe Reis's avatar

interesting. Are Swedish cities moving toward being "smart cities"?

Expand full comment
Oskar Lindberg's avatar

Well, it's in several strategies so on paper at least;-) Guess the same thing holds true in other regions as well due to the hype around iot, digital twins, datadriven decisions etc. We have a way to go and other european cities such as Amsterdam and Barcelona are known to have come further (what ever that means due to the vague definition of what a smart city actually is). Good thing though that it is on the agenda and also have garnered the interest of all kinds of different actors within the city (private, public etc), stimulating new kinds of ecosystems and business models. About time the public sector takes center stage in the digital transformation of the world, urgently needed in my opinion;-)

Expand full comment
Paul Johnson's avatar

Hi Joe,

Here’s my basic approach given I’m an external consultant. I try to have as light weight approach as possible when it comes to workshops with my clients, I’m sensitive to how much time pressure people are under and how expensive they can be. Although in some orgs people absolutely love a meeting 😂

1. I first determine the type of use case (BI, data science, transactional etc.) the physical model will be optimized for the most common workloads.

2. In my line of work it’s 90% data warehousing, so I’ll explain that. I use a technique similar to sun modelling to capture the requirements. This acts as my conceptual model although I don’t use that term.

https://www.synvert-tcm.com/blog/capturing-requirements-using-sun-models/

3. I’ll explore the source systems to find where the data comes from, how often we need to extract etc. If there are many systems for the same. This helps me determine if there’s a data vault with star schema or just a star schema.

4. Next I create a lineage diagram that shows how fields flow from source to target model, this actually helps create the physical model so I do these at the same time.

The lineage diagram is extremely useful for a developer.

Expand full comment
Anas Tina's avatar

Yes, I do model data.

I try to think of high level design (Concepts & Relationships between them)

I think in terms of: PEOPLE go to PLACES to do THINGS in TIME.

Who is doing what, where, and when?

High level designs are simple and keeps people on the same page, I think.

For the tooling at this stage: I use Entity-Relationships (ER) diagrams.

Expand full comment
Mohan's avatar

We are a small data team and our approach to modelling when we recently moved to using the Lakehouse (Databricks) is driven by the pain points we were facing with on-premise data warehouses.

We expose all three layers Bronze, Silver and Gold to our users and also provide the super users with their own Databricks workspace and data(lake)bases to store their outcomes. The self-serve code and data models are left to the super users for management and governance.

Bronze layer - we use it as a fast lane to onboard new data sources and model it like ODS.

Silver layer - our intention is to build star schemas however we accommodate practical and user experience into consideration for modelling. We emphasise that this is slow lane and require strong collaboration with the domain expert.

Gold layer - we pitch gold layer as the mean to achieve specific/complex use cases, summary/composite facts, etc.. We don't set modelling expectations in this layer as the main intention is to achieve the outcome.

We are still in this journey, very few users have the patience to build silver layer. Some super users get excited about building their own silver/gold layers in their own data(lake)bases however they realise that it's not an exciting or easy task after many months, especially keeping it running and monitoring for quality.

Expand full comment
Mohan's avatar

Few more points from my observation in the industry that I work,

- There's a very few people outside of IT that sees the value in data modelling

- Data modelling as a term is not understood clearly (thanks to Power BI)

- We buy and deploy most of our corporate and operational applications, and it is hard to find modelers/engineers who have the patience to understand the varying types of source data models and build data models driven by business processes

- Data modelling used to be a skill that bridged the business domain knowledge and ETL engineers however we all followed Silicon Valley and we were made to believe that data modelling is just a nice to have skill

- Been working in the same company for 10 years. Some of the problems I solved early in my career using classic data marts are still a challenge for the modern lake house and data engineers. I believe that the main reason is the lack of interest in thinking long term, lack of business analysis skills, and short attention span.

Expand full comment
User's avatar
Comment deleted
Jan 11, 2024
Comment deleted
Expand full comment
Joe Reis's avatar

awesome

Expand full comment