Over the past year, I've traveled the world several times, giving talks and, most importantly, trying to figure out what's happening in tech and data. I've met countless folks in different places across various roles and titles. It could be looking better. No matter where I go, I see the same problems popping up. Whether it’s data for software, analytics, or machine learning, the consistent theme is that working with data is hard and getting more complex.
Applications go haywire because of clunky and inefficient data models. This application data is thrown over the wall to data engineers, analysts, and data scientists who spend too much time making this messy data useful. Analytics is often a Rube Goldberg machine of ad hoc DAG workflows, often giving conflicting answers depending on which workflow ran which query. Machine learning models also suffer from this lousy data situation, providing wrong predictions due to poor training data. Crappy data is everywhere.
No matter the use case, data is often malformed, full of errors, or unusable. This creates insanely high technical, data, and organizational debt. So, what's causing all this chaos? I've thought about it a lot. Is it a problem with the tools and technology we have? We've got some pretty incredible data tools, way better than back in the day. But that's not the main issue. Are people just being careless or malicious with their data? I don't think so. Maybe careless, but probably not malicious. Folks are generally trying their best.
As I thought through the root cause of why data was so hard to work with, data modeling kept coming to mind. I decided to check if my hunch was correct.
Writing a Book in 2023 - Detours, a World Tour, and Learning a LOT
After Fundamentals of Data Engineering went to print in the Summer of 2022, I got many questions from data practitioners, wondering why our section on data modeling was so short. I agreed. Our coverage gave a sufficient survey of data modeling, but such an important topic needed a far deeper dive. My training in Lean back in the day got me to investigate the upstream applications generating the data that analysts and data scientists depend on. Sure enough, application developers and software engineers also ignore data modeling. The only time to get data right is when it's created. Every downstream use case of that data is, at best, an attempt to duct-tape and glue it into something useful. This data must reflect the organization's business processes, rules, vocabulary, and information flows. Somehow, data didn’t do that. What a mess.
In early 2023, I started working on a book about data modeling. Word got around, and many people loved that I was addressing this topic. A few people hated it. Some people felt like data modeling was a solved problem and felt threatened that I was writing a book. “How dare Joe intrude on our sacred turf?” There is a lot of gatekeeping and religious wars in data modeling. Others felt that data modeling was a waste of time. “Why bother with all of this data modeling ceremony and nonsense? Just query the data ad hoc and move on.” Those are…interesting…perspectives, for sure. But I wasn’t satisfied with gatekeeping (I’m rebellious), and if ad hoc queries are the solution, why is the global data landscape such a junk show?
I originally planned to finish my data modeling book in the Summer of 2023. Then two things happened. First, a month before I started my book, ChatGPT came on the scene. Being an author, this created an existential crisis. ChatGPT caused me to pause and understand the nature of LLMs for writing and data modeling (you’ll find my answers in my upcoming book). Second, the awesome folks at deeplearning.ai asked me to spearhead a data engineering specialization with them. Work with Andrew Ng and his world-class crew on what will be a fantastic course? Heck yes. You’d take that opportunity, too. Stay tuned for details on that, by the way.
Book writing is a notoriously solitary, almost black-box affair. All too often, books are written in silence. Then, they are released to the world, which either accepts or rejects the book. I don’t really like this approach. Also, given the somewhat controversial nature of data modeling, I wanted to test the ideas of my new book in public. In 2023, I traveled the world many times over, giving some version of a talk called “Data Modeling is Dead. Long Live Data Modeling!”
As it turns out, data modeling is alive. People loved the message of my talk - bring back data modeling. The tour allowed me to talk with countless amazing people to understand their challenges with data. This world tour was necessary to get a proper, on-the-ground understanding of where data modeling is today. It would be impossible for me to replicate my efforts in 2023, and I’m thankful to everyone who invited me to speak at their event. These speeches and interactions were exactly what I needed to figure out the content of the data modeling book and the path to making it most beneficial to the world.
What is Practical Data Modeling?
Practical Data Modeling is where I’ll drop early-release chapters of my book, videos, podcasts, articles, and foster a community of data modeling enthusiasts and practitioners. My goal for Practical Data Modeling is to be the platform for people to learn and understand data modeling - what it is, why it matters, and the various approaches you can use across different use cases. Along the way, I hope we can make new friends and help grow each other’s skills and knowledge. Practical Data Modeling needs to push us all forward. If we can grow, the industry will move forward.
Why not just write a data modeling book?
First, the nature of a book is different in 2024. LLMs changed that. For example, I see many knockoffs of Fundamentals of Data Engineering on Amazon, all of which read like they were generated in a few hours with ChatGPT. Wildly, Amazon recently instituted a rule that authors can only publish three books per day! Let that sink in. It often takes several months to many years to write a book. Now, anyone (or anything) can crank out several books daily. The bar for a mediocre book is whatever LLMs can produce. There is already a flood of mediocre books, only to worsen with the flood of AI-generated books. Curating and discovering great, helpful, and important books will be easier and more challenging. Easy because the standards for a book will be so low that it will be easier to rise above the competition. Challenging because people will need to find your book amongst a sea of crap.
Second, on a deeper level, a book is about an audience embracing the author's ideas and motivations. A book on its own is just a book. A book in isolation means very little until the book’s ideas capture an audience. The author is merely a conduit for ideas the audience discusses, debates, accepts, and tosses away. While we wrote Fundamentals of Data Engineering, Matt Housely and I built the audience around the book's ideas through our weekly podcast, The Monday Morning Data Chat. The audience loved some ideas and trashed others. So it goes.
Eventually, others invited us onto their podcasts and interviews to discuss our upcoming book. Suddenly, several months before our book was published, Fundamentals of Data Engineering was the number one new release on Amazon in several categories. Talk about writing a book under pressure. Based on our audience feedback, the book couldn’t just be good. It had to be excellent. The success of Fundamentals of Data Engineering speaks for itself. It’s still an Amazon best-seller in several categories, one of O’Reilly’s top books, and has captured the love of countless people worldwide, with daily posts from fans of the book. To say that I’m humbled is a vast understatement. Every day, I wake up grateful that my first book was a success. It will never get old. What I learned is that it's key to build the book in public and bring the audience along for the journey.
Lastly, the way we approach the topic of data modeling needs revision. The classic ideas in data modeling (relational, dimensional, Data Vault, etc) are all relevant, but the last significant idea came out in the early 2000s with Data Vault. Back then, the internet was still making inroads, and data didn’t impact people’s lives the way it does today. A lot has happened since then - Big Data, streaming and event architectures, NoSQL databases, ML/AI, and more. Data is the central force of today’s world. The growth of AI will only accelerate this many-fold, especially when you account for AI-generated data. We’re about to move into a vastly different world and uses for data than even a few years ago. As long as humans use data, we must have a role in its use along the data lifecycle. We will have an amazing future if we can improve our understanding and use of data modeling.
How To Get Started
Practical Data Modeling will be packed with actionable and practical content you can apply in your day-to-day work. You’ve got a lot of options for your attention, and I really want everyone to get tremendous value from Practical Data Modeling. Here’s what you can expect*.
Regular content throughout the week (articles and videos) on a variety of modeling topics.
Early access to chapters from my new data modeling book (coming out in 2024).
Exclusive access to our community of data modelers, where you can ask questions, share your work and projects, and learn from others.
Regular online community discussions and book clubs.
Discounts on online courses and workshops.
Much more on the way!
*Free (for now) and paid tiers with Early Bird pricing coming very soon.
I look forward to connecting with you on your data modeling journey. Sign up today!
Hi, I would like to ask if you plan to adress the technical debt in modeling. Like, what happens if legacy tables were modeled using a certain method and you want to change it, or used no modeling at all, how to deal with this kind of changes. Thanks!
Stumbled onto Practical Data Modeling this week and am VERY excited to dive in and come along for the ride - i'm an Analytics Engineer and am planning to double down on my data modeling skills