This is an excerpt from the draft of Chapter 2 for my upcoming book, Practical Data Modeling. If you’d like to get access to this and other chapters, as well as upcoming exclusive content, please become a paid subscriber.
Thanks,
Joe Reis
There’s No Free Lunch in Data Modeling
The phrase "no free lunch" means that it is impossible to get something for nothing. In other words, everything has a cost, obvious or hidden. The saying often summarizes this: "There is no such thing as a free lunch." Resources are limited, and benefits usually come with some cost or sacrifice. Even if something appears free, there are always associated costs or trade-offs.
There is no free lunch when it comes to data modeling. You have a spectrum of choices. On one extreme, you can ignore intentional data modeling and wing it. This is tempting for projects where you need to ship something fast. You’ll move fast but make a lot of mistakes along the way. On the other end, you intentionally model your data rigorously and methodically. You’ll move slowly, but your data model will be very robust. The tradeoff is if your organization will have the patience and money to invest in your efforts. Or, you can model data somewhere between these. Each approach has its tradeoffs and incurs debts that must be paid back at some later time.
Types of Debt - Technical, Data, and Organizational
Evaluating these tradeoffs and sacrifices from a cost-and-benefit perspective is critical. Every data modeling decision has a mix of benefits and debt. I often see people underestimate debt. There are three types of debt to consider: technical, organizational, and data. Most people are familiar with technical debt. However, data and organizational debt are also critical.
Technical debt is when short-term decisions and quick fixes are made to expedite the delivery of systems. This often results in more work in the future. These quick fixes involve choosing more straightforward but less optimal solutions. Engineers and developers usually say, “We’ll clear up our technical debt backlog very soon.” That day rarely arrives. As these quick fixes pile up, necessary improvements and refactoring are postponed or never addressed.
For data models, high technical debt often leads to models that do not accurately represent the underlying data or business processes, resulting in misleading analysis and decisions and poorly performing ML models and applications. Another manifestation of addressing technical debt involves revisiting and improving these areas to ensure long-term code quality and maintainability.
Data debt is the accumulation of defects in data. A subset of technical debt, data debt results from short-term actions taken at the expense of long-term data viability. Data debt encompasses poor data quality, lack of proper data governance, inadequate documentation, and ineffective or wrong data models. These shortcomings can lead to inconsistent, inaccurate, or inaccessible data. This makes it difficult for applications to run correctly, analysis to be performed, or ML models to be adequately trained. Over time, data debt hampers organizational agility, increases maintenance costs, and undermines decision-making processes, requiring significant effort and resources to rectify and maintain data integrity.
Organizational debt refers to the accumulated inefficiencies and shortcomings in an organization's processes, structures, and systems, often arising from prioritizing short-term gains over long-term sustainability. This can manifest as outdated procedures, insufficient training, poor communication channels, and lack of proper documentation and governance. Like technical and data debt, organizational debt impedes operational efficiency, reduces agility, and complicates decision-making. Over time, it can lead to decreased employee morale, increased operational costs, and a diminished ability to respond to market changes or innovate effectively, necessitating significant effort and strategic planning to address and improve.
Organizational debt is significant because it’s the organization that empowers you to model data. But what is given can be taken away. Think of organizational debt like a punch pass with a limited number of punches. Each short-term workaround or quick fix consumes one of the limited punches available to the organization. Each punch represents a decision to defer addressing a fundamental issue, adding to the accumulated debt. As the punches are used up, the organization becomes increasingly constrained, with fewer opportunities to delay necessary improvements without significant consequences. Eventually, the punch pass runs out, symbolizing a point where the accumulated debt must be addressed, often at a high cost and with extensive effort. Ignoring organizational debt for too long can result in a critical situation where the organization can no longer function effectively, much like running out of punches on a pass when more access is needed.
Every type of debt has an interest rate and time to be paid back. The interest rate is low in good circumstances, and the payback period is extended. In the worst case, you’re left with payday loan levels of high debt and very short payback times.
I look at things on the pendulum of fast and relaxed versus slow and rigorous. Let’s look at the tradeoffs associated with each approach.
Moving Fast or Slow
Should you move fast or slow? Or somewhere in between? Each approach has its tradeoffs in the types of success and debt you’ll incur.
Adopting a slow and rigorous data modeling approach presents advantages and challenges. The main benefit is that the resulting data model will be robust and meticulous. Technical and data debt will likely be low. Yet, this method often incurs significant organizational debt. The team responsible for creating such a model could be viewed unfavorably within the organization. While some may appreciate the robustness of a meticulously crafted data model, if it takes resources to develop at the expense of other projects that could provide quicker returns on value and effort, it might ultimately be perceived as a waste of time and effort. So, moving slowly in data modeling requires an understanding that there's a cost involved. Namely, the data modeling project may lose support if it doesn’t deliver value on time.
On the other end of the spectrum is the fast and relaxed approach, which can be tempting, especially in environments prioritizing quick results over meticulous detail. This method occasionally leads to the extreme of entirely neglecting intentional data modeling. As discussed previously, there are scenarios where this might be acceptable. Still, it comes with its own set of drawbacks. As I say in my talks, “The lack of a model is still a model. It’s just a crappy model.” High-interest technical and data debt come first. Later, organizational debt comes when people don't trust the data or the shoddy data model impacts critical systems and processes.
The lack of a model is still a model. It’s just a crappy model.
The pendulum of data modeling approaches swings widely, reflecting the industry’s cyclical nature of data practices. This dynamic will be explored further in the next chapter. The approach choice largely depends on your situation’s constraints and requirements, a theme I’ll discuss throughout this book. The ideal strategy is to aim for the best possible data model under the circumstances. “Best" is inherently context-dependent, shaped by the available data and resource constraints. Shortcuts made today become debt you’ll need to pay back later. Taking too long and moving at a glacial pace also creates debt.
There's "no free lunch" in data modeling. Everything has a price. Know what you can pay for.
Interesting, I thought organizational debt here was going to refer to what I’ve come to call the “semantic debt” that starts (or has the potential) to accrue once you create the model - i.e. the potential drift between actual and modeled business concepts or processes if you don’t stay on top of it. Particularly the shifts that don’t correspond to breaking changes that can bring them to the surface. In my experience that idea is hard for people to grasp.
And with data models, once reports are made with a model, it is difficult to refactor without breaking changes in reports.