I remember hearing the term “Code Smell” a long time ago, and the term was so visceral that it stuck with me, as I’m guessing it has with anyone it’s encountered. A Code Smell is described as “any characteristic in the source code of a program that possibly indicates a deeper problem.”
Some examples of Code Smells:
There are too many lines of code when you know the code can be simplified to one to two lines. You may work at a company that evaluates your performance on the number of lines of code. If so, I’m sorry.
Writing a large class or method that does WAY TOO MUCH. Forget about writing discrete and modular code.
Including credentials in your code - visible for all to see - in version control.
Shotgun surgery, where a single change requires modifications in multiple files, classes, or methods.
Not including comments in the code.
You get the idea. Code Smells are the visceral reaction when you see anti-patterns and curl your nostrils in shame and occasional horror. You feel emotions for the person who brought this smell into your world, usually a combination of disgust, shame, annoyance, pity, or something else.
While reading this paper on Data Smells, I was reminded that the data world has its own “smells.” The paper discusses specific data quality issues related to data used for AI training.
It got me thinking there are data model smells, too. If you work with data, you have an intuition about what I’m talking about. Here are some examples of data model smells.
Being overly dogmatic about a particular data modeling approach when it’s not the best fit for the situation at hand.
Poor and inconsistent naming conventions. I’ve seen columns with names like “WTF1234.” In this case, it was a legacy database with a limit on the characters that could be used in column names. The column had values, but nobody was quite sure what this name meant because the people who named and maintained the column moved on.
Redundant or duplicate data across many tables. This might be ok if you’re doing insert-only operations (maybe). However, updating and deleting these duplicate records can be a nightmare. There’s a reason why normalization was created. For most things, normalize unless you have a good reason not to. You’ll thank yourself later.
Column redundancy for no good reason. Columns with the same name are repeated with different numbers, representing purchase orders such as PO_1, PO_2, PO_3, etc. What do these columns mean? Are they different purchase orders? Various stages in the purchasing process? Instead of adding another column, why not consolidate your logic into one column?
Fragile design is where models break easily due to changes in upstream source systems, hard-coded dependencies (like server or column names), etc.
Over-reliance on free-text blobs and JSON types. There are better ways to operate if you’re doing text-heavy or deeply-nested data workloads.
Overloaded tables where every conceivable piece of data is stuffed into a single table. Bonus points for data model smells if you cram lots of free-text and deeply nested JSON into this table.
I could go on, but you get the idea. When you encounter these data model smells in the wild, your first reaction is probably, “WTF?!” Next, you hope you don’t have to deal with this data, and hopefully, some poor sucker gets to deal with it instead. The only thing that you can count on is pain and suffering.
What are some other data model smells you’ve encountered?
Ran across a data mart with table and column names that have spaces in the physical table! Nothing like adding quotes on all SQL. Some tools handle it fine, many do not. Smelly tables = smelly code.
Centipede fact tables https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/centipede-fact-table/