One of the ideas we promote is elegance in the core data model in a Data-Centric enterprise. This is harder than it sounds. Look at most application-centric data models: you would think they would be simpler than the enterprise model, after all, they are a small subset of it. Yet we often find individual application data models that are far more complex than the enterprise model that covers them.
You might think that the enterprise model is leaving something out, but that’s not what we’re finding when we load data from these systems. We can generally get all the data and all the fidelity in a simpler model.
It behooves us to ask a pretty broad question:
Where and when should I add new classes to my Data-Centric Ontology?
To answer this, we’re going to dive into four topics:
- The tradeoff of convenience versus overhead
- What is a class, really?
- Where is the proliferation coming from?
- What options do I have?
Convenience and Overhead
In some ways, a class is a shorthand for something (we’ll get a bit more detailed in the next paragraph). As such, putting a label to it can often be a big convenience. I have a very charming book called, Thing Explainer – Complicated Stuff in Simple Words,[1] by Randall Munroe (the author of xkcd Comics). The premise of Thing Explainer is that even very complex technical topics, such as dishwashers, plate tectonics, the International Space Station, and the Large Hadron Collider, can all be explained using a vocabulary of just ten hundred words. (To give you an idea of the lengths he goes to he uses “ten hundred” instead of one “thousand” to save a word in his vocabulary.)
So instead of coining a new word in his abbreviated vocabulary, “dishwasher” becomes, “box that cleans food holders,” food holders being bowls and plates). I lived in Papua New Guinea part time for a couple of years, and the national language there, Tok Pisin, has only about 2,000 words. They ended up with similar word salads. I remember the grocery store was at “plas bilong san kamup,” or “place belong sun come up,” which is Tok Pisin for “East.”
It is much easier to refer to “dishwashers” and “East” than their longer equivalents. It’s convenient. And it doesn’t cost us much in everyday conversation.
But let’s look at the convenience / overhead tradeoff in an information system that is not data-centric. Every time you add a new class (or a new attribute) to an information system you are committing the enterprise to deal with it potentially for decades to come. The overhead starts with application programming, that new concept has to be referred to by code, and not just a small amount. I’ve done some calculations in my book, Software Wasteland, that suggests each attribute added to a system adds at least 1,000 lines of source code—code to move the item from the database to some API, code to take it from the API and put it in the DOM or something similar, code to display it on a screen, in a report, maybe even in a drop-down list, code to validate it. Given that it costs money to write and test code, this is adding to the cost of a system. The real impact is felt downstream, felt in application maintenance, especially felt in the brittle world of systems integration, and it is felt by the users. Every new attribute is a new field on a form to puzzle about. Every new class is often a new form. New forms often require changes to process flow. And so, the complexity grows.
Finally, there is cognitive load. When we have to deal with dozens or hundreds of concepts, we don’t have too much trouble. When we get to thousands it becomes a real undertaking. Tens of thousands and it’s a career. And yet many individual applications have tens of thousands of concepts. Most large enterprises have millions which is why becoming data-centric is so appealing.
One of the other big overheads in traditional technology is duplication. When you create a new class, let’s say, “hand tools,” you may have to make sure that the wrench is in the Hand Tools class / table and also in the Inventory table. This relying on humans and procedures to remember to put things in more than one place is a huge undocumented burden.
We want to think long and hard before introducing a new class or even a new attribute.