Many taxonomies, especially well designed taxonomies with many facets, have dimensions that consist of very few, often just two categories, however this may cause more harm than it’s worth.
Many taxonomies, especially well designed taxonomies with many facets, have dimensions that consist of very few, often just two categories.
It is tempting to give these Boolean like tags, such as “Yes”/”No” or “Y”/”N” or “True”/”False” or even near Booleans like “H, M, L.” I’m going to suggest in this article not doing that, and instead use self describing meaningful names for the categories.
Before I do, let me do a bit of color commentary on the types of situations where this shows up. Recently we were designing a Resolution Planning system in the Financial Industry. In the course of this design it became tempting to have categories for inter-affiliate services such as “resolution criticality” or “materiality” or “Impact on Reputation,” not just tempting but these were part of the requirements from the regulators. It was tempting to have the specific terms within each category be something like “yes”/”no” or “high”, “medium”, “low”. Partly this is because you may want the reports to have columns like “resolution critical” and “yes” or “no” in the rows.
That’s the backdrop. I can speak from experience that it is very tempting to just create two taxonomic categories “Yes” and “No.” There are actually two flavors of this temptation:
- Just create two terms “yes” and “no” and use them in all the places they occur, that is there is a instance with a uri like :_Yes and an instance with a uri like :_No with labels “Yes” and “No”
- Create a different “yes” and “no” instances for each of the categories (that is that there is a uri with a name like :_resCrit_Yes which has a label “Yes” and elsewhere a uri with a name like :_materiality_Yes)
I’m going to suggest that both are flawed. The first requires us to have a new property for every distinction we make. In other words we can’t just say “categorizedBy” as we do with other categories, because you would need the name of the property to find out what “yes” means. While at first this seems reasonable, it leads to the type of design we find in legacy systems, with an excessive number of properties that have to be modeled, programmed to and learned by consumers of the data. The second approach is closer to what we will advocate here, but doesn’t go far enough as we’ll see.
My perspective here is based on two things:
- Years of forensic work, profiling and reverse engineering trying to deduce what existing data in legacy systems actually means, plus
- My commitment to the “Data Centric Revolution” wherein data becomes the permanent artifact and applications come and go. This is not the way things are now. In virtually all organizations now when people want new functionality they implement new applications, and “convert” their data from the old to the new. Moving to truly data centric enterprises will take some changes to points of view in this area.
I am reminded of a project we did with Sallie Mae, where we were using an ontology as the basis for their Service Oriented Architecture messages. Every day we’d tackle a few new messages and try to divine what the elements and attributes in the legacy systems meant. We would identify the obvious elements and have to send the analysts back to the developers to try to work out the more difficult ones. After several weeks of this I made an observation: the shorter the length of the element, the longer it would take us to figure out what it meant, with Booleans taking the longest.
I’ve been reflecting on this for years, and I think the confluence of our Resolution Planning application and the emergence of the Data Centric approach have led me to what the issue was and is.
“Yes” or even “True” doesn’t mean anything in isolation. It only means something in context. Yes is often the answer to a question, and if you don’t know what the question was, you don’t know what “yes” means. And in an application centric world, the question is in the application. Often it appears in the user interface. Then the reporting subsystem reinterprets it. Usually, due to space restrictions the reporting interpretation is an abbreviated version of the user interface version. So the user interface might say “Would the unavailability of this service for more than 24 hours impair the ability for a resolution team to complete trades considered essentially to continued operation of the financial system as a whole?” And the report might say “Resolution Critical.” Of course the question could just as well be expressed the other way around: “Could a team function through the resolution period without this services?” (Where “Yes” would mean approximately the same as “No” to the previous question).
In either event, Boolean data like this does not speak for itself. The data is inextricably linked to the application, which is what we’re trying to get beyond.
If we step back and reflect on what we’re trying to do we can address the problem. We are attempting to categorize things. In this case we’re trying to categorize “Inter-affiliate Services.” The categories we are trying to put things in are categories like “Would be Essential in the Event of a Resolution” and “Would not be Essential in the Event of a Resolution.” I recognize that this sounds a lot like “Yes” and “No” or perhaps the slightly improved “Essential” and “Non-Essential.” Now if you ask the question “Would the unavailability of this service for more than 24 hours impair the ability for a resolution team to complete trades considered essentially to continued operation of the financial system as a whole?” the user answer “Yes” would correspond to “Would be Essential in the Event of a Resolution.” If the question were changed to “Could a team function through the resolution period without this services?” we would map “No” to “Would be Essential in the Event of a Resolution.”
Consider the implication. With the fully qualified categories, you get several advantages:
- The data does speak for itself. You can review the data and know what it means without having to refer to application code, and without being forever dependent on the application code for interpretation.
- You could write a query and interpret the results, without needing labels from the application or the report.
- You could query for all the essential services. Consider how hard this would be in the Boolean case. You can query for things that are in the Resolution Critical mini taxonomy with the value of “Yes,” but you don’t really know what “Yes” means. With the fully qualified category you just query for the things that are categorized by “Would be Essential in the Event of a Resolution” and you’ve got it
- You can confidently create derivative classes. Let’s say you wanted the set of all departments that provided resolution critical services. You would just create a restriction class that related the department to the service with that category. You could do it with the Boolean, but you’d be continually dogged by the question “what did ‘yes’ mean in this context?”
- You can use the data outside the context in which it was originally created. In a world of linked data, it will be far easier to consume and use data that has more fully qualified categories.
Finally if you find you really need to put “Yes” on a report, you can always put an alternate display label on the category and this way the data would know what “yes” meant without having to refer to the application.
In conclusion: it is often tempting to introduce Boolean values, or very small taxonomies that function as Booleans into your ontology design. This leads to long term problems with coupling between the data and the application, and hampers maintenance and long term use of the data.
Preparing and using these more qualified categories only takes a bit more up front design work, and has no downside to implementation or subsequent use.