Dave McComb

The Data-Centric Revolution: Best Practices and Schools of Ontology Design

January 2, 2024 by Dave McComb

This article originally appeared at The Data-Centric Revolution: Best Practices and Schools of Ontology Design – TDAN.com. Subscribe to TDAN directly for this and other great content!

I was recently asked to present “Enterprise Ontology Design and Implementation Best Practices” to a group of motivated ontologists and wanna-be ontologists. I was flattered to be asked, but I really had to pause for a bit. First, I’m kind of jaded by the term “best practices.” Usually, it’s just a summary of what everyone already does. It’s often sort of a “corporate common sense.” Occasionally, there is some real insight in the observations, and even rarer, there are best practices without being mainstream practices. I wanted to shoot for that latter category.

As I reflected on a handful of best practices to present, it occurred to me that intelligent people may differ. We know this because on many of our projects, there are intelligent people and they often do differ. That got me to thinking: “Why do they differ?” What I came to was that there are really several different “schools of ontology design” within our profession. They are much like “schools of architectural design” or “schools of magic.” Each of those has their own tacit agreement as to what constitutes “best practice.”

Armed with that insight, I set out to identify the major schools of ontological design, and outline some of their main characteristics and consensus around “best practices.” The schools are (these are my made-up names, to the best of my knowledge none of them have planted a flag and named themselves — other than the last one):

Philosophy School
Vocabulary and Taxonomy School
Relational School
Object-Oriented School
Standards School
Linked Data School
NLP/LLM School
Data-Centric School

There are a few well known ontologies that are a hybrid of more than one of these schools. For instance, most of the OBO Life Sciences ontologies are a hybrid of the Philosophy and Taxonomy School, I think this will make more sense after we describe each school individually.

Philosophy School

The philosophy school aims to ensure that all modeled concepts adhere to strict rules of logic and conform to a small number of well vetted primitive concepts.

Exemplars

The Basic Formal Ontology (BFO), DOLCE and Cyc are the best-known exemplars of this school. Each has a set of philosophical primitives that all derived classes are meant to descend from.

How to Recognize

It’s pretty easy to spot an ontology that was developed by someone from the philosophy school. The top-level classes will be abstract philosophical terms such as “occurrent” and “continuant.”

Best Practices

All new classes should be based on the philosophical primitives. You can pretty much measure the adherence to the school by counting the number of classes that are not direct descendants of the 30-40 base classes.

Vocabulary and Taxonomy School

The vocabulary and taxonomy school tends to start with a glossary of terms from the domain and establish what they mean (vocabulary school) and how these terms are hierarchically related to each other (taxonomy school). The two schools are more alike than different.

The taxonomy school especially tends to be based on standards that were created before the Web Ontology Language (OWL). These taxonomies often model a domain as hierarchical structures without defining what a link in the hierarchy actually means. As a result, they often mix sub-component and sub-class hierarchies.

Exemplars

Many life sciences ontologies, such as SNOMED are primarily taxonomy ontologies, and only secondarily philosophy school ontologies. Also, the Suggested Upper Merged Ontology is primarily a vocabulary ontology, it was mostly derived from WordNet and one of its biggest strengths is its cross reference to 250,000 words and their many word senses.

How to Recognize

Vast numbers of classes. There are often tens of thousands or hundreds of thousands of classes in these ontologies.

Best Practices

For the vocabulary and taxonomy schools, completeness is the holy grail. A good ontology is one that contains as many of the terms from the domain as possible. The Simple Knowledge Organization System (SKOS) was designed for taxonomies. Thus, even though it is implemented in OWL, it is designed to add semantics to taxonomies that often are less rigorous, using generic predicates such as broaderThan and narrowerThan rather than more precise subclass or object properties such as “part of.” SKOS is a good tool for integrating taxonomies with ontologies.

Relational School

Most data modelers grew up with relational design, and when they design ontologies, they rely on ways of thinking that served them well in relational.

Exemplars

These are mostly internally created ontologies.

How to Recognize

Relational ontologists tend to be very rigorous about putting specific domains and ranges on all their properties. Properties are almost never reused. All properties will have inverses. Restrictions will be subclass axioms, and you will often see restrictions with “min 0” cardinality, which doesn’t mean anything to an inference engine, but to a relational ontologist it means “optional cardinality.” You will also see “max 1” and “exactly 1” restrictions which almost never imply what the modeler thought, and as a result, it is rare for relational modelers to run a reasoner (they don’t like the implications).

Best Practices

For relational ontologist best practices are to make ontologies that are as similar to existing relational structures as possible. Often, the model is a direct map from an existing relational system.

Modelers in the relational school (as well as the object-oriented school coming up next) tend to bring the “Closed World Assumption” (CWA) with them from their previous experience. CWA takes a mostly implicit attitude that the information in the system is a complete representation of the world. The “Open World Assumption” (OWA) takes the opposite starting point: that the data in the system is a subset of all knowable information on the subject.

CWA was and is more appropriate in narrow scope, bounded applications. When we query your employee master file looking for “Dave McComb” and don’t get a hit, we reasonably assume that he is not an employee of your enterprise. When TSA queries their system and doesn’t get a hit, they don’t assume that he is not a terrorist. They still use the X-ray and metal detectors. This is because they believe that their information is incomplete. They are open worlders. More and more of our systems combine internal and external data in ways that are more likely to be incomplete.

There are techniques for closing the open world, but the relational school tends not to use them because they assume their world is already closed.

Object-Oriented School

Like the relational school, the object-oriented school comes from designers who grew up with object-oriented modeling.

Exemplars

Again, a lot of object-oriented (OO) ontologies are internal client projects, but a few public ones of note include eCl@ss and Schema.org. eCl@ss is a standard for describing electrical products. It has been converted into an ontology. The ontology version has 60,000 classes, which combine taxonomic and OO style modeling. Schema.org is an ontology for tagging web sites that Google promotes to normalize SEO. It started life fairly elegant. It now has 1300 classes, many of which are taxonomic distinctions, rather than real classes.

How to Recognize

One giveaway for the object-oriented school is designing in SHACL. SHACL is a semantic constraint language, which is quite useful as a guard for updates to a triple store. Because SHACL is less concerned with meaning and more concerned with structure, many object-oriented ontologists prefer it to OWL for defining their classes.

Even those who design in OWL have some characteristic tells. OO ontologists tend to use subclassing far more than relational ontologists. They tend to declare which class is a subclass of another, rather than allowing the inference engine to infer subsumption. There is also a tendency to believe that the superclass will constrain subclass membership.

Best Practices

OO ontologies tend to co-exist with Graph QL and focus on json output. This is because the consuming applications are object oriented, and this style ontology and architecture have less impedance mismatch with the consuming applications. The level of detail tends to mirror the kind of detail you find in an application system. Best practices for an OO ontology would never consider the tens of thousands or hundreds of thousands of classes in a taxonomy ontology, nor would they go for the minimalist view of the philosophy or data-centric schools. They tend to make all distinctions at the class level.

Standards School

This is a Janus school, with two faces, one facing up and one facing down. The one facing down is concerned with building ontologies that others can (indeed should) reuse. The one facing up is the enterprise ontologies that import the standard ontologies in order to conform.

Exemplars

Many of the most popular ontology standards are produced and promoted by the W3C. These include DCAT (Data Catalog Vocabulary), the Ontology for Media Resources, Prov-O (an ontology of provenance), Time Ontology, and Dublin Core (an ontology for metadata, particular around library science).

How to Recognize

For the down facing standards ontology, it’s pretty easy. They are endorsed by some standards body. Most common are W3C, OMG and Oasis. ISO has been a bit late to this party, but we expect to see some soon. (Everyone uses the ISO country and currency codes, and yet there is no ISO ontology of countries or currencies.) There are also many domain-specific standard ontologies that are remakes of their previous message model standards, such as FHIR from HL7 in healthcare and ACORD in insurance.

The upward facing standards ontologies can be spotted by their importing a number of standard ontologies each meant to address an aspect of the problem at hand.

Best Practices

Best practice for downward facing standards ontologies is to be modular, fairly small, complete and standalone. Unfortunately, this best practice tends to result in modular ontologies that redefine (often inconsistently) shared concepts.

Best practice for upward facing standards ontologies is to rely as much as possible on ontologies defined elsewhere. This usually starts off by importing many ontologies and ends up with a number of bridges to the standards when it’s discovered that they are incompatible.

Linked Open Data School

The linked open data school promotes the idea of sharing identifiers across enterprises. Linked data is very focused on instance (individual or ABox) data, and only secondarily on classes.

Exemplars

The poster child for LOD is DBPedia, the LOD knowledge graph derived from the Wikipedia information boxes. It also includes the direct derivatives such as WikiData and the entire Linked Open Data Cloud.

I would put the Global Legal Entity Identifier Foundation (GLEIF) in this school as their primary focus is sharing between enterprises and there are more focused on the ABox (the instances).

How to Recognize

Linked open data ontologies are recognizable by their instances, often millions and in many cases billions of instances. The ontologies (TBox) is often very naïve, as they are often derived directly from informal classifications made by text editors in Wikipedia and its kin.

You will see many adhoc classes raised to the status of a formal class in LOD ontologies. I just notice the classes dbo:YearInSpaceFlight and yago:PsychologicalFeature100231001.

Best Practices

The first best practice (recognized more in the breach) is to rely on other organizations IRIs. This is often clumsy because historically, each organization invented identifiers for things in the world (their employees and vendors for instance) and they tend to build their IRIs around these well-known (at least locally) identifiers.

A second best practice is entity resolution and “owl:sameAs.” Entity resolution can determine if two IRIs represent the same real-world object. Once recognized, one of the organizations can choose to adopt the others IRI (previous paragraph best practice) or continue to use their own, but recognize the identity through owl:sameAs (which is mostly motivated by the following best practice).

LOD creates the opportunity for IRI resolution at the instance level. Put the DBPedia IRI for a famous person in your browser address bar and you will be redirected to DBPedia resolution page for that individual, showing all that DBPedia knows about them. For security reasons, most enterprises don’t yet do this. Because of this, another best practice is to only create triples with subjects whose domain name you control. Anything you state about a IRI in someone else’s name space will not be available for resolution by the organization that minted the subject URI.

NLP/LLM School

There is a school of ontology design that says turn ontology design over to the machines. It’s too hard anyway.

Exemplars

Most of these are also internal projects. About every two to three years, we see another startup with the premise that ontologies can be built by machines. For most of history, these were cleverly tailored NLP systems. The original works in this area took large teams of computational linguists to master.

This year (2023), they are all LLMs. You can ask ChatGPT to build an ontology for [fill in the blank] industry, and it will come up with something surprisingly credible looking.

How to Recognize

For LLMs, the first giveaway are hallucinations. These are hard to spot and require deep domain and ontology experience to pick out. The second clue is humans with six fingers (just kidding). There aren’t many publicly available LLM generated ontologies (or if there are they are so good we haven’t detected that they were machine generated).

Best Practices

Get a controlled set of documents that represent the domain you wish to model. This is better than relying on what ChatGPT learned by reading the internet.

And have a human in the loop. This is an approach that shows significant promise and several researchers have already created prototypes that utilize this approach. Consider that the NLP / LLM created artifacts are primarily speed reading or intelligent assistants for the ontologist.

In the broader adoption of LLMs, there is a lot of energy going into ways to use knowledge graphs as “guard rails” against some of LLMs excesses, and the value of keeping a human in the loop. Our immediate concern there are advocates of letting generative AI design ontologies, and as such it becomes a school of its own.

Data-Centric School

The data-centric school of ontology design, as promoted by Semantic Arts, focuses on ontologies that can be populated and implemented. In building architecture, they often say “It’s not architecture until it’s built.” The data-centric school says, “It’s not an ontology until it has been populated (with instance level, real world data, not just taxonomic tags).” The feedback loop of loading and querying the data is what validates the model.

Exemplars

Gist, an open-source owl ontology, is the exemplar data-centric ontology. SchemaApp, Morgan Stanley’s compliance graph, Broadridge’s Data Fabric, Procter & Gamble’s Material Safety graph, Schneider-Electric’s product catalog graph, Standard & Poor’s commodity graph, Sallie Mae’s Service Oriented Architecture and dozens of small firms’ enterprise ontologies are based on gist.

How to Recognize

Importing gist is a dead giveaway. Other telltale signs include a modest number of classes (less than 500 for almost all enterprises) and eschewing inverse and transitive properties (the overhead for these features in a large knowledge graph far outweigh their expressive power). Another giveaway is delegating taxonomic distinctions to be instances of subclasses of gist:Category rather than being classes in their own right.

Best Practices

One best practice is to have non primitive classes have “equivalent class” restrictions that define class membership and are used to infer the class hierarchy. Another best practice is to have domains and ranges at very high levels of abstraction (and often missing completely) in order to promote property reuse and reduce future refactoring.

Another best practice is to load a knowledge graph with data from the domain of discourse to prove that the model is appropriate and at the requisite level of detail.

Summary

One of the difficulties in getting wider spread adoption of ontologies and knowledge graphs is that if you recruit and/or assemble a group of ontologists, there is a very good chance you will have members from multiple of the above-described schools. There is a good chance they will have conflicting goals, and even a different definition of what “good” is. Often, they will not even realize that their difference of opinion is due to their being members of a different tribe.

There isn’t one of these schools that is better than any of the others for all purposes. They each grew up solving different problems and emphasizing different aspects of the problem.

When you look at existing ontologies, especially those that were created by communities, you’ll often find that many are an accidental hybrid of the above schools. This is caused by different members coming to the project from different schools and applying their own best practices to the design project.

Rather than try to pick which school is “best,” you should consider what the objectives of your ontology project are and use that to determine which school is better matched. Select ontologists and other team members who are willing to work to the style of that school. Only then is it appropriate to consider “best practices.”

Acknowledgement

I want to acknowledge Michael Debellis for several pages of input on an early draft of this paper. The bits that didn’t make it into this paper may surface in a subsequent paper.

The Data-Centric Revolution: “RDF is Too Hard”

September 8, 2023 by Dave McComb

This article originally appeared at The Data-Centric Revolution: “RDF is Too Hard” – TDAN.com. Subscribe to The Data Administration Newsletter for this and other great content!

The Data-Centric Revolution: “RDF is Too Hard”

By Dave McComb

We hear this a lot. We hear it from very smart people. Just the other day we heard someone say they had tried RDF twice at previous companies and it failed both times. (RDF stands for Resource Description Framework,^[1] which is an open standard underlying many graph databases). It’s hard to convince someone like that that they should try again.

That particular refrain was from someone who was a Neo4j user (the leading contender in the LPG (Labeled Property Graph) camp). We hear the same thing from any of three camps: the relational camp, the JSON camp, and the aforementioned LPG camp.

Each has a different reason for believing this RDF stuff is just too hard. Convincing those who’ve encountered setbacks to give RDF another shot is also challenging. In this article, I’ll explore the nuances of RDF, shedding light on challenges and strengths in the context of enterprise integration and application development.

For a lot of problems, the two-dimensional world of relational tables is appealing. Once you know the column headers, you pretty much know how to get to everything. It’s not quite one form per table, but it isn’t wildly off from that. You don’t have to worry about some of the rows having additional columns, you don’t have to worry about some cells being arrays or having additional depth. Everything is just flat, two-dimensional tables. Most reporting is just a few joins away.

JSON is a bit more interesting. At some point you discover, or decree if you’re building it, that your dataset has a structure. Not a two-dimensional structure as in relational, but more of a tree-like structure. More specifically, it’s all about determining if this is an array of dictionaries or a dictionary of arrays. Or a dictionary of dictionaries. Or an array of arrays. Or any deeply nested combination of these simple structures. Are the keys static — that is, can they be known specifically at coding time, or are they derived dynamically from the data itself? Frankly, this can get complex, but at least it’s only locally complex. A lot of JSON programming is about turning someone else’s structure into a structure that suits the problem at hand.

One way to think of LPG is JSON on top of a graph database. It has a lot of the flexibility of JSON coupled with the flexibility of graph traversals and graph analytics. It solves problems difficult to solve with relational or just JSON and has beautiful graphics out of the box. Maybe link to your blog post about LPGs as training wheels?

Each of these approaches can solve a wide range of problems. Indeed, almost all applications use one of those three approaches to structure the data they consume.

And I have to admit, I’ve seen a lot of very impressive Neo4j applications lately. Every once in a while, I question myself and wonder aloud if we should be using Neo4j. Not because RDF is too hard but because we’ve mastered it and have many successful implementations running at client sites and internally. But maybe, if it is really easier, we should switch. And maybe it just isn’t worth disagreeing with our prospects.

Enterprise Integration is Hard

Then it struck me. The core question isn’t really “RDF v LPG (or JSON or relational),” it’s “application development v. enterprise integration.”

I’ve heard Jans Aasman, CEO of Franz, the creators of AllegroGraph, make this observation more than once: “Most application developers have dedicated approximately 0 of their neurons contemplating how what they are working on is going to fit in with the rest of their enterprise, whereas people who are deeply into RDF may spend upwards of half their mental cycles thinking of how the task and data at hand fits into the overall enterprise model.”

That, I think, is the nub of the issue. If you are not concerned with enterprise integration, then maybe those features that scratch the itches that enterprise integration creates are not worth the added hassle.

Let’s take a look at the aspects of enterprise integration that are inherently hard, why RDF might be the right tool for the job, and why it might be overkill for traditional application development.

Complexity Reduction

One of the biggest issues dealing with enterprise integration is complexity. Most mid to large enterprises harbor thousands of applications. Each application has thousands of concepts (tables and columns or classes and attributes or forms and fields) that must be learned to become competent either in using the application and/or in debugging and extending it. No two application data models are alike. Even two applications in the same domain (e.g., two inventory systems) will have comically different terms, structures, and even levels of abstraction.

Each application is at about the complexity horizon that most mere mortals can handle. The combination of all those models is far beyond the ability of individuals to grasp.

Enterprise Resource Planning applications and Enterprise Data Modeling projects have shone a light on how complex it can get to attempt to model all an enterprise’s data. ERP systems now have tens of thousands of tables, and hundreds of thousands of columns. Enterprise Data Modeling fell into the same trap. Most efforts attempted to describe the union of all the application models that were in use. The complexity made them unusable.

What few who are focused on solving a point solution are aware of, is that there is a single simple model at the heart of every enterprise. It is simple enough that motivated analysts and developers can get their heads around it in a finite amount of time. And it can be mapped to the existing complex schemas in a lossless fashion.

The ability to posit these simple models is enabled by RDF (and its bigger brothers OWL and SHACL). RDF doesn’t guarantee you’ll create a simple or understandable model (there are plenty of counterexamples out there) but it at least makes the problem tractable.

Concept Sharing

An RDF based system is mostly structure-free, so we don’t have to be concerned with structural disparities between systems, but we do need a way to share concepts. We need to have a way to know that “employee,” “worker,” “user,” and “operator” are all referring to the same concept. Or if they aren’t, in what ways they overlap.

In an RDF-based system we spend a great deal of time understanding the concepts that are being used in all the application systems, and then creating a way that both the meaning and the identity of the concept can be easily shared across the enterprise. And that the map between the existing application schema elements and the shared concepts are also well known and findable.

One mechanism that helps with this is the idea that concepts have global identifiers (URIs /IRIs) that can be resolved. You don’t need to know which application defined a concept; the domain name (and therefore the source authority) is right there in the identifier and can be used much like a URL to surface everything known about the concept. This is an important feature of enterprise integration.

Instance Level Integration

It’s not just the concepts. All the instances referred to in application systems have identifiers. But often the identifiers are local. That is, “007” refers to James Bond in the Secret Agent table, but it refers to “Ham Sandwich” in the company cafeteria system.

The fact that systems have been creating identity aliases for decades is another problem that needs to be addressed at the enterprise level. The solution is not to attempt, as many have in the past, to jam a “universal identifier” into the thousands of affected systems. It is too much work, and they can’t handle it anyway. Plus, there are many identity problems that were unpredicted at the time their systems were built (who imagined that some of our vendors would also become customers?) and are even harder to resolve.

The solution involves a bit of entity resolution, coupled with a flexible data structure that can accommodate multiple identifiers without getting confused.

Data Warehouse, Data Lake, and Data Catalog all in One

Three solutions have been mooted over the last three decades to partially solve the enterprise integration problem: data warehouses, lakes, and catalogs. Data warehouses acknowledged that data has become balkanized. By conforming it to a shared dimensional model and co-locating the data, we could get combined reporting. But the data warehouse was lacking on many fronts: it only had a fraction of the enterprise’s data, it was structured in a way that wouldn’t allow transactional updates, and it was completely dependent on the legacy systems that fed it. Plus, it was a lot of work.

The data lake approach said co-location is good, let’s just put everything in one place and let the consumers sort it out. They’re still trying to sort it out.

Finally, the data catalog approach said: don’t try to co-locate the data, just create a catalog of it so consumers can find it when they need it.

The RDF model allows us to mix and match the best of all three approaches. We can conform some of the enterprise data (we usually recommend all the entity data such as MDM and the like, as well as some of the key transactional data). An RDF catalog, coupled with an R2RML or RML style map, will not only allow a consumer to find data sets of interest, in many cases they can be accessed using the same query language as the core graph. This ends up being a great solution for things like IoT, where there are great volumes of data that only need to be accessed on an exception basis.

Query Federation

We hinted at query federation in the above paragraph. The fact that query federation is built into the spec (of SPARQL, which is the query language of choice for RDF, and also doubles as a protocol for federation) allows data to be merged at query time, across different database instances, different vendors and even different types of databases (with real time mapping, relational and document databases can be federated into SPARQL queries).

Where RDF Might Be Overkill

The ability to aid enterprise integration comes at a cost. Making sure you have valid, resolvable identifiers is a lot of work. Harmonizing your data model with someone else’s is also a lot of work. Thinking primarily in graphs is a paradigm shift. Anticipating and dealing with the flexibility of schema-later modeling adds a lot of overhead. Dealing with the oddities of open world reasoning is a major brain breaker.

If you don’t have to deal with the complexities of enterprise integration, and you are consumed by solving the problem at hand, then maybe the added complexity of RDF is not for you.

But before you believe I’ve just given you a free pass consider this: half of all the work in most IT shops is putting back together data that was implemented by people who believed they were solving a standalone problem.

Summary

There are many aspects of the enterprise integration problem that lend themselves to RDF-based solutions. The very features that help at the enterprise integration level may indeed get in the way at the point solution level.

And yes, it would in theory be possible to graft solutions to each of the above problems (and more, including provenance and fine-grained authorization) onto relational, JSON or LPG. But it’s a lot of work and would just be reimplementing the very features that developers in these camps find so difficult.

If you are attempting to tackle enterprise integration issues, we strongly encourage you to consider RDF. There is a bit of a step function to learn it and apply it well, but we think it’s the right tool for the job.

Data-Centric Revolution: Is Knowledge Ontology the Missing Link?

June 29, 2023 by Dave McComb

“You would think that after knocking around in semantics and knowledge graphs for over two decades I’d have had a pretty good idea about Knowledge Management, but it turns out I didn’t.

I think in the rare event the term came up I internally conflated it with Knowledge Graphs and moved on. The first tap on the shoulder that I can remember was when we were promoting work on a mega project in Saudi Arabia (we didn’t get it, but this isn’t likely why). We were trying to pitch semantics and knowledge graphs as the unifying fiber for the smart city the Neom Line was to become.

In the process, we came across a short list of Certified Knowledge Management platforms they were considering. Consider my chagrin when I’d never heard of any of them. I can no longer find that list, but I’ve found several more since…”

Read the rest: Data Centric Revolution: Is Knowledge Ontology the Missing Link? – TDAN.com

Interested in joining the discussion? Join the gist Forum (Link to register here)

The Data-Centric Revolution: Zero Copy Integration

March 8, 2023 by Dave McComb

I love the term “Zero Copy Integration.” I didn’t come up with it, it was the Data Collaboration Alliance, that came up with that one. The Data Collaboration Alliance is a Canadian based advocacy group promoting localized control of data along with federated access.

What I like about the term is how evocative it is. Everyone knows that all integration consists of copying and transforming data. Whether you do that through an API or through ETL (Extract Transform and Load) or Data Lake style ELT (Extract Load and leave it to someone else to maybe eventually Transform). Either way, we know from decades of experience that integration is at its core copying data from a source to a destination.

This is why “copy-less copying” is so evocative. It forces you to rethink your baseline assumptions.

We like it because it describes what we’ve been doing for years, and never had a name for. In this article, I’m going to drill a bit deeper into the enabling technology (i.e., what do you need to have in place to get Zero Copy Integration to work), then do a case study, and finally wrap up with “do you literally mean zero copy?”

The Data-Centric Revolution: Detour/Shortcut to FAIR

December 8, 2022 by Dave McComb

The FAIR principles for data sets are gaining traction, especially in the pharmaceutical industry in Europe. FAIR stands for: Findable, Accessible, Interoperable and Reusable. In a world of exponential data growth and ever-increasing silo-ization, the FAIR principles are needed more than ever. In this article, we will first summarize the FAIR principles and describe the typical roadmap to FAIR. After that we will argue that using a Data-Centric approach is the best route to achieving FAIR principles. This material was recently presented at a FAIR workshop for a European pharmaceutical firm.

FAIR emerged as a response to the fragmentation of data within and among most large firms. Most of the publicly available FAIR materials tend to focus on awareness and assessment (how FAIR are you?) and leave the way-finding to individual companies. Difficulties in sharing and applying data to problems create a friction on innovation that FAIR is meant to address.

Keep reading at The Data-Centric Revolution: Detour / Shortcut to FAIR – TDAN.com

The Data-Centric Revolution: OWL as a Discipline

September 12, 2022 by Dave McComb

Many developers pooh-pooh OWL (the dyslexic acronym for the Web Ontology Language). Many decry it as “too hard,” which seems bizarre, given that most developers I know pride themselves on their cleverness (and, as anyone who takes the time to learn OWL knows, it isn’t very hard at all). It does require you to think slightly differently about the problem domain and your design. And I thin that’s what developers don’t like. If they continue to think in the way they’ve always thought, and try to express themselves in OWL, yes, they will and do get frustrated.

That frustration might manifest itself in a breakthrough, but more often, it manifests itself in a retreat. A retreat perhaps to SHACL, but more often the retreat is more complete than that, to not doing data modeling at all. By the way, this isn’t a “OWL versus SHACL” discussion, we use SHACL almost every day. This is an “OWL plus SHACL” conversation.

The point I want to make in this article is that it might be more productive to think of OWL not as a programming language, not even as a modeling language, but as a discipline. A discipline akin to normalization.

Keep reading: The Data-Centric Revolution: OWL as a Discipline – TDAN.com

Read more of Dave’s articles: mccomb – TDAN.com

Incremental Stealth Legacy Modernization

May 20, 2022March 8, 2022 by Dave McComb

I’m reading the book Kill it with Fire by Marianne Bellotti. It is a delightful book. Plenty of pragmatic advice, both on the architectural side (how to think through whether and when to break up that monolith) and the organizational side (how to get and maintain momentum for what are often long, drawn-out projects). So far in my reading she seems to advocate incremental improvement over rip and replace, which is sensible, given the terrible track record with rip and replace. Recommended reading for anyone who deals with legacy systems (which is to say anyone who deals with enterprise systems, because a majority are or will be legacy systems).

But there is a better way to modernize legacy systems. Let me spoil the suspense: it is Data-Centric. We are calling it Incremental Stealth Legacy Modernization because no one is going to get the green light to take this on directly. This article is for those playing the long game.

Legacy Systems

Legacy is the covering concept for a wide range of activities involving aging enterprise systems. I had the misfortune of working in Enterprise IT just as the term “Legacy” became pejorative. It was the early 1990’s, we were just completing a long-term strategic plan for John’s Manville. We decided to call it the “Legacy Plan” as we thought those involved with it would leave a legacy to those who came after. The ink had barely dried when “legacy” acquired a negative connotation. (While writing this I just looked it up, and Wikipedia thinks the term had already acquired its negative connotation in the 1980’s. Seems to me if it were in widespread use someone would have mentioned it before we published that report).

There are multiple definitions of what makes something a legacy system. Generally, it refers to older technology that is still in place and operating. What tends to keep legacy systems in place are networks of complex dependencies. A simple stand-alone program does not become a legacy system, because when the time comes, it can easily be rewritten and replaced. Legacy systems have hundreds or thousands of external dependencies, that often are not documented. Removing, replacing, or even updating legacy systems runs the risk of violating some of those dependencies. It is the fear of this disruption that keeps most legacy systems in place. And the longer it stays in place the more dependencies it accretes.

If these were the only forces affecting legacy systems, they would stay in place forever. The countervailing forces are obsolescence, dis-economy, and risk. While many parts of the enterprise depend on the legacy systems, the legacy system itself has dependencies. The system is dependent on operating systems, programming languages, middleware, and computer hardware. Any of these dependencies can and do become obsolescent and eventually obsolete. Obsolete components are no longer supported and therefore represent a high degree of risk of total failure of the system. The two main dimensions of dis-economy are operations and change. A modern system can typically run at a small fraction of the operating costs of a legacy system, especially when you tally up all the licenses for application systems, operating systems and middleware and add in salary costs for operators and administrators to support. The dis-economy of change is well known coming in the form of integration debt. Legacy systems are complex and brittle which makes change hard. The cost to make even the smallest changes to a legacy system are orders of magnitude more than the cost to make a similar change to a modern well-designed system. They are often written in obscure languages. One of my first legacy modernization projects involved replacing a payroll system written in assembler language with one that was to be written in “ADPAC.” You can be forgiven for thinking it is insane to have written a payroll system in assembler language, and even more so for replacing with a system written in a language that no one in the 21st century has heard of, but this was a long time ago, and is indicative of where legacy systems come from.

Legacy Modernization

Eventually the pressure to change overwhelms the inertia to leave things as they are. This usually does not end well for several reasons. Legacy modernization is usually long delayed. There is not a compelling need to change, and as a result for most of the life of a legacy systems resources have been assigned to other projects that get short term net positive returns. Upgrading the legacy system represents low upside. The new legacy system will do the same thing the old legacy system did, perhaps a bit cheaper or a bit better, but not fundamentally differently. Your old payroll system is paying everyone, and so will a new one.

As a result, the legacy modernization project is delayed as long as possible. When the inevitable precipitating event occurs, the replacement becomes urgent. People are frustrated with the old system. Replacing the legacy system with some more modern system seems like a desirable thing to do. Usually this involves replacing an application system with a package, as this is the easiest project to get approved. These projects were called “Rip and Replace” until the success rate of this approach plummeted. It is remarkable how expensive these projects are and how frequently they fail. Each failure further entrenches the legacy system and raises the stakes for the next project.

Ms. Bellotti points out in Kill it with Fire, many times the way to go is incremental improvement. By skillfully understanding the dependencies, and engineering decoupling techniques, such as APIs and intermediary data sets, it is possible to stave off some of the highest risk aspects of the legacy system. This is preferably to massive modernization projects that fail but, interestingly, has its own downsides: major portions of the legacy system continue to persist, and as she points out, few developers want to sign on to this type of work.

We want to outline a third way.

The Lost Opportunity

After a presentation on Data-Centricity someone in the audience pointed out that data-warehousing represented a form of Data-Centricity. Yes, in a way it does. With Data Warehousing and more recently Data Lakes and Data Lake houses, you have taken a subset of the data from numerous data silos and put it in one place for easier reporting. Yes, this captures a few of the data-centric tenets.

But what a lost opportunity. Think about it, we have spent the last 30 years setting up ETL pipelines and gone through several generations of data warehouses (from Kimball / Inmon roll your own to Teradata, Netezza to Snowflake and dozens more along the way) but have not gotten one inch closer to replacing any legacy systems. Indeed, the data warehouse is entrenching the legacy systems deeper by being dependent on them for their source of data. The industry has easily spent hundreds of billions of dollars, maybe even trillions of dollars over the last several decades, on warehouses and their ecosystems, but rather than getting us closer to legacy modernization we have gotten further from it.

Why no one will take you seriously

If you propose replacing a legacy system with a Knowledge Graph you will get laughed out of the room. Believe me, I’ve tried. They will point out that the legacy systems are vastly complex (which they are), have unknowable numbers of dependent systems (they do), the enterprise depends on their continued operation for its very existence (it does) and there are few if any reference sites of firms that have done this (also true). Yet, this is exactly what needs to be done, and at this point, it is the only real viable approach to legacy modernization.

So, if no one will take you seriously, and therefore no one will fund you for this route to legacy modernization, what are you to do? Go into stealth mode.

Think about it: if you did manage to get funded for a $100 million legacy replacement project, and it failed, what do you have? The company is out $100 million, and your reputation sinks with the $100 million. If instead you get approval for a $1 Million Knowledge Graph based project that delivers $2 million in value, they will encourage you to keep going. Nobody cares what the end game is, but you.

The answer then, is incremental stealth.

Tacking

At its core, it is much like sailing into the wind. You cannot sail directly into the wind. You must tack, and sail as close into the wind as you can, even though you are not headed directly towards your target. At some point, you will have gone far to the left of the direct line to your target, and you need to tack to starboard (boat speak for “right”). After a long starboard tack, it is time to tack to port.

In our analogy, taking on legacy modernization directly is sailing directly into the wind. It does not work. Incremental stealth is tacking. Keep in mind though, just incremental improvement without a strategy is like sailing with the wind (downwind): it’s fun and easy, but it takes you further from your goal, not closer.

The rest of this article are what we think the important tacking strategy should be for a firm that wants to take the Data-Centric route to legacy modernization. We have several clients that are on the second and third tack in this series.

I’m going to use a hypothetical HR / Payroll legacy domain for my examples here, but they apply to any domain.

Leg 1 – ETL to a Graph

The first tack is the simplest. Just extract some data from legacy systems and load it into a Graph Database. You will not get a lot of resistance to this, as it looks familiar. It looks like yet another data warehouse project. The only trick is getting sponsors to go this route instead of the tried-and-true data warehouse route. The key enablers here are to find problems well suited to graph structures, such as those that rely on graph analytics or shortest path problems. Find data that is hard to integrate in a data warehouse, a classic example is integrating structured data with unstructured data, which is nearly impossible in traditional warehouses, and merely tricky in graph environments.

The only difficulty is deciding how long to stay on this tack. As long as each project is adding benefit, it is tempting to stay on this tack for a long, long time. We recommend staying this course at least until you have a large subset of the data in at least one domain in the graph while refreshing frequently.

Let’s say after being on this tack for a long while you have all the key data on all your employees in the graph and being updated frequently

Leg 2 – Architecture MVP

On the first leg of the journey there are no updates being made directly to the graph. Just as in a data warehouse: no one makes updates in place in the data warehouse. It is not designed to handle that, and it would mess with everyone’s audit trails.

But a graph database does not have the limitations of a warehouse. It is possible to have ACID transactions directly in the graph. But you need a bit of architecture to do so. The challenge here is crating just enough architecture to get through your next tack. It depends a lot on what you think your next tack will be as to where you start. You’ll need constraint management to make sure your early projects are not loading invalid data back into your graph. Depending on the next tack you may need to implement fine grained security.

Whatever you choose, you will need to build or buy enough architecture to get your first update in place functionality going.

Leg 3 — Simple new Functionality in the Graph

In this leg we begin building update in place business use cases. We recommend not trying to replace anything yet. Concentrate on net new functionality. Some of the current best places to start are maintaining reference data (common shared data such as country codes, currencies, and taxonomies) and/ or some meta data management. Everyone seems to be doing data cataloging projects these days, they could just as well be done in the graph and give you experience and working through learning this new paradigm.

The objective here is to spend enough time on this tack that developers become comfortable with the new development paradigm. Coding directly to graph involves new libraries and new patterns.

Optionally, you may want to stay on this tack long enough to build “model driven development” (low code / no code in Gartner speak) capability into the architecture. The objective of this effort is to drastically reduce the cost of implementing new functionality in future tacks. Comparing before and after metrics on reduced code development, code testing, and code defects to make the case for the innovative approach will be alarming. Or you could leave model driven to a future tack.

Using the payroll / HR example, it will add new functionality dependent on HR data, but other things are not dependent on it. Maybe you built a skills database, or a learning management system. It depends on what is not yet in place that can be purely additive. These are the good places to start demonstrating business value.

Leg 4 – Understand the Legacy System and its Environment

Eventually you will get good at this and want to replace some legacy functionality. Before you do it will behoove you to do a bunch of deep research. Many legacy modernization attempts have run aground from not knowing what they did not know.

There are three things that you don’t fully know at this point:

• What data is the legacy system managing
• What business logic is the legacy system delivering
• What systems are dependent on the legacy system, and what is the nature of those dependencies.
If you have done the first three tacks well, you will have all the important data from the domain in the graph. But you will not have all the data. In fact, at the meta data level, it will appear that you have the tiniest fraction of the data. In your Knowledge Graph you may have populated a few hundred classes and used a few hundred properties, but your legacy system has tens of thousands of columns. By appearances you are missing a lot. What we have discovered anecdotally but have not proven yet, is that legacy systems are full of redundancy and emptiness. You will find that you do have most of the data you need, but before you proceed you need to prove this.

We recommend data profiling using software from a company such as GlobalIDs, IoTahoe or BigID. This software reads all the data in the legacy system and profiles it. It discovers patterns and creates histograms, which reveal where the redundancy is. More importantly, you can find data that is not in the graph and have a conversation about whether it is needed. A lot of data in legacy systems are accumulators (YTD, MTD etc.) that can easily be replaced by aggregation functions, processing flags that are no longer needed, and vast number of fields that are no longer used but both business and IT are afraid to let go. This will provide that certainty.

Another source of fear is “business logic” hidden in the legacy system. People fear that we do not know all of what the legacy system is doing and turning it off will break something. There are millions of lines of code in that legacy system, surely it is doing something useful. Actually, it is not. There is remarkably little essential business logic in most legacy systems. I know as I’ve built complex ERP systems and implemented many packages. Most of this code is just moving data from the database to an API, or to a transaction to another API, or into a conversational control record, or to the DOM if it is a more modern legacy system, onto the screen and back again. There is a bit of validation sprinkled throughout which some people call “business logic” but that is a stretch, it’s just validation. There is some mapping (when the user selects “Male” in the drop down put “1” in the gender field). And occasionally there is a bit of bona fide business logic. Calculating economic order quantities, critical paths or gross to net payroll calculations are genuine business logic. But they represent far less than 1% of the code base. The value is to be sure you have found them and insert into the graph.

This is where reverse engineering or legacy understanding software plays a vital role. Ms. Bellotti is 100% correct on this point as well. If you think these reverse engineer systems are going to automate your legacy conversion, you are in for a world of hurt. But what they can do is help you find the genuine business logic and provide some comfort to the sponsors that there isn’t something important that the legacy system is doing that no one knows about.

The final bit of understanding is the dependencies. This is the hardest one to get complete. The profiling software can help. Some can detect when the histogram of social security numbers in system A changes and the next day the same change is seen in system B, therefore it must be an interface. But beyond this the best you can do is catalog all the known data feeds and APIs. These are the major mechanisms that other systems use to become dependent on the legacy system. You will need to have strategies to mimic these dependencies to begin the migration.

This tack is purely research, and therefore does not develop any perceived immediate gain. You may need to bundle it with some other project that is providing incremental value to get it funded or you may fund it via contingency budget.

Leg 5 – Become the System of Record for some subset

Up to this point, data has been flowing into the graph from the legacy system or originating directly in the graph.

Now it is time to begin the reverse flow. We need to find an area where we can begin the flow going in the other direction. We now have enough architecture to build and answer use cases in the graph, it is time to start publishing rather than subscribing.

It is tempting to want to feed all the data back to the legacy system, but the legacy system has lots of data we do not want to source. Furthermore, this entrenches deeper into the legacy system. We need to pick off small areas that could decommission part of the legacy system.

Let’s say there was a certificate management system in the legacy system. We replace this with a better one in the graph and quit using the legacy one. But from our investigation above, we realize that the legacy certificate management system was feeding some points to the compensation management system. We just make sure the new system can feed the compensation system those points.

Leg 6 – Replace the dependencies incrementally

Now the flywheel is starting to turn. Encouraged by the early success of the reverse flow, the next step is to work out the data dependencies in the legacy system and work out a sequence to replace them.

The legacy payroll system is dependent on the benefit elections system. You now have two choices. You could replace the benefits system in the Graph. Now you will need to feed the results of the benefit elections (how much to deduct for the health care options etc.) to the legacy system. This might be the easier of the two options.

But the one that has the most impact is the other. Replace the payroll system. You have the benefits data feeding into the legacy system. If you replace the payroll system, there is nothing else (in HR) you need to feed. A feed the financial system and the government reporting system will be necessary, but you will have taken a much bigger leap in the legacy modernization effort.

Leg 7 – Lather, Rinse, Repeat

Once you have worked through a few of those, you can safely decommission the legacy system a bit at a time. Each time, pick off an area that can be isolated. Replace the functionality and feed the remaining bits of the legacy infrastructure if necessary. Just stop using that portion of the legacy system. The system will gradually atrophy. No need for any big bang replacement. The risk is incremental and can be rolled back and retried at any point.

Conclusion

We do not go into our clients claiming to be doing legacy modernization, but it is our intent to put them in a position where they could realize over time by applying knowledge graph capabilities.

We all know that at some point all legacy systems will have to be retired. At the moment the state of the art seems to be either “rip and replace” usually putting a packaged application in to replace the incumbent legacy system, or incrementally improve the legacy system in place.

We think there is a risk adverse, predictable, and self-funding route to legacy modernization, and it is done through Data-Centric implementation.

SHACL and OWL

November 3, 2021 by Dave McComb

There is a meme floating around out in the internet ether these days: “Is OWL necessary, or can you do everything you need to with SHACL?” We use SHACL most days and OWL every day and we find it quite useful.

It’s a matter of scope. If you limited your scope to replacing individual applications, you could probably get away with just using SHACL. But frankly, if that is your scope, maybe you shouldn’t be in the RDF ecosystem at all. If you are just making a data graph, and not concerned with how it fits into the broader picture, then Neo4j or TigerGraph should give you everything you need, with much less complexity.

If your scope is unifying the information landscape of an enterprise, or an industry / supply chain, and if your scope includes aligning linked open data (LOD), then our experience says OWL is the way to go. At this point you’re making a true knowledge graph.

By separating meaning (OWL) from structure (SHACL) we find it feasible to share meaning without having to share structure. Payroll and HR can share the definition and identity of employees, while sharing very little of their structural representations.

Employing formal definition of classes takes most of the ambiguity out of systems. We have found that leaning into full formal definitions greatly reduces the complexity of the resulting enterprise ontologies.

We have a methodology we call “think big/ start small.” The “think big” portion is primarily about getting a first version of an enterprise ontology implemented, and the “start small” portion is about standing up a knowledge graph and conforming a few data sets to the enterprise ontology. As such, the “think big” portion is primarily OWL. The “start small” portion consists of small incremental extensions to the core model (also in OWL) conforming the data sets to the ontology (TARQL, R2RML, SMS or similar technologies), SHACL to ensure conformance and SPARQL to prove that it all fits together correctly.

For us, it’s not one tool, or one standard, or one language for all purposes. For us it’s like Mr. Natural says, “get the right tool for the job.”

The Data-Centric Revolution: Avoiding the Hype Cycle

September 2, 2021September 1, 2021 by Dave McComb

Gartner has put “Knowledge Graphs” at the peak of inflated expectations. If you are a Knowledge Graph software vendor, this might be good news. Companies will be buying knowledge graphs without knowing what they are. I’m reminded of an old cartoon of an executive dictating into a dictation machine: “…and in closing, in the future implementing a relational database will be essential to the competitive survival of all firms. Oh, and Miss Smith, can you find out what a relational database is?” I imagine this playing out now, substituting “knowledge graph” for “relational database” and by-passing the misogynistic secretarial pool.

If you’re in the software side of this ecosystem, put some champagne on ice, dust off your business plan, and get your VCs on speed dial. Happy times are imminent.

Oh no! Gartner has put Knowledge Graphs at the peak of the hype cycle for Artificial Intelligence

Those of you who have been following this column know that our recommendations for data-centric transformations strongly encourage semantic technology and model driven development implemented on a knowledge graph. As we’ve said elsewhere, it is possible to become data-centric without all three legs of this stool, but it’s much harder than it needs to be. We find our fate at least partially tethered to the knowledge graph marketplace. You might think we’d be thrilled by the news that is lifting our software brethren’s boats.

But we know how this movie / roller coaster ends. Once a concept scales this peak, opportunists come out of the woodwork. Consultants will be touting their “Knowledge Graph Solutions” application and vendors will repackage their Content Management System or ETL Pipeline product as a key step on the way Knowledge Graph nirvana. Anyone who can spell “Knowledge Graph” will have one to offer.

Some of you will avoid the siren’s song, but many will not. Projects will be launched with great fanfare. Budgets will be blown. What is predictable is that these projects will fail to deliver on their promises. Sponsors will be disappointed. Naysayers will trot out their “I told you so’s.” Gartner will announce Knowledge Graphs are in the Trough of Disillusionment. Opportunists will jump on the next band wagon.

Click here to continue reading on TDAN.com

If you’re interested in Knowledge Graphs, and would like to avoid the trough of disillusionment, contact me: [email protected]

Smart City Ontologies: Lessons Learned from Enterprise Ontologies

June 30, 2021June 29, 2021 by Dave McComb

Smart City Ontologies: Lessons Learned from Enterprise Ontologies For the last 20 years, Semantic Arts has been helping firms design and build enterprise ontologies to get them on the data-centric path. We have learned many lessons from the enterprise that can be applied in the construction of smart city ontologies.

What is similar between the enterprise and smart cities?

They both have thousands of application systems. This leads to thousands of arbitrarily different data models, which leads to silos.
The enterprise and smart cities want to do agile, BUT agile is a more rapid way to create more silos.

What is different between the enterprise and smart cities?

In the enterprise, everyone is on the same team working towards the same objectives.
In smart cities there are thousands of different teams working towards different objectives. For example:

Utility companies.
Sanitation companies.
Private and public industry.

Large enterprises have data lakes in data warehouses.
In smart cities there are little bits of data here and there.

What have we learned?

“Simplicity is the ultimate sophistication.”

20 years of Enterprise Ontology construction and implementation has taught us some lessons that apply to Smart City ontologies. The Smart City Landscape informs us how to apply those lessons, which are:

Think big and start small.

Simplicity is key to integration.

Low code / No code is key for citizen developers.

Semantic Arts gave this talk at the W3c Workshop on Smart Cities originally recorded on June 25, 2021. Click here to view the talks and interactive sessions recorded for this virtual event. A big thank you to W3C for allowing us to contribute!

Philosophy School

Exemplars

How to Recognize

Best Practices

Vocabulary and Taxonomy School

Exemplars

How to Recognize

Best Practices

Relational School

Exemplars

How to Recognize

Best Practices

Object-Oriented School

Exemplars

How to Recognize

Best Practices

Standards School

Exemplars

How to Recognize

Best Practices

Linked Open Data School

Exemplars

How to Recognize

Best Practices

NLP/LLM School

Exemplars

How to Recognize

Best Practices

Data-Centric School

Exemplars

How to Recognize

Best Practices

Summary

Acknowledgement

Enterprise Integration is Hard

Complexity Reduction

Concept Sharing

Instance Level Integration

Data Warehouse, Data Lake, and Data Catalog all in One

Query Federation

Where RDF Might Be Overkill

Summary

Legacy Systems

Legacy Modernization

The Lost Opportunity

Why no one will take you seriously

Tacking

Leg 1 – ETL to a Graph

Leg 2 – Architecture MVP

Leg 3 — Simple new Functionality in the Graph

Leg 4 – Understand the Legacy System and its Environment

Leg 5 – Become the System of Record for some subset

Leg 6 – Replace the dependencies incrementally

Leg 7 – Lather, Rinse, Repeat

Conclusion

What is similar between the enterprise and smart cities?

What is different between the enterprise and smart cities?

What have we learned?

Contact Us

Learn More