Data-Centric: How Big Things Get Done (in IT)

Dave McComb

I read “How Big Things Get Done” when it first came out about six months ago.[1] I liked it then. But recently, I read another review of it, and another coin dropped. I’ll let you know what the coin was toward the end of this article, but first I need to give you my own review of this highly recommended book.

The prime author, Bent Flyvbjerg, is a professor of “Economic Geography” (whatever that is) and has a great deal of experience with engineering and architecture. Early in his career, he was puzzling over why mass transit projects seemed routinely to go wildly over budget. He examined many in great detail; some of his stories border on the comical, except for the money and disappointment that each new round brought.

He was looking for patterns, for causes. He began building a database of projects. He started with a database of 178 mass transit projects, but gradually branched out.

It turns out there wasn’t anything especially unique about mass transit projects. Lots of large projects go wildly over budget and schedule, but the question was: Why?

It’s not all doom and gloom and naysaying. He has some inspirational chapters about the construction of the Empire State Building, the Hoover Dam, and the Guggenheim Museum in Bilbao. All of these were in the rarified atmosphere of the less than ½ of 1% of projects that came in on time and on budget.

Flyvbjerg contrasted them with a friend’s brownstone renovation, California’s bullet train to nowhere, the Pentagon (it is five-sided because the originally proposed site had roads on five sides), and the Sydney Opera House. The Sydney Opera House was a disaster of such magnitude that the young architect who designed it never got another commission for the rest of his career.

Each of the major projects in his database has key considerations, such as original budget and schedule and final cost and delivery. The database is organized by type of project (nuclear power generation versus road construction, for instance). The current version of the database has 38,000 projects. From this database, he can calculate the average amount projects run over budget by project type.

IT Projects

He eventually discovered IT projects. He finds them to be among the most likely projects to run over budget. According to his database, IT projects run over budget by an average of 73%. This database is probably skewed toward larger projects and more public ones, but this should still be of concern to anyone who sponsors IT projects.

He described some of my favorites in the book, including healthcare.gov. In general, I think he got it mostly right. Reading between the lines, though, he seems to think there is a logical minimum that the software projects should be striving for, and therefore he may be underestimating how bad things really are.

This makes sense from his engineering/architecture background. For instance, the Hoover Dam has 4.3 million cubic yards of concrete. You might imagine a design that could have removed 10 or 20% of that, but any successful dam-building project would involve nearly 4 million cubic yards of concrete. If you can figure out how much that amount of concrete costs and what it would take to get it to the site and installed, you have a pretty good idea of what the logical minimal possible cost of the dam would be.

I think he assumed that early estimates for the cost of large software projects, such as healthcare.gov at $93 million, may have been closer to the logical minimum price, which just escalated from there, to $2.1 billion.

What he didn’t realize, but readers of Software Wasteland[2] as well as users of healthsherpa.com[3] did, was that the actual cost to implement the functionality of healthcare.gov is far less than $2 million; not the $93 million originally proposed, and certainly not the $2.1 billion it eventually cost. He likely reported healthcare.gov as a 2,100% overrun (final budget of $2.1 billion / original estimate of $93 million). This is what I call the “should cost” overrun. But the “could cost” overrun was closer to 100,000% (one hundred thousand percent, which is a thousand-fold excess cost).

From his database, he finds that IT projects are in the top 20%, but not the worst if you use average overrun as your metric.

He has another metric that is also interesting called the “fat tail.” If you imagine the distribution of project overruns around a mean, there are two tails to the bell curve, one on the left (projects that overrun less than average) and one on the right for projects that overrun more than average. If overruns were normally distributed, you would expect 68% of the projects to be within one standard deviation of the mean and 94% within two standard deviations. But that’s not what you find with IT projects. Once they go over, they have a very good chance of going way over, which means the right side of the bell curve goes kind of horizontal. He calls this a “fat tail.” IT projects have the fattest tails of all the projects in his database.

IT Project Contingency

Most large projects have “contingency budgets.” That is an amount of money set aside in case something goes wrong.

If the average large IT project goes over budget by 73%, you would think that most IT project managers would use a number close to this for their contingency budget. That way, they would hit their budget-with-contingency half the time.

If you were to submit a project plan with a 70% contingency, you would be laughed out of the capital committee. They would think that you have no idea how to manage a project of this magnitude. And they would be right. So instead, you put a 15% contingency (on top of the 15% contingency your systems integrator put in there) and hope for the best. Most of the time, this turns out badly, and half the time, this turns out disastrously (in the “fat tail” where you run over by 447%). As Dave Barry always says, “I am not making this up.”

Legacy Modernization

These days, many of the large IT projects are legacy modernization projects. Legacy modernization means replacing technology that is obsolete with technology that is merely obsolescent, or soon to become so. These days, a legacy modernization project might be replacing Cobol code with Java.

It’s remarkable how many of these there are. Some come about because programming languages become obsolete (really it just becomes too hard to find programmers to work on code that is no longer padding their resumes). Far more common are vendor-forced migrations. “We will no longer support version 14.4 or earlier; clients will be required to upgrade.”  What used to be an idle threat is now mandatory, as staying current is essential in order to have access to zero-day security patches.

When a vendor-forced upgrade is announced, often the client realizes this won’t be as easy as it sounds (mostly because the large number of modifications, extensions, and configurations they have made to the package over the years are going to be very hard to migrate). Besides, having been held hostage by the vendor for all this time, they are typically ready for a break. And so, they often put it out to bid, and bring in a new vendor.

What is it about these projects that are so rife? Flyvbjerg touches on it in the book. I will elaborate here.

Remember when your company implemented its first payroll system? Of course you don’t, unless you are, like, 90 years old. Trust me, everyone implemented their first automated payroll system in the 1950s and 1960s (so I’m told, I wasn’t there either). They implemented them with some of the worst technology you can imagine. Mainframe Basic Assembler Language and punched cards were state of the art on some of those early projects. These projects typically took dozens of person years (OK, back in those days they really were man years) to complete. This would be $2-5 million at today’s wages.

These days, we have modern programming languages, tools, and hardware that is literally millions of times more powerful than what was available to our ancestors. As such, a payroll system implementation in a major company is a multi-hundred million undertaking these days. “Wait, Dave, are you saying that the cost of implementing something as bog standard as a payroll system has gone up a factor of 100, while the technology used to implement it has improved massively?” Yes, that is exactly what I’m saying.

To understand how this could be you might consult this diagram.

This is an actual diagram from a project with a mid-sized (7,000-person) company. Each box represents an application and each line an interface. Some are APIs, some are ETLs, and some are manual. All must be supported through any conversion.

My analogy is with heart transplantation. Any butcher worth their cleaving knife could remove one person’s heart and put in another in a few minutes. That isn’t the hard part. The hard part is keeping the patient alive through the procedure and hooking up all those arteries, veins, nerves, and whatever else needs to be restored. You don’t get to quit when you’re half done.

And so it is with legacy modernization. Think of any of those boxes in the above diagram as a critical organ. Replacing it involves reattaching all those pink lines (plus a bunch more you don’t even know are there).

DIMHRS was the infamous DoD project to upgrade their HR systems. They gave up with north of a billion dollars invested when they realized they likely only had about 20% of the interfaces completed and they weren’t even sure what the final number would be.

Back to Flyvbjerg’s Book

We can learn a lot by looking at the industries where projects run over the most and run over the least. The five types of projects that run over the most are:

  • Nuclear storage
  • Olympic Games
  • Nuclear power
  • Hydroelectric dams
  • IT

To paraphrase Tolstoy, “All happy projects are alike; each unhappy project is unhappy in its own way.”

The unhappiness varies. The Olympics is mostly political. Sponsors know the project is going to run wildly over, but want to do the project anyway, so they lowball the first estimate. Once the city commits, they have little choice but to build all the stadiums and temporary guest accommodations. One thing all of these have in common is they are “all or nothing” projects. When you’ve spent half the budget on a nuclear reactor, you don’t have anything useful. When you have spent 80% of the budget and the vendor tells you you are half done, you have few choices other than to proceed. Your half a nuclear plant is likely more liability than asset.

 

Capital Project Riskiness by Industry [4]

 

And so it is with most IT projects. Half a legacy modernization project is nothing.

Now let’s look at the bottom of Flyvbjerg’s table:

  • Roads
  • Pipelines
  • Wind power
  • Electrical transmission
  • Solar power

Roads. Really? That’s how bad the other 20 categories are.

What do these have in common? Especially wind and solar.

They are modular. Not modular as in made of parts, even nuclear power is modular in some fashion. They are modular in how their value is delivered. If you plan a wind project with 100 turbines, then when you have installed 10, you are generating 10% of the power you hoped the whole project would. You can stop at this point if you want (you probably won’t as you’re coming in on budget and getting results).

In my mind, this is one reason I think wind and solar are going to outpace most predictions of their growth. It’s not because they are green, or even that they are more economical — they are — but they are also far more predictable and lower risk. People who invest capital like that.

Data-Centric as the Modular Approach to Digital Transformation

That’s when the coin dropped.

What we have done with data-centric is create a modular way to convert an enterprise’s entire data landscape. If we pitched it as one big monolithic project, it would likely be hundreds of millions of dollars, and by the logic above, high risk and very likely to go way over budget.

But instead, we have built a methodology that allows clients to migrate toward data-centric one modest sized project at a time. At the end of each project, the client has something of value they didn’t have before, and they have convinced more people within their organization of the validity of the idea.

Briefly how this works:

  • Design an enterprise ontology. This is the scaffolding that prevents subsequent projects from merely re-platforming existing silos into neo-ilos.
  • Load data from several systems into a knowledge graph (KG) that conforms to the ontology in a sandbox. This is nondestructive. No production systems are touched.
  • Update the load process to be live. This does introduce some redundant interfaces. It does not require any changes, but some additions to the spaghetti diagram (this is all for the long-term good).
  • Grow the domain footprint. Each project can add more sources to the knowledge graph. Because of the ontology, the flexibility of the graph and the almost free integration properties of RDF technology, each domain adds more value, through integration, to the whole.
  • Add capability to the KG architecture. At first, this will be view-only capability. Visualizations are a popular first capability. Natural language search is another. Eventually, firms add composable and navigable interfaces, wiki-like. Each capability is its own project and is modular and additive as described above. If any project fails, it doesn’t impact anything else.
  • Add live transaction capture. This is the inflection point. Up to this point, the project was a richer and more integrated data warehouse. Up to this point, the system relied on the legacy systems for all the information, much as a data warehouse does. At this junction, you implement the ability to build use cases directly on the graph. These use cases are not bound to each other in the way that monolithic legacy system use cases are. These use cases are bound only to the ontology and therefore are extremely modular.
  • Make the KG the system of record. With the use case capability in place, the graph can become the source system and system of record for some data. Any data sourced directly in the graph no longer needs to be fed from the legacy system. People can continue to update it in the legacy system if there are other legacy systems that depend on it, but over time, portions of the legacy system will atrophy.
  • Legacy avoidance. We are beginning to see clients who are far enough down this path that they have broken the cycle of dependence they have been locked into for decades. The cycle is: If we have a business problem, we need to implement another application to solve it. It’s too hard to modify an existing system, so let’s build another. Once a client starts to get to critical mass in some subset of their business, they begin to become less eager to leap into another neo-legacy project.
  • Legacy erosion. As the KG becomes less dependent on the legacy systems, the users can begin partitioning off parts of it and decommissioning them a bit at a time. This takes a bit of study to work through the dependencies, but is definitely worth it.
  • Legacy replacement. When most of the legacy systems data is already in the graph, and many of the use cases have been built, managers can finally propose a low-risk replacement project. Those pesky interface lines are still there, but there are two strategies that can be used in parallel to deal with them. One is to start the furthest downstream, with the legacy systems that are fed, but do little feeding of others. The other strategy is to replicate the interface functionality, but from the graph.

We have done dozens of these projects. This approach works. It is modular, predictable, and low-risk.

If you want to talk to someone about getting on a path of modular modernization that really works, look us up.

The New Gist Model for Quantitative Data

Phil Blackwood

Every Enterprise can benefit from having a simple, standard way to represent quantitative data. In this blog post, we will provide examples of how to use the new gist model of quantitative data released in gist version 13. After illustrating key concepts, we will look at how all the pieces fit together and provide one concrete end-to-end example.

Let’s examine the following:

  1. How is a measurement represented?
  2. Which units can be used to measure a given characteristic?
  3. How do I convert a value from one unit to another?
  4. How are units defined in terms of the International System of Units?

First, we want to be able to represent a fact like:

“The patio has an area of 144 square feet.”

The area of the patio is represented using this pattern:

… where:

A magnitude is an amount of some measurable characteristic.

An aspect is a measurable characteristic like cost, area, or mass.

A unit of measure is a standard amount used to measure or specify things, like US dollar, meter, or kilogram.

Second, we need to be able to identify which units are applicable for measuring a given aspect. Consider a few simple examples, the aspects distance, energy, and cost:

For every aspect there is a group of applicable units. For example, there is a group of units that measure energy density:

… where:

A unit group is a collection of units that can be used to measure the same aspect.

 

A common scenario is that we want to validate the combination of aspect and unit of measure. All we need to do is check to see if the unit of measure is a member of the unit group for the aspect:

Next, we want to be able to convert measurements from one unit to another. A conversion like this makes sense only when the two units measure the same aspect. For example, we can convert pounds to kilograms because they both measure mass, but we can’t convert pounds to seconds. When a conversion is possible, the rule is simple:

There is an exception to the rule above for units of measure that do not have a common zero value. For example, 0 degrees Fahrenheit is not the same temperature as 0 degrees Kelvin.

To convert from Kelvin to Fahrenheit, reverse the steps: first divide by the conversion factor and then subtract the offset.

To convert a value from Fahrenheit to Celsius, first use the conversion above to convert to Kelvin, and then convert from Kelvin to Celsius.

 

Next, we will look at how units of measure are related to the International System of Units, which defines a small set of base units (kilogram, meter, second, Kelvin, etc.) and states:

Notice that every expression on the right side is a multiple of kilogram meter2 per second3. We can avoid redundancy by “attaching” the exponents of base units to the unit group. That way, when adding a new unit of measure to the unit group for power there is no need to re-enter the data for the exponents.

The example also illustrates the conversion factors; each conversion factor appears as the initial number on the right hand side. In other words:

The conversion factors and exponents allow units of measure to be expressed in terms of the International System of Units, which acts as something of a Rosetta Stone for understanding units of measure.

One additional bit of modeling allows calculations of the form:

(45 miles per hour) x 3 hours = 135 miles

To enable this type of math, we represent miles per hour directly in terms of miles and hours:

Putting the pieces together:

Here is the standard representation of a magnitude:

Every aspect has a group of units that can be used to measure it:

Every member of a unit group can be represented as a multiple of the same product of powers of base units of the International System of Units:

where X can be:

  • Ampere
  • Bit
  • Candela
  • Kelvin
  • Kilogram
  • Meter
  • Mole
  • Number
  • Other
  • Radian
  • Second
  • Steradian
  • USDollar

 

Every unit of measure belongs to one or more unit groups, and if can be defined in terms of other units acting as multipliers and divisors:

We’ll end with a concrete example, diastolic blood pressure.

The unit group for blood pressure is a collection of units that measure blood pressure. The unit group is related to the exponents of base units of the International System of Units:

Finally, one member of the unit group for blood pressure is millimeter of mercury. The scope note gives an equation relating the unit of measure to the base units (in this case, kilogram, meter, and second).

The diagrams above were generated using a visualization tool. The text version of the diagrams is:

For more examples and some basic queries, visit the gitHub site gistReferenceData.

 

In closing, we would like to acknowledge the re-use of concepts from QUDT, namely:

  • every magnitude has an aspect, via the new gist property hasAspect
  • aspects are individuals instead of categories or subclasses of Magnitude as in gist 12
  • exponents are represented explicitly, enabling calculations

The Data-Centric Revolution: Best Practices and Schools of Ontology Design

This article originally appeared at The Data-Centric Revolution: Best Practices and Schools of Ontology Design – TDAN.com. Subscribe to TDAN directly for this and other great content!

I was recently asked to present “Enterprise Ontology Design and Implementation Best Practices” to a group of motivated ontologists and wanna-be ontologists. I was flattered to be asked, but I really had to pause for a bit. First, I’m kind of jaded by the term “best practices.” Usually, it’s just a summary of what everyone already does. It’s often sort of a “corporate common sense.” Occasionally, there is some real insight in the observations, and even rarer, there are best practices without being mainstream practices. I wanted to shoot for that latter category.

As I reflected on a handful of best practices to present, it occurred to me that intelligent people may differ. We know this because on many of our projects, there are intelligent people and they often do differ. That got me to thinking: “Why do they differ?” What I came to was that there are really several different “schools of ontology design” within our profession. They are much like “schools of architectural design” or “schools of magic.” Each of those has their own tacit agreement as to what constitutes “best practice.”

Armed with that insight, I set out to identify the major schools of ontological design, and outline some of their main characteristics and consensus around “best practices.” The schools are (these are my made-up names, to the best of my knowledge none of them have planted a flag and named themselves — other than the last one):

  • Philosophy School
  • Vocabulary and Taxonomy School
  • Relational School
  • Object-Oriented School
  • Standards School
  • Linked Data School
  • NLP/LLM School
  • Data-Centric School

There are a few well known ontologies that are a hybrid of more than one of these schools. For instance, most of the OBO Life Sciences ontologies are a hybrid of the Philosophy and Taxonomy School, I think this will make more sense after we describe each school individually.

Philosophy School

The philosophy school aims to ensure that all modeled concepts adhere to strict rules of logic and conform to a small number of well vetted primitive concepts.

Exemplars

The Basic Formal Ontology (BFO), DOLCE and Cyc are the best-known exemplars of this school.  Each has a set of philosophical primitives that all derived classes are meant to descend from.

How to Recognize

It’s pretty easy to spot an ontology that was developed by someone from the philosophy school. The top-level classes will be abstract philosophical terms such as “occurrent” and “continuant.”

Best Practices

All new classes should be based on the philosophical primitives. You can pretty much measure the adherence to the school by counting the number of classes that are not direct descendants of the 30-40 base classes.

Vocabulary and Taxonomy School

The vocabulary and taxonomy school tends to start with a glossary of terms from the domain and establish what they mean (vocabulary school) and how these terms are hierarchically related to each other (taxonomy school). The two schools are more alike than different.

The taxonomy school especially tends to be based on standards that were created before the Web Ontology Language (OWL). These taxonomies often model a domain as hierarchical structures without defining what a link in the hierarchy actually means. As a result, they often mix sub-component and sub-class hierarchies.

Exemplars

Many life sciences ontologies, such as SNOMED are primarily taxonomy ontologies, and only secondarily philosophy school ontologies. Also, the Suggested Upper Merged Ontology is primarily a vocabulary ontology, it was mostly derived from WordNet and one of its biggest strengths is its cross reference to 250,000 words and their many word senses.

How to Recognize

Vast numbers of classes. There are often tens of thousands or hundreds of thousands of classes in these ontologies.

Best Practices

For the vocabulary and taxonomy schools, completeness is the holy grail. A good ontology is one that contains as many of the terms from the domain as possible. The Simple Knowledge Organization System (SKOS) was designed for taxonomies. Thus, even though it is implemented in OWL, it is designed to add semantics to taxonomies that often are less rigorous, using generic predicates such as broaderThan and narrowerThan rather than more precise subclass or object properties such as “part of.” SKOS is a good tool for integrating taxonomies with ontologies.

Relational School

Most data modelers grew up with relational design, and when they design ontologies, they rely on ways of thinking that served them well in relational.

Exemplars

These are mostly internally created ontologies.

How to Recognize

Relational ontologists tend to be very rigorous about putting specific domains and ranges on all their properties. Properties are almost never reused. All properties will have inverses. Restrictions will be subclass axioms, and you will often see restrictions with “min 0” cardinality, which doesn’t mean anything to an inference engine, but to a relational ontologist it means “optional cardinality.” You will also see “max 1” and “exactly 1” restrictions which almost never imply what the modeler thought, and as a result, it is rare for relational modelers to run a reasoner (they don’t like the implications).

Best Practices

For relational ontologist best practices are to make ontologies that are as similar to existing relational structures as possible. Often, the model is a direct map from an existing relational system.

Modelers in the relational school (as well as the object-oriented school coming up next) tend to bring the “Closed World Assumption” (CWA) with them from their previous experience. CWA takes a mostly implicit attitude that the information in the system is a complete representation of the world. The “Open World Assumption” (OWA) takes the opposite starting point: that the data in the system is a subset of all knowable information on the subject.

CWA was and is more appropriate in narrow scope, bounded applications. When we query your employee master file looking for “Dave McComb” and don’t get a hit, we reasonably assume that he is not an employee of your enterprise. When TSA queries their system and doesn’t get a hit, they don’t assume that he is not a terrorist. They still use the X-ray and metal detectors. This is because they believe that their information is incomplete. They are open worlders. More and more of our systems combine internal and external data in ways that are more likely to be incomplete.

There are techniques for closing the open world, but the relational school tends not to use them because they assume their world is already closed.

Object-Oriented School

Like the relational school, the object-oriented school comes from designers who grew up with object-oriented modeling.

Exemplars

Again, a lot of object-oriented (OO) ontologies are internal client projects, but a few public ones of note include eCl@ss and Schema.org. eCl@ss is a standard for describing electrical products. It has been converted into an ontology. The ontology version has 60,000 classes, which combine taxonomic and OO style modeling. Schema.org is an ontology for tagging web sites that Google promotes to normalize SEO. It started life fairly elegant. It now has 1300 classes, many of which are taxonomic distinctions, rather than real classes.

How to Recognize

One giveaway for the object-oriented school is designing in SHACL. SHACL is a semantic constraint language, which is quite useful as a guard for updates to a triple store. Because SHACL is less concerned with meaning and more concerned with structure, many object-oriented ontologists prefer it to OWL for defining their classes.

Even those who design in OWL have some characteristic tells. OO ontologists tend to use subclassing far more than relational ontologists. They tend to declare which class is a subclass of another, rather than allowing the inference engine to infer subsumption. There is also a tendency to believe that the superclass will constrain subclass membership.

Best Practices

OO ontologies tend to co-exist with Graph QL and focus on json output. This is because the consuming applications are object oriented, and this style ontology and architecture have less impedance mismatch with the consuming applications. The level of detail tends to mirror the kind of detail you find in an application system. Best practices for an OO ontology would never consider the tens of thousands or hundreds of thousands of classes in a taxonomy ontology, nor would they go for the minimalist view of the philosophy or data-centric schools. They tend to make all distinctions at the class level.

Standards School

This is a Janus school, with two faces, one facing up and one facing down. The one facing down is concerned with building ontologies that others can (indeed should) reuse. The one facing up is the enterprise ontologies that import the standard ontologies in order to conform.

Exemplars

Many of the most popular ontology standards are produced and promoted by the W3C. These include DCAT (Data Catalog Vocabulary), the Ontology for Media Resources, Prov-O (an ontology of provenance), Time Ontology, and Dublin Core (an ontology for metadata, particular around library science).

How to Recognize

For the down facing standards ontology, it’s pretty easy. They are endorsed by some standards body. Most common are W3C, OMG and Oasis. ISO has been a bit late to this party, but we expect to see some soon. (Everyone uses the ISO country and currency codes, and yet there is no ISO ontology of countries or currencies.) There are also many domain-specific standard ontologies that are remakes of their previous message model standards, such as FHIR from HL7 in healthcare and ACORD in insurance.

The upward facing standards ontologies can be spotted by their importing a number of standard ontologies each meant to address an aspect of the problem at hand.

Best Practices

Best practice for downward facing standards ontologies is to be modular, fairly small, complete and standalone. Unfortunately, this best practice tends to result in modular ontologies that redefine (often inconsistently) shared concepts.

Best practice for upward facing standards ontologies is to rely as much as possible on ontologies defined elsewhere. This usually starts off by importing many ontologies and ends up with a number of bridges to the standards when it’s discovered that they are incompatible.

Linked Open Data School

The linked open data school promotes the idea of sharing identifiers across enterprises. Linked data is very focused on instance (individual or ABox) data, and only secondarily on classes.

Exemplars

The poster child for LOD is DBPedia, the LOD knowledge graph derived from the Wikipedia information boxes. It also includes the direct derivatives such as WikiData and the entire Linked Open Data Cloud.

I would put the Global Legal Entity Identifier Foundation (GLEIF) in this school as their primary focus is sharing between enterprises and there are more focused on the ABox (the instances).

How to Recognize

Linked open data ontologies are recognizable by their instances, often millions and in many cases billions of instances. The ontologies (TBox) is often very naïve, as they are often derived directly from informal classifications made by text editors in Wikipedia and its kin.

You will see many adhoc classes raised to the status of a formal class in LOD ontologies. I just notice the classes dbo:YearInSpaceFlight and yago:PsychologicalFeature100231001.

Best Practices

The first best practice (recognized more in the breach) is to rely on other organizations IRIs. This is often clumsy because historically, each organization invented identifiers for things in the world (their employees and vendors for instance) and they tend to build their IRIs around these well-known (at least locally) identifiers.

A second best practice is entity resolution and “owl:sameAs.” Entity resolution can determine if two IRIs represent the same real-world object. Once recognized, one of the organizations can choose to adopt the others IRI (previous paragraph best practice) or continue to use their own, but recognize the identity through owl:sameAs (which is mostly motivated by the following best practice).

LOD creates the opportunity for IRI resolution at the instance level. Put the DBPedia IRI for a famous person in your browser address bar and you will be redirected to DBPedia resolution page for that individual, showing all that DBPedia knows about them. For security reasons, most enterprises don’t yet do this. Because of this, another best practice is to only create triples with subjects whose domain name you control. Anything you state about a IRI in someone else’s name space will not be available for resolution by the organization that minted the subject URI.

NLP/LLM School

There is a school of ontology design that says turn ontology design over to the machines. It’s too hard anyway.

Exemplars

Most of these are also internal projects. About every two to three years, we see another startup with the premise that ontologies can be built by machines. For most of history, these were cleverly tailored NLP systems. The original works in this area took large teams of computational linguists to master.

This year (2023), they are all LLMs. You can ask ChatGPT to build an ontology for [fill in the blank] industry, and it will come up with something surprisingly credible looking.

How to Recognize

For LLMs, the first giveaway are hallucinations. These are hard to spot and require deep domain and ontology experience to pick out. The second clue is humans with six fingers (just kidding). There aren’t many publicly available LLM generated ontologies (or if there are they are so good we haven’t detected that they were machine generated).

Best Practices

Get a controlled set of documents that represent the domain you wish to model. This is better than relying on what ChatGPT learned by reading the internet.

And have a human in the loop. This is an approach that shows significant promise and several researchers have already created prototypes that utilize this approach. Consider that the NLP / LLM created artifacts are primarily speed reading or intelligent assistants for the ontologist.

In the broader adoption of LLMs, there is a lot of energy going into ways to use knowledge graphs as “guard rails” against some of LLMs excesses, and the value of keeping a human in the loop. Our immediate concern there are advocates of letting generative AI design ontologies, and as such it becomes a school of its own.

Data-Centric School

The data-centric school of ontology design, as promoted by Semantic Arts, focuses on ontologies that can be populated and implemented. In building architecture, they often say “It’s not architecture until it’s built.” The data-centric school says, “It’s not an ontology until it has been populated (with instance level, real world data, not just taxonomic tags).” The feedback loop of loading and querying the data is what validates the model.

Exemplars

Gist, an open-source owl ontology, is the exemplar data-centric ontology. SchemaApp, Morgan Stanley’s compliance graph, Broadridge’s Data Fabric, Procter & Gamble’s Material Safety graph, Schneider-Electric’s product catalog graph, Standard & Poor’s commodity graph, Sallie Mae’s Service Oriented Architecture and dozens of small firms’ enterprise ontologies are based on gist.

How to Recognize

Importing gist is a dead giveaway. Other telltale signs include a modest number of classes (less than 500 for almost all enterprises) and eschewing inverse and transitive properties (the overhead for these features in a large knowledge graph far outweigh their expressive power). Another giveaway is delegating taxonomic distinctions to be instances of subclasses of gist:Category rather than being classes in their own right.

Best Practices

One best practice is to have non primitive classes have “equivalent class” restrictions that define class membership and are used to infer the class hierarchy. Another best practice is to have domains and ranges at very high levels of abstraction (and often missing completely) in order to promote property reuse and reduce future refactoring.

Another best practice is to load a knowledge graph with data from the domain of discourse to prove that the model is appropriate and at the requisite level of detail.

Summary

One of the difficulties in getting wider spread adoption of ontologies and knowledge graphs is that if you recruit and/or assemble a group of ontologists, there is a very good chance you will have members from multiple of the above-described schools. There is a good chance they will have conflicting goals, and even a different definition of what “good” is. Often, they will not even realize that their difference of opinion is due to their being members of a different tribe.

There isn’t one of these schools that is better than any of the others for all purposes. They each grew up solving different problems and emphasizing different aspects of the problem.

When you look at existing ontologies, especially those that were created by communities, you’ll often find that many are an accidental hybrid of the above schools. This is caused by different members coming to the project from different schools and applying their own best practices to the design project.

Rather than try to pick which school is “best,” you should consider what the objectives of your ontology project are and use that to determine which school is better matched. Select ontologists and other team members who are willing to work to the style of that school. Only then is it appropriate to consider “best practices.”

Acknowledgement

I want to acknowledge Michael Debellis for several pages of input on an early draft of this paper. The bits that didn’t make it into this paper may surface in a subsequent paper.

DCA Forum Recap: Forrest Hare, Summit Knowledge Solutions

A knowledge model for explainable military AI

Forrest Hare, Founder of Summit Knowledge Solutions, is a retired US Air Force targeting and information operations officer who now works with the Defense Intelligence Agency (DIA). His experience includes integrating intelligence from different types of communications, signals, imagery, open source, telemetry, and other sources into a cohesive and actionable whole.

Hare became aware of semantics technology while at SAIC and is currently focused on building a space + time ontology called the DIA Knowledge Model so that Defense Department intelligence could use it to contextualize these multi-source inputs.

The question becomes, how do you bring objects that don’t move and objects that do move into the same information frame with a unified context? The information is currently organized by collectors and producers.

The object-based intelligence that does exist involves things that don’t move at all.  Facilities, for example, or humans using phones that are present on a communications network are more or less static. But what about the things in between such as trucks that are only intermittently present?

Only sparse information is available about these. How do you know the truck that was there yesterday in an image is the same truck that is there today? Not to mention the potential hostile forces who own the truck that have a strong incentive to hide it.

Objects in object-based intelligence not only include these kinds of assets, but also events and locations that you want to collect information about. In an entity-relationship sense, objects are entities.

Hare’s DIA Knowledge Model uses the ISO standard Basic Formal Ontology (BFO) to unify domains so that the information from different sources is logically connected and therefore makes sense as part of a larger whole. BFO’s maintainers (Director Barry Smith and team at the National Center for Ontological Research (NCOR) at the University of Buffalo) keep the ontology strictly limited to 30 or so classes.

The spatial-temporal regions of the Knowledge Model are what’s essential to do the kinds of dynamic, unfolding object tracking that’s been missing from object-based intelligence. Hare gave the example of a “site” (an immaterial entity) from a BFO perspective. A strict geolocational definition of “site” makes it possible for both humans and machines to make sense of the data about sites. Otherwise, Hare says, “The computer has no idea how to understand what’s in our databases, and that’s why it’s a dumpster fire.”

This kind of mutual human and machine understanding is a major rationale behind explainable AI. A commander briefed by an intelligence team must know why the team came to the conclusions it did. The stakes are obviously high. “From a national security perspective, it’s extremely important for AI to be explainable,” Hare reminded the audience. Black boxes such as ChatGPT as currently designed can’t effectively answer the commander’s question on how the intel team arrived at the conclusions it did.

Finally, the level of explain-ability knowledge models like the DIA’s becomes even more critical as information flows into the Joint Intelligence Operations Center (JIOC). Furthermore, the various branches of the US Armed Forces must supply and continually update a Common Intelligence Picture that’s actionable by the US President, who’s the Commander in Chief for the military as a whole.

Without this conceptual and spatial-temporal alignment across all service branches, joint operations can’t proceed as efficiently and effectively as they should.  Certainly, the risk of failure looms much larger as a result.

Contributed by Alan Morrison

Financial Data Transparency Act “PitchFest”

The Data Foundation (Data Foundation PitchFest) hosted at PitchFest on “Unlocking the vision of the Financial Data Transparency Act” a few days ago. Selected speakers were given 10 minutes to bring their best ideas on how to use the improved financial regulatory information and data.

The Financial Data Transparency Act is a new piece of legislation directly affecting the financial services industry. In short, it directs financial regulators to harmonize data collections and move to machine (and people) readable forms. The goal is to reduce the burdens of compliance on regulated industries, increase the ability to analyze data, and to enhance overall transparency.

Two members of our team, Michael Atkin and Dalia Dahleh were given the opportunity to present. Below is the text from Michael Atkin’s pitch:

  1. Background – Just to set the stage. I’ve been fortunate to have been in the position as scribe, analyst, advocate and organizer for data management since 1985.  I’ve always been a neutral facilitator – allowing me to sit on all sides of the data management issue all over the world – from data provider to data consumer to market authority to regulator.  I’ve helped create maturity models outlining best practice – performed benchmarking to measure progress – documented the business case – and created and taught the Principles of Data Management at Columbia University.  I’ve also served on the SEC’s Market Data Advisory Committee, the CFTC’s Technical Advisory Committee and as the Chair of the Data Subcommittee of the OFR’s Financial Research Advisory activity during the financial crisis of 2008.  So, I have some perspective on the challenges the regulators face and the value of the FDTA.
  2. Conclusion (slide 2) – My conclusions after all that exposure are simple. There is a real data dilemma for many entities.  The dilemma is caused by fragmentation of technology.  It’s nobody’s fault.  We have business and operational silos.  They are created using proprietary software.  The same things are modeled differently based on the whim of the architects, the focus of the applications and the nuances of the technical solution.This fragmentation creates “data incongruence” – where the meaning of data from one repository doesn’t match other repositories.  We have the same words, with different meanings.  We have the same meaning using different words.  And we have nuances that get lost in translation.  As a result, we spend countless effort and money moving data around, reconciling meaning and doing mapping.  As one of my banking clients said … “My projects end up as expensive death marches of data cleansing and manipulation just to make the software work.”  And we do this over and over ad infinitum.Not only do we suffer from data incongruence – we suffer from the limitations of relational technology that still dominates our way of processing data.  For the record, relational technology is over 50 years old.  It was (and is) great for computation and structured data.  It’s not good for ad hoc inquiry and scenario-based analysis.  The truth is that data has become isolated and mismatched across repositories due to technology fragmentation and the rigidity of the relational paradigm.  Enterprises (including government enterprises) often have thousands of business and data silos – each based on proprietary data models that are hard to identify and even harder to change.  I refer to this as the bad data tax.  It costs most organizations somewhere around 40-60% of their IT budget to address.  So, let’s recognize that this is a real liability.  One that diverts resources from business goals, extends time-to-value for analysts, and leads to knowledge worker frustration.  The new task before FSOC leadership and the FDTA is now about fixing the data itself.
  3. Solution (slide 3) – The good news is that the solution to this data dilemma is actually quite simple and twofold in nature. First – adopt the principles of good data hygiene.  And on that front, there appears to be good progress thanks to efforts around the Federal Data Strategy and things related to BCBS 239 and the Open Government Data Act.  But governance alone will not solve the data dilemma.The second thing that is required is to adopt data standards that were specifically designed to address the problems of technology fragmentation.  And these open data web-based standards are quite mature.  They include the Internationalized Resource Identifier (or IRI) for identity resolution.  The use of ontologies – that enable us to model simple facts and relationship facts.  And the expression of these things in standards like RDF for ontologies, OWL for inferencing and SHACL for business rules.From these standards you get a bunch of capabilities.  You get quality by math (because the ontology ensures precision of meaning).  You get reusability (which eliminates the problem of hard coded assumptions and the problem of doing the same thing in slightly different ways).  You get access control (because the rules are embedded into the data and not constrained by systems or administrative complexity).  You get lineage traceability (because everything is linked to a single identifier so that data can be traced as it flows across systems).  And you get good governance (since these standards use resolvable identity, precise meaning and lineage traceability to shift governance from people-intensive data reconciliation to more automated data applications).
  4. FDTA (slide 4) – Another important component is that this is happening at the right time. I see the FDTA as the next step in a line of initiatives seeking to modernize regulatory reporting and reduce risk.  I’ve witnessed the efforts to move to T+1 (to address the clearing and settlement challenge).  I’ve seen the recognition of global interdependencies (with the fallout from Long Term Capital, Enron and the problems of derivatives in Orange County).  We’ve seen the problems of identity resolution that led to KYC and AML requirements.  And I was actively involved in understanding the data challenges of systemic risk with the credit crisis of 2008.The problem with all these regulatory activities is that most of them are not about fixing the data.  Yes, we did get LEI and data governance.  Those are great things, but far from what is required to address the data dilemma.  I also applaud the adoption of XBRL (and the concept of data tagging).  I like the XBRL taxonomies (as well as the Eurofiling regulatory taxonomies) – but they are designed vertically report-by-report with a limited capability for linking things together.  Not only that, most entities are just extracting XBRL into their relational environments that does little to address the problem of structural rigidity.  The good news is that all the work that has gone into the adoption of XBRL is able to be leveraged.  XML is good for data transfer.  Taxonomies are good for unraveling concepts and tagging.  And the shift from XML to RDF is straightforward and would not affect those who are currently reporting using XBRL.One final note before I make our pitch.  Let’s recognize that XBRL is not the way the banks are managing their internal data infrastructures.  They suffer from the same dilemmas as the regulators and almost every G-SIB and D-SIB I know is moving toward semantic standards.  Because even though FDTA is about the FSOC agencies – it will ultimately affect the financial institutions.  I see this as an opportunity for collaboration between regulators and the regulated, in building the infrastructure for the digital world.
  5. Proposal (slide 5) – Semantic Arts is proposing a pilot project to implement the foundational infrastructure of precise data about financial instruments (including identification, classification, descriptive elements and corporate actions), legal entities (including entity types as well as information about ownership and control), obligations (associated with issuance, trading, clearing and settlement), and holdings about the portfolios of the regulated entities. These are the building blocks of linked risk analysis.To implement this initiative, we are proposing you start with a single simple model of the information from one of the covered agencies.  The Initial project would focus on defining the enterprise model and conforming two to three key data sets to the model.  The resulting model would be hosted on a graph database.  Subsequent projects would involve expanding the footprint of data domains to be added to the graph, and gradually building functionality to begin to reverse the legacy creation process.We would initiate things by leveraging the open standard upper ontology (GIST) from Semantic Arts as well as the work of the Financial Industry Business Ontology (from the EDM Council) and any other vetted ontology like the one OFR is building for CFI.Semantic Arts has a philosophy of “think big” (like cross-agency interoperability) but “start small” (like a business domain of one of the agencies).  The value of adopting semantic standards is threefold – and can be measured using the “three C’s” of metrics.  The first C is cost containment starting with data integration and includes areas focused on business process automation and consolidation of redundant systems (best known as technical agility).  The second C is capability enhancement for analysis of the degrees of interconnectedness, the nature of transitive relationships, state contingent cash flow, collateral flow, guarantee and transmission of risk.  The final C is implementation of the control environment focused on tracking data flow, protecting sensitive information, preventing unwanted outcomes, managing access and ensuring privacy.
  6. Final Word (contact) – Just a final word to leave you with. Adopting these semantic standards can be accomplished at a fraction of the cost of what you spend each year supporting the vast cottage industry of data integration workarounds.  The pathway forward doesn’t require ripping everything out but instead building a semantic “graph” layer across data to connect the dots and restore context.  This is what we do.  Thank you.

Link to Slide Deck

DCA Forum Recap: Jans Aasman, Franz

How a “user” knowledge graph can help change data culture

Identity and Access Management (IAM) has had the same problem since Fernando Corbató of MIT first dreamed up the idea of digital passwords in 1960: opacity. Identity in the physical world is rich and well-articulated, with a wealth of different ways to verify information on individual humans and devices. By contrast, the digital realm has been identity data impoverished, cryptic and inflexible for over 60 years now.

Jans Aasman, CEO of Franz, provider of the entity-event knowledge graph solution Allegrograph, envisions a “user” knowledge graph as a flexible and more manageable data-centric solution to the IAM challenge. He presented on the topic at this past summer’s Data-Centric Architecture Forum, which Semantic Arts hosted near its headquarters in Fort Collins, Colorado.

Consider the specificity of a semantic graph and how it could facilitate secure access control. Knowledge graphs constructed of subject-predicate-object triples make it possible to set rules and filters in an articulated and yet straightforward manner.  Information about individuals that’s been collected for other HR purposes could enable this more precise filtering.

For example, Jans could disallow others’ access to a triple that connects “Jans” and “salary”. Or he could disallow access to certain predicates.

Identity and access management vendors call this method Attribute-Based Access Control (ABAC). Attributes include many different characteristics of users and what they interact with, which is inherently more flexible than role-based access control (RBAC).

Cell-level control is also possible, but as Forrest Hare of Summit Knowledge Solutions points out, such security doesn’t make a lot of sense, given how much meaning is absent in cells controlled in isolation. “What’s the classification of the number 7?” He asked. Without more context, it seems silly to control cells that are just storing numbers or individual letters, for example.

Simplifying identity management with a knowledge graph approach

Graph databases can simplify various aspects of the process of identity management. Let’s take Lightweight Directory Access Protocol, or LDAP, for example.

This vendor-agnostic protocol has been around for 30 years, but it’s still popular with enterprises. It’s a pre-web, post-internet hierarchical directory service and authentication protocol.

“Think of LDAP as a gigantic, virtual telephone book,” suggests access control management vendor Foxpass. Foxpass offers a dashboard-based LDAP management product which it claims is much easier to manage than OpenLDAP.

If companies don’t use LDAP, they might as well use Microsoft’s Active Directory, which is a broader, database-oriented identity and access management product that covers more of the same bases. Microsoft bundles AD with its Server and Exchange products, a means of lock-in that has been quite effective. Lock-in, obviously, inhibits innovation in general.

Consider the whole of identity management as it exists today and how limiting it has been. How could enterprises embark on the journey of using a graph database-oriented approach as an alternative to application-centric IAM software? The first step involves the creation of a “user” knowledge graph.

Access control data duplication and fragmentation

Semantic Arts CEO Dave McComb in his book Software Wasteland estimated that 90 percent of data is duplicated. Application-centric architectures in use since the days of mainframes have led to user data sprawl. Part of the reason there is such a duplication of user data is that authentication, authorization, and access control (AAA) methods require more bits of personally identifiable information (PII) be shared with central repositories for AAA purposes.

B2C companies are particularly prone to hoovering up these additional bits of PII lately and storing that sensitive info in centralized repositories. Those repositories become one-stop shops for identity thieves. Customers who want to pay online have to enter bank routing numbers and personal account numbers. As a result, there’s even more duplicate PII sprawl.

One of the reasons a “user” knowledge graph (and a knowledge graph enterprise foundation) could be innovative is that enterprises who adopt such an approach can move closer to zero-copy integration architectures. Model-driven development of the type that knowledge graphs enable assumes and encourages shared data and logic.

A “user” graph coupled with project management data could reuse the same enabling entities and relationships repeatedly for different purposes. The model-driven development approach thus incentivizes organic data management.

The challenge of harnessing relationship-rich data

Jans points out that enterprises, for example, run massive email systems that could be tapped to analyze project data for optimization purposes. And disambiguation by unique email address across the enterprise can be a starting point for all sorts of useful applications.

Most enterprises don’t apply unique email address disambiguation, but Franz has a pharma company client that does, an exception that proves the rule. Email continues to be an untapped resource in many organizations precisely because it’s a treasure trove of relationship data.

Problematic data farming realities: A social media example

Relationship data involving humans is sensitive by definition, but the reuse potential of sensitive data is too important to ignore. Organizations do need to interact with individuals online, and vice versa.

Former US Federal Bureau of Investigation (FBI) counterintelligence agent Peter Strzok quoted from Deadline: White House, an MSNBC program in the US aired on August 16:

“I’ve served I don’t know how many search warrants on Twitter (now known as X) over the years in investigations. We need to put our investigator’s hat on and talk about tradecraft a little bit. Twitter gathers a lot of information. They just don’t have your tweets. They have your draft tweets. In some cases, they have deleted tweets. They have DMs that people have sent you, which are not encrypted. They have your draft DMs, the IP address from which you logged on to the account at the time, sometimes the location at which you accessed the account and other applications that are associated with your Twitter account, amongst other data.” 

X and most other social media platforms, not to mention law enforcement agencies such as the FBI, obviously care a whole lot about data. Collecting, saving, and allowing access to data from hundreds of millions of users in such a broad, comprehensive fashion is essential for X. At least from a data utilization perspective, what they’ve done makes sense.

Contrast these social media platforms with the way enterprises collect and handle their own data. That collection and management effort is function- rather than human-centric. With social media, the human is the product.

So why is a social media platform’s culture different? Because with public social media, broad, relationship-rich data sharing had to come first. Users learned first-hand what the privacy tradeoffs were, and that kind of sharing capability was designed into the architecture. The ability to share and reuse social media data for many purposes implies the need to manage the data and its accessibility in an elaborate way. Email, by contrast, is a much older technology that was not originally intended for multi-purpose reuse.

Why can organizations like the FBI successfully serve search warrants on data from data farming companies? Because social media started with a broad data sharing assumption and forced a change in the data sharing culture. Then came adoption. Then law enforcement stepped in and argued effectively for its own access.

Broadly reused and shared, web data about users is clearly more useful than siloed data. Shared data is why X can have the advertising-driven business model it does. One-way social media contracts with users require agreement with provider terms. The users have one choice: Use the platform, or don’t.

The key enterprise opportunity: A zero-copy user PII graph that respects users

It’s clear that enterprises should do more to tap the value of the kinds of user data that email, for example, generates. One way to sidestep the sensitivity issues associated with reusing that sort of data would be to treat the most sensitive user data separately.

Self-sovereign identity (SSI) advocate Phil Windley has pointed out that agent-managed, hashed messaging and decentralized identifiers could make it unnecessary to duplicate identifiers that correlate. If a bartender just needs to confirm that a patron at the bar is old enough to drink, the bartender could just ping the DMV to confirm the fact. The DMV could then ping the user’s phone to verify the patron’s claimed adult status.

Given such a scheme, each user could manage and control their access to their own most sensitive PII. In this scenario, the PII could stay in place, stored, and encrypted on a user’s phone.

Knowledge graphs lend themselves to such a less centralized, and yet more fine-grained and transparent approach to data management. By supporting self-sovereign identity and a data-centric architecture, a Chief Data Officer could help the Chief Risk Officer mitigate the enterprise risk associated with the duplication of personally identifiable information—a true, win-win.

 

Contributed by Alan Morrison

How to Take Back 40-60% of Your IT Spend by Fixing Your Data

Creating a semantic graph foundation helps your organization become data-driven while significantly reducing IT spend

Organizations that quickly adapt to changing market conditions have a competitive advantage over their peers. Achieving this advantage is dependent on their ability to capture, connect, integrate, and convert data into insight for business decisions and processes. This is the goal of a “data-driven” organization. However, in the race to become data-driven, most efforts have resulted in a tangled web of data integrations and reconciliations across a sea of data silos that add up to between 40% – 60% of an enterprise’s annual technology spend. We call this the “Bad Data Tax”. Not only is this expensive, but the results often don’t translate into the key insights needed to deliver better business decisions or more efficient processes.

This is partly because integrating and moving data is not the only problem. The data itself is stored in a way that is not optimal for extracting insight. Unlocking additional value from data requires context, relationships, and structure, none of which are present in the way most organizations store their data today.

Solution to the Data Dilemma

The good news is that the solution to this data dilemma is actually quite simple. It can be accomplished at a fraction of the cost of what organizations spend each year supporting the vast industry of data integration workarounds. The pathway forward doesn’t require ripping everything out but building a semantic “graph” layer across data to connect the dots and restore context. However, it will take effort to formalize a shared semantic model that can be mapped to data assets, and turn unstructured data into a format that can be mined for insight. This is the future of modern data and analytics and a critical enabler to getting more value and insight out of your data.

This shift from relational to graph approach has been well-documented by Gartner who advise that “using graph techniques at scale will form the foundation of modern data and analytics” and “graph technologies will be used in 80% of data and analytics innovations by 2025.” Most of the leading market research firms consider graph technologies to be a “critical enabler.” And while there is a great deal of experimentation underway, most organizations have only scratched the surface in a use-case-by-use-case fashion. While this may yield great benefits for the specific use case, it doesn’t fix the causes behind the “Bad Data Tax” that organizations are facing. Until executives begin to take a more strategic approach with graph technologies, they will continue to struggle to deliver the needed insights that will give them a competitive edge. 

Modernizing Your Data Environment

Most organizations have come of age in a world dominated by technology. There have been multiple technology revolutions that have necessitated the creation of big organizational departments to make it all work. In spite of all the activity, the data paradigm hasn’t evolved much. Organizations are still managing data using relational technology invented in the 1970’s. While relational databases are the best fit for managing structured data workloads, they are not good for ad hoc inquiry and scenario-based analysis.

Data has become isolated and mismatched across repositories and silos due to technology fragmentation and the rigidity of the relational paradigm. Enterprises often have thousands of business and data silos–each based on proprietary data models that are hard to identify and even harder to change. This has become a liability that diverts resources from business goals, extends time-to-value for analysts, and leads to business frustration. The new task before leadership is now about fixing the data itself.

Fixing the data is possible with graph technologies and web standards that share data across federated environments and between interdependent systems. The approach has evolved for ensuring data precision, flexibility, and quality. Because these open standards are based on granular concepts, they become reusable building blocks for a solid data foundation. Adopting them removes ambiguity, facilitates automation, and reduces the need for data reconciliation.

Data Bill of Rights

Organizations need to remind themselves that data is simply a representation of real things (customers, products, people, and processes) where precision, context, semantics, and nuance matter as much as the data itself. For those who are tasked with extracting insight from data, there are several expectations that should be honored– that the data should be available and accessible when needed, stored in a format that is flexible and accurate, retains the context and intent of the original data, and is traceable as it flows through the organization.

This is what we call the “Data Bill of Rights”. Providing this Data Bill of Rights is achievable right now without a huge investment in technology or massive disruption to the way the organization operates.

Strategic Graph Deployment

Many organizations are already leveraging graph technologies and semantic standards for their ability to traverse relationships and connect the dots across data silos. These organizations are often doing so on a case-by-case basis covering one business area and focusing on an isolated application, such as fraud detection or supply chain analytics. While this can result in faster time-to-value for a singular use case, without addressing the foundational data layers, it results in another silo without gaining the key benefit of reusability.

The key to adopting a more strategic approach to semantic standards and knowledge graphs starts at the top with buy-in across the C-suite. Without this senior sponsorship, the program will face an uphill battle of overcoming the organizational inertia with little chance of broad success. However, with this level of support, the likelihood dramatically increases of getting sufficient buy-in across all the stakeholders involved in managing an organization’s data infrastructure.

While starting as an innovation project can be useful, forming a Graph Center of Excellence, will have an even greater impact. It can give the organization a dedicated team to evangelize and execute the strategy, score incremental wins to demonstrate value and leverage best practices and economies of scale along the way. They would be tasked with both building the foundation as well as prioritizing graph use cases against organizational focuses.

One key benefit from this approach is the ability to start small, deliver quick wins, and expand as value is demonstrated. There is no getting around the mandate to initially deliver something practical and useful. A framework for building a Graph Center of Excellence will be published in the coming weeks.

Scope of Investment Required

Knowledge graph advocates admit that a long tail of investment is necessary to realize its full potential. Enterprises need basic operational information including an inventory of the technology landscape and the roadmap of data and systems to be merged, consolidated, eliminated, or migrated. They need to have a clear vision of the systems of record, data flows, transformations, and provisioning points. They need to be aware of the costs associated with the acquisition of platforms, triplestore databases, pipeline tools, and other components needed to build the foundational layer of the knowledge graph.

In addition to the plumbing, organizations need to also understand the underlying content that supports business functionality. This includes the reference data about business entities, agents, and people. The taxonomies and data models about contract terms and parties, the meaning of ownership and control, notions of parties and roles, and so on. These concepts are the foundation of the semantic approach. These might not be exciting, but they are critical because it is the scaffolding for everything else.

Initial Approach

When thinking about the scope of investment, the first graph-enabled application can take anywhere from 6-12 months from conception to production. Much of the time needs to be invested in getting data teams aligned and mobilized – which underscores the essential nature of leadership and the importance of starting with the right set of use cases. It need to be operationally viable and solve a real business problem. The initial use case has to be important for the business.

With the right strategic approach in perspective, the first delivery is infrastructure plus pipeline management and people. This gets the organization the MVP including an incremental project plan and rollout. The second delivery should consist of the foundational building blocks for workflow and reusability. This will prove the viability of the approach.

Building Use Cases Incrementally

The next series of use cases should be based on matching functionality to capitalize on concept reusability. This will enable teams to shift their effort from building the technical components to adding incremental functionality. This translates to 30% of the original cost and a rollout that could be three times faster. These costs will continue to decrease as the enterprise expands reusable components – achieving full value around the third year.

The strategic play is not the $3-$5 million for the first few domains, but the core infrastructure required to run the organization moving forward. It is absolutely possible to continue to add use cases on an incremental level, but not necessarily the best way to capitalize on the digital future. The long-term cost efficiency of a foundational enterprise knowledge graph (EKG) should be compared to the costs of managing thousands of silos. For a big enterprise, this can be measured in hundreds of millions of dollars – before factoring in the value proposition of enhanced capabilities for data science and complying with regulatory obligations to manage risks.

Business Case Summary

Organizations are paying a “Bad Data Tax” of 40% – 60% of their annual IT spend on the tangled web of integrations across their data silos. To make matters worse, following this course does not help an organization achieve their goal of being data-driven. The data itself has a problem. This is due to the way data is traditionally stored in rows, columns, and tables that do not have the context, relationships, and structure needed to extract the needed insight.

Adding a semantic graph layer is a simple, non-intrusive solution to connect the dots, restore context, and provide what is needed for data teams to succeed. While the Bad Data Tax alone quantifiably justifies the cost of solving the problem, it scarcely scratches the surface of the full value delivered. The opportunity cost side, though more difficult to quantify, is no less significant with the graph enabling a host of new data and insight capabilities (better AI and data science outcomes, increased personalization and recommendations for driving increased revenue, more holistic views through data fabrics, high fidelity digital twins of assets, processes, and systems for what-if analysis, and more).

While most organizations have begun deploying graph technologies in isolated use cases, they have not yet applied them foundationally to solving the Bad Data Tax and fixing their underlying data problem. Success will require buy-in and sponsorship across the C-suite to overcome organizational inertia. For best outcomes, create a Graph Center of Excellence focused on strategically deploying both a semantic graph foundation and high-priority use cases. The key will be in starting small, delivering quick wins with incremental value and effectively communicating this across all stakeholders.

While initial investments can start small, expect initial projects to take from 6-12 months. To cover the first couple of projects, a budget between $1.5-$3 million should be sufficient. The outcomes will justify further investment in graph-based projects throughout the organization, each deploying 30% faster and cheaper than early projects through leveraging best practices and economies of scale.

Conclusion

The business case is compelling – the cost to develop a foundational graph capability is a fraction of the amount wasted each year on the Bad Data Tax alone. Addressing this problem is both easier and more urgent than ever. Failing to develop the data capabilities that graph technologies offer can put organizations at a significant disadvantage, especially in a world where AI capabilities are accelerating and critical insight is being delivered in near real time. The opportunity cost is significant. The solution is simple. Now is the time to act.

 

This article originally appeared at How to Take Back 40-60% of Your IT Spend by Fixing Your Data – Ontotext, and was reposted 

 

DCA Forum Recap: US Homeland Security

How US Homeland Security plans to use knowledge graphs in its border patrol efforts

During this summer’s Data Centric Architecture Forum, Ryan Riccucci, Division Chief for U.S. Border Patrol – Tucson (AZ) Sector, and his colleague Eugene Yockey gave a glimpse of what the data environment is like within the US Department of Homeland Security (DHS), as well as how transforming that data environment has been evolving.

The DHS celebrated its 20-year anniversary recently. The Federal department’s data challenges are substantial, considering the need to collect, store, retrieve and manage information associated with 500,000 daily border crossings, 160,000 vehicles, and $8 billion in imported goods processed daily by 65,000 personnel.

Riccucci is leading an ontology development effort within the Customs and Border Patrol (CBP) agency and the Department of Homeland Security more generally to support scalable, enterprise-wide data integration and knowledge sharing. It’s significant to note that a Division Chief has tackled the organization’s data integration challenge. Riccucci doesn’t let leading-edge, transformational technology and fundamental data architecture change intimidate him.

Riccucci described a typical use case for the transformed, integrated data sharing environment that DHS and its predecessor organizations have envisioned for decades.

The CBP has various sensor nets that monitor air traffic close to or crossing the borders between Mexico and the US, and Canada and the US. One such challenge on the Mexican border is Fentanyl smuggling into the US via drones. Fentanyl can be 50 times as powerful as morphine. Fentanyl overdoses caused 110,000 deaths in the US in 2022.

On the border with Canada, a major concern is gun smuggling via drone from the US. to Canada. Though legal in the US, Glock pistols, for instance, are illegal and in high demand in Canada.

The challenge in either case is to intercept the smugglers retrieving the drug or weapon drops while they are in the act. Drones may only be active for seven to 15 minutes at a time, so the opportunity window to detect and respond effectively is a narrow one.

Field agents ideally need to see enough visual real-time, mapped airspace information on the sensor activated, allowing them to move quickly and directly to the location. Specifics are important; verbally relayed information by contrast can often be less specific, causing confusion or misunderstanding.

The CBP’s successful proof of concept involved a basic Resource Description Framework (RDF) triple, semantic capabilities with just this kind of information:

Sensor → Act of sensing → drone (SUAS, SUAV, vehicle, etc.)

In a recent test scenario, CBP collected 17,000 records that met specified time/space requirements for a qualified drone interdiction over a 30-day period.

The overall impression that Riccucci and Yockey conveyed was that DHS has both the budget and the commitment to tackle this and many other use cases using a transformed data-centric architecture. By capturing information within an interoperability format, the DHS has been apprehending the bad guys with greater frequency and precision.

Contributed by Alan Morrison

HR Tech and The Kitchen Junk Drawer

I often joke that when I started with Semantic Arts nearly two years ago, I had no idea a solution existed to a certain problem that I well understood. I had experienced many of the challenges and frustrations of an application-centric world but had always assumed it was just a reality of doing business. As an HR professional, I’ve heard over the years about companies having to pick the “best of the worst” technologies. Discussion boards are full of people dissatisfied with current solutions – and when they try new ones, they are usually dissatisfied with those too!

The more I have come to understand the data-centric paradigm, the more I have discovered its potential value in all areas of business, but especially in human resources. It came as no surprise to me when a recent podcast by Josh Bersin revealed that the average large company is using 80 to 100 different HR Technology systems (link). Depending on who you ask, HR is comprised of twelve to fifteen key functions – meaning that we have an average of six applications for each key function. Even more ridiculously, many HR leaders would admit that there are probably even more applications in use that they don’t know about.  Looking beyond HR at all core business processes, larger companies are using more than two hundred applications, and the number is growing by 10% per year, according to research by Okta from earlier this year (link). From what we at Semantic Arts have seen, the problem is actually much greater than this research indicates.

Why Is This a Problem?

Most everyone has experienced the headaches of such application sprawl. Employees often have to crawl through multiple systems, wasting time and resources, either to find data they need or to recreate the analytics required for reporting. As more systems come online to try to address gaps, employees are growing weary of learning yet another system that carries big promises but usually fails to deliver (link). Let’s not forget the enormous amount of time spent by HR Tech and other IT resources to ensure everything is updated, patched and working properly. Then, there is the near daily barrage of emails and calls from yet another vendor promising some incremental improvement or ROI that you can’t afford to miss (“Can I have just 15 minutes of your time?”).

Bersin’s podcast used a great analogy for this: the kitchen drawer problem. We go out and procure some solution, but it gets thrown into the drawer with all the other legacy junk. When it comes time to look in the drawer, either it’s so disorganized or we are in such a hurry that it seems more worthwhile to just buy another app than to actually take the time to sort through the mess.

Traditional Solutions

When it comes to legacy applications, companies don’t even know where to start. We don’t know who is even using which system, so we don’t dare to shut off or replace anything. So we end up with a mess of piecemeal integrations that may solve the immediate issue, but just kicks the technical debt down the road. Sure, there are a few ETL and other integration tools out there that can be helpful, but without a unified data model and a broad plan, these initiatives usually end up in the drawer with all the other “flavor of the month” solutions.

Another route is to simply put a nice interface over the top of everything, such as ServiceNow or other similar solutions. This can enhance the employee experience by providing a “one stop shop” for information, but it does nothing to address the underlying issues. These systems have gotten quite expensive, and can run $50,000-$100,000 per year (link). The systems begin to look like ERPs in terms of price and upkeep, and eventually they become legacy systems themselves.

Others go out and acquire a “core” solution such as SAP, Oracle, or another ERP system. They hope that these solutions, together with the available extensions, will provide the same interface benefits. A company can then buy or build apps that integrate. Ultimately, these solutions are also expensive and become “black boxes” where data and its related insights are not visible to the user due to the complexity of the system. (Intentional? You decide…). So now you go out and either pay experts in the system to help you manipulate it or settle for whatever off-the-shelf capabilities and reporting you can find. (For one example of how this can go, see link).

A Better Path Forward

Many of the purveyors of these “solutions” would have you believe there is no better way forward; but those familiar with data-centricity know better. To be clear, I’m not a practioner or technologist. I joined Semantic Arts in an HR role, and the ensuing two years have reshaped the way I see HR and especially HR information systems. I’ll give you a decent snapshot as I understand it, along with an offer that if your interested in the ins and outs of these things I’d be happy to introduce you to someone that can answer them in greater detail.

Fundamentally, a true solution requires a mindset shift away from application silos and integration, towards a single, simple model that defines the core elements of the business, together with a few key applications that are bound to that core and speak the same language. This can be built incrementally, starting with specific use cases and expanding as it makes sense. This approach means you don’t need to have it “all figured out” from the start. With the adoption of an existing ontology, this is made even easier … but more on that later.

Once a core model is established, an organization can begin to deal methodically with legacy applications. You will find that over time many organizations go from legacy avoidance to legacy erosion, and eventually to legacy replacement. (See post on Incremental Stealth Legacy Modernization). This allows a business to slowly clean out that junk drawer and avoid filling it back up in the future (and what’s more satisfying than a clean junk drawer?).

Is this harder in the short term than traditional solutions? It may appear so on the surface, but really it isn’t. When a decision is made to start slowly, companies discover that the flexibility of semantic knowledge graphs allows for quick gains. Application development is less expensive and applications more easily modified as requirements change. Early steps help pay for future steps, and company buy-in becomes easier as stakeholders see their data come to life and find key business insights with ease.

For those who may be unfamiliar with semantic knowledge graphs, let me try to give a brief introduction. A graph database is a fundamental shift away from the traditional relational structure. When combined with formal semantics, a knowledge graph provides a method of storing and querying information that is more flexible and functional (more detail at link or link). Starting from scratch would be rather difficult, but luckily there are starter models (ontologies) available, including one we’ve developed in-house called gist, which is both free and freely available. By building on an established structure, you can avoid re-inventing the wheel.

HR departments looking to leverage AI and large language models in the future will find this data-centric transformation even more essential, but that’s a topic for another time.

Conclusion

HR departments face unique challenges. They deal with large amounts of information and must justifying their spending as non-revenue producing departments. The proliferation of systems and applications is a drain on employee morale and productivity and represents a major source of budget drain.

By adopting data-centric principles and applying them intentionally in future purchasing and application development, HR departments can realize greater strategic insights while saving money and providing a richer employee experience.

Taken all the way to completion, adoption of these technologies and principles would mean business data stored in a single, secured location. Small apps or dashboards can be rapidly built and deployed as the business evolves. No more legacy systems, no more hidden data, no more frustration with systems that simply don’t work.

Maybe, just maybe, this model will provide a success story that leads the rest of the organization to adopt similar principles.

 

JT Metcalf is the Chief Administrative Officer at Semantic Arts, managing HR functions along with many other hats.

The Data-Centric Revolution: “RDF is Too Hard”

This article originally appeared at The Data-Centric Revolution: “RDF is Too Hard” – TDAN.com. Subscribe to The Data Administration Newsletter for this and other great content!

The Data-Centric Revolution: “RDF is Too Hard”

By Dave McComb

We hear this a lot. We hear it from very smart people. Just the other day we heard someone say they had tried RDF twice at previous companies and it failed both times. (RDF stands for Resource Description Framework,[1] which is an open standard underlying many graph databases). It’s hard to convince someone like that that they should try again.

That particular refrain was from someone who was a Neo4j user (the leading contender in the LPG (Labeled Property Graph) camp). We hear the same thing from any of three camps: the relational camp, the JSON camp, and the aforementioned LPG camp.

Each has a different reason for believing this RDF stuff is just too hard. Convincing those who’ve encountered setbacks to give RDF another shot is also challenging. In this article, I’ll explore the nuances of RDF, shedding light on challenges and strengths in the context of enterprise integration and application development.

For a lot of problems, the two-dimensional world of relational tables is appealing. Once you know the column headers, you pretty much know how to get to everything. It’s not quite one form per table, but it isn’t wildly off from that. You don’t have to worry about some of the rows having additional columns, you don’t have to worry about some cells being arrays or having additional depth. Everything is just flat, two-dimensional tables. Most reporting is just a few joins away.

JSON is a bit more interesting. At some point you discover, or decree if you’re building it, that your dataset has a structure. Not a two-dimensional structure as in relational, but more of a tree-like structure. More specifically, it’s all about determining if this is an array of dictionaries or a dictionary of arrays. Or a dictionary of dictionaries. Or an array of arrays. Or any deeply nested combination of these simple structures. Are the keys static — that is, can they be known specifically at coding time, or are they derived dynamically from the data itself? Frankly, this can get complex, but at least it’s only locally complex. A lot of JSON programming is about turning someone else’s structure into a structure that suits the problem at hand.

One way to think of LPG is JSON on top of a graph database. It has a lot of the flexibility of JSON coupled with the flexibility of graph traversals and graph analytics. It solves problems difficult to solve with relational or just JSON and has beautiful graphics out of the box. Maybe link to your blog post about LPGs as training wheels?

Each of these approaches can solve a wide range of problems. Indeed, almost all applications use one of those three approaches to structure the data they consume.

And I have to admit, I’ve seen a lot of very impressive Neo4j applications lately. Every once in a while, I question myself and wonder aloud if we should be using Neo4j. Not because RDF is too hard but because we’ve mastered it and have many successful implementations running at client sites and internally. But maybe, if it is really easier, we should switch. And maybe it just isn’t worth disagreeing with our prospects.

Enterprise Integration is Hard

Then it struck me. The core question isn’t really “RDF v LPG (or JSON or relational),” it’s “application development v. enterprise integration.”

I’ve heard Jans Aasman, CEO of Franz, the creators of AllegroGraph, make this observation more than once: “Most application developers have dedicated approximately 0 of their neurons contemplating how what they are working on is going to fit in with the rest of their enterprise,  whereas people who are deeply into RDF may spend upwards of half their mental cycles thinking of how the task and data at hand fits into the overall enterprise model.”

That, I think, is the nub of the issue. If you are not concerned with enterprise integration, then maybe those features that scratch the itches that enterprise integration creates are not worth the added hassle.

Let’s take a look at the aspects of enterprise integration that are inherently hard, why RDF might be the right tool for the job, and why it might be overkill for traditional application development.

Complexity Reduction

One of the biggest issues dealing with enterprise integration is complexity. Most mid to large enterprises harbor thousands of applications. Each application has thousands of concepts (tables and columns or classes and attributes or forms and fields) that must be learned to become competent either in using the application and/or in debugging and extending it. No two application data models are alike. Even two applications in the same domain (e.g., two inventory systems) will have comically different terms, structures, and even levels of abstraction.

Each application is at about the complexity horizon that most mere mortals can handle. The combination of all those models is far beyond the ability of individuals to grasp.

Enterprise Resource Planning applications and Enterprise Data Modeling projects have shone a light on how complex it can get to attempt to model all an enterprise’s data. ERP systems now have tens of thousands of tables, and hundreds of thousands of columns. Enterprise Data Modeling fell into the same trap. Most efforts attempted to describe the union of all the application models that were in use. The complexity made them unusable.

What few who are focused on solving a point solution are aware of, is that there is a single simple model at the heart of every enterprise. It is simple enough that motivated analysts and developers can get their heads around it in a finite amount of time. And it can be mapped to the existing complex schemas in a lossless fashion.

The ability to posit these simple models is enabled by RDF (and its bigger brothers OWL and SHACL). RDF doesn’t guarantee you’ll create a simple or understandable model (there are plenty of counterexamples out there) but it at least makes the problem tractable.

Concept Sharing

An RDF based system is mostly structure-free, so we don’t have to be concerned with structural disparities between systems, but we do need a way to share concepts. We need to have a way to know that “employee,” “worker,” “user,” and “operator” are all referring to the same concept.  Or if they aren’t, in what ways they overlap.

In an RDF-based system we spend a great deal of time understanding the concepts that are being used in all the application systems, and then creating a way that both the meaning and the identity of the concept can be easily shared across the enterprise.  And that the map between the existing application schema elements and the shared concepts are also well known and findable.

One mechanism that helps with this is the idea that concepts have global identifiers (URIs /IRIs) that can be resolved.  You don’t need to know which application defined a concept; the domain name (and therefore the source authority) is right there in the identifier and can be used much like a URL to surface everything known about the concept.  This is an important feature of enterprise integration.

Instance Level Integration

It’s not just the concepts. All the instances referred to in application systems have identifiers.  But often the identifiers are local. That is, “007” refers to James Bond in the Secret Agent table, but it refers to “Ham Sandwich” in the company cafeteria system.

The fact that systems have been creating identity aliases for decades is another problem that needs to be addressed at the enterprise level. The solution is not to attempt, as many have in the past, to jam a “universal identifier” into the thousands of affected systems. It is too much work, and they can’t handle it anyway. Plus, there are many identity problems that were unpredicted at the time their systems were built (who imagined that some of our vendors would also become customers?) and are even harder to resolve.

The solution involves a bit of entity resolution, coupled with a flexible data structure that can accommodate multiple identifiers without getting confused.

Data Warehouse, Data Lake, and Data Catalog all in One

Three solutions have been mooted over the last three decades to partially solve the enterprise integration problem: data warehouses, lakes, and catalogs.  Data warehouses acknowledged that data has become balkanized.  By conforming it to a shared dimensional model and co-locating the data, we could get combined reporting.  But the data warehouse was lacking on many fronts: it only had a fraction of the enterprise’s data, it was structured in a way that wouldn’t allow transactional updates, and it was completely dependent on the legacy systems that fed it. Plus, it was a lot of work.

The data lake approach said co-location is good, let’s just put everything in one place and let the consumers sort it out. They’re still trying to sort it out.

Finally, the data catalog approach said: don’t try to co-locate the data, just create a catalog of it so consumers can find it when they need it.

The RDF model allows us to mix and match the best of all three approaches. We can conform some of the enterprise data (we usually recommend all the entity data such as MDM and the like, as well as some of the key transactional data). An RDF catalog, coupled with an R2RML or RML style map, will not only allow a consumer to find data sets of interest, in many cases they can be accessed using the same query language as the core graph. This ends up being a great solution for things like IoT, where there are great volumes of data that only need to be accessed on an exception basis.

Query Federation

We hinted at query federation in the above paragraph. The fact that query federation is built into the spec (of SPARQL, which is the query language of choice for RDF, and also doubles as a protocol for federation) allows data to be merged at query time, across different database instances, different vendors and even different types of databases (with real time mapping, relational and document databases can be federated into SPARQL queries).

Where RDF Might Be Overkill

The ability to aid enterprise integration comes at a cost. Making sure you have valid, resolvable identifiers is a lot of work. Harmonizing your data model with someone else’s is also a lot of work. Thinking primarily in graphs is a paradigm shift. Anticipating and dealing with the flexibility of schema-later modeling adds a lot of overhead. Dealing with the oddities of open world reasoning is a major brain breaker.

If you don’t have to deal with the complexities of enterprise integration, and you are consumed by solving the problem at hand, then maybe the added complexity of RDF is not for you.

But before you believe I’ve just given you a free pass consider this: half of all the work in most IT shops is putting back together data that was implemented by people who believed they were solving a standalone problem.

Summary

There are many aspects of the enterprise integration problem that lend themselves to RDF-based solutions. The very features that help at the enterprise integration level may indeed get in the way at the point solution level.

And yes, it would in theory be possible to graft solutions to each of the above problems (and more, including provenance and fine-grained authorization) onto relational, JSON or LPG. But it’s a lot of work and would just be reimplementing the very features that developers in these camps find so difficult.

If you are attempting to tackle enterprise integration issues, we strongly encourage you to consider RDF. There is a bit of a step function to learn it and apply it well, but we think it’s the right tool for the job.

Skip to content