Dave McComb

The Data-Centric Revolution: Fighting Class Proliferation

July 27, 2021June 2, 2021 by Dave McComb

One of the ideas we promote is elegance in the core data model in a Data-Centric enterprise. This is harder than it sounds. Look at most application-centric data models: you would think they would be simpler than the enterprise model, after all, they are a small subset of it. Yet we often find individual application data models that are far more complex than the enterprise model that covers them.

You might think that the enterprise model is leaving something out, but that’s not what we’re finding when we load data from these systems. We can generally get all the data and all the fidelity in a simpler model.

It behooves us to ask a pretty broad question:

Where and when should I add new classes to my Data-Centric Ontology?

To answer this, we’re going to dive into four topics:

The tradeoff of convenience versus overhead
What is a class, really?
Where is the proliferation coming from?
What options do I have?

Convenience and Overhead

In some ways, a class is a shorthand for something (we’ll get a bit more detailed in the next paragraph). As such, putting a label to it can often be a big convenience. I have a very charming book called, Thing Explainer – Complicated Stuff in Simple Words,^[1] by Randall Munroe (the author of xkcd Comics). The premise of Thing Explainer is that even very complex technical topics, such as dishwashers, plate tectonics, the International Space Station, and the Large Hadron Collider, can all be explained using a vocabulary of just ten hundred words. (To give you an idea of the lengths he goes to he uses “ten hundred” instead of one “thousand” to save a word in his vocabulary.)

So instead of coining a new word in his abbreviated vocabulary, “dishwasher” becomes, “box that cleans food holders,” food holders being bowls and plates). I lived in Papua New Guinea part time for a couple of years, and the national language there, Tok Pisin, has only about 2,000 words. They ended up with similar word salads. I remember the grocery store was at “plas bilong san kamup,” or “place belong sun come up,” which is Tok Pisin for “East.”

It is much easier to refer to “dishwashers” and “East” than their longer equivalents. It’s convenient. And it doesn’t cost us much in everyday conversation.

But let’s look at the convenience / overhead tradeoff in an information system that is not data-centric. Every time you add a new class (or a new attribute) to an information system you are committing the enterprise to deal with it potentially for decades to come. The overhead starts with application programming, that new concept has to be referred to by code, and not just a small amount. I’ve done some calculations in my book, Software Wasteland, that suggests each attribute added to a system adds at least 1,000 lines of source code—code to move the item from the database to some API, code to take it from the API and put it in the DOM or something similar, code to display it on a screen, in a report, maybe even in a drop-down list, code to validate it. Given that it costs money to write and test code, this is adding to the cost of a system. The real impact is felt downstream, felt in application maintenance, especially felt in the brittle world of systems integration, and it is felt by the users. Every new attribute is a new field on a form to puzzle about. Every new class is often a new form. New forms often require changes to process flow. And so, the complexity grows.

Finally, there is cognitive load. When we have to deal with dozens or hundreds of concepts, we don’t have too much trouble. When we get to thousands it becomes a real undertaking. Tens of thousands and it’s a career. And yet many individual applications have tens of thousands of concepts. Most large enterprises have millions which is why becoming data-centric is so appealing.

One of the other big overheads in traditional technology is duplication. When you create a new class, let’s say, “hand tools,” you may have to make sure that the wrench is in the Hand Tools class / table and also in the Inventory table. This relying on humans and procedures to remember to put things in more than one place is a huge undocumented burden.

We want to think long and hard before introducing a new class or even a new attribute.

The Data-Centric Revolution: The Role of SemOps (Part 1)

July 27, 2021December 2, 2020 by Dave McComb

We’ve been working on something we call “SemOps” (like DevOps but for Semantic Technology + IT Operations). The basic idea is how can we create a pipeline to go from proposed enterprise ontology or taxonomy enhancements to “in-production” as frictionlessly as possible.

As so often happens, when we shine the Semantic Light on a topic area, we see things anew. In this very circuitous way, we’ve come to some observations and benefits that we think will be of interest even to those who aren’t on the Semantic path.

DevOps for Data People

If you’re completely on the data side, you may not be aware of what developers are doing these days. Most mature development teams have deployed some version of DevOps (Software Development + IT Operations) along with CI/CD (Continuous Integration / Continuous Deployment).

To understand what they are doing it helps to harken back to what preceded DevOps and CI/CD. Once upon a time, software was delivered via the waterfall methodology. Months or occasionally years would be spent getting the requirements for a project “just right.” The belief was that if you didn’t get the requirements right up front, the cost to add even a single new feature would cost 40 times what it would cost if the requirement were identified up front. It turns out there was some good data on this cost factor, and it still casts its shadow any time you try to make a modification to a packaged enterprise application, 40 x is a reasonable benchmark compared to what it would cost to implement that feature outside the package. This as a side note is the economics that creates the vast number of “satellite systems” that seem to spring up alongside large packaged applications.

Once the requirements were signed off on, the design began (more months or years) then coding (more months or years) finally systems testing (more months or years). Then the big conversion weekend, the system goes into production, tee shirts are handed out to the survivors and the system becomes IT Operations problem.

There really was only ever, one “move to production” and few thought it worthwhile to invest the energy in making this more efficient. Most sane people, once they’d stayed up all night on a conversion weekend, were loath to sign up for another, and it certainly didn’t occur to them to find out a way to make it better.

Then agile came along. One of the tenets of agile was that you always had a working version that you could, in theory, push to production. In the early days it wasn’t that people were pushing to production on any frequent schedule, but the fact that you always could was a good discipline to avoid technical debt and straying off building hypothetical components.

Over time, the idea that you could push to production became the idea that you should. As people invested more and more in their unit testing and regression testing, and pipelines to move from dev to QA to production, people became used to the idea of pushing small incremental changes into production systems. That was the birth of DevOps and CI/CD. In mature organizations like Google and Amazon, new versions of their software are being pushed to production many times per day (some reports say many times per second, but this may be hyperbole).

The reason I bring it up is because there are some things in there that we expect to duplicate with SemOps, and some that we already have with data (as I was writing this sentence, I was tempted to write “DataOps” and I thought: “is there such a thing?”) A nanosecond of googling later and I found this extremely well written article on the topic from our friends at DataKitchen. They are focusing more on the data analytics part of the enterprise, which is a hugely important area. The points I was going to make were more focused on the data source end of the pipeline, but the two ideas tie together nicely.

Click here to read more on TDAN.com

Sharing Ontologies Globally To Speed Science And Healthcare Solutions

June 16, 2021October 14, 2020 by Dave McComb

Sharing Ontologies Globally To Speed Science And Healthcare Solutions The COVID-19 pandemic is a clear example of how healthcare practitioners require swift access to enormous amounts of diverse information to efficaciously treat patients. They must synthesize individual data (vital signs, clinical history, demographics, and more) with rapidly evolving knowledge about COVID-19 and make decisions relevant to the conditions from which specific patients suffer.ners rely on point-of-care decision support systems to accelerate patient-care analysis and to scale treatments for intake quantities of global pandemics. They analyze a plethora of inputs to produce tailored treatment recommendations, in near real-time, which significantly enhance the quality of treatment.

Ontologies Create The Foundation For Complex Data Analysi

The underlying utility of these systems is widely based on the vast quantities of healthcare knowledge analyzed. Such knowledge must be uniformly represented (at scale) with rich, contextualized descriptions of the full scope of clinical trials, pharmaceutical information, and research germane to the biomedical field that expands daily with each published paper and new findings. This knowledge should be rapidly accessible, reusable, and a sturdy foundation on which to base present and future research in this field, encompassing everything from long-standing maladies like peanut allergies to emergent ones like COVID-19.

Ontologies—evolving conceptual data models with standardized concepts and uniquely fulfill each of these requirements to fuel healthcare research and point-of-care decision support systems, helping save lives when they need saving most.

International Ontology Sharing Is Becoming A Reality

A consortium of researchers recently formed an organization dedicated to standardizing how scientists define their ontologies, which are essential for retrieving datasets as well as understanding and reproducing research. The group called OntoPortal Alliance is creating a public repository of internationally shared domain-specific ontologies. All the repositories will be managed with a common OntoPortal appliance that has been tested with AllegroGraph Semantic Knowledge Graph software. This enables any OntoPortal adopter to get all the power, features, maintainability, and support benefits that come from using a widely adopted, state-of-the-art semantic knowledge graph database.

The first set of ontology repositories making up the OntoPortal Alliance include BioPortal (biomedical and other ontologies used internationally), SIFR (biomedical ontologies in the French language), BMICC MedPortal (biomedical ontologies focused on Chinese users), AgroPortal (ontologies focused on agronomy and related sciences), and) EcoPortal (ontologies focused on environmental science. The OntoPortal Alliance will be adding more ontology repositories and is open to working with researchers in other domains who want to offer ontologies publicly.

Click here to read the full article at HealthITOutcomes.com

The Data-Centric Revolution: Data-Centric vs. Centralization

July 23, 2021September 2, 2020 by Dave McComb

We just finished a conversation with a client who was justifiably proud of having centralized what had previously been a very decentralized business function (in this case, it was HR, but it could have been any of a number of functions). They had seemingly achieved many of the benefits of becoming data-centric through decentralization: all their data in one place, a single schema (data model) to describe the data, and dozens of decommissioned legacy systems.

We decided to explore whether this was data-centric and the desirable endgame for all their business functions.

A quick review. This is what a typical application looks like:

The metadata is the key. The application, the business logic and the UI are coded to the metadata (Schema), and the data is accessed through and understood by the metadata. What happens in every large enterprise (and most small ones) is that different departments or divisions implement their own applications.

*Click on the image to see a larger version.*

Many of the applications were purchased, and today, some are SaaS (Software as a Service) or built in-house. What they all fail to share is a common schema. The metadata is arbitrarily different and, as such, the code base on top of the metadata is different, so there is no possibility of sharing between departments. Systems integrators try to work out what the data means and piece it together behind the scenes. This is where silos come from. Most large firms don’t have just four silos, they have thousands of them.

One response to this is “centralization.” If you discover that you have implemented, let’s say, dozens of HR systems, you may think it’s time to replace them with one single centralized HR system. And you might think this will make you Data-Centric. And you would be, at least, partially right.

Recall one of the litmus tests for Data-Centricity:

Let’s take a deeper look at the centralization example.

Centralization replaces a lot of siloed systems with one centralized one. This achieves several things. It gets all the data in one place, which makes querying easier. All the data conforms to the same schema (and single shared model). Typically, if this is done with traditional technology, this is not a simple model, nor is it extensible or federate-able, though there is some progress.

The downside is that everyone now must use the same UI and conform to the same model, and that’s the tradeoff.

The tradeoff works pretty well for business domains where the functional variety from division to division is slight, or where the benefit to integration exceeds the loss due to local variation. For many companies, centralization will work for back office functions like HR, Legal, and some aspects of Accounting.

However, in areas where the local differences are what drives effectiveness and efficiency (sales, production, customer service, or supply chain management) centralization may be too high a price to pay for lack of flexibility.

Let’s look at how Data-Centricity changes the tradeoffs.

Click here to read more on TDAN.com

The Data-Centric Revolution: The Sky is Falling (Let’s Make Lemonade)

July 23, 2021June 17, 2020 by Dave McComb

Recently IDC predicted that IT spending will drop by 5% due to the COVID-19 pandemic.^[1] Last week, Gartner went further by predicting that IT spending would drop by 8% or $300 Billion.^[2] (Expect a prediction bidding war.) Both were consistent: highest hit areas would be devices, followed by IT service and enterprise software.

The predicted $100 billion drop, in those last two categories, should send chills through those of us who make our living in those two categories. And keep in IT Spending mind, this drop will occur in the latter half of this year. To date, here have been very few cuts.

But I’m seeing the glass half full here. Half full of lemonade.^[3]

Here is my thought process:

For at least five years, we have been advocating to abandon the senseless implementation of application after application. (You know: the silo making industry.) We have made a strong case for avoiding the application centric quagmire in Software Wasteland.^[4]
And yet spending on implementing application systems had continued unabated since 2015.
With the need to slash budgets in the latter half of 2020, the large application implementation projects will be the easiest section to target.
Indeed, the IDC article says that “IT services spending will also decline, mostly due to delays in large projects.”
Furthermore, “some firms will cut capital spending and others will either delay new projects or seek to cut costs in other ways.”
Gartner reported that “some companies are cutting big IT projects altogether; others are ploughing ahead but delaying some elements of their plans to save money.”
Hershey has halted sections of a new ERP system and will drop IT capital spending from the budgeted $500 million to between $400-450 million.
Gartner also stated that “health care systems [are] pushing out projects to create digital health records by six months or more.”

This would be a terrible time to be an application software vendor or a systems integrator. The yearly 7% reductions in both categories are still in front of us. Any contract not yet signed will be put on hold. Even contracts in progress may get cancelled.

Click here to read more on TDAN.com

The Data-Centric Revolution: The Role of SemOps (Part 2)

July 23, 2021March 3, 2020 by Dave McComb

In our previous installment of this two-part series we introduced a couple of ideas.

First, data governance may be more similar to DevOps than first meets the eye.

Second, the rise of Knowledge Graphs, Semantics and Data-Centric development will bring with it the need for something similar, which we are calling, “SemOps” (Semantic Operations).

Third, when you peel back what people are doing in DevOps and Data Governance, we get down to five key activities that will be very instructive in our SemOps journey:

Quality
Allowing/ “Permission-ing”
Predicting Side Effects
Constructive
Traceability

We’ll take up each in turn and compare and contrast how each activity is performed in DevOps and Data Governance to inform our choices in SemOps.

But before we do, I want to cover one more difference: how the artifacts scale under management.

Code

There isn’t any obvious hierarchy to code, from abstract to concrete or general to specific, as there is in data and semantics. It’s pretty much just a bunch of code, partitioned by silos. Some of it you bought, some you built, and some you rent through SaaS (Software as a Service).

Each of these silos represents, often, a lot of code. Something as simple as Quick Books is 10 million lines of code. SAP is hundreds of millions. Most in-house software is not as bloated as most packages or software services; still, it isn’t unusual to have millions of lines of code in an in-house developed project (much of it is in libraries that were copied in, but it still represents complexity to be managed). The typical large enterprise is managing billions of lines of code.

The only thing that makes this remotely manageable is, paradoxically, the thing that makes it so problematic: isolating each codebase in its own silo. Within a silo, the developer’s job is to not introduce something that will break the silo and to not introduce something that will break the often fragile “integration” with the other silos.

Data and Metadata

There is a hierarchy to data that we can leverage for its governance. The main distinction is between data and metadata.

The Data-Centric Revolution: The Role of SemOps (Part 2)

There is almost always more data than metadata. More rows than columns. But in many large enterprises there is far, far more metadata than anyone could possibly guess. We were privy to a project to inventory the metadata for a large company, who shall go nameless. At the end of the profiling, it was discovered that there were 200 million columns under management in the sum total of the firm. This is columns not rows. No doubt there were billions of rows in all their data.

There are also other levels that people often introduce to help with the management of this pyramid. People often separate Reference data (e.g., codes and geographies) and Master data (slower changing data about customers, vendors, employees and products).

These distinctions help, but even as the data governance people are trying to get their arms around this, the data scientists show up with “Big Data.” Think of big data being below the bottom of this pyramid. Typically, it is even more voluminous, and usually has only the most ad hoc metadata (the “keys” in the “key/value pairs” in the deeply nested json data structures are metadata, sort of, but you are left guessing what these short cryptic labels actually mean).

Click here to read more on TDAN.com

Property Graphs: Training Wheels on the way to Knowledge Graphs

June 16, 2021October 29, 2019 by Dave McComb

I’m at a graph conference. The general sense is that property graphs are much easier to get started with than Knowledge Graphs. I wanted to explore why that is, and whether it is a good thing.

It’s a bit of a puzzle to us, we’ve been using RDF and the Semantic Web stack for almost two decades, and it seems intuitive, but talking to people new to graph databases there is a strong preference to property graphs (at this point primarily Neo4J and TigerGraph, but there are others). – Dave McComb

Property Graphs

A knowledge graph is a database that stores information as digraphs (directed graphs, which are just a link between two nodes).

Property Graphs: Training Wheels on the way to Knowledge Graphs

The nodes self-assemble (if they have the same value) into a completer and more interesting graph.

Property Graphs: Training Wheels on the way to Knowledge Graphs

What makes a graph a “property graph” (also called a “labeled property graph”) is the ability to have values on the edges

Either type of graph can have values on the nodes, in a Knowledge Graph they are done with a special kind of edge called a “datatype Property.”

Property Graphs: Training Wheels on the way to Knowledge Graphs

Here is an example of one of the typical uses for values on the edges (the date the edge was established). As it turns out this canonical example isn’t a very good example, in most databases, graph or otherwise, a purchase would be a node with many other complex relationships.

The better use of dates on the edges in property graphs are where there is what we call a “durable temporal relation.” There are some relationships that exist for a long time, but not forever, and depending on the domain are often modeled as edges with effective start and end dates (ownership, residence, membership are examples of durable temporal relations that map well to dates on the edges)

The other big use case for values on the edges which we’ll cover below.

The Appeal of Property Graphs

Talking to people and reading white papers, it seems the appeal of Property Graph data bases are in these areas:

Closer to what programmers are used to
Easy to get started
Cool Graphics out of the box
Attributes on the edges
Network Analytics

Property Graphs are Closer to What Programmers are Used to

The primary interfaces to Property Graphs are json style APis, which developers are comfortable with and find easy to adapt to.

Property Graphs: Training Wheels on the way to Knowledge Graphs

Easy to Get Started

Neo4J in particular have done a very good job of getting people set up and running and productive in short order. There are free versions to get started with, and well exercised data sets to get up and going rapidly. This is very satisfying for people getting started.

Property Graphs: Training Wheels on the way to Knowledge Graphs

Cool Graphics Out of the Box

One of the striking things about Neo4J is their beautiful graphics

Property Graphs: Training Wheels on the way to Knowledge Graphs

You can rapidly get graphics that often have never been seen in traditional systems, and this draws in the attention of sponsors.

Property Graphs have Attributes on the Edges

Perhaps the main distinction between Property Graphs and RDF Graphs is the ability to add attributes to the edges in the network. In this case the attribute is a rating (this isn’t a great example, but it was the best one I could find easily).

Property Graphs: Training Wheels on the way to Knowledge Graphs

One of the primary use cases for attributes on the edges would be weights that are used in the evaluation of network analytics. For instance, a network representation of how to get from one town to another, might include a number of alternate sub routes through different towns or intersections. Each edge would represent a segment of a possible journey. By putting weights on each edge that represented distance, a network algorithm could calculate the shortest path between two towns. By putting weights on the edges that represent average travel time, a network algorithm could calculate the route that would take the least time.

Other use cases for attributes on the edges include temporal information (when did this edge become true, and when was is no longer true), certainty (you can rate the degree of confidence you have in a given link and in some cases only consider links that are > some certainly value), and popularity (you could implement the page rank algorithm with weights on the edges, but I think it might be more appropriate to put the weights on the nodes)

Network Analytics

There are a wide range of network analytics that come out of the box and are enabled in the property graph. Many do not require attributes on the edges, for instance the “clustering” and “strength of weak ties” suggested in this graphic can be done without attributes on the edges.

Property Graphs: Training Wheels on the way to Knowledge Graphs

However, many of the network analytics algorithms can take advantage of and gain from weights on the edges.

Property Graphs: What’s Not to Like

That is a lot of pluses on the Property Graph side, and it explains their meteoric rise in popularity.

Our contention is that when you get beyond the initial analytic use case you will find yourself in a position of needing to reinvent a great body of work that already exists and have been long standardized. At that point if you have over committed to Property Graphs you will find yourself in a quandary, whereas if you positioned Property Graphs as a stepping stone on the way to Knowledge Graphs you will save yourself a lot unnecessary work.

Property Graphs, What’s the Alternative?

The primary alternative is an RDF Knowledge Graph. This is a graph database using the W3C’s standards stack including RDF (resource description framework) as well as many other standards that will be described below as they are introduced.

The singular difference is the RDF Knowledge Graph standards were designed for interoperability at web scale. As such all identifiers are globally unique, and potentially discoverable and resolvable. This is a gigantic advantage when using knowledge graphs as an integration platform as we will cover below.

Where You’ll Hit the Wall with Property Graphs

There are a number of capabilities, we assume you’ll eventually want to add on to your Property Graph stack, such as:

Schema
Globally Unique Identifiers
Resolvable identifiers
Federation
Constraint Management
Inference
Provenance

Our contention is you could in principle add all this to a property graph, and over time you will indeed be tempted to do so. However, doing so is a tremendous amount of work, high risk, and even if you succeed you will have a proprietary home-grown version of all these things that already exist, are standardized and have been in large scale production systems.

As we introduce each of these capabilities that you will likely want to add to your Property Graph stack, we will describe the open standards approach that already covers it.

Schema

Property Graphs do not have a schema. While big data lauded the idea of “schema-less” computing, the truth is, completely removing schema means that a number of functions previously performed by schema have now moved somewhere else, usually code. In the case of Property Graphs, the nearest equivalent to a schema is the “label” in “Labeled Property Graph.” But as the name suggests, this is just a label, essentially like putting a tag on something. So you can label a node as “Person” but that tells you nothing more about the node. It’s easier to see how limited this is when you label a node a “Vanilla Swap” or “Miniature Circuit Breaker.”

Knowledge Graphs have very rich and standardized schema. One of the ways they allow you to have the best of both worlds, is unlike relational databases they do not require all schema to be present before any data can be persisted. At the same time when you are ready to add schema to your graph, you can do so with a high degree of rigor and go to as much or as little detail as necessary.

Globally Unique Identifiers

The identifiers in Property Graphs are strictly local. They don’t mean anything outside the context of the immediate database. This is a huge limitation when looking to integrate information across many systems and especially when looking to combine third party data.

Knowledge Graphs are based on URIs (really IRIs). Uniform Resource Identifiers (and their Unicode equivalent, which is a super set, International Resource Identifiers) are a lot like URLs, but instead of identifying a web location or page, they identify a “thing.” In best practices (which is to say 99% of all the extant URIs and IRIs out there) the URI/IRI is based on a domain name. This delegation of id assignment to organizations that own the domain names allows relatively simple identifiers that are not in danger of being mistakenly duplicated.

Every node in a knowledge graph is assigned a URI/IRI, including the schema or metadata. This makes discovering what something means as simple as “following your nose” (see next section)

Resolvable Identifiers

Because URI/IRIs are so similar to URLs, and indeed in many situations are URLs it makes it easy to resolve any item. Clicking on a URI/IRI can redirect to a server in the domain name of the URI/IRI, which can then render a page that represents the Resource. In the case of a schema/ metadata URI/IRI the page might describe what the metadata means. This typically includes both the “informal” definition (comments and other annotations) as well as the “formal” definition (described below).

For a data URI/IRI the resolution might display what is known about the item (typically the outgoing links), subject to security restrictions implemented by the owner of the domain. This style of exploring a body of data, by clicking on links and exploring is called “following your nose” and is a very effective way of learning a complex body of knowledge, because unlike traditional systems you do not need to know the whole schema in order to get started.

Property Graphs have no standard way of doing this. Anything that is implemented is custom for the application at hand.

Federation

Federation refers to the ability to query across multiple databased to get a single comprehensive result set. This is almost impossible to do with relational databases. No major relational database vendor will execute queries across multiple databases and combine the result (the result generally wouldn’t make any sense anyway as the schemas are never the same). The closest thing in traditional systems, is the Virtual Data P***, which allows some limited aggregation of harmonized databases.

The Property Graphs also have no mechanism for federation over more than a single in memory graph.

Federation is built into SPARQL (the W3C standard for querying “triple stores” or RDF based Graph Databases). You can point a SPARQL query at a number of databases (including relational databases that have been mapped to RDF through another W3C standard, R2RML).

Constraint Management

One of the things needed in a system that is hosting transactional updates, is the ability to enforce constraints on incoming transactions. Suffice it to say Property Graphs have no transaction mechanism and no constraint management capability.

Knowledge Graphs have the W3C standard, SHACL (SHApes Constraint Language) to specify constraints in a model driven fashion.

Inference

Inference is the creation of new information from existing information. A Property Graph creates a number of “insights” which are a form of inference, but it is really only in the heads of the persons running the analytics and interpreting what the insight is.

Knowledge Graphs have several inference capabilities. What they all share is that the result of the inference is rendered as another triple (the inferred information is another fact which can be expressed as a triple). In principle almost any fact that can be asserted in a Knowledge Graph can also be inferred, provided the right contextual information. For instance, we can infer that a class is a subclass of another class. We can infer that a node has a given property, we can infer that two nodes represent the same real-world items, and each of these inferences can be “materialized” (written) back to the database. This makes any inferred fact available to any human reviewing the graph, and process that acts on the graph, including queries.

Two of the prime creators of inferred knowledge are RDFS and OWL, the W3C standards for schema. RDFS provides the simple sort of inference that people familiar with Object Oriented programming might be familiar with, primarily the ability infer that a node that is a member of a class is also a member of any of its superclasses. A bit new to many people is the idea that properties can have superproperties, and that leads to inference at the instance level. If you make the assertion that you have a mother (property :hasMother) Beth, and then declare :hasParent to be a superproperty of :hasMother, the system will infer that you :hasParent Beth, and this process can be repeated by making :has Ancestor a superproperty of :hasParent. The system can infer and persist this information.

OWL (the Web Ontology Language for dyslexics) allows for much more complex schema definitions. OWL allows you to create class definitions from Booleans, and allows the formal definition of classes by creating membership definitions based on what properties are attached to nodes.

If RDFS and OWL don’t provide sufficient rigor and/or flexibility there are two other options, both rule languages and both will render their inferences as triples that can be returned to the triple store. RIF (the Rule Interchange Format) allow inference rules defined in terms of “if / then“ logic. SPARQL the above-mentioned query language can also be used to create new triples that can be rendered back to the triple store.

Provenance

Provenance is the ability to know where any atom of data came from. There are two provenance mechanisms in Knowledge Graphs. For inferences generated from RDFS or OWL definitions, there is an “explain” mechanism, which is decribed in the standards as “proof.” In the same spirit as a mathematical proof, the system can reel out the assertions including schema-based definitions as data level assertions that led to the provable conclusion of the inference.

For data that did not come from inference (that was input by a user, or purchased, or created through some batch process, there is a W3C standard, call PROV-O (the provenance ontology) that outlines a standard way to describe where a data set or even an individual atom of data came from.

Property Graphs have nothing similar.

Convergence

The W3C held a conference to bring together the labeled property graph camp with the RDF knowledge graph camp in Berlin in March of 2019.

One of our consultants attended and has been tracking the aftermath. One promising path is RDF* which is being mooted as a potential candidate to unify the two camps. There are already several commercial implementations supporting RDF*, even though the standard hasn’t even begun its journey through the approval process. We will cover RDF* in a subsequent white paper.

Summary

Property Graphs are easy to get started with. People think RDF based Knowledge Graphs are hard to understand, complex and hard to get started with. There is some truth to that characterization.

The reason we made the analogy to “training wheels” (or “stepping stones” in the middle of the article) is to acknowledge that riding a bike is difficult. You may want to start with training wheels. However, as you become proficient with the training wheels, you may consider discarding them rather than enhancing them.

Most of our clients start directly with Knowledge Graphs, but we recognize that that isn’t the only path. Our contention is that a bit of strategic planning up front, outlining where this is likely to lead gives you a lot more runway. You may choose to do your first graph project using a property graph, but we suspect that sooner or later you will want to get beyond the first few projects and will want to adopt an RDF / Semantic Knowledge Graph based system.

Toss Out Metadata That Does Not Bring Joy

July 23, 2021September 4, 2019 by Dave McComb

As I write this, I can almost hear you wail “No, no, we don’t have too much metadata, we don’t have nearly enough! We have several projects in flight to expand our use of metadata.”

Sorry, I’m going to have to disagree with you there. You are on a fool’s errand that will just provide busy work and will have no real impact on your firm’s ability The Data-Centric Revolution: Implementing a Data-Centric Architecture to make use of the data they have.

Let me tell you what I have seen in the last half dozen or so very large firms I’ve been involved with, and you can tell me if this rings true for you. If you are in a mid-sized or even small firm you may want to divide these numbers by an appropriate denominator, but I think the end result will remain the same.

Most large firms have thousands of application systems. Each of these systems have data models that consist of hundreds of tables and many thousands of columns. Complex applications, such as SAP, explode these numbers (a typical SAP install has populated 90,000 tables and a half million columns).

Even as we speak, every vice president with a credit card is unknowingly expanding their firm’s data footprint by implementing suites of SaaS (Software as a Service) applications. And let’s not even get started on your Data Scientists. They are rabidly vacuuming up every dataset they can get their hands on, in the pursuit of “insights.”

Naturally you are running out of space, and especially system admin bandwidth in your data centers, so you turn to the cloud. “Storage is cheap.”

This is where the Marie Kondo analogy kicks in. As you start your migration to the cloud (or to your Data Lake, which may or may not be in the cloud), you decide “this would be a good time to catalog all this stuff.” You launch into a project with the zeal of a Property and Evidence Technician at a crime scene. “Let’s careful identify and tag every piece of evidence.” The advantage that they have, and you don’t is that their world is finite. You are faced with cataloging billions of pieces of metadata. You know you can’t do it alone, so you implore the people who are putting the data in the Data Swamp (er, Lake). You mandate that anything that goes into the lake must have a complete catalog. Pretty soon you notice, that the people putting the data in don’t know what it is either. And they know most of it is crap, but there are a few good nuggets in there. If you require them to have descriptions of each data element, they will copy the column heading and call it a description.

Let’s just say, hypothetically, you succeeded in getting a complete and decent catalog for all the datasets in use in your enterprise. Now what?

Click here to read more on TDAN.com

The Data-Centric Revolution: Lawyers, Guns and Money

July 23, 2021June 5, 2019 by Dave McComb

My book “The Data-Centric Revolution” will be out this summer. I will also be presenting at Dataversity’s Data Architecture Summit coming up in a few months. Both exercises reminded me that Data-Centric is not a simple technology upgrade. It’s going to take a great deal more to shift the status quo.

Let’s start with Lawyers, Guns and Money, and then see what else we need.

A quick recap for those who just dropped in: The Data-Centric Revolution is the recognition that maintaining the status quo on enterprise information system implementation is a tragic downward spiral. Almost every ERP, Legacy Modernization, MDM, or you name it project is coming in at ever higher costs and making the overall situation worse.

We call the status quo the “application-centric quagmire.” The application-centric aspect stems from the observation that many business problems turn into IT projects, most of which end up with building, buying, or renting (Software as a Service) a new application system. Each new application system comes with its own, arbitrarily different data model, which adds to the pile of existing application data models, further compounding the complexity, upping the integration tax, and inadvertently entrenching the legacy systems.

The alternative we call “data-centric.” It is not a technology fix. It is not something you can buy. We hope for this reason that it will avoid the fate of the Gartner hype cycle. It is a discipline and culture issue. We call it a revolution because it is not something you add to your existing environment; it is something you do with the intention of gradually replacing your existing environment (recognizing that this will take time.)

Seems like most good revolutions would benefit from the Warren Zevon refrain: “Send lawyers, guns, and money.” Let’s look at how this will play out in the data-centric revolution.

Click here to read more on TDAN.com

The 1st Annual Data-Centric Architecture Forum: Re-Cap

July 23, 2021May 23, 2019 by Dave McComb

In the past few weeks, Semantic Arts, hosted a new Data-Centric Architecture Forum. One of the conclusions made by the participants was that it wasn’t like a traditional conference. This wasn’t marching from room to room to sit through another talking head and PowerPoint lead presentation. There were a few PowerPoint slides that served to anchor, but it was much more a continual co-creation of a shared artifact.

The agreed consensus was:

— Yes, let’s do it again next year.
— Let’s call it a forum, rather than a conference.
— Let’s focus on implementation next year.
— Let’s make it a bit more vendor-friendly next year.

So retrospectively, last week was the first annual Data-Centric Architecture Forum.

What follows are my notes and conclusions from the forum.

Shared DCA Vision

I think we came away with a great deal of commonality and more specifics on what a DCA needs to look like and what it needs to consist of. The straw-man (see appendix A) came through with just a few revisions (coming soon). More importantly, it grounded everyone on what was needed and gave a common vocabulary about the pieces.

Uniqueness

I think with all the brain power in the room and the fact that people have been looking for this for a while, after we had described what such a solution entailed, if anyone knew of a platform or set of tools that provided all of this, out of the box, they would have said so.

I think we have outlined a platform that does not yet exist and needs to. With a bit of perseverance, next year we may have a few partial (maybe even more than partial) implementations.

Completeness

After working through this for 2 ½ days, I think if there were anything major missing, we would have caught it. Therefore, this seems to be a pretty complete stack. All the components and at least a first cut as to how they are related seems to be in place.

Doable-ness

While there are a lot of parts in the architecture, most of the people in the room thought that most of the parts were well-known and doable.

This isn’t a DARPA challenge to design some state-of-the-art thing, this is more a matter of putting pieces together that we already understand.

Vision v. Reference Architecture

As noted right at the end, this is a vision for an architecture— not a specific architecture or a reference architecture.

Notes From Specific Sessions

DCA Strawman

Most of this is covered was already covered above. I think we eventually suggested that “Analytics” might deserve its own layer. You could say that analytics is a “behavior” but it seems to be burying the lead.

I also thought it might be helpful to have some of the specific key APIs that are suggested by the architecture, and it looks like we need to split the MDM style of identity management from user identity management for clarity, and also for positioning in the stack.

State of the Industry

There is a strong case to be made that knowledge graph driven enterprises are eating the economy. Part of this may be because network effect companies are sympathetic with network data structures. But we think the case can be made so that the flexibility inherent in KGs applies to companies in any industry.

According to research that Alan provided, the average enterprise now executes 1100 different SaaS services. This is fragmenting the data landscape even faster than legacy did.

Business Case

A lot of the resistance isn’t technical, but instead tribal.

Even within the AI community there are tribes with little cross-fertilization:

Symbolists
Bayesians
Statisticians
Connectionists
Evolutionaries
Analogizers

On the integration front, the tribes are:

Relational DB Linkers
Application-Centric ESB Advocates
Application-Centric RESTful developers
Data-centric Knowledge Graphers

Click here to read more on TDAN.com

Convenience and Overhead

DevOps for Data People

Code

Data and Metadata

Property Graphs

The Appeal of Property Graphs

Property Graphs are Closer to What Programmers are Used to

Easy to Get Started

Cool Graphics Out of the Box

Property Graphs have Attributes on the Edges

Network Analytics

Property Graphs: What’s Not to Like

Property Graphs, What’s the Alternative?

Where You’ll Hit the Wall with Property Graphs

Schema

Globally Unique Identifiers

Resolvable Identifiers

Federation

Constraint Management

Inference

Provenance

Convergence

Summary

Shared DCA Vision

Uniqueness

Completeness

Doable-ness

Vision v. Reference Architecture

Notes From Specific Sessions

DCA Strawman

State of the Industry

Business Case

Contact Us

Learn More