DCA Forum Recap: Forrest Hare, Summit Knowledge Solutions

A knowledge model for explainable military AI

Forrest Hare, Founder of Summit Knowledge Solutions, is a retired US Air Force targeting and information operations officer who now works with the Defense Intelligence Agency (DIA). His experience includes integrating intelligence from different types of communications, signals, imagery, open source, telemetry, and other sources into a cohesive and actionable whole.

Hare became aware of semantics technology while at SAIC and is currently focused on building a space + time ontology called the DIA Knowledge Model so that Defense Department intelligence could use it to contextualize these multi-source inputs.

The question becomes, how do you bring objects that don’t move and objects that do move into the same information frame with a unified context? The information is currently organized by collectors and producers.

The object-based intelligence that does exist involves things that don’t move at all.  Facilities, for example, or humans using phones that are present on a communications network are more or less static. But what about the things in between such as trucks that are only intermittently present?

Only sparse information is available about these. How do you know the truck that was there yesterday in an image is the same truck that is there today? Not to mention the potential hostile forces who own the truck that have a strong incentive to hide it.

Objects in object-based intelligence not only include these kinds of assets, but also events and locations that you want to collect information about. In an entity-relationship sense, objects are entities.

Hare’s DIA Knowledge Model uses the ISO standard Basic Formal Ontology (BFO) to unify domains so that the information from different sources is logically connected and therefore makes sense as part of a larger whole. BFO’s maintainers (Director Barry Smith and team at the National Center for Ontological Research (NCOR) at the University of Buffalo) keep the ontology strictly limited to 30 or so classes.

The spatial-temporal regions of the Knowledge Model are what’s essential to do the kinds of dynamic, unfolding object tracking that’s been missing from object-based intelligence. Hare gave the example of a “site” (an immaterial entity) from a BFO perspective. A strict geolocational definition of “site” makes it possible for both humans and machines to make sense of the data about sites. Otherwise, Hare says, “The computer has no idea how to understand what’s in our databases, and that’s why it’s a dumpster fire.”

This kind of mutual human and machine understanding is a major rationale behind explainable AI. A commander briefed by an intelligence team must know why the team came to the conclusions it did. The stakes are obviously high. “From a national security perspective, it’s extremely important for AI to be explainable,” Hare reminded the audience. Black boxes such as ChatGPT as currently designed can’t effectively answer the commander’s question on how the intel team arrived at the conclusions it did.

Finally, the level of explain-ability knowledge models like the DIA’s becomes even more critical as information flows into the Joint Intelligence Operations Center (JIOC). Furthermore, the various branches of the US Armed Forces must supply and continually update a Common Intelligence Picture that’s actionable by the US President, who’s the Commander in Chief for the military as a whole.

Without this conceptual and spatial-temporal alignment across all service branches, joint operations can’t proceed as efficiently and effectively as they should.  Certainly, the risk of failure looms much larger as a result.

Contributed by Alan Morrison

DCA Forum Recap: Jans Aasman, Franz

How a “user” knowledge graph can help change data culture

Identity and Access Management (IAM) has had the same problem since Fernando Corbató of MIT first dreamed up the idea of digital passwords in 1960: opacity. Identity in the physical world is rich and well-articulated, with a wealth of different ways to verify information on individual humans and devices. By contrast, the digital realm has been identity data impoverished, cryptic and inflexible for over 60 years now.

Jans Aasman, CEO of Franz, provider of the entity-event knowledge graph solution Allegrograph, envisions a “user” knowledge graph as a flexible and more manageable data-centric solution to the IAM challenge. He presented on the topic at this past summer’s Data-Centric Architecture Forum, which Semantic Arts hosted near its headquarters in Fort Collins, Colorado.

Consider the specificity of a semantic graph and how it could facilitate secure access control. Knowledge graphs constructed of subject-predicate-object triples make it possible to set rules and filters in an articulated and yet straightforward manner.  Information about individuals that’s been collected for other HR purposes could enable this more precise filtering.

For example, Jans could disallow others’ access to a triple that connects “Jans” and “salary”. Or he could disallow access to certain predicates.

Identity and access management vendors call this method Attribute-Based Access Control (ABAC). Attributes include many different characteristics of users and what they interact with, which is inherently more flexible than role-based access control (RBAC).

Cell-level control is also possible, but as Forrest Hare of Summit Knowledge Solutions points out, such security doesn’t make a lot of sense, given how much meaning is absent in cells controlled in isolation. “What’s the classification of the number 7?” He asked. Without more context, it seems silly to control cells that are just storing numbers or individual letters, for example.

Simplifying identity management with a knowledge graph approach

Graph databases can simplify various aspects of the process of identity management. Let’s take Lightweight Directory Access Protocol, or LDAP, for example.

This vendor-agnostic protocol has been around for 30 years, but it’s still popular with enterprises. It’s a pre-web, post-internet hierarchical directory service and authentication protocol.

“Think of LDAP as a gigantic, virtual telephone book,” suggests access control management vendor Foxpass. Foxpass offers a dashboard-based LDAP management product which it claims is much easier to manage than OpenLDAP.

If companies don’t use LDAP, they might as well use Microsoft’s Active Directory, which is a broader, database-oriented identity and access management product that covers more of the same bases. Microsoft bundles AD with its Server and Exchange products, a means of lock-in that has been quite effective. Lock-in, obviously, inhibits innovation in general.

Consider the whole of identity management as it exists today and how limiting it has been. How could enterprises embark on the journey of using a graph database-oriented approach as an alternative to application-centric IAM software? The first step involves the creation of a “user” knowledge graph.

Access control data duplication and fragmentation

Semantic Arts CEO Dave McComb in his book Software Wasteland estimated that 90 percent of data is duplicated. Application-centric architectures in use since the days of mainframes have led to user data sprawl. Part of the reason there is such a duplication of user data is that authentication, authorization, and access control (AAA) methods require more bits of personally identifiable information (PII) be shared with central repositories for AAA purposes.

B2C companies are particularly prone to hoovering up these additional bits of PII lately and storing that sensitive info in centralized repositories. Those repositories become one-stop shops for identity thieves. Customers who want to pay online have to enter bank routing numbers and personal account numbers. As a result, there’s even more duplicate PII sprawl.

One of the reasons a “user” knowledge graph (and a knowledge graph enterprise foundation) could be innovative is that enterprises who adopt such an approach can move closer to zero-copy integration architectures. Model-driven development of the type that knowledge graphs enable assumes and encourages shared data and logic.

A “user” graph coupled with project management data could reuse the same enabling entities and relationships repeatedly for different purposes. The model-driven development approach thus incentivizes organic data management.

The challenge of harnessing relationship-rich data

Jans points out that enterprises, for example, run massive email systems that could be tapped to analyze project data for optimization purposes. And disambiguation by unique email address across the enterprise can be a starting point for all sorts of useful applications.

Most enterprises don’t apply unique email address disambiguation, but Franz has a pharma company client that does, an exception that proves the rule. Email continues to be an untapped resource in many organizations precisely because it’s a treasure trove of relationship data.

Problematic data farming realities: A social media example

Relationship data involving humans is sensitive by definition, but the reuse potential of sensitive data is too important to ignore. Organizations do need to interact with individuals online, and vice versa.

Former US Federal Bureau of Investigation (FBI) counterintelligence agent Peter Strzok quoted from Deadline: White House, an MSNBC program in the US aired on August 16:

“I’ve served I don’t know how many search warrants on Twitter (now known as X) over the years in investigations. We need to put our investigator’s hat on and talk about tradecraft a little bit. Twitter gathers a lot of information. They just don’t have your tweets. They have your draft tweets. In some cases, they have deleted tweets. They have DMs that people have sent you, which are not encrypted. They have your draft DMs, the IP address from which you logged on to the account at the time, sometimes the location at which you accessed the account and other applications that are associated with your Twitter account, amongst other data.” 

X and most other social media platforms, not to mention law enforcement agencies such as the FBI, obviously care a whole lot about data. Collecting, saving, and allowing access to data from hundreds of millions of users in such a broad, comprehensive fashion is essential for X. At least from a data utilization perspective, what they’ve done makes sense.

Contrast these social media platforms with the way enterprises collect and handle their own data. That collection and management effort is function- rather than human-centric. With social media, the human is the product.

So why is a social media platform’s culture different? Because with public social media, broad, relationship-rich data sharing had to come first. Users learned first-hand what the privacy tradeoffs were, and that kind of sharing capability was designed into the architecture. The ability to share and reuse social media data for many purposes implies the need to manage the data and its accessibility in an elaborate way. Email, by contrast, is a much older technology that was not originally intended for multi-purpose reuse.

Why can organizations like the FBI successfully serve search warrants on data from data farming companies? Because social media started with a broad data sharing assumption and forced a change in the data sharing culture. Then came adoption. Then law enforcement stepped in and argued effectively for its own access.

Broadly reused and shared, web data about users is clearly more useful than siloed data. Shared data is why X can have the advertising-driven business model it does. One-way social media contracts with users require agreement with provider terms. The users have one choice: Use the platform, or don’t.

The key enterprise opportunity: A zero-copy user PII graph that respects users

It’s clear that enterprises should do more to tap the value of the kinds of user data that email, for example, generates. One way to sidestep the sensitivity issues associated with reusing that sort of data would be to treat the most sensitive user data separately.

Self-sovereign identity (SSI) advocate Phil Windley has pointed out that agent-managed, hashed messaging and decentralized identifiers could make it unnecessary to duplicate identifiers that correlate. If a bartender just needs to confirm that a patron at the bar is old enough to drink, the bartender could just ping the DMV to confirm the fact. The DMV could then ping the user’s phone to verify the patron’s claimed adult status.

Given such a scheme, each user could manage and control their access to their own most sensitive PII. In this scenario, the PII could stay in place, stored, and encrypted on a user’s phone.

Knowledge graphs lend themselves to such a less centralized, and yet more fine-grained and transparent approach to data management. By supporting self-sovereign identity and a data-centric architecture, a Chief Data Officer could help the Chief Risk Officer mitigate the enterprise risk associated with the duplication of personally identifiable information—a true, win-win.

 

Contributed by Alan Morrison

DCA Forum Recap: US Homeland Security

How US Homeland Security plans to use knowledge graphs in its border patrol efforts

During this summer’s Data Centric Architecture Forum, Ryan Riccucci, Division Chief for U.S. Border Patrol – Tucson (AZ) Sector, and his colleague Eugene Yockey gave a glimpse of what the data environment is like within the US Department of Homeland Security (DHS), as well as how transforming that data environment has been evolving.

The DHS celebrated its 20-year anniversary recently. The Federal department’s data challenges are substantial, considering the need to collect, store, retrieve and manage information associated with 500,000 daily border crossings, 160,000 vehicles, and $8 billion in imported goods processed daily by 65,000 personnel.

Riccucci is leading an ontology development effort within the Customs and Border Patrol (CBP) agency and the Department of Homeland Security more generally to support scalable, enterprise-wide data integration and knowledge sharing. It’s significant to note that a Division Chief has tackled the organization’s data integration challenge. Riccucci doesn’t let leading-edge, transformational technology and fundamental data architecture change intimidate him.

Riccucci described a typical use case for the transformed, integrated data sharing environment that DHS and its predecessor organizations have envisioned for decades.

The CBP has various sensor nets that monitor air traffic close to or crossing the borders between Mexico and the US, and Canada and the US. One such challenge on the Mexican border is Fentanyl smuggling into the US via drones. Fentanyl can be 50 times as powerful as morphine. Fentanyl overdoses caused 110,000 deaths in the US in 2022.

On the border with Canada, a major concern is gun smuggling via drone from the US. to Canada. Though legal in the US, Glock pistols, for instance, are illegal and in high demand in Canada.

The challenge in either case is to intercept the smugglers retrieving the drug or weapon drops while they are in the act. Drones may only be active for seven to 15 minutes at a time, so the opportunity window to detect and respond effectively is a narrow one.

Field agents ideally need to see enough visual real-time, mapped airspace information on the sensor activated, allowing them to move quickly and directly to the location. Specifics are important; verbally relayed information by contrast can often be less specific, causing confusion or misunderstanding.

The CBP’s successful proof of concept involved a basic Resource Description Framework (RDF) triple, semantic capabilities with just this kind of information:

Sensor → Act of sensing → drone (SUAS, SUAV, vehicle, etc.)

In a recent test scenario, CBP collected 17,000 records that met specified time/space requirements for a qualified drone interdiction over a 30-day period.

The overall impression that Riccucci and Yockey conveyed was that DHS has both the budget and the commitment to tackle this and many other use cases using a transformed data-centric architecture. By capturing information within an interoperability format, the DHS has been apprehending the bad guys with greater frequency and precision.

Contributed by Alan Morrison

Extending an upper-level ontology (like GIST)

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here)

If you have been following my blogs over the past year or so they you will know I am a big fan of adopting an upper-level ontology to help bootstrap your own bespoke ontology project. Of the available upper-level ontologies I happen to like gist as it embraces a “less is more” philosophy.

Given that this is 3rd party software with its own lifecycle, how does one “merge” such an upper ontology with your own? Like most things in life, there are two primary ways.

CLONE MODEL

This approach is straightforward: simply clone the upper ontology and then modify/extend it directly as if it were your own (being sure to retain any copyright notice). The assumption here is that you will change the “gist” domain into something else like “mydomain”. The benefit is that you don’t have to risk any 3rd party updates affecting your project down the road. The downside is that you lose out on the latest enhancements/improvements over time, which if you wish to adopt, would require you to manually re-factor into your own ontology.

As the inventors of gist have many dozens of man-years of hands-on experience with developing and implementing ontologies for dozens of enterprise customers, this is not an approach I would recommend for most projects.

EXTEND MODEL

Just as when you extend any 3rd party software library you do so in your own namespace, you should also extend an upper-level ontology in your own namespace. This involves just a couple of simple steps:

First, declare your own namespace as an owl ontology, then import the 3rd party upper-level ontology (e.g. gist) into that ontology. Something along the lines of this:

<https://ont.mydomain.com/core> 
    a owl:Ontology ;
    owl:imports <https://ontologies.semanticarts.com/o/gistCore11.0.0> ;
    .

Second, define your “extended” classes and properties, referencing appropriate gist subclasses, subproperties, domains, and/or range assertions as needed. A few samples shown below (where “my” is the prefix for your ontology domain):

my:isFriendOf 
     a owl:ObjectProperty ;
     rdfs:domain gist:Person ;
     .
my:Parent 
    a owl:Class ;
    rdfs:subClassOf gist:Person ;
    .
my:firstName 
    a owl:DatatypeProperty ;
    rdfs:subPropertyOf gist:name ;
    .

The above definitions would allow you to update to new versions of the upper-level ontology* without losing any of your extensions. Simple right?

*When a 3rd party upgrades the upper-level ontology to a new major version — defined as non-backward compatible — you may find changes that need to be made to your extension ontology; as a hypothetical example, if Semantic Arts decided to remove the class gist:Person, the assertions made above would no longer be compatible. Fortunately, when it comes to major updates Semantic Arts has consistently provided a set of migration scripts which assist with updating your extended ontology as well as your instance data. Other 3rd parties may or may not follow suit.

Thanks to Rebecca Younes of Semantic Arts for providing insight and clarity into this.

Knowledge Graph Modeling: Time series micro-pattern using GIST

Michael Sullivan is a Principle Cloud Solutions Architect at Oracle.  Article reprinted with permission (original is here)

For any enterprise, being able to model time series is more than just important, in many cases it is critical. There are many examples but some trivial ones include “Person is employed By Employer” (Employment date-range), “Business has Business Address” (Established Location date-range), “Manager supervises Member Of Staff” (Supervision date-range), and so on. But many developers who dabble in RDF graph modeling end up scratching their heads — how can one pull that off if one can’t add attributes to an edge? While it is true that one can always model things using either reification or leveraging RDF Quads (see my previous blog semantic rdf properties) now might be a good time to take a step back and explore how the semantic gurus at Semantic Arts have neatly solved how to model time series starting with version 11 of GIST, their free upper-level ontology (link below).

First a little history. The core concept of RDF is to “connect” entities via predicates (a.k.a. “triples”) as shown below. Note that either predicate could be inferred from the other, bearing in mind that you need to maintain at least one explicit predicate between the two as there is no such thing in RDF as an subject without a predicate/object. Querying such data is also super simple.

Typical entity to entity relationships in RDF

So far so good. In fact, this is about as simple as it gets. But what if we wanted to later enrich the above simple semantic relationship with time-series? After all, it is common to want to know WHEN Mark supervised Emma. With out-of-the-box RDF you can’t just hang attributes on the predicates (I’d argue that this simplistic way of thinking is why property graphs tend to be much more comforting to developers). Further, we don’t want to throw out our existing model and go through the onerous task of re-modeling everything in the knowledge graph. Instead, what if we elevated the specific “supervises” relationship between Mark and Emma to become a first-class citizen? What would that look like? I would suggest that a “relation” entity that becomes a placeholder for the “Mark Supervises Emma” relationship would fit the bill. This entity would in turn reference Mark via a “supervision by” predicate while referencing Emma via a “supervision of” predicate.

Ok, now that we have a first-class relation entity, we are ready to add additional time attributes (i.e. triples), right? Well, not so fast! The key insight that in GIST, is that the “actual end date” and “actual start date” predicates as used here specify the precision of the data property (rather than letting the data value specifying the precision), which in our particular use case we want to be the overall date, not any specific time. Hence our use of gist:actualStartDate and gist:actualEndDate here instead of something more time-precise.

The rest is straightforward as depicted in the micro-pattern diagram shown immediately below. Note that in this case, BOTH the previous “supervised by” and “supervises” predicates connecting Mark to Emma directly can be — and probably should be — inferred! This will allow time-series to evolve and change over time while enabling queryable (inferred) predicates to always be up-to-date and in-sync. It also means that previous queries using the old model will continue to work. A win-win.

Time series micro-pattern using GIST

A clever ontological detail not shown here: A temporal relation such as “Mark supervises Emma” must be gist:isConnectedTo a minimum of two objects — this cardinality is defined in the GIST ontology itself and is thus inherited. The result is data integrity managed by the semantic database itself! Additionally, you can see the richness of the GIST “at date time” data properties most clearly in the expression of the hierarchical model in latest v11 ontology (see Protégé screenshot below). This allows the modeler to specify the precision of the start and end date times as well as distinguishing something that is “planned” vs. “actual”. Overall a very flexible and extensible upper ontology that will meet most enterprises’ requirements.

"at date time" data property hierarchy as defined in GIST v11

Further, this overall micro-pattern, wherein we elevate relationships to first-class status, is infinitely re-purposable in a whole host of other governance and provenance modeling use-cases that enterprises typically require. I urge you to explore and expand upon this simple yet powerful pattern and leverage it for things other than time-series!

One more thing…

Given that with this micro-pattern we’ve essentially elevated relations to be first class citizens — just like in classic Object Role Modeling (ORM) — we might want to consider also updating the namespaces of the subject/predicate/object domains to better reflect the objects and roles. After all, this type of notation is much more familiar to developers. For example, the common notation object.instance is much more intuitive than owner.instance. As such, I propose that the traditional/generic use of “ex:” as used previously should be replaced with self-descriptive prefixes that can represent both the owner as well as the object type. This is good for readability and is self-documenting. And ultimately doing so may help developers become more comfortable with RDF/SPARQL over time. For example:

  • ex:_MarkSupervisesEmma becomes rel:_MarkSupervisesEmma
  • ex:supervisionBy becomes role:supervisionBy
  • ex:_Mark becomes pers:_Mark

Where:

@prefix rel: <www.example.com/relation/>.
@prefix role: <www.example.com/role/>.
@prefix pers: <www.example.com/person/>.

Links

Alan Morrison: Zero-Copy Integration and Radical Simplification

Dave McComb’s book Software Wasteland underscored a fundamental problem: Enterprise software sometimes costs 1,000 times more than it ought to. The poster child for cost overruns was highlighted in the book was Healthcare.gov, a public registration system for the US Affordable Care Act, enacted in 2010. By 2018, the US Federal government had spent $2.1 billion to build and implement the system. Most of that money was wasted. The government ended up adopting many of the design principles embodied in an equivalent system called HealthSherpa, which cost $1 million to build and implement.

In an era where the data-centric architecture Semantic Arts advocates should be the norm, application-centric architecture still predominates. But data-centric architecture doesn’t just reduce the cost of applications. It also attacks the data duplication problem attributable to poor software design. This article explores how expensive data duplication has become, and how data-centric, zero-copy integration can put enterprises on a course to simplification.

Data sprawl and storage volumes

In 2021, Seagate became the first company to ship three zettabytes worth of hard disks. It took them 36 years to ship the first zettabyte. six years to ship the second zettabyte, and only one additional year to ship the third zettabyte. 

The company’s first product, the ST-506, was released in 1980. The ST-506 hard disk, when formatted, stored five megabytes (10002). By comparison, an IBM RAMAC 305, introduced in 1956, stored five to ten megabytes. The RAMAC 305 weighed 10 US tons (the equivalent of nine metric tonnes). By contrast, the Seagate ST-506, 24 years later, weighed five US pounds (or 2.27 kilograms).

A zettabyte is the equivalent of 7.3 trillion MP3 files or 30 billion 4K movies, according to Seagate. When considering zettabytes:

  • 1 zettabyte equals 1,000 exabytes.
  • 1 exabyte equals 1,000 petabytes.
  • 1 petabyte equals 1,000 terabytes.

IDC predicts that the world will generate 178 zettabytes of data by 2025. At that pace, “The Yottabyte Era” would succeed The Zettabyte Era by 2030, if not earlier.

The cost of copying

The question becomes, how much of the data generated will be “disposable” or unnecessary data? In other words, how much data do we actually need to generate, and how much do we really need to store? Aren’t we wasting energy and other resources by storing more than we need to?

Let’s put it this way: If we didn’t have to duplicate any data whatsoever, the world would only have to generate 11 percent of the data it currently does. In 2021 terms, we’d only need to generate 8.7 zettabytes of data, compared with the 78 zettabytes we actually generated worldwide over the course of that year.

Moreover, Statista estimates that the ratio of unique to replicated data stored worldwide will decline to 1:10 from 1:9 by 2024. In other words, the trend is toward more duplication, rather than less.

The cost of storing oodles of data is substantial. Computer hardware guru Nick Evanson, quoted by Gerry McGovern in CMSwire, estimated in 2020 that storing two yottabytes would cost $58 trillion. If the cost per byte stored stayed constant, 40 percent of the world’s economic output would be consumed in 2035 by just storing data.

Clearly, we should be incentivizing what graph platform Cinchy calls “zero-copy integration”–a way of radically reducing unnecessary data duplication. The one thing we don’t have is “zero-cost” storage. But first, let’s finish the cost story. More on the solution side and zero-copy integration later.

The cost of training and inferencing large language models

Model development and usage expenses are just as concerning. The cost of training machines to learn with the help of curated datasets is one thing, but the cost of inferencing–the use of the resulting model to make predictions using live data–is another. 

“Machine learning is on track to consume all the energy being supplied, a model that is costly, inefficient, and unsustainable,” Brian Bailey in Semiconductor Engineering pointed out in 2022. AI model training expense has increased with the size of the datasets used, but more importantly, as the amount of parameters increases by four, the  amount of energy consumed in the process increases by 18,000 times. Some AI models included as many as 150 billion parameters in 2022. The more recent ChatGPT LLM Training includes 180 billion parameters. Training can often be a continuous activity to keep models up to date.

But the applied model aspect of inferencing can be enormously costly. Consider the AI functions in self-driving cars, for example. Major car makers sell millions of cars a year, and each one they sell is utilizing the same carmaker’s model in a unique way. 70 percent of the energy consumed in self-driving car applications could be due to inference, says Godwin Maben, a scientist at electronic design automation (EDA) provider Synopsys.

Data Quality by Design

Transfer learning is a machine learning term that refers to how machines can be taught to generalize better. It’s a form of knowledge transfer. Semantic knowledge graphs can be a valuable means of knowledge transfer because they describe contexts and causality well with the help of relationships. 

Well-described knowledge graphs provide the context in contextual computing. Contextual computing, according to the US Defense Advanced Research Projects Agency (DARPA), is essential to artificial general intelligence.

A substantial percentage of training set data used in large language models is more or less duplicate data, precisely because of poorly described context that leads to a lack of generalization ability. Thus the reason why the only AI we have is narrow AI. And thus the reason large language models are so inefficient.

But what about the storage cost problem associated with data duplication? Knowledge graphs can help with that problem also, by serving as a means for logic sharing. As Dave has pointed out, knowledge graphs facilitate model-driven development when applications are written to use the description or relationship logic the graph describes. Ontologies provide the logical connections that allow reuse and thereby reduce the need for duplication.

FAIR data and Zero-Copy Integration

How do you get others who are concerned about data duplication on board with semantics and knowledge graphs? By encouraging data and coding discipline that’s guided by FAIR principles. As Dave pointed out in a December 2022 blogpost, semantic graphs and FAIR principles go hand in hand. https://www.semanticarts.com/the-data-centric-revolution-detour-shortcut-to-fair/ 

Adhering to the FAIR principles, formulated by a group of scientists in 2016, promotes reusability by “enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse by individuals.”  When it comes to data, FAIR stands for Findable, Accessible, Interoperable, and Reusable. FAIR data is easily found, easily shared, easily reused quality data, in other words. 

FAIR data implies the data quality needed to do zero-copy integration.

Bottom line: When companies move to contextual computing by using knowledge graphs to create FAIR data and do model-driven development, it’s a win-win. More reusable data and logic means less duplication, less energy, less labor waste, and lower cost. The term “zero-copy integration” underscores those benefits.

 Alan Morrison is an independent consultant and freelance writer on data tech and enterprise transformation. He is a contributor to Data Science Central and TechTarget sites with over 35 years of experience as an analyst, researcher, writer, editor and technology trends forecaster, including 20 years in emerging tech R&D at PwC.

A Data Engineer’s Guide to Semantic Modelling

While on her semantic modelling journey and as a Data Engineer herself, Ilaria Maresi encountered a range of challenges. There was not one definite source where she could quickly look things up, many of the resources were extremely technical and geared towards a more experienced audience while others were too wishy-washy. Therefore, she decided to compose this 50-page document where she explains semantic modelling and her most important lessons-learned – all in an engaging and down-to-earth writing style.

She starts off with the basics: what is a semantic model and why should you consider building one? Obviously, this is best explained by using a famous rock band as an example. In this way, you learn to draw the basic elements of a semantic model and some fun facts about Led Zeppelin at the same time!

For your model to actually work, it is essential that machines can also understand these fun facts. This might sound challenging if you are not a computer scientist but this guide will walk you through it  step-by-step – it even has pictures of baby animals! You will learn how to structure your model in Resource Description Framework (RDF) and give it meaning with the vocabulary extension that wins the prize for cutest acronym: Web Ontology Language (OWL).

All other important aspects of semantic modelling will be discussed. For example, how to make sure we all talk about the same Led Zeppelin by using Uniform Resource Identifiers (URIs). Moreover, you are not the first one thinking and learning about knowledge representation: many domain experts have spent serious time and effort in defining the major concepts of their field, called ontologies. To prevent you from re-inventing the wheel, we list the most important resources and explain their origin.

Are you a Data Engineer that has just started with semantic modelling? Want to refresh your memory? Maybe you have no experience with semantic modelling yet but feel it might come in handy? Well, this guide is for you!

Click here to access a data engineer’s guide to semantic modelling

Written by Tess Korthout

A Brief Introduction to the gist Semantic Model

Phil Blackwood, Ph.D.

It’s no secret that most companies have silos of data and continue to create new silos.  Data that has the same meaning is often represented hundreds or thousands of different ways as new data models are introduced with every new software application, resulting in a high cost of integration.

By contrast, the data-centric approach starts with the common meaning of the data to address the root cause of data silos:

An enterprise is data-centric to the extent that all application functionality is based on a single, simple, extensible, federate-able data model.

An early step along the way to becoming data-centric is to establish a semantic model of the common concepts used across your business.  This might sound like a huge undertaking, and perhaps it will be if you start from scratch.  A better option is to adopt an existing core semantic model that has been designed for businesses and has a track record of success, such as gist.

Gist is an open source semantic model created by Semantic Arts. 

Gist is an open source semantic model created by Semantic Arts.  It is the result of more than a decade of refinement based on data-centric projects done with major corporations in a variety of lines of business.  Semantic Arts describes gist as “… designed to have the maximum coverage of typical business ontology concepts with the fewest number of primitives and the least amount of ambiguity.”  The Wikipedia entry for upper ontologies compares gist to other ontologies, and gives a sense of why it is a match for corporate data management.

 

This blog post introduces gist by examining how some of the major Classes and Properties can be used.  We will not go into much detail; just enough to convey the general idea.

Everyone in your company would probably agree that running the business involves products, services, agreements, and events like payments and deliveries.  In turn, agreements and events involve “who, what, where, when, and why”, all of which are included in the gist model.  Gist includes about 150 Classes (types of things), and different parts of the business can be often be modeled by adding sub-classes.  Here are a few of the major Classes in gist:

Gist also includes about 100 standard ways things can be related to each other (Object Properties), such as:

  • owns
  • produces
  • governs
  • requires, prevents, or allows
  • based on
  • categorized by
  • part of
  • triggered by
  • occurs at (some place)
  • start time, end time
  • has physical location
  • has party (e.g. party to an agreement)

For example, the data representing a contract between a person and your company might include things like:

In gist, a Contract is a legally binding Agreement, and an Agreement is a Commitment involving two or more parties.  It’s clear and simple.  It’s also expressed in a way that is machine-readable to support automated inferences, Machine Learning, and Artificial Intelligence.

The items and relationships of the contract can be loaded into a knowledge graph, where each “thing” is a node and each relationship is an edge.  Existing data can be mapped to this standard representation to make it possible to view all of your contracts through a single lens of terminology.  The knowledge graph for an individual contract as sketched out above would look like:

Note that this example is just a starting point.  In practice, every node in the diagram would have additional properties (arrows out) providing more detail.  For example, the ID would link to a text string and to the party that allocated the ID (e.g. the state government that allocated a driver’s license ID).  The CatalogItem would be a detailed Product or Service Specification.

In the knowledge graph, there would be a single Person entry representing a given individual, and if two entries were later discovered to represent the same person, they could be linked with a sameAs relationship.

Relationships in gist (Properties) are first class citizens that have a meaning independent of the things they link, making them highly re-usable.  For example, identifiedBy is not limited to contracts, but can be used anywhere something has an ID.  Note that the Properties in gist are used to define relationships between instances rather than Classes; there are also a few standard relationships between Classes such as subClassOf and equivalentTo.

The categorizedBy relationship is a powerful one, because it allows the meaning of an item to be specified by linking to a taxonomy rather than by creating new Classes.  This pattern contributes to extensibility; adding new characteristics becomes comparable to adding valid values to a relational semantic model instead of adding new attributes.

Unlike traditional data models, the gist semantic model can be loaded into a knowledge graph and then the data is loaded into the same knowledge graph as an extension to the model.  There is no separation between the conceptual, logical, and physical models.  Similar queries can be used to discover the model or to view the data.

Gist uses the W3C OWL standard (Web Ontology Language), and you will need to understand OWL to get the most value out of gist.  To get started with OWL for corporate data management, check out the book Demystifying OWL for the Enterprise, by Michael Uschold.  There’s also a brief introduction to OWL and the way it uses set theory here.

The technology stack that supports OWL is well-established and has minimal vendor lock-in because of the simple standard data representation.  A semantic model created in one knowledge graph (triple store) can generally be ported to another tool without too much trouble.

To explore gist in more detail, you can download an ontology editor such as Protégé and then select File > Open From URL and enter: https://ontologies.semanticarts.com/o/gistCore9.4.0  Once you have the gist model loaded, select Entities and then review the descriptions of Classes, Object Properties (relationships between things), and Data Properties (which point to string or numeric values with no additional properties).  If you want to investigate gist in an orderly sequence, I’d suggest viewing items in groups of “who, what, when, where, and how.”

Take a look at gist.  It’s worth your time, because having a standard set f common terms like gist is a significant step toward reversing the trend toward more and more expensive data silos.

Click here to learn more about gist.

A Mathematician and an Ontologist walk into a bar…

The Ontologist and Mathematician should be able to find common ground because Cantor introduced set theory into the foundation of mathematics, and W3C OWL uses set theory as a foundation for ontology language.  Let’s listen in as they mash up Cantor and OWL …

Ontologist: What would you like to talk about?

Mathematician: Anything.

Ontologist: Pick a thing. Any. Thing. You. Like.

Mathematician: [looks across the street]

A Mathematician and an Ontologist walk into a bar…

Ontologist: Sure, why not?  Wells Fargo it is.  If we wanted to create an ontology for banking, we might need to have a concept of a company being a bank to differentiate it from other types of companies.  We would also want to generalize a bit and include the concept of Organization.

Mathematician: That’s simple in the world of sets.

A Mathematician and an Ontologist walk into a bar…

Ontologist: In my world, every item in your diagram is related to every other item.  For example, Wells Fargo is not only a Bank, but it is also an Organization.  Relationships to “Thing” are applied automatically by my ontology editor.  When we build our ontology, we would first enter the relationships in the diagram below (read it from the bottom to the top):

A Mathematician and an Ontologist walk into a bar…

Then we would run a reasoner to infer other relationships.  The result would look like this:

A Mathematician and an Ontologist walk into a bar…

Mathematician: My picture has “Banks” and yours has “Bank”.  You took off the “s”.

Ontologist: Well, yes, I changed all the set names to make them singular because that’s the convention for Class names.  Sorry.  But now that you mention it … whenever I create a new Class I use a singular name just like everyone else does, but I also check to see if the plural is the good name for the set of things in the Class.  If the plural doesn’t sound like a set, I rethink it.  Try that with “Tom’s Stamp Collection” and see what you get.

Mathematician: I’d say you would have to rethink that Class name if you wanted the members of the Class to be stamps.  Otherwise, people using your model might not understand your intent.  Is a Class more like a set, or more like a template?

Ontologist: Definitely not a template, unlike object-oriented programming.  More like a set where the membership can change over time.

Mathematician: OK.  S or no S, I think we are mostly talking about the same thing.  In fact, your picture showing the Classes separated out instead of nested reminds me of what Georg Cantor said: “A set is a Many that allows itself to be thought of as a One.”

Ontologist: Yes.  You can think of a Class as a set of real world instances of a concept that is used to describe a subject like Banking.  Typically, we can re-use more general Classes and only need to create a subclass to differentiate its members from the other members of the existing Class (like Bank is a special kind of Company).  We create or re-use a Class when we want to give the Things in it meaning and context by relating them to other things.

Mathematician: Like this?

A Mathematician and an Ontologist walk into a bar…

Ontologist: Exactly.  Now we know more about Joan, and we know more about Wells Fargo.  We call that a triple.

Mathematician: A triple.  How clever.

Ontologist: Actually, that’s the way we store all our data.  The triples form a knowledge graph.

Mathematician: Oh, now that’s interesting …  nice idea. Simple and elegant.  I think I like it.

Ontologist: Good.  Now back to your triple with Joan and Wells Fargo.  How would you generalize it in the world of sets?

Mathematician: Simple.  I call this next diagram a mapping, with Domain defined as the things I’m mapping from and Range defined as the things I’m mapping to.

A Mathematician and an Ontologist walk into a bar…

Ontologist: I call worksFor an Object Property.  For today only, I’m going to shorten that to just “Property”.  But.  Wait, wait, wait.  Domain and Range?

A Mathematician and an Ontologist walk into a bar…

In my world, I need to be careful about what I include in the Domain and Range, because any time I use worksFor, my reasoner will conclude that the thing on the left is in the Domain and the thing on the right is in the Range.

Ontologist continues: Imagine if I set the Domain to Person and the Range to Company, and then assert that Sparkplug the horse worksFor Tom the farmer.  The reasoner will tell me Sparkplug is a Person and Tom is a Company.  That’s why Domain and Range always raise a big CAUTION sign for me.  I always ask myself if there is anything else that might possibly be in the Domain or Range, ever, especially if the Property gets re-used by some else.  I need to define the Domain and Range broadly enough for future uses so I won’t end up trying to find the Social Security number of a horse.

Mathematician: Bummer.  Good luck with that.

Ontologist: Oh, thank you.  Now back your “mapping”.  I suppose you think of it as a set of arrows and you can have subsets of them.

Mathematician: Yes, pretty much.  If I wanted to be more precise, I would say a mapping is a set of ordered pairs.  I’m going to use an arrow to show what order the things are in; and voila, here is your set diagram for the concept:

A Mathematician and an Ontologist walk into a bar…

You will notice that there are two different relationships:

A Mathematician and an Ontologist walk into a bar…

The pair (Joan, Wells Fargo) is in both sets, so it is in both mappings.  Does that make sense to you?

Ontologist: Yes, I think it makes sense.  In my world, if I cared about both of these types of relationships, I would make isAManagerAt a subProperty of worksFor, and enter the assertion that Joan is a manager at Wells Fargo.  My reasoner would add the inferred relationship that Joan worksFor Wells Fargo.

Mathematician: Of course!  I think I’ve got the basic idea now.  Let me show you what else I can do with sets.  I’ll even throw in some your terminology.

Ontologist: Oh, by all means. [O is silently thinking, “I bet this is all in OWL, but hey, the OWL specs don’t have pictures of sets.”]

Mathematician: [takes a deep breath so he can go on and on … ]

Let’s start with two sets:

A Mathematician and an Ontologist walk into a bar…

The intersection is a subset of each set, and each of the sets is a subset of the union.  If we want to use the intersection as a Class, we should be able to infer:

A Mathematician and an Ontologist walk into a bar…And if we want to use the union as a Class, then each original Class is a Sub Class of the union:

A Mathematician and an Ontologist walk into a bar…

If two Classes A and B have no members in common (disjoint), then every Sub Class of A is disjoint from every sub class of B:

A Mathematician and an Ontologist walk into a bar…A mapping where there is at most one arrow out from each starting point is called a function.

A Mathematician and an Ontologist walk into a bar…A mapping where there is at most one arrow into each ending point is called inverse-functional.

A Mathematician and an Ontologist walk into a bar…

You get the inverse of a mapping by reversing the direction of all the arrows in it.  As the name implies, if a mapping is inverse-functional, it means the inverse is a function.

Sometimes the inverse mapping ends up looking just like the original (called symmetric), and sometimes it is “totally different” (disjoint or asymmetric).

A Mathematician and an Ontologist walk into a bar…Sometimes a mapping is transitive, like our diagram of inferences with subClassOf, where a subclass of a subclass is a subclass.  I don’t have a nice simple set diagram for that, but our Class diagram is an easy way to visualize it.  Take two hops using the same relationship and you get another instance of the relationship:

A Mathematician and an Ontologist walk into a bar…

Sets can be defined by combining other sets and mappings, such as the set of all people who work for some bank (any bank).

Ontologist: Not bad.  Here’s what I would add:

Sometimes I define a set by a phrase like you mentioned (worksFor some Bank), and in OWL I can plug that phrase into any expression where a Class name would make sense.  If I want to turn the set into a named Class, I can say the Class is equivalent to the phrase that defines it.  Like this:

BankEmployee is equivalentTo (worksFor some Bank).

The reasoner can often use the phrase to infer things into the Class BankEmployee, or use membership in the Class to infer the conditions in the phrase are true.  A lot of meaning can be added to data this way.  Just as in a dictionary, we define things in terms of other things.

When two Classes are disjoint, it means they have very distinct and separate meanings.  It’s a really good thing, especially at more general levels.  When we record disjointness in the ontology, the reasoner can use it to detect errors.

Whenever I create a Property, I always check to see if it is a function.  If so, I record the fact that it is a function in the ontology because it sharpens the meaning.

We never really talked about Data Properties.  Maybe next time.  They’re for simple attributes like “the building is 5 stories tall”.

A lot of times, a high level Property can be used instead of creating a new subProperty.  Whenever I consider creating a new subProperty, I ask myself if my triples will be just as meaningful if I use the original Property.  A lot of times, the answer is yes and I can keep my model simple by not creating a new Property.

An ontology is defined in terms of sets of things in the real world, but our data base usually does not have a complete set of records for everything defined in the ontology.  So, we should not try to infer too much from the data that is present.  That kind of logic is built in to reasoners.

On the flip side, the data can include multiple instances for the same thing, especially when we are linking multiple data sets together.  We can use the sameAs Property to link records that refer to the same real-world thing, or even to link together independently-created graphs.

The OWL ontology language is explained well at: https://www.w3.org/TR/owl-primer/

However, even if we understand the theory, there are many choices to be made when creating an ontology.  If you are creating an ontology for a business, a great book that covers the practical aspects is Demystifying OWL for the Enterprise by Michael Uschold.

Mathematician: I want the last word.

Ontologist: OK.

Mathematician:

A Mathematician and an Ontologist walk into a bar…Ontologist: I agree, but that wasn’t a word.  🙂

Mathematician: OK.  I think I’m starting to see what you are doing with ontologies.  Here’s what it looks like to me: since it is based on set logic and triples, the OWL ontology language has a rock-solid foundation.

Written By: Phil Blackwood, Ph.D.

The Data-Centric Hospital

Why software change is not as graceful

‘The Graceful Adaption of St Frances Xavier Cabrini Hospital since 1948.


This post has been updated to reflect the current corona virus crisis. Joe Pine and Kim Korn authors of Infinite Possibility: Creating Customer Value on the Digital Frontier say the coronacrisis will change healthcare for the better. Kim points out that although it is 20 years late. It is good to see healthcare dipping a toe in the 21st Century. However, I think, we have a long way to go.

Dave McComb in his book, ‘The Data-Centric Revolution: Restoring Sanity to Enterprise Information Systems, suggests that buildings change more gracefully than software does. He uses Stewart Brand’s model which shows how buildings can change after they are built.

Graceful Change

I experienced the graceful change during the 19 years that I worked at Cabrini Hospital Malvern. The buildings changed whilst still serving its customers. To outsiders, it may have appeared seamless. To insiders charged with changing the environment, it took constant planning and endless negotiation with internal stakeholders and the surrounding community.

Geoff Fazakerley, the director of buildings, engineering, and diagnostic services orchestrated most of the activities following the guiding principles of the late Sister Irma Lunghi MSC, “In all that we must first keep our eyes on our patients’ needs.”

Geoff, took endless rounds with his team to assess where space could be allocated for temporary offices and medical suites. He then negotiated with those that were impacted to move to temporary locations. On several instances space for non-patient services had to be found outside the hospital so that patients would not be inconvenienced. The building as it now stands bears testament to how Geoff’s team and the architects managed the graceful change.

Why software change is not as graceful

Most enterprise software changes are difficult. In ‘Software Wasteland: How the Application-Centric Mindset is Hobbling our Enterprises’, McComb explains that by focusing on the application as the foundation or starting point instead of taking a data-centric approach this hobbles agile adoption, extension and innovation. When a program is loaded first, it creates a new data structure. With this approach, each business problem requires yet another application system. Each application creates and manages yet another data model. Over time this approach leads to many more applications and many additional complex data models. Rather than improving access to information this approach steadily erodes it.

McComb says that every year the amount of data available to enterprises doubles, while the ability to effectively use it decreases. Executives lament their legacy systems, but their projects to replace them rarely succeed. This is not a technology problem, it is more about mindset.

From personal experience, we know that some buildings adapt well and some don’t. Those of us that work in large organisations also know that this is the same with enterprise software. However, one difference between buildings and software is important. Buildings are physical. They are, to use a phrase, ‘set in stone’. They are situated on a physical site. You either have sufficient space or you need to acquire more. You tend to know these things even before adaption and extension begin. With software it is different, there are infinite possibilities.

Software is different

The boundaries cannot be seen. Software is made of bits, not atoms. Hence, we experience software differently. As Joe Pine and Kim Korn explain in their book, ‘Infinite Possibility: Creating Customer Value on the Digital Frontier’, software exists beyond the physical limitation of time, space, and matter. With no boundaries, the software provides the opportunity for infinite possibilities. But as James Gilmore says in the foreword of the book, most enterprises treat digital technology as an incremental adjunct to their existing processes. As a result, the experience is far from enriching. Instead of making the real world experience better, software often worsens it. In hospitals, software forces clinicians to take their eyes off the patient to input data into screens and to focus on the content of their screens.

More generally, there appears to be a gap between what the end-users of enterprise software expect, the champions of the software, and the expectations of software vendors themselves. The blog post by Virtual Stacks makes our thinking about software sound like a war of expectations.

The war of expectations

People that sell software, the people that buy software, and the people that eventually use the software view things very differently:

  1. Executives in the C-Suite implement an ERP System to gain operational excellence and cost savings. They often put someone in charge of the project that doesn’t know enough about ERP systems and how to manage the changes that the software demands in work practice.
  2. Buyers of an ERP system expect that it will fulfil their needs straight out-of-the-box. Sellers expect some local re-configuration.
  3. Reconfiguring enterprise software calls for a flexible budget. The budget must provide for consultants that may have to be called in. It needs to provide for additional training and more likely than not major change management initiatives.
  4. End-users have to be provided with training even before the software is launched. This is especially necessary when they have to learn new skills that are not related to their primary tasks. In hospitals, clinicians find that their workload blows out. They see themselves as working for the software rather than the other way around.

The organisational mindset

Shaun Snapp in his blogpost for Brightwork Research points out that what keeps the application-centric paradigm alive is how IT vendor and services are configured and incentivised. There is a tremendous amount of money to be made when building, implementing and integrating applications in organisations. Or as McComb says, ‘the real problem surrounds business culture’.

Can enterprise software adapt well?

The short answer is yes. However, McComb’s key point is not that some software adapts well and some don’t, it is that legacy software doesn’t. Or as the quotation in Snapp’s post suggests:

‘The zero-legacy startups have a 100:1 cost and flexibility advantage over their established rivals. Witness the speed and agility of Pinterest, Instagram, Facebook and Google. What do they have that their more established competitors don’t? It’s more instructive to ask what they don’t have: They don’t have their information fractured into thousands of silos that must be continually integrated” at great cost.’

Snapp goes on to say that ‘in large enterprises, simple changes that any competent developer could make in a week typically take months to implement. Often the change requests get relegated to the “shadow backlog” where they are ignored until the requesting department does the one thing that is guaranteed to make the situation worse: launch another application project.’

Adapting well

Werner Vogels, VP &CTO of Amazon provides a good example of successful change. Use this link to read the full post.

Perhaps as Joe Pine and Kim Korn say the coronacrisis will change healthcare for the better. I have written about using a data-centric model to query hospital ERP systems to track and trace materials. My eBook, ‘Hidden Hospital Hazards: Saving Lives and Improving Margins’ can be purchased from Amazon, or you may obtain a free PDF version here.

WRITTEN BY
Len Kennedy
Author of ‘Hidden Hospital Hazards’

Skip to content