White Paper: The Value of Using Knowledge Graphs in Some Common Use Cases

We’ve been asked to comment on the applicability of Knowledge Graphs and Semantic Technology in service of a couple of common use cases.  We will draw on our own experience with client projects as well as some examples we have come to from networking with our peers.

The two use cases are:

  • Customer 360 View
  • Compliance

We’ll organize this with a brief review of why these two use cases are difficult for traditional technologies, then a very brief summary of some of the capabilities that these new technologies bring to bear, and finally a discussion of some case studies that have successfully used graph and semantic technology to address these areas.

Why is This Hard?

In general, traditional technologies encourage complexity, and they encourage it through ad-hoc introduction of new data structures.  When you are solving an immediate problem at hand, introducing a new data structure (a new set of tables, a new json data structure, a new message, a new API, whatever) seems to be an expedient.  What is rarely noticed is the accumulated effect of many, many small decisions taken this way.  We were at a healthcare client who admitted (they were almost bragging about it) that they had patient data in 4,000 tables in their various systems.  This pretty much guarantees you have no hope of getting a complete picture of a patient’s health and circumstances. There is no human that could write a 4,000 table join and no systems that could process it even if it were able to be written.

This shows up everywhere we look.  Every enterprise application we have looked at in detail is 10-100 times more complex than it needs to be to solve the problem at hand.  Systems of systems (that is the sum total of the thousands of application systems managed by a firm) are 100- 10,000 times more complex than they need to be.  This complexity shows up for users who have to consume information (so many systems to interrogate, each arbitrarily different) and developers and integrators who fight a read guard action to keep the whole at least partially integrated.

Two other factors contribute to the problem:

  • Acquisition – acquiring new companies inevitably brings another ecosystem of applications that must be dealt with.
  • Unstructured information – a vast amount of important information is still represented in unstructured (text) or semi-structured forms (XML, Json, HTML). Up until now it has been virtually impossible to meaningfully combine this knowledge with the structured information businesses run on.

Let’s look at how these play out in the customer 360 view and compliance.

Customer 360

Eventually, most firms decide that it would be of great strategic value to provide a view of everything that is known about their customers. There are several reasons this is harder than it looks.  We summarize a few here:

  • Customer data is all over the place. Every system that places an order, or provides service, has its own, often locally persisted set of data about “customers.”
  • Customer data is multi-formatted. Email and customer support calls represent some of the richest interactions most companies have with their clients; however, these companies find data from such calls difficult to combine with the transactional data about customers.
  • Customers are identified differently in different systems. Every system that deals with customers assigns them some sort of customer ID. Some of the systems share these identifiers.  Many do not.  Eventually someone proposes a “universal identifier” so that each customer has exactly one ID.  This almost never works.  In 40 years of consulting I’ve never seen one of these projects succeed.  It is too easy to underestimate how hard it will be to change all the legacy systems that are maintaining customer data.  And as the next bullet suggests, it may not be logically possible.
  • The very concept of “customer” varies widely from system to system. In some systems the customer is an individual contact; in other, a firm; in another a role; in yet another, a household. For some it is a bank account (I know how weird that sounds but we’ve seen it).
  • Each system needs to keep different data about customers in order to achieve their specific function. Centralizing this puts a burden of gathering a great deal of data at customer on-boarding time that may not be used by anyone.

Compliance

The primary reason that compliance related systems are complex is that what you are complying with is a vast network of laws and regulations written exclusively in text and spanning a vast array of overlapping jurisdictions.  These laws and regulations are changing constantly and are always being re-interpreted through findings, audits, and court cases.

The general approach is to carve off some small scope, read up as much as you can, and build bespoke systems to support them. The first difficulty is that there are humans in the loop all throughout the process.  All documents need to be interpreted, and for that interpretation to be operationalized it generally has to be through a hand-crafted system.

A Brief Word on Knowledge Graphs and Semantic Technology

Knowledge Graphs and Graph Databases have gained a lot of mind share recently as it has become known that most of the very valuable digital native firms have a knowledge graph at their core:

  • Google – the google knowledge graph is what has made their answering capability so much better than the key word search that launched their first offering. It also powers their targeted ad placement.
  • LinkedIn, Facebook, Twitter – all are able to scale and flex because they are built on graph databases.
  • Most Large Financial Institutions – almost all major financial institutions have some form of Knowledge Graph or Graph Database initiative in the works.

Graph Databases

A graph database expresses all its information in a single, simple relationship structure: two “nodes” are connected by an “edge.”

A node is some identifiable thing.  It could be a person or a place or an email or a transaction.  An “edge” is the relationship between two nodes.  It could represent where someone lives, that they sent or received an email, or that they were a party to a transaction.

A graph database does not need to have the equivalent of a relational table structure set up before any data can be stored, and you don’t need to know the whole structure of the database and all its metadata to use a graph database.  You can just add new edges and nodes to existing nodes as soon as you discover them.  The network (the graph) grows organically.

The most common use case for graph databases are analytic.  There are a whole class of analytics that make use of network properties (i.e., how closely x is connected to y, what the shortest route is from a to b).

Knowledge Graphs

Most graph databases focus on low level data: transactions, communications, and the like. If you add a knowledge layer onto this, most people refer to this as a knowledge graph.  The domain of medical knowledge (diseases, symptoms, drug/drug interaction, and even the entire human genome) has been converted to knowledge graphs to better understand and explore the interconnected nature of health and disease.

Often the knowledge in a knowledge graph has been harvested from documents and converted to the graph structure.  When you combine a knowledge graph with specific data in a graph database the combination is very powerful.

Semantic Technology

Semantic Technology is the open standards approach to knowledge graphs and graph databases.  (Google, Facebook, LinkedIn and Twitter all started with open source approaches, but have built their own proprietary versions of these technologies.)  For most firms we recommend going with open standards.  There are many open source and vendor supported products at every level of the stack, and a great deal of accumulated knowledge as to how to solve problems with these technologies.

Semantic technologies implement an alphabet soup of standards, including: RDF, RDFS, OWL, SPARQL, SHACL, R2RML, JSON-LD, and PROV-O.  If you’re unfamiliar with these it sounds like a bunch of techno-babble. The rap against semantic technology has been that it is complicated.  It is, especially if you have to embrace and understand it all at once.  But we have been using this technology for almost 20 years and have figured out how to help people adapt by using carefully curated subsets of each of the standards and leading through example to drastically reduce the learning curve.

While there is still some residual complexity, we think it is well worth the investment in time.  The semantic technologies stack has solved a large number of problems that graph databases and knowledge graphs have to solve on their own, on a piecemeal basis.  Some of these capabilities are:

  • Schema – graph databases and even knowledge graphs have no standard schema, and if you wish to introduce one you have to implement the capability yourself. The semantic technologies have a very rich schema language that allows you to define classes based on what they mean in the real world.  We have found that disciplined use of this formal schema language creates enterprise models that are understandable, simple, and yet cover all the requisite detail.
  • Global Identifiers – semantic technology uses URIs (the Unicode version of which is called an IRI) to identify all nodes and arcs. A URI looks a lot like a URL, and best practice is to build them based on a domain name you own.  It is these global identifiers that allow the graphs to “self-assemble” (there is no writing of joins in semantic technology, the data is already joined by the system).
  • Identity Management – semantic technology has several approaches that make living with the fact that you have assigned multiple identifiers to the same person or product or place. One of the main ones is called “sameAs” and allows the system to know that ‘n’ different URIs (which were produced from data in ‘n’ different systems, with ‘n’ different local IDs) all represent the same real-world item, and all information attached to any of those URIs is available to all consumers of the data (subject to security, of course).
  • Resource Resolution – some systems have globally unique identifiers (you’ve seen those 48-character strings of numbers and letters that come with software licenses, and the like), but these are not very useful, unless you have a special means for finding out what any of them are or mean. Because semantic technology best practice says to base your URIs on a domain name that you own, you have the option for providing a means for people to find out what the URI “means” and what it is connected to.
  • Inference – with semantic technology you do not have to express everything explicitly as you do in traditional systems. There is a great deal of information that can be inferred based on the formal definitions in the knowledge graph as part of the semantic schema and combined with the detailed data assertions.
  • Constraint Management – most graph databases and knowledge graphs were not built for online interactive end user update access. Because of their flexibility it is hard to enforce integrity management. Semantic technology has a model driven constraint manager that can ensure the integrity of a database is maintained.
  • Provenance – one key use case in semantic technology is combining data from many different sources. This creates a new requirement when looking at data that has come from many sources you often need to know: Where did this particular bit of data come from?  Semantic Technologies have solved this in a general way that can go down to individual data assertions.
  • Relational and Big Data Integration – you won’t be storing all of your data in a graph database (semantic, or otherwise). Often you will want to combine data in your graph with data in your existing systems.  Semantic technology has provided standards, and there are vendors that have implemented these standards, such that you can write a query that combines information in the graph with that in a relational database or a big data store.

It is hard to cover a topic as broad as this in a page, but hopefully this establishes some of what the approach provides.

Applying Graph Technology

So how do these technologies deliver capability to some more common business problems?

Customer 360

We worked with a bank that was migrating to the cloud.  As part of the migration they wanted to unify their view of their customers.  They brought together a task force from all the divisions to create a single definition of a customer.  This was essentially an impossible task.  For some divisions (Investment Banking) a customer was a company, for others (Credit Card processing) it was usually a person.  Not only were there differences in type, all the data that they wanted and were required to have in these different contexts was different.  Further one group (corporate) espoused a very broad definition of customer that included anyone that they could potentially contact.  Needless to say, the “Know Your Customer” group couldn’t abide this definition as every new customer obligates them to perform a prescribed set of activities.

What we have discovered time and again is that if you start with a term (say, “Customer”) and try to define it, you will be deeply disappointed.  On the other hand, if you start with formal definitions (one of which for “Customer” might be, “a Person who is an owner or beneficiary on a financial account” (and of course financial account has to be formally defined)), it is not hard to get agreement on what the concept means and what the set of people in this case would be.  From there it is not hard to get to an agreed name for each concept.

In this case we ended up creating a set of formal, semantic definitions for all the customer related concepts.  At first blush it might sound like we had just capitulated to letting everyone have their own definition of what a “Customer” was.  While there are multiple definitions of “Customer” in the model, they are completely integrated in a way that any individual could be automatically categorized and simultaneously in multiple definitions of “Customer” (which is usually the case).

The picture shown below, which mercifully omits a lot of the implementation detail, captures the essence of the idea. Each oval represents a definition of “Customer.”

Knowledge graphs

In the lower right is the set of people who have signed up for a free credit rating service.  These are people who have an “Account’ (the credit reporting account), but it is an account without financial obligation (there is no balance, you cannot draw against it, etc.).  The Know Your Customer (KYC) requirements only kick in for people with Financial Accounts.  The overlap suggests some people have financial accounts and non-financial accounts.  The blue star represents a financial customer that also falls under the guidelines of KYC.  Finally, the tall oval at the top represents the set of people and organizations that are not to be customers, the so-called “Sanctions lists.”  You might think that these two ovals should not overlap, but with the sanctions continually changing and our knowledge of customer relations constantly changing, it is quite possible that we discover after the fact that a current customer is on the sanctions list.  We’ve represented this as a brown star that is simultaneously a financial customer and someone who should not be a customer.

We think this approach uniquely deals with the complexity inherent in large companies’ relationships with their customers.

In another engagement we used a similar approach to find customers who were also vendors, which is often of interest, and typically hard to detect consistently.

Compliance

Compliance also is a natural for solving with Knowledge Graphs.

Next Angles

Mphasis’ project “Next Angles” converts regulatory text into triples conforming to an ontology, which they can then use to evaluate particular situations (we’ve worked with them in the past on a semantic project).  In this white paper they outline how it has been used to streamline the process of detecting money laundering: http://ceur-ws.org/Vol-1963/paper498.pdf.

Legal and Regulatory Information Provider

Another similar project that we worked on was with a major provider of legal and regulatory information.  The firm ingests several million documents a day, mostly court proceedings but also all changes to laws and regulation.  For many years these documents were tagged by a combination of scripts and off shore human taggers.  Gradually the relevance and accuracy of their tagging began to fall behind that of their rivals.

They employed us to help them develop an ontology and knowledge graph; they employed the firm netOWL to perform the computational linguistics to extract data from documents and conform it to the ontology.  We have heard from third parties that the relevance of their concept-based search is now considerably ahead of their competitors.

They recently contacted us as they are beginning work on a next generation system, one that takes this base question to the next level: Is it possible to infer new information in search by leveraging the knowledge graph they have plus a deeper modeling of meaning?

Investment Bank

We are working in the Legal and Compliance Division for a major investment bank.  Our initial remit was to help with compliance to records retention laws. There is complexity at both ends of this domain.  On one end there are hundreds of jurisdictions promulgating and changing laws and regulations continually.  On the other end are the billions of documents and databases that must be classified consistently before they can be managed properly.

We built a knowledge graphs that captured all the contextual information surrounding a document or repository.  This included who authored it, who put it there, what department were they in, what cost code they charged, etc., etc.  Each bit of this contextual data had textual data available.  We were able to add some simple natural language processing that allowed them to accurately classify about 25% of the data under management.  While 25% is hardly a complete solution, this compares to ½ of 1% that had been classified correctly up to that point.  Starting from this they have launched a project with more sophisticated NLP and Machine Learning to create an end user “classification wizard” that can be used by all repository managers.

We have moved on to other related compliance issues, which includes managing legal holds, operation risk, and a more comprehensive approach to all compliance.

Summary: Knowledge Graphs & Semantic Technology

Knowledge Graphs and Semantic Technology are the preferred approach to complex business problems, especially those that require the deep integration of information that was previously hard to align, such as customer-related and compliance-related data.

Click here to download the white paper.

Field Report from the First Annual Data-Centric Architecture Conference

Our Data-Centric Architecture conference a couple weeks ago was pretty incredible. I don’t think I’ve ever participated in a single intense, productive conversation with 20 people that lasted 2 1/2 days, with hardly a let up. Great energy, very balanced participation.

And I echo Mark Wallace’s succinct summary on LinkedIn.

I think one thing all the participants agreed on was that it wasn’t a conference, or at least not a conference in the usual sense. I think going forward we will call it the Data-centric Architecture Forum. Seems more fitting.

My summary take away was:

  1. This is an essential pursuit.
  2. There is nothing that anyone in the group (and this is a group with a lot of coverage) knows of that does what a Data-Centric Architecture has to do, out of the box.
  3. We think we have identified the key components. Some of them are difficult and have many design options that are still open, but no aspect of this is beyond the reach of competent developers, and none of the components are even that big or difficult.
  4. The straw-man held up pretty well. It seemed to work pretty well as a communication device. We have a few proposed changes.
  5. We all learned a great deal in the process.

A couple of immediate next steps:

  1. Hold the date, and save some money: We’re doing this again next year Feb 3-5, $225 if you register by April 15th: http://dcc.semanticarts.com.
  2. The theme of next year’s forum will be experience reports on attempting to implement portions of the architecture.
  3. We are going to pull together a summary of points made and changes to the straw-man.
  4. I am going to begin in earnest on a book covering the material covered.

Field Report by Dave McComb

Join us next year!

What will we talk about at the Data-Centric Conference?

“The knowledge graph is the only currently implementable and sustainable way for businesses to move to the higher level of integration needed to make data truly useful for a business.”

data-centric conferenceYou may be wondering what some of our Data-Centric Conference panel topics will actually look like, what the discussion will entail. This article from Forbes is an interesting take on knowledge graphs and is just the kind of thing we’ll be discussing at the Data-Centric Conference.

When we ask Siri, Alexa or Google Home a question, we often get alarmingly relevant answers. Why? And more importantly, why don’t we get the same quality of answers and smooth experience in our businesses where the stakes are so much higher?

The answer is that these services are all powered by extensive knowledge graphs that allow the questions to be mapped to an organized set of information that can often provide the answer we want.

Is it impossible for anyone but the big tech companies to organize information and deliver a pleasing experience? In my view, the answer is no. The technology to collect and integrate data so we can know more about our businesses is being delivered in different ways by a number of products. Only a few use constructs similar to a knowledge graph.

But one company I have been studying this year, Cambridge Semantics, stands out because it is focused primarily on solving the problems related to creating knowledge graphs that work in businesses. Cambridge Semantics technology is powered by AnzoGraph, its highly scalable graph database, and uses semantic standards, but the most interesting thing to me is how the company has assembled all the elements needed to create a knowledge graph factory.  Because in business we are going to need many knowledge graphs that can be maintained and evolved in an orderly manner.

Read more here: Is The Enterprise Knowledge Graph Finally Going To Make All Data Usable?

Register for the conference here.

P.S. The Early Bird Special for Data-Centric Conference registration runs out 12/31/18.

 

The Data-Centric Revolution: Implementing a Data-Centric Architecture

Dave McComb returns to The Data Administration Newsletter with news of roll-your-own data-centric architecture stacks. Rather, he makes an introduction to what the early adopters of data-centric architectures will need to undertake the data-centric revolution and make such a necessary transition.

At some point, there will be full stack data-centric architectures available to buy, to use as a service or as an open source project.  At theThe Data-Centric Revolution: Implementing a Data-Centric Architecture moment, as far as we know, there isn’t a full stack data-centric architecture available to direct implementation.  What this means is that early adopters will have to roll their own.

This is what the early adopters I’m covering in my next book have done and—I expect for the next year or two at least— what the current crop of early adopters will need to do.

I am writing a book that will describe in much greater detail the considerations that will go into each layer in the architecture.

This paper will outline what needs to be considered to become data-centric and give people an idea of the scope of such an undertaking.  You might have some of these layers already covered.

Find his answers in The Data-Centric Revolution: Implementing a Data-Centric Architecture.

Click here to read a free chapter of Dave McComb’s book, “The Data-Centric Revolution”.

The Data-Centric Revolution: Implementing a Data-Centric Architecture

At some point, there will be full stack data-centric architectures available to buy, to use as a service or as an open source project.  At the moment, as far as we know, there isn’t a full stack data-centric architecture available to direct implementation.  What this means is that early adopters will have to roll their own.

This is what the early adopters I’m covering in my next book have done and—I expect for the next year or two at least— what the current crop of early adoptersThe Data-Centric Revolution will need to do.

I am writing a book that will describe in much greater detail the considerations that will go into each layer in the architecture.

This paper will outline what needs to be considered to give people an idea of the scope of such an undertaking.  You might have some of these layers already covered.

Simplicity

There are many layers to this architecture, and at first glance it may appear complex.  I think the layers are a pretty good separation of concern, and rather than adding to the complexity, I believe it may simplify it.

As you review the layers, do so through the prism of the two driving APIs.  There will be more than just these two APIs and we will get into the additional ones, as appropriate, but this is not going to be the usual Swiss army knife of a whole lot of APIs, with each one doing just a little bit.  The APIs are of course RESTful.

The core is composed of two APIs (with our working titles):

  • ExecuteNamedQuery—This API assumes a SPARQL query has been stored in the triple store and given a name. In addition, the query is associated with a set of substitutable parameters.  At run time, the name of the query is forwarded to the server with the parameter names and values.  The back end fetches the query, rewrites it with the parameter values in place, executes that, and returns it to the client.  Note that if the front end did not know the names of the available queries, it could issue another named query that returns all the available named queries (with their parameters).  Also note that this also implies the existence of an API that will get the queries into the database, but we’ll cover that in the appropriate layer when we get to it.
  • DeltaTriples—This API accepts two arrays of triples as its payload. One is the “adds” array, which lists the new triples that the server needs to create, and the other is “deletes,” which are the triples to be removed.  This puts a burden on the client.  The client will be constructing a UI from the triples it receives in a request, allowing a user to change data interactively, and then evaluate what changed.  This part isn’t as hard as it sounds when you consider that order is unimportant with triples.  There will be quite a lot going on with this API as we descend down the stack, but the essential idea is that this API is the single route through which all updates pass through, and will ultimately result in an ACID compliant transaction being updated to the triple store.

I’m going to proceed from the bottom (center) of the architecture up, with consideration for how these two key APIs will be influenced by each of the layers.

A graphic that ties this all together appears at the end of this article.

Data Layer

At the center of this architecture is the data.  It would be embarrassing if something else were at the center of the data-centric architecture.  The grapefruit wedges here are each meant to represent a different repository. There will be more than one repository in the architecture.

The darker yellow ones on the right are meant to represent repositories that are more highly curated.  The lighter ones on the left represent those less curated (perhaps data sets retrieved from the web).  The white wedge is a virtual repository.  The architecture knows where the data is but resolves it at query time. Finally, the cross hatching represents provenance data.  In most cases, the provenance data will be in each repository, so this is just a visual clue.

The two primary APIs bottom out here, and become queries and updates.

Federation Layer

One layer up is the ability to federate a query over multiple repositories.  At this time, we do not believe it will be feasible or desirable to spread an update over more than one repository (this would require the semantic equivalent of a two-phased commit).  In most implementations this will be a combination between native abilities of a triple store, reliance on support for the standards-based federation, and bespoke capability.  The federation layer will be interpreting the ExecuteNamedQuery requests.

Click here to read more on TDAN.com

Are You Spending Way Too Much on Software?

Alan Morrison, senior research fellow at PwC’s Center for Technology and Innovation, interviews Dave McComb forstrategy+business about why IT systems and software continue to cost more, but still under-deliver. McComb argues that legacy processes, excess code, and a mind-set that accepts high price tags as the norm have kept many companies from making the most of their data.

Global spending on enterprise IT could reach US$3.7 trillion in 2018, according to Gartner. The scale of this investment is surprising, given the evolution of the IT sector. Basic computing, storage, and networking have become commodities, and ostensibly cheaper cloud offerings such as infrastructure-as-a-service and Are you spending too much on software? software-as-a-service are increasingly well established. Open source software is popular and readily available, and custom app development has become fairly straightforward.

Why, then, do IT costs continue to rise? Longtime IT consultant Dave McComb attributes the growth in spending largely to layers of complexity left over from legacy processes. Redundancy and application code sprawl are rampant in enterprise IT systems. He also points to a myopic view in many organizations that enterprise software is supposed to be expensive because that’s the way it’s always been.

McComb, president of the information systems consultancy Semantic Arts, explores these themes in his new book, Software Wasteland: How the Application-Centric Mindset Is Hobbling Our Enterprises. He has seen firsthand how well-
intentioned efforts to collect data and translate it into efficiencies end up at best underdelivering — and at worst perpetuating silos and fragmentation. McComb recently sat down with s+b and described how companies can focus on the standard models that will ultimately create an efficient, integrated foundation for richer analytics.

Click here to read the Question & Answer session.

The gist Namespace Delimiter: Hash to Slash

The change in gist:

gist namespace delimiter

We recently changed the namespace for gist from

  • http://ontologies.semanticarts.com/gist#
    to
  • http://ontologies.semanticarts.com/gist/

What you need to do:

This change is backwards-incompatible with existing versions of gist. The good news is that the changes needed are straightforward. To migrate to the new gist will require changing all uses of gist URIs to use the new namespace. This will include the following:

  1. any ontology that imports gist
  2. any ontology that does not import gist, but that refers to some gist URIs
  3. any data set of triples that uses gist URIs

For 1 and 2, you need only change the namespace prefix and carry on as usual.  For files with triples that use namespaces you need to first change the namespaces and then reload the triples into any triple stores where the old files were loaded into.  If there triples use prefixed terms, then you need only change the prefixes. If the triples use full URIs then you will need to go a global replace swapping out the old namespace for the new one.

The rationale for making this change:

We think that other ontologists and semantic technologists may be interested in the reasons for this change. To that end, we re-trace the thought process and discussions we had internally as we debated the pros and cons of this change.

There are three key aspects of URIs that we are primarily interested in:

  • Global Uniqueness – the ability of triple stores to self-assemble graphs without resorting to metadata relies on the fact that URIs are globally unique
  • Human readability – we avoid traditional GUIDs because we prefer URIs that humans can read and understand.
  • Resolvability – we are interested in URIs that identify resources that could be located and resolved on the web (subject to security constraints).

The move from hash to slash was motivated by the third concern, the first two are not affected.

In the early days the web was a web of documents.  For efficiency reasons, the standards (including and especially RFC 3986[1]) declared that the hash designated a “same-document reference” that is everything after the hash was assumed to be in the document represented by the string up to the hash.  Therefore, the resolution was done in the browser and not on the server. This was a good match for standards, and for small (single document) ontologies.  As such, for many years, most ontologies used the hash convention, including owl, rdf, skos, void, vcard, umbel and good relations.

Anyone with large ontologies or large datasets that were hosted in databases and not documents adapted the / convention, including DBpedia, Schema.org, Snomed, Facebook, Foaf, Freebase, Open Cyc and the New York Times.

The essential tradeoff is for resolving the URI.  If you can be reasonably sure that everything you would want to provide to the user at resolution time, would be in relatively small document, then the hash convention is fine.

If you wish your resolution to have additional data that may not have been in the original document (say where used information that isn’t in the defining document) you need to do the resolution on the server.  Because of the standards, the server does not see anything after the hash so if you use the hash convention, rather that resolving the uri from the url address bar, you must programmatically call a server with the URI as an argument in the API call.

With the slash convention you have the choice of putting the URI in the URL bar and getting it resolved, or calling an API similar to the hash option above.

If you commit to API calls then there is a slight advantage to hash as it is slightly easier to parse on the back end.  In our opinion this slight advantage does not compare to the flexibility of being able to resolve through the URL bar as well as still having the option of using an API call for resolution.

The DBpedia SPARQL endpoint (http://dbpedia.org/sparql ) has thoughtfully prepopulated 240 of the most common namespaces in their Sparql editor.  At the time of this writing, 59 of the 240 use the hash delimiter.  Nearly 100 of the namespaces come from DBpedia’s decision to have a different namespace for each language, and when these are excluded the slash advantage isn’t nearly as pronounced (90 slashes versus 59 hashes) but still a predominance for slash.

We are committed to providing, in the future, a resolution service to make it easy to resolve our concepts through a URL address bar.  For the present the slash is just as good for all other purposes.  We have decided to eat the small migration cost now rather than later.

[1] https://www.rfc-editor.org/info/rfc3986

Data-Centric vs. Application-Centric

Data-Centric vs. Software Wasteland

Dave McComb’s new book “Software Wasteland: How the Application-Centric Mindset is Hobbling our Enterprises” has just been released.

In it, I make this case that the opposite of Data-Centric is Application Centric, and our preoccupation with Application-Centric approaches over the last several decades has caused the cost and complexity of our information systems to be at least 10 times what they should be and in most cases we’ve examined, 100 times what they should be.

This article is a summary about how diametrically opposed these two world views are, and how the application-centric mind set is draining our corporate coffers.

An information system is essentially data and behavior.

On the surface, you wouldn’t think it would make much difference with which one you started with if you need both and they feed off each other.  But it turns out it does make a difference.  A very substantial difference.

Screen Shot 2018-03-04 at 10.20.54 PM

What does it do?

The application-centric approach starts with “what does this system need to do?” Often this is framed in terms of business process and/or work flow.  In the days before automation, information systems were work flow systems.  Humans executed tasks or procedures.  Most tasks had prerequisite data input and generated data output.  The classic “input / process / output” mantra described how work was organized.

Information in the pre-computer era was centered around “forms.”  Forms were a way to gather some prerequisite data, which could then be processed.  Sometimes the processing was calculation.  The form might be hours spent and pay rate, and the calculation might be determining gross pay.

These forms also often were filed, and the process might be to retrieve the corresponding form, in the corresponding (paper) file folder and augment it as needed.

While this sounds like ancient history, it persists.  If you’ve been the doctor recently, you might have noticed that despite decades of “Electronic Medical Records,” the intake is weirdly like it always has been: paper form based.

This idea that information systems are the automation of manual work flow tasks continues.  In the Financial Service industry, it is called RPA (Robotic Process Automation) despite the fact that there are no robots.  What is being automated are the myriad of tasks that have evolved to keep a Financial Services firm going.

When we automate a task in this way, we buy into a couple of interesting ideas, without necessarily noticing that we have done so.  The first is that automating the task is the main thing.  The second is that the task defines how it would like to see the input and how it will organize the output.  This is why there are so many forms in companies and especially in the government.

The process automation essentially exports the problem of getting the input assembled and organized into the form the process wants.  In far too many cases this falls on the user of the system to input the data, yet again, despite the fact that you know you have told this firm this information dozens of times before.

In the cases where the automation does not rely on a human to recreate the input, something almost as bad is occurring: developers are doing “systems integration” to get the data from wherever it is to the input structures and then aligning the names, codes and categories to satisfy the input requirements.

Most large firms have thousands of these processes.  They have implemented thousands of application systems, each of which automates anywhere between a handful and dozens of these processes.  The “modern” equivalent of the form is the document data structure.  A document data structure is not a document in the same way that Microsoft Word creates a document. Instead, a document data structure is a particular way to organize a semi-structured data structure.  The most popular now is json (javascript object notation).

A typical json document looks like:

{‘Patient’: {‘id’: ‘12345’, ‘meds’: [ ‘2345’, ‘3344’, ‘9876’] } }

Json relies on two primary structures: lists and dictionaries.  Lists are shown inside square brackets (the list following the work ‘meds’ in the above example).  Dictionaries are key / value pairs and are inside the curly brackets.  In the above ‘id’ is a key and ‘12345’ is the value, ‘meds’ is a key and the list is the value, and ‘Patient’ is a key and the complex structure (a dictionary that contains a both simple values and lists) is the value.  These can be arbitrarily nested.

Squint very closely and you will see the document data structure is our current embodiment of the form.

The important parallels are:

  • The process created the data structure to be convenient to what the process needed to do.
  • There is no mechanism here for coordinating or normalizing these keys and values.

Process-centric is very focused on what something does.  It is all about function.

Click here to continue reading on TDAN.com

A Tale of Two Projects

If someone has a $100 million project, the last thing that would occur to them would be to launch a second project in parallel using different methods to see which method works better. That would seem to be insane, almost asking for the price to be doubled. Besides, most sponsors of projects believe they know the best way to run such a project.

However, setting up and running such a competition would establish once and for all what processes work best for large scale application implementations. There would be some logistical issues to be sure, but well worth it. To the best of my knowledge, though, this hasn’t happened.

Thankfully, the next best thing has happened. Luckily, we have recently encountered a “natural experiment” in the world of enterprise application development and deployment. We are going to mine this natural experiment for as much as we can.

President Barack Obama signed the Affordable Care Act into law in March 23, 2010. The project was awarded to CGI Federal, a division of the Canadian company, CGI, for $93.7 million. I’m always amused at the spurious precision the extra $0.7 million implies. It sort of signals that somebody knows exactly how much this project is going to cost. It is just the end product of some byzantine negotiating process. It was slated to go live October 2013. (I was blissfully unaware of this for the entire three years the project was in development).

One day in October 2013, one of my developers came into my office and told me he had just heard of an application system comprising over 500,000,000 lines of code. He couldn’t fathom what you would need 500,000,000 lines of code to do. He was a recent college graduate, had been working for us for several years, and had written a few thousand lines of elegant architectural code. We were running major parts of our company on these few thousand lines of code so he was understandably puzzled at what this could be.

We sat down at my monitor and said, “Let’s see if we can work out what they are doing.”

This was the original, much maligned rollout of Healthcare.gov. We were one of the few that first week who managed to log in and try our luck (99% of the people who tried to access healthcare.gov in its first two weeks were unable to complete a session).

As each screen came up, I’d say “what do you think this screen is doing behind the scenes?” and we would postulate, guess a bit as to what else it might be doing, and jot down notes on the effort to recreate this. For instance, on the screen when we entered our fake address (our first run was aborted when we entered a Colorado address as Colorado was doing a state exchange) we said, “What would it take to write address validation software?” This was easy, as he had just built an address validation routine for our software.

After we completed the very torturous process, we compiled our list of how much code would be needed to recreate something similar. We settled on perhaps tens of thousands of lines of code (if we were especially verbose). But no way in the world was there any evidence in the functionality of the system that there was a need for 500,000,000 lines of code.

Meanwhile news was leaking that the original $93 million project had now ballooned to $500 million.

In the following month, I had a chance encounter with the CEO of Top Coder, a firm that organizes the equivalent of X prizes for difficult computer programming challenges. We discussed Healthcare.gov. My contention was that this was not the half-billion dollar project that it had already become, but was likely closer to the coding challenges that Top Coder specialized in. We agreed that this would make for a good Top Coder project and began looking for a sponsor.

Life imitates art, and shortly after this exchange, we came across HealthSherpa.com. The Health Sherpa User Experience was a joy compared to Healthcare.gov. I was more interested in the small team that had rebuilt the equivalent for a fraction (a tiny fraction) of the cost.

From what I could tell from a few published papers, a small team of three to four in two to three months had built equivalent functionality to that which hundreds of professionals had spent years laboring over. This isn’t exactly equivalent. It was much better in some ways, and fell a bit short in a few others.

In the ensuing years, I’d used this as a case study of what is possible in the world of enterprise (or larger) applications. Over the course of the ensuing four years, I’ve been tracking both sides of this natural experiment from afar.

I looked on in horror to watch the train wreck of the early rollout of Healthcare.org balloon from $1/2 billion to $1 billion (many firms have declared victory in “fixing” the failed install for a mere incremental $1/2 billion), and more recently to $2.1 billion. By the 2015 enrolment period, Healthcare.gov had adopted the HealthSherpa user experience, which they now call “Marketplace lite.” Meanwhile HealthSherpa persists, having enrolled over 800,000 members, and at times handles 5% of the traffic for the ACA.

healthcare.gov example

The writing of Software Wasteland prompted me to research deeper, in order to crisp up this natural experiment playing out in front of us. I interviewed George Kalogeropoulos, CEO of HealthSherpa, several times in 2017, and have reviewed all the available public documentation for Healthcare.gov and HealthSherpa.

The natural experiment that has played out here is around the hypothesis that there are application development and deployment process that can change the resource consumption and costs by a factor of 1,000. As with the Korean Peninsula, you can nominate either side to be the control group. In the Korea example, we could say that communism was the control group and market democracy the experiment. The hypothesis would be that the experiment would lead to increased prosperity. Alternatively, you could pose it the other way around: market democracy is the control and dictatorial communism is the experiment that leads to reduced prosperity.

If we say that spending a billion dollars for a simple system is the norm (which it often is these days) then that becomes the control group, and agile development becomes the experiment. The hypothesis is that adopting agile principles can improve productivity by many orders of magnitude. In many settings, the agile solution is not the total solution, but in this one (as we will see), it was sufficient.

This is not an isolated example – it is just one of the best side-by-side comparisons. What follows is more evidence that application development and implementation are far from waste-free.


Do you want to read more? Click here.

Software Wasteland

Software Wasteland: Know what’s causing application development waste so you can turn the tide.

software wasteland

Software Wasteland is the book your Systems Integrator and your Application Software vendor don’t want you to read. Enterprise IT (Information Technology) is a $3.8 trillion per year industry worldwide. Most of it is waste.

We’ve grown used to projects costing tens of millions or even billions of dollars, and routinely running over budget and schedule many times over. These overages in both time and money are almost all wasted resources. However, the waste is hard to see, after being so marbled through all the products, processes, and guiding principles. That is what this book is about. We must see, understand, and agree about the problem before we can take coordinated action to address it.

Take the dive and check out Software Wasteland here.

Skip to content