Does your Data Spark Joy? Part 1

Why is Marie Kondo so popular for home organization?

Does your Data Spark Joy?Marie Kondo released her book, “The Life-Changing Magic of Tidying up,” almost ten years ago and has since gained much notoriety for motivating millions of people to de-clutter their homes, offices, and lives. Some people are literally buried in their possessions with no clear way to get from room to room.  Others simply struggle to get out the door in the morning because their keys, wallet, and phone play a daily game of hide-and-seek. Whatever the underlying cause of this overwhelm, Marie Kondo offers a simple, clear method for getting stuff under control. Not only that, but she promises that tidying up will clear the spaces in our lives, leaving room for peace and joy.

Why does this method apply to Data-Centric Architecture?

You might be wondering what this has to do with data-centric architecture.  In many ways the Marie Kondo method is easily extrapolated out of the realm of physical possessions and applied to virtual things: bits of data, documents, data storage containers, etc. In the world of information and data, it’s not surprising that people have seen parallels between belongings and data.  That said, it’s not enough to just say that new applications, storage methods, or business processes will solve the problems of information overload, data silos, or dirty data.  Instead, it’s important to examine your company’s data and the business that data serves.

Overarching Data-Centric Principles

For most businesses and agencies, data is essential to function and is ensconced in legal requirements and data lifecycle policy.  It simply isn’t realistic to say, “Throw it all out!”  Instead, the principles behind acquiring, using, storing, and eventually discarding things must be understood.  And in the virtual space, we can understand “things” to be data-centric, metadata, and systems.

Her Method Starts with “Why?”

In her book, Marie Kondo says, “Before you start, visualize your destination.”  And she expands on this, asking readers to think deeply about the question and visualize the outcome of having a tidy space: “Think in concrete terms so that you can vividly picture what it would be like to live in a clutter-free space.” Our clients will often engage us with some ideal data situation in mind.  It might be expressed in terms of requirements or use cases, but it often has to do with being able to harmonize and align data, do large-scale systems integration, or add more sophisticated querying capabilities to existing databases or tools.  In fact, the first steps of our client engagements have to do with developing these questions into statements of work.

Also, we encourage clients to envision their data and what it can tell them independently of applications, systems, and capabilities precisely to avoid the pitfall of thinking in terms of using new tools to solve undefined problems.  It’s uncanny that this method of interrogation into underlying motivations is common between data-centric development and spark-joy tidying up.

Her Method is About the Psychology of Belongings.

It is important to understand how organizations come to have their data.  In the US Government, entire programs are devoted to managing acquisition. In finance, manufacturing, and other industries, the process of acquiring systems and data is often a business unto itself. It’s not uncommon to hear people working with data to refer to “data silos” when talking about partitioned and disconnected collections of data.  Sometimes this data is shuffled into classified folders and proprietary systems unnecessarily, simply because someone wants to retain control of it. In my work at the Federal Government, I found that the process of determining the system of record to be intensely political and time-consuming.  It’s not a trivial process and not simple, but it is essential to the effort of tidying your data-centric environment.

Sort your Data by Category.

Marie Kondo recommends going categorically for a reason.  In her book, she talks about her process of evaluating her belongings by location, drawer by drawer, room by room, and discovering that she found herself organizing multiple drawers with the same things repeatedly.  She tells us, “The root of the problem lies in the fact that people often store the same type of item in more than one place.  When we tidy each place separately, we fail to see that we’re repeating the same work in many locations and become locked into a vicious circle of tidying.” If this doesn’t sound familiar, you aren’t even working with data.

For me, this principle became clear when I gathered all my office supplies in one place. I was astounded by the small mountain of binder clips (and Sharpies) that seemed to materialize out of nowhere. I always seem to be looking for binder clips and sharpies, so I was shocked by how many I had.

I can think of no closer parallel than the proliferation of siloed systems that appear in each department within an agency.  When I worked for a government agency, I was part of a team whose job it was to survey the offices to find out who was using flight data.  There were several billion-dollar systems in development and in maintenance that held flight data. Over the course of a few years, I would hear quotes about the agency-wide number of flight data systems go from 15 systems, to 20, to 30, and beyond.  It literally became an inside-joke with leadership. And at times, we would hear rumors about some small branch office that had their own Microsoft Access database to keep track of their own data, because they couldn’t get what they needed from the systems of record.  Systems are like the binder clips of enterprise data, except that this kind of proliferation is as easy as making a copy. You don’t even need to make a trip to the office supply store to end up with a pile of duplicates.  If you want to understand how much data redundancy you have, search for specific categories of data across all systems.

Does it spark Joy? What does joy mean in the context of systems and data?

How do you know what sparks joy?  First, look at how the principle of looking for joy is applied.  Presumably, you are in your line of business because on some level it brings you joy – joy that derives from fulfilling a purpose.  Remember the first step of understanding why you are embarking on a transformative process and go back to what you envisioned.  Another way that you can look at joy is whether your space and the things in it allow for that spark to happen.  Ideally, you remove the items from your space that hinder that spark, after acknowledging the lessons they’ve taught you.  Do you feel that spark of joy when you grab your keys in the morning on the way out the door? If you’ve ever tried to find misplaced keys while you’re in a rush, you know the antithesis of joy. Having done the work of creating a space where your keys are easy to find is a way of facilitating joy in your morning routine.

One of the supposed failures of the Marie Kondo method as it applies to data clutter is that it is impossible to physically hold, or even look at, every single piece of data in your system.  Again, rely on the principle behind her method, which is that it is important to be thorough and aim for an environment that facilitates ease and joy.  Don’t say, “We can’t delete any personnel data!” and quit.  Commit to taking an inventory of your personnel systems and the systems that use personnel data. If that process reveals that you have ten different personnel systems and personnel data scattered in several other systems, you must take a closer look at your data environment.  At one point in my physical de-cluttering, I found a tin full of paper clips.  I didn’t handle each shiny paper clip individually; rather, I acknowledged the paper clips served me when I printed more documents onto paper, and since I no longer had a printer, I decided to toss them into the recycling bin.

Remember why you’re considering a solution to data problems in the first place and make a commitment to doing the work of determining your real data needs. Purpose is key, because the way data sparks joy is by enabling you to fulfil that purpose.  This can be difficult where the work you do is abstract and somewhat removed from business that is easy to understand.  However, the critical point to knowing whether or not the data in front of you serves its purpose in your business is to fully understand your business.

Discard and Delete your Data

Take a wardrobe full of clothing for example.  Many of Marie Kondo’s clients are surprised when they start organizing their wardrobes. It’s surprising when you can see the amount of clothing that is unserviceable, the number of items that still have tags on them, the number of hand-me-downs or gifts that don’t suit you, etc. These items are sometimes difficult to discard because of several reasons:

  • It’s kept out of obligation to the giver.
  • It cost a lot of money to buy it.
  • It’s still in good repair.
  • It might be the perfect thing to wear at an unspecified event in the future.
  • It reminds you of the lovely event at which you wore it.
  • It reminds you of the person who left it with you.

It may seem far-fetched to apply these reasons to data storage, but a quick glance through failed data projects will show you otherwise.  Consider the proprietary data locked in a system owned by a vendor for which your license has lapsed, or the system that’s coded in an outdated language whose expert programmers have to be called out of retirement to access, or that directory of data that doesn’t really match the fields in your database, but you requested through a complex data-sharing agreement with another agency.  If you can’t think of an example of a system that has been paid for but hasn’t been used, just consider that the terms shelfware and vaporware exist. It’s easy to be cynical about data precisely because of the overlaps between why we keep things in our closets and garages, and why we keep systems and data in our repositories. When you consider these parallels and understand the principles behind evaluating the items you keep with the hope that they will make your life better, sparking joy becomes easier.

Storage experts are hoarders.

Marie Kondo says you don’t need more storage.  That new Cloud service that can take all the old databases you have and make them accessible is not going to solve your problem. Data storage is expensive, and you do not need a new data storage solution.  What you need is to understand your business process, the business need for the data you believe you have, and a disposition plan for everything else.

How do you start?

In summary, if you’re looking for smart data-centric solutions to help you manage an overwhelming amount of data, or you’re looking for ways to access your vast stores of data in a way that enables smarter business solutions, your bigger issue might be data hoarding.  Looking at your business needs, closely examining the data you have, and coming up with strategies for aligning your data to a manageable data lifecycle can seem overwhelming.  Using a data-centric approach will bring that dream into focus. Keep an eye out for part two of this series to learn how to get your data to spark joy for you.

Click Here to Read Part 2 of this Series

The Data-Centric Hospital

Why software change is not as graceful

‘The Graceful Adaption of St Frances Xavier Cabrini Hospital since 1948.


This post has been updated to reflect the current corona virus crisis. Joe Pine and Kim Korn authors of Infinite Possibility: Creating Customer Value on the Digital Frontier say the coronacrisis will change healthcare for the better. Kim points out that although it is 20 years late. It is good to see healthcare dipping a toe in the 21st Century. However, I think, we have a long way to go.

Dave McComb in his book, ‘The Data-Centric Revolution: Restoring Sanity to Enterprise Information Systems, suggests that buildings change more gracefully than software does. He uses Stewart Brand’s model which shows how buildings can change after they are built.

Graceful Change

I experienced the graceful change during the 19 years that I worked at Cabrini Hospital Malvern. The buildings changed whilst still serving its customers. To outsiders, it may have appeared seamless. To insiders charged with changing the environment, it took constant planning and endless negotiation with internal stakeholders and the surrounding community.

Geoff Fazakerley, the director of buildings, engineering, and diagnostic services orchestrated most of the activities following the guiding principles of the late Sister Irma Lunghi MSC, “In all that we must first keep our eyes on our patients’ needs.”

Geoff, took endless rounds with his team to assess where space could be allocated for temporary offices and medical suites. He then negotiated with those that were impacted to move to temporary locations. On several instances space for non-patient services had to be found outside the hospital so that patients would not be inconvenienced. The building as it now stands bears testament to how Geoff’s team and the architects managed the graceful change.

Why software change is not as graceful

Most enterprise software changes are difficult. In ‘Software Wasteland: How the Application-Centric Mindset is Hobbling our Enterprises’, McComb explains that by focusing on the application as the foundation or starting point instead of taking a data-centric approach this hobbles agile adoption, extension and innovation. When a program is loaded first, it creates a new data structure. With this approach, each business problem requires yet another application system. Each application creates and manages yet another data model. Over time this approach leads to many more applications and many additional complex data models. Rather than improving access to information this approach steadily erodes it.

McComb says that every year the amount of data available to enterprises doubles, while the ability to effectively use it decreases. Executives lament their legacy systems, but their projects to replace them rarely succeed. This is not a technology problem, it is more about mindset.

From personal experience, we know that some buildings adapt well and some don’t. Those of us that work in large organisations also know that this is the same with enterprise software. However, one difference between buildings and software is important. Buildings are physical. They are, to use a phrase, ‘set in stone’. They are situated on a physical site. You either have sufficient space or you need to acquire more. You tend to know these things even before adaption and extension begin. With software it is different, there are infinite possibilities.

Software is different

The boundaries cannot be seen. Software is made of bits, not atoms. Hence, we experience software differently. As Joe Pine and Kim Korn explain in their book, ‘Infinite Possibility: Creating Customer Value on the Digital Frontier’, software exists beyond the physical limitation of time, space, and matter. With no boundaries, the software provides the opportunity for infinite possibilities. But as James Gilmore says in the foreword of the book, most enterprises treat digital technology as an incremental adjunct to their existing processes. As a result, the experience is far from enriching. Instead of making the real world experience better, software often worsens it. In hospitals, software forces clinicians to take their eyes off the patient to input data into screens and to focus on the content of their screens.

More generally, there appears to be a gap between what the end-users of enterprise software expect, the champions of the software, and the expectations of software vendors themselves. The blog post by Virtual Stacks makes our thinking about software sound like a war of expectations.

The war of expectations

People that sell software, the people that buy software, and the people that eventually use the software view things very differently:

  1. Executives in the C-Suite implement an ERP System to gain operational excellence and cost savings. They often put someone in charge of the project that doesn’t know enough about ERP systems and how to manage the changes that the software demands in work practice.
  2. Buyers of an ERP system expect that it will fulfil their needs straight out-of-the-box. Sellers expect some local re-configuration.
  3. Reconfiguring enterprise software calls for a flexible budget. The budget must provide for consultants that may have to be called in. It needs to provide for additional training and more likely than not major change management initiatives.
  4. End-users have to be provided with training even before the software is launched. This is especially necessary when they have to learn new skills that are not related to their primary tasks. In hospitals, clinicians find that their workload blows out. They see themselves as working for the software rather than the other way around.

The organisational mindset

Shaun Snapp in his blogpost for Brightwork Research points out that what keeps the application-centric paradigm alive is how IT vendor and services are configured and incentivised. There is a tremendous amount of money to be made when building, implementing and integrating applications in organisations. Or as McComb says, ‘the real problem surrounds business culture’.

Can enterprise software adapt well?

The short answer is yes. However, McComb’s key point is not that some software adapts well and some don’t, it is that legacy software doesn’t. Or as the quotation in Snapp’s post suggests:

‘The zero-legacy startups have a 100:1 cost and flexibility advantage over their established rivals. Witness the speed and agility of Pinterest, Instagram, Facebook and Google. What do they have that their more established competitors don’t? It’s more instructive to ask what they don’t have: They don’t have their information fractured into thousands of silos that must be continually integrated” at great cost.’

Snapp goes on to say that ‘in large enterprises, simple changes that any competent developer could make in a week typically take months to implement. Often the change requests get relegated to the “shadow backlog” where they are ignored until the requesting department does the one thing that is guaranteed to make the situation worse: launch another application project.’

Adapting well

Werner Vogels, VP &CTO of Amazon provides a good example of successful change. Use this link to read the full post.

Perhaps as Joe Pine and Kim Korn say the coronacrisis will change healthcare for the better. I have written about using a data-centric model to query hospital ERP systems to track and trace materials. My eBook, ‘Hidden Hospital Hazards: Saving Lives and Improving Margins’ can be purchased from Amazon, or you may obtain a free PDF version here.

WRITTEN BY
Len Kennedy
Author of ‘Hidden Hospital Hazards’

Structure-First Data Modeling: The Losing Battle of Perfect Descriptions

Structure-First Data Modeling: The Losing Battle of Perfect Descriptions In my last article I described Meaning-First data modeling. It’s time to dig into its predecessor and antithesis, which I call Structure-First data modeling, specifically looking at how two assumptions drive our actions. Assumptions are quite useful since they leverage experience without having to re-learn what is already known. It is a real time-saver.

Until it isn’t.

For nearly the last half century, the eventual implementation for data management systems has consisted of various incarnations of tables-with-columns and the supporting infrastructure which weaves them into a solution. The brilliant works of Steve Hoberman, Len Silverston, David Hay, and many others, in developing data modeling strategies and patterns are notable and admirable. They pushed data modeling art and science forward. As strong as those contributions are, they are still description-focused and assume a Structure-First implementation.

Structure-First data modeling is based on two assumptions. The first assumption is that the solution will always be physically articulated in a tables-with-columns structure. The second is that proceeding requires developing complete descriptions of subject matter. This second assumption is also on the path of either/or thinking; either the description is complete, or it is not. If it is not, then tables-with-columns (and a great deal of complexity) are added until it is complete. Our analysis, building on these assumptions, is focused on the table structures and how they are joined to create a complete attribute inventory.

The focus on structure is required because no data can be captured until the descriptive attribute structure exists. This inflexibility makes the system both brittle and complex.
All the descriptive attribution being stuffed into tables-with-columns are a parts list for the concept, but there is no succinct definition of the whole. These first steps taken on a data management journey are on the path to complexity, and since they are based on rarely articulated assumptions, the path is never questioned. The complete Structure-First model must accommodate every possible descriptive attribute that could be useful. We have studied E. F. Codd’s 5 data normalization levels and drive towards structural normalization. Therefore, our analysis is focused on avoiding repeating columns, multiple values in a single column, etc., rather than on what the data means.

Yet with all the attention paid to capturing all the descriptive attributes, new ones constantly appear. We know this is inevitable for any system having even a modest lifespan. For example, thanks to COVID-19, educational institutions that have never offered online courses are suddenly faced with moving exclusively to online offerings, at least temporarily. Buildings and rooms are not relevant for those offerings, but web addresses and enabling software are. Experience demonstrates how costly it is in both time and resources to add a new descriptive attribute after the system has been put into production. Inevitably something needs to be added. This happens either because something was missed or a new requirement was added. It also happens because buried in the long parts list of descriptive attributes, the same thing has been described several times in different ways. The brittle nature of tables-with-columns results in every change requiring very expensive modeling, refactoring, and regression testing to get the change into production.

Neither the tables-with-columns nor descriptive assumption parts lists assumptions apply when developing semantic knowledge graph solutions using Structure-First Data Modeling: The Losing Battle of Perfect Descriptionsa Meaning-First data modeling approach. Why am I convinced Meaning-First will advance the data management discipline? Because Meaning-First is definitional, the path of both/and thinking, and it rests on a single structure, the triple, for virtually everything. The World-Wide Web Consortium (W3C) defined the standard RDF (Resource Description Framework) triple to enable linking data on the open web and in private organizations. The definition, articulated in RDF triples, captures the essence to which new facts are linked. Semantic technologies provide a solid, machine-interpretable definition and the standard RDF triple as the structure. Since there is no need to build new structures, new information can be added instantly. By simply dropping new information into the database, it automatically links to existing data right away.

While meaning and structure are separate concepts, we have been conflating them for decades, resulting in unnecessary complexity. Humankind has been formalizing the study of meaning since Aristotle and has been making significant progress along the way. Philosophy’s formal logics are semantics’ Meaning-First cornerstone. Formal logics define the nature of whatever is being studied such that when something matches the formal definition, it can be proved that it is necessarily in the defined set. Semantic technology has enabled machine-readable assembly using formal logics. An example might make it easier to understand.

Consider a requirement to know which teams have won the Super Bowl. How would each approach solve this requirement? The required data is:
• Super Bowls played
• Teams that played in each Super Bowl
• Final scores
Data will need to be acquired in both cases and is virtually the same, so this example skips over those mechanics to focus on differences.

A Structure-First approach might look something like this. First, create a conceptual model with the table structures and their columns to contain  all the relevant team, Super Bowl, and score data. Second, create a logical model from the conceptual model that Structure-First Data Modeling: The Losing Battle of Perfect Descriptionsidentifies the logical designs that will allow the data to be connected and used. This requires primary and foreign key designs, logical data types and sizes, as well as join structures for assembling data from multiple tables. Third, create a physical model from the logical to model the storage strategy and incorporate vendor-specific implementation details.

Only at this point can the data be entered into the Structure-First system. This is because until the structure has been built, there is no place for the data to land. Then, unless you (the human user) know the structure, there is no way to get data back out. However, this isn’t true when using Meaning-First semantic technology.

A Meaning-First approach can start either by acquiring well-formed triples or building the model as the first step. The model can then define the meaning of “Super Bowl winner” as the team with the highest score for each Super Bowl occurrence. Semantic technology captures the meaning using formal logics, and the data that match that meaning self-assemble into the result set. Formal logics can also be used to infer which teams might have won the Super Bowl using the logic “in order to win, the team must have played in the Super Bowl,” and not all NFL teams have.

The key is that in the Meaning-First example, members of the set called Super Bowl winners can be returned without identifying the structure in the request. The Structure-First example required understanding and navigating the structure before even starting to formulate the question. It’s not so hard in this simple example, but in enterprise data systems with hundreds, or more likely thousands, of tables, understanding the structure is extremely challenging.

Structure-First Data Modeling: The Losing Battle of Perfect DescriptionsSemantic Meaning-First databases, known as triplestores, are not a collection of tables-with-columns. They are comprised of RDF triples that are used for both the definitions (schema in the form of an ontology) and the content (data). As a result, you can write queries against an RDF data set that you have never seen and get meaningful answers. Queries can return what sets have been defined. Queries can then find when the set is used as the subject or the object of a statement. Semantic queries simply walk across the formal logic that defines the graph letting the graph itself inform you about possible next steps. This isn’t an option in Structure-First environments because they are not based in formal logic and the schema is encapsulated in a different language from the data.

Traditional Structure-First databases are made up of tens to hundreds, often thousands of tables. Each table is arbitrarily made up and named by the modeler with the goal to contain all attributes of a specific concept. Within each table are columns that are also made up, again hopefully with lots of rigor, but made up. You can prove this to yourself by looking at the lack of standard definitions around simple concepts like address. Some will leverage modeling patterns, some will leverage standards like USPS, but the variability between systems is great and arbitrary.

Semantic technology has enabled the Meaning-First Structure-First Data Modeling: The Losing Battle of Perfect Descriptionsapproach with machine-readable definitions to which new attribution can be added in production. At the same time this clarity is added to the data management toolkit, semantic technology sweeps away the nearly infinite collection of complex table-with-column structures with the one single, standards-based RDF triple structure. Changing from descriptive to definitional is orders of magnitude clearer. Replacing tables and columns with triples is orders of magnitude simpler. Combining them into a single Meaning-First semantic solution is truly a game changer.

Graph Database Superpowers: Unraveling the back-story of your favorite graph databases

The graph database market is very exciting, as the long list of vendors continues to grow. You may not know that there are huge differences in the origin story of the dozens of graph databases on the market today. It’s this origin story that greatly impacts the superpowers and weaknesses of the various offerings.

While Superman is great at flying and stopping locomotives, you shouldn’t rely on him around strange glowing metal. Batman is great at catching small-time hoods and maniacal characters, but deep down, he has no superpowers other than a lot of funding for special vehicles and handy tool belts. Superman’s origin story is much different than Batman’s, and therefore the impact they have on the criminal world is very different.

This is also the case with graph databases. The origin story absolutely makes a difference when it comes to strengths and weaknesses. Let’s look at how the origin story of various graph databases can make all the difference in the world when it comes to use cases for the solutions.

Graph Database Superhero: RDF and SPARQL databases

Examples: Ontotext, AllegroGraph, Virtuoso and many others

Origin Story: Short for Resource Description Framework, RDF is a decades-old data model with origins with Tim Berners-Lee. The thought behind RDF was to provide a data model that allows the sharing of

data, similar to how we share information on the internet. Technically, this is the classic triple-store with subject-predicate-object.

Superpower: Semantic Modeling. Basic understanding of concepts and the relationships between those concepts. Enhanced context with the use of ontology. Sharing data and concepts on the web. These databases often support OWL and SHACL, which help with the process of describing what the data should look like and the sharing of data like we share web pages.

Kryptonite: The RDF original specification did not account for properties on predicates very well. So for example, if I wanted to specify WHEN Sue became a friend of Mary, or the fact that Sue is a friend of Mary, according to Facebook, handling provenance and time may be more cumbersome. Many RDF databases added quad-store options where users could handle provenance or time, and several are adding the new RDF* specification to overcome shortcomings. More on this in a minute.

Many of the early RDF stores were built on transactional architecture, so that they scaled somewhat to handle transactions, but had size restrictions on performing analytics on many triples.

It is in this category that the vendors have had some time to mature. While the origins may be in semantic web and sharing data, many have stretched their superpowers with labeled properties and other useful features.

Graph Database Superhero: Labeled Property graph with Cypher

Example: Neo4j

Origin Story: Short for labeled property graph, the premier player in the LPG was and is Neo4j. According to podcasts and interviews of the founder, the original thought was more about managing content on a web site, where taxonomies gave birth to many-to-many relationships. Neo4j developed its new type of system in order to support its enterprise content management team. So, when you needed to search across your website for certain content, for example, when a company changes its logo, the LPG kept track of how these assets were connected. This is offered as an alternative to the JOIN table in an RDBMS that holds foreign keys of both the participating tables, and this is extremely costly in traditional databases.

SuperPower: Although the origin story is about web site content taxonomies, it turns out that these types of databases were also pretty good for 360-degree customer view applications and understanding multiple supply chain systems. Cypher, although not a W3C or ISO standard, has become a de facto standard language as the Cypher community has grown with Neo4j’s efforts. Neo4j also has been an advocate of the new upcoming GQL standard, which may result in a more capable Cypher language.

Kryptonite: Neo4j has built its own system from the ground up on a transactional architecture. Although some scaling features have recently been added to Neo4j version 4, the approach is more about federating queries rather than an MPP approach. In version 4, the developers have added manual sharding and a new way to query sharded clusters. This requires extra work when sharding and writing your queries. This is a similar approach to transactional RDF stores where SPARQL 1.1, supports integrated federated queries through a SERVICE clause. In other words, you may still encounter limits when trying to scale and perform analytics. Time will tell if the latest federated approach is scalable.

Ontologies and inferencing are not standard features with a property graph, although some capability is offered here with add-ons. If you’re expecting to manage semantics in a property graph, it’s probably the wrong choice.

Graph Database Superhero: Proprietary Graph

Example: TigerGraph

Origin Story: According to their web site, when the founders of TigerGraph decided to write a database, one of the founders was working at Twitter on a project that needed bigger scale graph algorithms than Neo4j could offer. TigerGraph devised a completely new architecture for the data model and storage, even devising its own language for a graph.

Superpowers: Through Tiger, the market could now appreciate that graph databases could now run on a cluster. Although certainly not the first to run on a cluster, this focus was on real power in running end-user supplied graph algorithms on a lot of data.

Kryptonite: The database decidedly went on its own with regard to standards. Some shortcomings on the simplicity of leveraging ontologies, performing inferencing and make use of your people who know either SPARQL or Cypher are apparent. By far the biggest disadvantage to this proprietary graph is that you have to think more about schema and JOINs prior to loading data. The schema model is more reminiscent of a traditional database than any of the other solutions on the market. While it may be a solid solution for running graph algorithms, if you’re creating a knowledge graph by integrating multiple sources and you want to run BI-style analytics on said knowledge graph, you may have an easier time with a different solution.

Interesting to note that although TigerGraph’s initial thinking was to beat Neo4j at proprietary graph algorithms, TigerGraph has teamed up with the Neo4j folks and is in the early stages of making its proprietary language a standard via ISO and SQL. Although TigerGraph releases many benchmarks, I have yet to see them release benchmarks for TPC-H or TPC-DS, standard BI-style analytics benchmarks. Also, due to a non-standard data model, harmonizing data from multiple sources requires some extra legwork and thought about how the engine will execute analytics.

Graph Database Superhero: RDF Analytical DB with labeled properties

Example: AnzoGraph DB

Origin Story: AnzoGraph DB was the brainchild of former Netezza/Paraccel engineers who designed MPP platforms like Netezza, ParAccel and Redshift. They became interested in graph databases, recognizing that there was a gap in perhaps the biggest category of data, namely data warehouse-style data and analytics. Although companies making transactional graph databases covered a lot of ground in the market, there were very few analytical graph databases that could follow standards, perform graph analytics and leverage ontologies/inferencing for improved analytics.

Superpowers: Cambridge Semantics designed a triple store that both followed standards and could scale like a data warehouse. In fact, it was the first OLAP MPP platform for graph, capable of analytics on a lot of triples. It turns out that this is the perfect platform for creating a knowledge graph, facilitating analytics built from a collection of structured and unstructured data. The data model helps users load almost any data at any time.

Because of the schemaless nature, the data can be sparsely populated. It supports very fast in-memory transformations, thus data can be loaded and cleansed later (ELT). Because metadata and Instance data together in the same graph and without any special effort — sure makes all those ELT queries much more flexible, iterative and powerful. With an OLAP graph like AnzoGraph DB, you add any subject-predicate-object-property at any time without having to make a plan to do so.

In traditional OLAP databases, you can have views. In this new type of database, you can have multi-graphs that can be queried as one graph when needed.

Kryptonite: Although ACID compliant, other solutions on the market might support faster transactions due to the OLAP nature of this database’s design. Ingestion of massive amounts of transactions might require additional technologies, like Apache Kafka, to ingest smoothly in high-transactional environments. Like many warehouse-style technologies, data loading is very fast and therefore batch loads are very fast. Pairing an analytical database with a transactional database is also sometimes a solution for companies who have both high transactions and deep analytics to perform.

Other types of “Graph Databases”

A few other types of graph databases that have some graph superpowers. Traditional database vendors have recognized that graph can be powerful and have offered a data model to have a graph model in addition to their native model. For example, Oracle has two offerings. You can buy an add-on package that offers geospatial and graph. In addition, the company offers an in-memory graph that is separate from traditional Oracle.

You can get graph database capabilities in an Apache Hadoop stack under GraphFrames. GraphFrames works on top of Apache Spark. Given Spark’s capability to handle big data, scaling is a superpower. However, given that your requirements might lead you to layering technologies, tuning a combination of Spark, HDFS, Yarn and GraphFrames could be the challenge.

The other solutions give you a nice taste of graph functionality in a solution that you probably already have. The kryptonite here is usually about performance when scaling to billions or trillions of triples and then trying to run analytics on said triples.

The Industry is full of Ironmen

Ironman Tony Stark built his first suit out of scrap parts when he was captured by terrorists and forced to live in a cave. It had many vulnerabilities, but it served it’s one purpose: to get the hero to safety. Later, the Ironman suit evolved to be more powerful, deploy more easily and think on its own. The industry is full of Tony Starks who will evolve the graph database.

However, while evolution happens, remember that graph databases aren’t one thing.

A graph database is a generic term, but simply doesn’t get you the level of detail you need to understand which problem it solves. The industry has developed various methods of doing the critical tasks that drive value in this category we call graph databases. Whether it’s harmonizing diverse data sets, performing graph analytics, performing inferencing and leveraging ontologies, you really have to think about what you’d like to get out of the graph before you choose a solution.

WRITTEN BY

Steve Sarsfield

VP Product, AnzoGraph (AnzoGraph.com). Formerly from IBM, Talend and Vertica. Author of the book the Data Governance Imperative.

DCAF 2020: Second Annual Data-Centric Architecture Forum Re-Cap

Last year, we decided to call the first annual Data Centric Architecture Conference a Forum which resulted in DCAF. It didn’t take long for attendees to start calling the event “decaf” but they were equally quick to point out that the forum was anything but decaf. We had a great blend of presentations ranging from discussions about emerging best practices in the applied semantics profession to mind-blowing vendor demos. Our stretch goals from last year included growing the number of attendees, seeing more data-centric vendors, and exploring security and privacy. These were met and exceeded, and we’re on track to set even loftier stretch-goals for next year.

Throughout the Data-Centric Architecture Forum presentations, we were particularly impressed by the blockchain data security presentation by Brian Platz at https://flur.ee/. Semantic tech is an obvious choice for organizations wishing to become data centric, but we often have to rely on security frameworks that work for legacy platforms. It was exciting to see a platform that addresses security in a way that is highly compatible with semantics. They also provide a solid architecture that is consistent with the goals of the DCA, regardless of whether their clients choose to go with more traditional relational configurations, or semantic configurations.

We welcomed returning attendees from Lymba, showcasing some of the project work they’ve done while partnering with Semantic Arts. Mark Van Berkel from Schema App built an architecture based on outcomes from last year’s Data Centric Architecture Conference. It’s amazing what a small team can do in a short amount of time when they’re operating free from corporate constraints.

One of our concerns with growing the number of participants was that we would lose the energy of the room, the level of comfort in sharing ideas and networking across unspoken professional barriers (devs vs product? Not here!). Everyone was set up to learn from these presentations. The group was intimate enough that presenters could engage directly with the audience, which included developers, other vendors, and practitioners in field of semantics. We made every effort to keep presentations on target and to keep audience participation smoothly moderated, so coffee breaks were fertile ground for discussions and networking. So much of this conversation grew organically that we at Semantic Arts decided to open virtual forums to continue the discussions.

You can join us on these channels at:
LinkedIn group
Estes Park Group

While we’re on the topic of goals, here’s what we envision for next year’s Data-Centric Architecture Forum:
• Continuing with our mindset of growth – we want to see vendors bring the clients who showcase the best the tools and products have to offer. Success stories and challenges welcome.
• Academic interests – not that this is going to be a job fair, but Fort Collins IS a college town, just sayin’. Also, to that point, how do we recruit? What does it take to be a DCAF professional? What are you (vendors and clients) looking for when you want to build teams that can work on transformative tech?
• Continuing with our mindset of transparency, learning, and vulnerability. We still have to really solve the issue of security and privacy; how do we do that when we’re all about sharing data? What are our blind-spots as a profession?

Meaning-First Data Modeling, A Radical Return to Simplicity

Person uses language. Person speaks language.Meaning-First Data Modeling, A Radical Return to Simplicity Person learns language. We spend the early years of life learning vocabulary and grammar in order to generate and consume meaning. As a result of constantly engaging in semantic generation and consumption, most of us are semantic savants. This Meaning-First approach is our default until we are faced with capturing meaning in databases. We then revert to the Structure-First approach that has been beaten into our heads since Codd invented the relational model in 1970. This blog post presents Meaning-First data modeling for semantic knowledge graphs as a replacement to Structure-First modeling. The relational model was a great start for data management, but it is time to embrace a radical return to simplicity: Meaning-First data modeling.

This is a semantic exchange, me as a writer and you as a reader. The semantic mechanism by which it all works is comprised of a subject-predicate-object construct. The subject is a noun to which the statement’s meaning is applied. The predicate is the verb, the action part of the statement. The object is also generally a noun, the focus of the action. These three parts are the semantic building blocks of language and the focus of this post, semantic knowledge graphs.

In Meaning-First semantic data models the subject-predicate-object construct  is called a triple, the foundational structure upon which semantic technology is built. Simple facts are stated with these three elements, each of which is commonly surrounded by angle brackets. The first sentence in this post is an example triple. <Person> <uses> <language>. People will generally get the same meaning from it. Through life experience, people have assembled a working knowledge that allows us to both understand the subject-predicate-object pattern as well as what people and language are. Since computers don’t have life experience, we must fill in some details to allow this same understanding to be reached. Fortunately, a great deal of this work has been done by the World Wide Web Consortium (W3C) and we can simply leverage those standards.

Modeling the triple “Person uses language” in Figure 1, Triple diagram using arrows and ovals is a good start. Tightening the model by adding formal definitions makes it more robust and less ambiguous. These definitions come from gist, Semantic Arts’ minimalist upper level ontology. The subject, <Person>, is defined as “A Living Thing that is the

Meaning-First Data Modeling, A Radical Return to Simplicity
Figure 1, Triple diagram

offspring of some Person and that has a name.” The object, <Language>, is defined as “A recognized, organized set of symbols and grammar”. The predicate, <uses>, isn’t defined in gist, but could be defined as something like “Engages with purpose”. It is the action linking <Person> to <Language> to create the assertion about Person. Formal definitions for subjects and objects are useful because they are mathematically precise. They can be used by semantic technologies to reach the same conclusions as can a person with working knowledge of these terms.

 

Surprise! This single triple is (almost) an ontology. This is almost an ontology because it contains formal definitions and is in the form of a triple. Almost certainly, it is the world’s smallest ontology, and it is missing a few technical components, but it is a good start on an ontology all the same. The missing components come from standards published by the W3C which won’t be covered in detail here. To make certain the progression is clear, a quick checkpoint is in order. These are the assertions so far:

  • A triple is made up of a <Subject>, a <Predicate>, and an <Object>.
  • <Subjects> are always Things, e.g. something with independent existence including ideas.
  • <Predicates> create assertions that
    • Connect things when both the Subject and Object are things, or
    • Make assertions about things when the Object is a literal
  • <Objects> can be either
    • Things or
    • Literals, e.g. a number or a string

These assertions summarize the Resource Description Framework (RDF) model. RDF is a language for representing information about resources in the World Wide Web. Resource refers to anything that can be returned in a browser. More generally, RDF enables Linked Data (LD) that can operate on the public internet or privately within an organization. It is the simple elegance embodied in RDF that enables Meaning-First Data Modeling’s radically powerful capabilities. It is also virtually identical to the linguistic building blocks that enabled cultural evolution: subject, predicate, object.

Where RDF defines the framework that defines the triple, Resource Description Framework Schema (RDFS) provides a data-modeling vocabulary for building RDF triples. RDFS is an extension of the basic RDF vocabulary and is leveraged by higher-level languages such as Web Ontology Language (OWL), and Dublin Core Metadata Initiative (Dcterms). RDFS supports constructs for declaring that resources, such as Living Thing and Person, are classes. It also enables establishing subclass relationships between classes so the computer can make sense of the formal Person definition “A Living Thing that is the offspring of some Person and that has a name.”

Here is a portion of the schema supporting the opening statement in this post,

Figure 2, RDFS subclass property

“Person uses Language”. For simplicity, the ‘has name’ portion of the definition has been omitted from this diagram, but it will show up later.Figure 2 shows the RDFS subClassOf property as a named arrow connecting two ovals. This model is correct as it shows the subClassOf property, yet it isn’t quite satisfying. Perhaps it is even a bit ambiguous because through the lens of traditional, Structure-First data modeling, it appears to show two tables with a connecting relationship.

 

Nothing could be further from the truth.

There are two meanings here and they are not connected structures. The Venn diagram in Figure 3, RDFS subClassOf Venn diagram more clearly shows the Person set is wholly contained within the set of all Living

Figure 3, RDFS subClassOf Venn diagram

Things so it is also a Living Thing. There is no structure separating them. They are in fact both in one single structure; a triple store. They are differentiated only by the meaning found in their formal definitions which create membership criteria of two different sets. The first set is all Living Things. The second set, wholly embedded within the set of all Living Things, is the set of all Living Things that are also the offspring of some Person and that have a name. Person is a more specific set with criteria that causes a Living Thing to be a member of the Person set but is also still a member of the Living Things set.

Rather than Structure-First modeling, this is Meaning-First modeling built upon the triple defined by RDF with the schema articulated in RDFS. There is virtually no structure beyond the triple. All the triples, content and schema, commingle in one space called a triple store.

Figure 4, Complete schema

Here is some informal data along with the simple ontology’s model:

Schema:

  • <Person> <uses> <Language>

Content:

  • <Mark> <uses> <English>
  • <Boris > <uses> <Russian>
  • <Rebecca> <uses> <Java>
  • <Andrea> <uses> <OWL>

Contained within this sample data lies a demonstration of the radical simplicity of Meaning-First data modeling. There are two subclasses in the data content not   currently

Figure 5, Updated Language Venn diagram

modeled in the schema, yet they don’t violate the schema. The Figure 5 shows subclasses added to the schema after they have been discovered in the data. This can be done in a live, production setting without breaking anything! In a Structure-First system, new tables and joins would need to be added to accommodate this type of change at great expense and over a long period of time. This example just scratches the radical simplicity surface of Meaning-First data modeling.

 

 

Stay tuned for the next installment and a deeper dive into Meaning-First vs Structure-First data modeling!

Time to Rethink Master and Reference Data

Time to Rethink Master and Reference Data

Every company contends with data quality, and in its pursuit they often commit substantial resources to manage their master and reference data. Remarkably, quite a bit of confusion exists around exactly what these are and how they differ. And since they provide context to business activity, this confusion can undermine any data quality initiative.

Here are amalgams of the prevailing definitions, which seem meaningful at first glance:

Time to Rethink Master and Reference Data

Sound familiar? In this article, I will discuss some tools and techniques for naming and defining terms that explain how these definitions actually create confusion. Although there is no perfect solution, I will share the terms and definitions that have helped me guide data initiatives, processes, technologies, and governance over the course of my career.

What’s in a Name?

Unique and self-explanatory names save time and promote common understanding. Naming, however, is nuanced in that words are often overloaded with multiple meanings. The word “customer,” for instance, often means very different things to people in the finance, sales, or product departments. There are also conventions that, while not exactly precise, have accumulated common understanding over time. The term “men’s room,” for example, is understood to mean something more specific than a room (it has toilets); yet something less specific than men’s (it’s also available to boys).

They’re both “master”

The term “master” data derives from the notion that each individually identifiable thing has a corresponding, comprehensive and authoritative record in the system. The verb to master means to gain control of something. The word causes confusion, however, when used to distinguish master data from reference data. If anything, reference data is the master of master data, as it categorizes and supplies context to master data. The dependency graph below demonstrates that master data may refer to and thus depend on reference data (red arrow), but not the other way around:

They’re both “reference”

The name “reference data” also makes sense in isolation. It evokes reference works like dictionaries, which are highly curated by experts and typically used to look up individual terms rather than being read from beginning to end. But reference can also mean the act of referring, and in practice, master data has just as many references to it as reference data.  

So without some additional context, these terms are problematic in relation to each other.

It is what it is

Although we could probably conjure better terms, “Master Data” and “Reference Data” have become universal standards with innumerable citations. Any clarification provided by new names would be offset by their incompatibility with the consensus

Pluralizations R Us

Whenever possible, it’s best to express terms in the singular rather than the plural since the singular form refers to the thing itself, while the plural form denotes a set. That’s why dictionaries always define the singular form and provide the plural forms as an annotation.  Consider the following singular and plural terms and definitions:

* Note that entity is used in the entity-relationship sense, where it denotes a type of thing rather than an identifiable instance of a thing.

The singular term “entity” works better for our purposes since the job at hand is to classify each entity as reference or master, rather than some amorphous concept of data. In our case, classifying each individual entity informs its materialized design in a database, its quality controls, and its integration process. The singular also makes it more natural to articulate relationships between things, as demonstrated by these awkward counterexamples:

“One bushels contains many apples.”

“Each data contains one or more entities.”

Good Things Come in Threes

Trying to describe the subject area with just two terms, master and reference, falls short because the relationship between the two cannot be fully understood without also defining the class that includes them both.  For example, some existing definitions specify a “disjoint” relationship in which an entity can belong to either reference or master data, but not both. This can be represented as a diagram or tree:

The conception is incomplete because the class that contains both reference and master data is missing.  Are master data and reference data equal siblings among other data categories, as demonstrated below?

That’s not pragmatic, since it falsely implies that master and reference data have no more potential for common governance and technology than, say, weblogs and image metadata. We can remedy that by subsuming master and reference data within an intermediate class, which must still be named, defined, and assigned the common characteristics shared by master and reference data.

Some definitions posit an inclusion or containment relationship in which reference data is a subset of master data, rather than a disjoint peer. This approach, however, omits the complement–the master data which is not reference data.

Any vocabulary that doesn’t specify the combination of master and reference data will be incomplete and potentially confusing.

It’s Just Semantics

Generally speaking, there are two broad categories of definitions: extensional and intensional.  

Extensional Definitions

An extensional definition simply defines an entity by listing all of its instances, as in the following example:

This is out of the question for defining reference or master data, as each has too many entities and regularly occurring additions. Imagine how unhelpful and immediately obsolete the following definition would be:

A variation of this approach, ostensive definition, uses partial lists as examples.  These are often used for “type” entities that nominally classify other things:

Ostensive definitions, unlike extensional definitions, can withstand the addition of new instances. They do not, however, explain why their examples satisfy the term. In fact, ostensive definitions are used primarily for situations in which it’s hard to formulate a definition that can stand on its own.  Therefore both extensive and ostensive definitions are inadequate since they fail to provide a rationale to distinguish reference from master data.

Intensional Definitions 

Intensional definitions, on the other hand, define things by their intrinsic properties and do not require lists of instances.  The following definition of mineral, for example, does not list any actual minerals:

With that definition, we can examine the properties of quartz, for example, and determine that it meets the necessary and sufficient conditions to be deemed a mineral.  Now we’re getting somewhere, and existing definitions have naturally used this approach.  

Unfortunately, the conditions put forth in the existing definitions of master and reference data can describe either, rather than one or the other. The following table shows that every condition in the intensional definitions of master and reference data applies to both terms:

How can you categorize the product entity, for example, when it adheres to both definitions? It definitely conforms to the definition of master–a core thing shared across an enterprise. But it also conforms to reference, as it’s often reasonably stable and simply structured, used to categorize other things (sales), provides a list of permissible values (order forms), and corresponds to external databases (vendor part lists).  I could make the same case for almost any entity categorized as master or reference, and this is where the definitions fail.

Master data in reference data: use intentional definitions

Celebrate Diversity

Although they share the same intrinsic qualities, master and reference data truly are different and require separate terms and definitions. Their flow through a system and their respective quality control processes, for instance, are quite distinct.  

Reference data is centrally administered and stored. It is curated by an authoritative party before becoming available in its system of record, and only then is it copied to application databases or the edge. An organization, for instance, would never let a user casually add a new unit of measure or a new country.

Master data, on the other hand, is often regularly added and modified in various distributed systems. New users register online, sales systems acquire new customers, organizations hire and fire employees, etc. The data comes in from the edge during the normal course of business, and quality is enforced as it is merged into the systems of record.

Master data and reference data change and merge

Companies must distinguish between master and reference data to ensure their quality and proper integration.

Master data and reference data are distinct concepts that require...

Turn The Beat Around

It’s entirely reasonable and common to define things by their intrinsic qualities and then use those definitions to inform their use and handling. Intuition tells us that once we understand the characteristics of a class of data, we can assess how best to manage it. But since the characteristics of master and reference data overlap, we need to approach their definitions differently.

 

In software architecture and design, there’s a technique called Inversion of Control that reverses the relationship between a master module and the process it controls. It essentially makes the module subservient to the process. We can apply a similar concept here by basing our definitions on the processes required by the data, rather than trying to base the processes on insufficiently differentiated definitions. This allows us to pragmatically define terms that abide by the conclusions described above:

  1. Continue to use the industry-standard terms “master data” and “reference data.”
  2. Define terms in the singular form.
  3. Define a third concept that encompasses both categories.
  4. Eschew extensive and ostensive definitions, and use intensional definitions that truly distinguish the concepts

With all that out of the way, here are the definitions that have brought clarity and utility to my work with master and reference data. I’ve promoted the term “core” from an adjective of master data to a first-class concept that expresses the superclass encompassing both master and reference entities.

With core defined, we can use a form of intensional definition called genus differentia for reference and master data. Genus differentia definitions have two parts. The first, genus, refers to a previously defined class to which the concept belongs–core entity, in our case. The rest of the definition, the differentia, describes what sets it apart from others in its class. We can now leverage our definition of core entity as the genus, allowing the data flow to provide the differentia. This truly distinguishes reference and master.

We can base the plural terms on the singular ones:

Conclusion

This article has revealed several factors that have handicapped our understanding of master and reference data:

  • The names and prevailing definitions insufficiently distinguish the concepts because they apply to both.
  • The plural form of a given concept obscures its definition.
  • Master data and reference data are incompletely described without a third class that contains both. 

Although convention dictates retention of the terms “master” and “reference,” we achieve clarity by using genus differentia to demonstrate that while they are both classified as core entities, they are truly distinguished by their flow and quality requirements rather than any intrinsic qualities or purpose.

By Alan Freedman

Connect with the Author

Want to learn more about what we do at Semantic Arts? Contact us!

Facet Math: Trim Ontology Fat with Occam’s Razor

Facet Math: Trim Ontology Fat with Occam's RazorAt Semantic Arts we often come across ontologies whose developers seem to take pride in the number of classes they have created, giving the impression that more classes equate to a better ontology. We disagree with this perspective and as evidence, point to Occam’s Razor, a problem-solving principle that states, “Entities should not be multiplied without necessity.” More is not always better. This post introduces Facet Math and demonstrates how to contain runaway class creation during ontology design.

Semantic technology is suited to making complex information intellectually manageable and huge class counts are counterproductive. Enterprise data management is complex enough without making the problem worse. Adding unnecessary classes can render enterprise data management intellectually unmanageable. Fortunately, the solution comes in the form of a simple modeling change.

Facet Math leverages core concepts and pushes fine-grained distinction to the edges of the data model. This reduces class counts and complexity without losing any informational fidelity. Here is a scenario that demonstrates spurious class creation in the literature domain. Since literature can be sliced many ways, it is easy to justify building in complexity as data structures are designed. This example demonstrates a typical approach and then pivots to a more elegant Facet Math solution.Facet Math: Trim Ontology Fat with Occam's Razor

A taxonomy is a natural choice for the literature domain. To get to each leaf, the whole path must be modeled adding a multiplier with each additional level in the taxonomy. This case shows the multiplicative effect and would result in a tree with 1000 leaves (10*10*10) assuming it had:
10 languages
10 genres
10 time periods

Taxonomies typically are not that regular though they do chart a path from the topmost concept down to each leaf. Modelers tend to model the whole path which multiplies the result set. Having to navigate taxonomy paths makes working with the information more difficult. The path must be disassembled to work with the components it has aggregated.

This temptation to model taxonomy paths into classes and/or class hierarchies creates a great deal of complexity. The languages, genres, and time periods in the example are really literature categories. This is where Facet Math kicks in taking an additive approach by designing them as distinct categories. Using those categories for faceted search and dataset assembly returns all the required data. Here is how it works.

Facet Math: Trim Ontology Fat with Occam's Razor

To apply Facet Math, remove the category duplication from the original taxonomy by refactoring them as category facets. The facets enable exactly the same data representation:
10 languages
10 genres
10 time periods

By applying Facet Math principles, the concept count is reduced by two orders of magnitude. Where the paths multiplied to produce 1000 concepts, facets are only added and there are now only 30. This results in two orders of magnitude reduction!

Sure, this is a simple example. Looking at a published ontology might be more enlightening.

SNOMED (Systematized Nomenclature of Medicine—Clinical Terms) ontology is a real-world example.

Since the thesis here is looking at fat reduction, here is the class hierarchy in SNOMED to get from the top most class to Gastric Bypass.Facet Math: Trim Ontology Fat with Occam's Razor

Notice that Procedure appears in four levels, Anastomosis and Stomach each appear in two levels. This hierarchy is a path containing paths.

SNOMED’s maximum class hierarchy depth is twenty-seven. Given the multiplicative effect shown above in the first example, SNOMED having 357,533 classes, while disappointing, is not surprising. The medical domain is highly complex but applying Facet Math to SNOMED would surely generate some serious weight reduction. We know this is possible because we have done it with clients. In one case Semantic Arts produced a reduction from over one hundred fifty thousand concepts to several hundred without any loss in data fidelity.

Bloated ontologies contain far more complexity than is necessary. Humans cannot possibly memorize a hundred thousand concepts, but several hundred are intellectually manageable. Computers also benefit from reduced class counts. Machine Learning and Artificial Intelligence applications have fewer, more focused concepts to work with so they can move through large datasets more quickly and effectively.

It is time to apply Occam’s Razor and avoid creating unnecessary classes. It is time to design ontologies using Facet Math.

When is a Brick not a Brick?

They say good things come in threes and my journey to data-centricity started with three revelations.

The first was connected to a project I was working on for a university college with a problem that might sound familiar to some of you. The department I worked in was taking four months to clean, consolidate and reconcile our quarterly reports to the college executive. We simply did not have the resources to integrate incoming data from multiple applications into a coherent set of reports in a timely way.

The second came in the form of a lateral thinking challenge worthy of Edward de Bono: ‘How many different uses for a brick can you think of?’

The third revelation happened when I was on a consulting assignment at a multinational software company in Houston, Texas. As part of a content management initiative we were hired to work with their technical documentation team to install a large ECM application. What intrigued me the most, though, were the challenges the company experienced at the interface between the technology and the ‘multiple of multiples’ with respect to business language.

Revelation #1: Application Data Without the Application is Easy to Work With

The college where I had my first taste of data-centricity had the usual array of applications supporting its day-to-day operations. There were Student systems, HR systems, Finance systems, Facility systems, Faculty systems and even a separate Continuing Education System that replicated all those disciplines (with their own twists, of course) under one umbrella.

The department I worked in was responsible for generating executive quarterly reports for all activities on the academic side plus semi-annual faculty workload and annual graduation and financial performance reports. In the beginning we did this piece-meal and as IT resources became available. One day, we decided to write a set of specifications about what kind of data we needed; to what level of granularity; in what sequence; and, how frequently it should be extracted from various sources.

We called the process ‘data liquefication’ because once the data landed on our shared drive the only way we could tell what application it came from was by the file name. Of course, the contents and structure of the individual extracts were different, but they were completely pliable. Detached from the source application, we had complete freedom to do almost anything we wanted with it. And we did. The only data model we had to build (actually, we only ever thought about it once) was which “unit of production’ to use as the ‘center’ of our new reporting universe. To those of you working with education systems today, the answer will come as no surprise. We used ‘seat’. 

A journey to data-centricity
Figure 1: A Global Candidate for Academic Analytics

Once that decision was taken, and we put feedback loops in to correct data quality at source, several interesting patterns emerged:

  • The collections named Student, Faculty, Administrator and Support Staff were not as mutually exclusive as we originally thought. Several individuals occupied multiple roles in one semester.
  • The Finance categories were set up to reflect the fact that some expenses applied to all Departments; some were unique to individual Departments; and, some were unique to Programs.
  • Each application seemed to use a different code or name or structure to identify the same Person, Program or Facility.

From these patterns we were able to produce quarterly reports in half the time. We also introduced ‘what-if’ reporting for the first time, and since we used the granular concept of ‘seat’ as our unit of production we added Cost per Seat; Revenue per Seat; Overhead per Seat; Cross-Faculty Registration per Seat; and, Longitudinal Program Costs, Revenues, Graduation Rates and Employment Patterns to our mix of offerings as well.

Revelation #2: A Brick is Always a Brick. How it is Used in A Separate Question

When we separate what a thing “is” from how it is used, some interesting data patterns show up. I won’t take up much space in this article to enumerate them, but the same principle that can take ‘one thing’ like an individual brick and use it in multiple ways (paper weight, door stop, wheel chock, pendulum weight, etc.) puts the whole data classification thing in a new light.

The string “John Smith” can appear, for example, as the name of a doctor, a patient, a student, an administrator and/or an instructor. This is a similar pattern to the one that popped up at the university college. As it turns out that same string can be used as an entity name, an attribute, as metadata, reference data and a few other popular ‘sub-classes’ of data. They are not separate collections of ‘things’ as much as they are separate functions of the same thing.

Figure 2: What some ‘thing’ is and how it is used are two separate things

The implication for me was to classify ‘things’ first and foremost as what they refer to or in fact what they are. So, “John Smith” refers to an individual, and in my model surrounding data-centricity “is-a”(member of the set named) Person. On the other side of the equation, words like ‘Student’, ‘Patient’, and ‘Administrator’ for example are Roles. In my declarations, Student “is-a”(member of the set named) Role.

One of the things this allowed me to do was to create a very small (n = 19) number of mutually exclusive and exhaustive sets in any collection. This development also supported the creation of semantically interoperable interfaces and views into broadly related data stores.

Revelation #3: Shape and Semantics Must be Managed Separately and on Purpose

The theme of separation came up again while working on a technical publications project in Houston, Texas. Briefly, the objective was to render application user support topics into their smallest, reusable chunks and make it possible for technical writers to create document maps ranging from individual Help files in four different formats to full-blown, multi-chapter user guides and technical references. What really made the project challenging was what we came to call the ‘’multiple of multiples” problem. This turned out to be the exact opposite challenge of reuse in Revelation #1:

  • Multiple customer platforms
  • Multiple versions of customer platforms
  • Multiple product families (Mainframe, Distributed and Hybrid)
  • Multiple product platforms
  • Multiple versions of product platforms
  • Multiple versions of products (three prior, one current, and one work-in-progress)
  • Multiple versions of content topics
  • Multiple versions of content assemblies (guides, references, specification sheets, for example)
  • Multiple customer locales (United States, Japan, France, Germany, China, etc.)
  • Multiple customer language (English (two ‘flavours’), Japanese, German, Chinese, etc.)

The solution to this ‘factorial mess’ was not found in an existing technology (including the ECM software we were installing) but in fact came about by not only removing all architectural or technical considerations (as we did in Revelation #1), but asking what it means to say: “The content is the same” or “The content is different.”

In the process of comparing two components found in the ‘multiple of multiples’ list, we discovered three factors for consideration:

  1. The visual ‘shape’ of the components. ‘Stop’ and ‘stop’ look the same.
  2. The digital signatures of the components. We used MD5 Hash to do this.
  3. The semantics of the components. We used translators and/or a dictionary.

Figure 3 shows the matrix we used to demonstrate the tendency of each topic to be reused (or not) in one of the multiples.

Figure 3: Shape, Signal and Semantics for Content Component Comparison

It turns out that content can vary as a result of time (a version), place (a locale with different requirements for the same feature, for example) people (different languages) and/or format (saving a .docx file as a pdf). In addition to changes in individual components, assemblies of components can have their own identities.

This last point is especially important. Some content was common to all products the company sold. Other content was variable along product lines, client platform, target market and audience. Finally, the last group of content elements were unique to a unique combination of parameters.

Take-Aways

Separating data from its controlling applications presents an opportunity to look at it in a new way. Removed from its physical and logical constraints, data-centricity begins to look a lot like the language of business. While the prospect of liberating data this way might horrify many application developers and data modelers out there, those of us trying to get the business closer to the information they need to accomplish their goals see the beginning of more naturally integrated way of doing that.

The Way Forward with Data-Centricity

Data-centricity in architecture is going to take a while to get used to. I hope this post has given readers a sense of what the levers to making it work might look like and how they could be put to good use.

Click here to read a free chapter of Dave McComb’s book, “The Data-Centric Revolution”

Article by John O’Gorman

Connect with the Author

 

 

 

 

My Path Towards Becoming A Data-Centric Revolution Practitioner

In 1986 I started down a path that, in 2019, has made me a fledgling Data-Centric revolution practitioner. My path towards the Data-Centric revolution started in 1986 with my wife and I founding two micro-businesses in the music and micro-manufacturing industries. In 1998 I put the music business, EARTHTUNES, on hold and sold the other; then I started my Information Technology career. For the last 21 years I’ve covered hardware, software, network, administration, data architecture and development. I’ve mastered relational and dimensional design, working in small and large environments. But my EARTHTUNES work in 1994 powerfully steered me toward the Data-Centric revolution.

In early 1994 I was working on my eighth, ninth and tenth nature sound albums for my record label EARTHTUNES. (See album cover photos below.) The year before, I had done 7 months’ camping and recording in the Great Smoky Mountains National Park to capture the raw materials for my three albums. (To hear six minutes of my recording from October 24, 1993 at 11:34am, right-click here and select open link in new tab, to download the MP3 and PDF files—my gift to you for your personal use. You may listen while you finish reading below, or anytime you like.)

In my 1993 field work I generated 268 hours of field recordings with 134 field logs. (See below for my hand-written notes from the field log.)

Now, in 1994, I was trying to organize the audio recordings’ metadata so that I could select the best recordings and sequence them according to a story-line across the three albums. So, I made album part subtake forms for each take, each few-minutes’ recording, that I thought worthy of going on one of the albums. (See the image of my Album Part Subtake Form, below.)

I organized all the album part subtake forms—all my database metadata entries—and, after months of work, had my mix-down plan for the three albums. In early summer I completed the mix and Macaulay Library of Nature Sound prepared to publish the “Great Smoky Mountains National Park” series: “Winter & Spring;” “Summer & Fall;” and “Storms in the Smokies.”

The act of creating those album part subtake forms was a tipping point towards my becoming a Data-Centric revolution practitioner. In 1994 I started to understand many of the principles defined here and in chapter 2 of Dave McComb’s “The Data-Centric Revolution: Restoring Sanity to Enterprise Information Systems” . Since then I have internalized and started walking them out. The words below are my understandings of the principles, adapted from the Manifesto and McComb’s book.

  • All the many different types of data needed to be included: structured, semi-structured, network-structured and unstructured. Audio recordings and their artifacts; business and reference data; and other associated data, altogether, was my invaluable, curated inter-generational asset. These were the only foundation for future work.
  • I knew that I needed to organize my data in an industry-standard, archival, human-readable and machine-readable format so that I could use it across all my future projects, integrate it with external data, and export it into many different formats. Each new project and whatever applications I made or used would depend completely upon this first class-citizen, this curated data store. In contrast, apps, computing devices and networks would be, relative to the curated data, ephemeral second-class citizens.
  • Any information system I built or acquired must be evolve-able and specialize-able: they had to have a reasonable cost of change as my business evolved; and the integration of my data needed to be nearly free.
  • My data was an open resource that must be shareable, that needed to far outlive the initial database application I made. (I knew that a hundred or so years in the future, climate change would alter the flora and fauna of the habitats I had recorded in; this would change the way those habitats sounded. I was convicted that my field observation data, with recordings, needed to be perpetually accessible as a benchmark of how the world had changed.) Whatever systems I used, the data must have its integrity and quality preserved.
  • This meant that my data needed to have its meaning precisely defined in the context of long-living semantic disciplines and technologies. This would enable successive generations (using different applications and systems) to understand and use my lifework, enshrined in the data legacy I left behind.
  • I needed to use low-code/no-code as much as possible; to enable this I wanted the semantic model to be the genesis of the data structures, constraints and presentation layer, being used to generate all or most data structures and app components/apps (model-driven everything). I needed to use established, well-fitting-with-my-domain ontologies, adding only what wasn’t available and allowing local variety in the context of standardization (specialize-able and single but federated). (Same with the apps.)

From 1994 to the present I’ve been seeking the discipline and technology stacks that a handful of architects and developers could use to create this legacy. I think that I have finally found them in the Data-Centric revolution. My remaining path is to develop full competence in the appropriate semantic disciplines and technology stacks, build my business and community and complete my information system artifacts: passing my work to my heirs over the next few decades.

Article By Jonathon R. Storm

Jonathon works as a data architect helping to maintain and improve a Data-Centric information system that is used to build enterprise databases and application code in a Data-Centric company. Jonathon continues to, on weekends, record the music of the wilderness; in the next year he plans to get his first EARTHTUNES website online to sell his nature sound recordings: you can email him at [email protected] to order now.

Skip to content