Taxonomies, Ontologies And Machine Learning: The Future Of Knowledge Management

Kurt Cagle 2019

As an ontologist, I'm often asked about the distinctions between taxonomies and ontologies, and whether ontologies are replacing taxonomies. The second question is easy to answer: "No." Both taxonomies and ontologies serve vital, and often complementary, roles ... if they are used right.

A taxonomy is, to put it simply, a categorization scheme. Most readers should be familiar with a few critical taxonomies such as the Linnaeus Taxonomy used to represent how animals are related to one another, and the Dewey Decimal System for libraries, which represents subject areas of interest. Others are more subtle. For instance, the Pantone Color System (PCS) is a commercial system used to identify specific swatches of color with a name and a code. Military organizations have ranks with names and designations that indicate not only experience but also authority, such as a Colonel (O6 in the US Army or Air Force) or a Chief Petty Officer (E7 in the US Navy or Coast Guard).

Taxonomies, in this case, identify specific names, definitions and code designations, but often also have a (usually implied) ordering system as well. Frequently, such ordering is rubric (or subject matter) oriented, such that everything is contained within a hierarchy, with the hierarchy becoming more specialized as you move toward leaves of the hierarchy, and more generalized as you move towards the root. Hierarchies are also contextually inclusive - if you identify a given resource as being associated with a term in a hierarchy, this also implies that the resource is part of the broader categories (e.g., if my cat "Bright Eyes" is identified as being a cat, it is also considered to being of the cat family, a mammal, a chordate (it has a backbone) and an animal respectively.

Most taxonomies attempt to ensure that for any given resource, there is one and only one bucket (classification), that a given entity can fit into in that ontology at a leaf level. A cat, for instance, cannot also be a dog. That does not mean that at a more general level both don't share a common rubric. Cats and dogs, for instance, are both carnivores in the Linnaean System, which hints that categorization is usually not quite as cut and dried as it would seem to be at first glance. Indeed, that particular system actually has different strata (ranks) of comparison, and as such represents several different but interrelated classification vocabularies. Both cat and dog (or felis domesticus and canis familiaris, respectively) are Species, while Carnivora, to which they both belong is an Order. The idea that the same resource can have two categories apply to it holds because of a set of relationships:

<Bright Eyes> is a member of the Felis Domesticus (cat).

<Bright Eyes> is a member of the Order Carnivora.

Felis Domesticus is a subClass of Carnivora.

Felis Domesticus is a more narrow concept than Carnivora.

Now, when Linnaeus first created his system (which has since changed dramatically, btw) in the seventeenth century, he was looking for a way of creating a better understanding of biology by grouping animals with related traits. For instance, one trait that he used was whether an animal ate meat primarily, ate plant matter primarily, or ate both. Another trait was the primary shape of the set of teeth, while another related to whether they were warm-blooded (exothermic) or cold-blooded (endothermic). These traits were primarily phenotype expressions, and because evolution was still a couple of centuries in the future when Linnaeus created his taxonomy he didn't have the language to talk about convergent or divergent evolution.

In an exclusionary taxonomy, traits are essentially inherited, and the deeper the rank, there are more distinct traits (or facets) that become inherited. In principle, this means that it should be possible to specify any location (any context) in a taxonomy by setting the value of each value to either a specific enumerated instance from a set, or with a flag indicating that the enumeration is unspecified. For instance, if an entity is an animal, it consumes oxygen and produces carbon dioxide. There is no indication about an animal being ambulatory (able to move), having fur, or liking rock music. Certainly, an animal may have any of those characteristics, but from the standpoint of the classification system, the facet categories of movement, skin covering or musical taste are simply not relevant to the Animal category.

If you specify an animal that has a spinal cord (a chordate), that significantly reduces the number of classifications that are potentially relevant but have not yet been specified (i.e., it eliminates insects, shellfish, arthropods and so forth). Every time that you specify a facet value for a given facet, you are eliminating everything that isn't that facet value. From a hierarchical standpoint, you've moved your context so that you only consider those items that are at that node or branchward, while the path back to the root is clearly fixed. Note that this introduces an interesting characteristic of tree paths, however - the path back to the root is not only fixed in terms of the facets being constrained but is also fixed in order of how these facets are traversed. Pu, another way, the taxonomy is intrinsically directional.

Suppose, however, that instead, you asked a group of a couple of hundred (or thousand, or however many) people to list between five and ten characteristics that describe a given animal, choosing their own terms, and adding new terms when a term didn't otherwise exist. This is what's called an open taxonomy, although it's probably more well known under the term folksonomy. A cat may be tagged as:

furry, small, teeth, white, gray, orange, blue, striped, tortoiseshell, tail, cute, smart, solitary, warm, purr, meat eater, meow, mammal, lion, puma, domesticated, hiss ...

A dog may be tagged as:

furry, fuzzy, small, medium, big, teeth, white, brown, spots, yip, bark, meat eater, yelp, tongue, bite, cute, scary, pack animal, man's best friend, warm, wolf, domesticated, playful, dangerous, nose, bright, dumb ...

This may seem like an odd way to model, but bear with me. This really is not that different from what Linnaeus came up with, and for the most part, such folksonomy terms themselves lend themselves to facet groupings:

skin-covering: furry, fuzzy

size: small, medium, big

color: white, brown, gray, orange, blue

color-pattern: striped, tortoiseshell

bite, purr, meow, yelp, yip, bark,hiss

notable-anatomy: tail, teeth, tongue, nose

intelligence: smart, bright, dumb

related-animals: lion, puma, wolf

food-preference: meat-eater

behavior: solitary, playful, dangerous, man's best friend, pack animal, bite

evocation: cute, scary

temperature: warm

known-classifications: mammal

anthro-centricism: domesticated

What's perhaps most notable about these facets is that they are more or less orthogonal to one another. Put another way, such facets generally don't overlap, though a given facet (such as size) may have more than one term that is applicable (small, medium, large), depending upon the particular instance involved. Note also that there is a certain subjectiveness to the facet terms - a tarantula is large for a spider but is far smaller than a cat.

They key here is that the facets have qualified the terms. What that means is that you can make an assertion: my pet has a facet term vocalization:hiss. That one facet term, within the limits of the current taxonomy, would by itself suggest that the entity that it describes is a cat. However, a snake also has a vocalization:hiss, a notable-anatomy:tail and a food-preference:meat-eater. On the other hand, it also has skin-covering:scales and temperature:cold.

In this regard, this kind of folksonomy "modeling" can be thought of as metaphoric - the more two entities overlap in terms of their facets, the more metaphorically similar they are. We also, in our natural language, tend to place a higher emphasis on some facets than on others. For instance, anthro-centricism actually figures fairly large in most linguistic models: both cats and dogs are domesticated. Dogs and wolves are biologically much closer than dogs and cats, but we seldom talk about cats and wolves (though of course, there's always lions and tigers and bears, oh my!).

My argument is that the same thing that applies to descriptions about entities also applies to discussions of attributes. A car most famously has a year, make, model, and trim (or variant). Each of these (including the year) can be thought of as facets, with associated facet terms:

year: 2019

make: Suburu

model: Forester

trim: LLB

Let's say that I wanted to identify model as a facet, using a folksonomy. The definition may look something like:

brand, line, product line, production line, vehicle, make, plant, trim ...

Let's say that I had a different database that talked about a line rather than a model. It likely could be identified similarly:

brand, model, product line, production line, vehicle, make, plant, trim ...

Metaphorically, you can reasonably assert the likelihood that model and line are in fact the same property increases as both the number of coincident facet terms rises and the number of exclusive facet terms falls. Indeed, this is also a way that you can think about semantics - line in the context of a product is very different from line from a geometry standpoint, and this would be very evident by the fact that even factoring in synonyms, there is likely no overlap between the set of terms defining one as compared to another.

There are two schools of thought about whether is better to have a limited and closed taxonomy or an open taxonomy as the basis for modeling. The closed model approach, in essence, is a fully reductionist approach - every concept can be broken down into a (small and manageable) set of universal core terms. More complex concepts can then be modeled either by adding additional orthogonal facets, by constraining more terms within open facet term sets or by the composition of two or more existing concepts. As mentioned before, this is fully decomposable - if you have compositions of multiple terms, each of these can be broken down and duplicated can then be eliminated.

There are advantages to this approach. You can apply inheritance more readily, and computationally, it becomes easier to determine the metaphorical similarity between two different entities or attributes (you can even argue that in this particular way of thinking, you don't really need attributes at all).

The disadvantages to this approach, however, are also worth noting. The biggest is that it places a significant burden on the curators to use only those primitives and to describe everything in terms of those decomposable terms. This can lead to incredible contortions when it comes to describing things and requires that every query or analysis is preceded by some kind of decompositional analysis.

The second school of thought is to go with an open taxonomy, and a larger number of taxonomists working with a suggestive rather than required ontology but needing considerably less effort to do the classifications. Entity and attribute tagging can be accomplished at ingestion time, but it can also be enhanced by others working with the metadata. This is, in fact, the approach that most CMS systems currently employ for their content, requiring that an author or editor add enough tags to highlight key article concepts. This can also be enhanced by entity extraction algorithms (such as those employed by Smartlogic) that find the most relevant tags either from the text itself or through inference against a given lexicon of concepts).

This often requires the use of the taxonomist in a different role, one where they work with the folksonomy itself to identify groups of tags that together act as facets. It should also be noted that this can (and likely will) be accomplished through clustering algorithms that identify correlated groups of terms that can be decomposed into non-overlapping dimensions (which is what a facet really is). To get technical, each facet is a functional eigenstate that represents an orthogonal dimension of analysis.

For those like me whose math classes were a long time ago in a galaxy far, far away, the role of the taxonomist when dealing with an open taxonomy is to ensure that synonyms are identified to keep the number of terms manageable, that terms are organized into facet groups, and that constraint modeling (like saying that the model of a car is constrained by the make of that car) takes place. Machine language can help to reduce the overall workload there considerably, but curating the taxonomy still requires a certain human hand even so, albeit far less than would be required with a closed taxonomy model.

One additional advantage that comes with the open taxonomy model is that it is easier to create breakdowns that are consistent for properties that define most complex composites. For instance, consider the following breakdown of a property pulled from a data stream containing a column about estimated revenue in the third quarter of 2018:

Kurt Cagle 2019

This contains a huge amount of metadata - when the particular column was focused, what currency units were used, whether the numbers were confirmed or only estimated, as well as composite concepts made from the arbitrary decomposition of simple concepts. Various combinations of terms can also provide higher order concepts that can enrich the set, as well as each term being better able to clarify what their context is. In theory, this should be decomposable to a small subset, but the benefits to be achieved by that decomposition are dubious, at best.

Having said that, neither approach is superior in all cases. Working from a closed core ontology usually gives a more consistent mechanism for matching, but it also requires more discipline (and the right tools) to build a more expansive set of concepts. Using a folksonomy, on the other hand, is less precise and controlled, but works better when you have fewer (or less experienced) taxonomists working with the data (or when you're using machine learning). Folksonomies also tend to surface relationships that aren't necessarily what you're expecting, though the flip side to that is that this approach can also challenge your assumptions about the data that's being modeled, letting the results emerge as a more organic ontology.

One final question should be clarified here: How do these taxonomies fit into the broader question of ontologies? Categorizations (faceting) makes up a huge percentage of the total number of attributes in a model. Any time you have text that repeats in a column, you are likely looking at a category that could be expanded as nodes in a network, and the argument can be made that even dates and other vectors can be normalized as buckets (this is precisely the point where semantics meets machine learning). For instance, most cars have the concept of seating, which is a numeric value that really indicates the number of "seat belt sets" available. Semantically, six seats expressed as:

"6"^^unit:_SeatingCapacity

is effectively a facet term. It could be replaced with a labeled term (such as "six seats"), is bounded, and is reasonably finite. In the broader picture, this also points to a point that often gets lost with semantics: literals are also objects - they have a specific type, can be bound as simple types, and can even appear as a subject in their own right.

This is all well and good, but what does it have to do with taxonomies? Many schema languages (such as XML's XSD language) have the notion of an enumeration - a sequence of labels that describe different states for a given facet. Some of those are roles (classifications of medical specialists, such as Pediatrician or Oncologist), some are types (Technology vs. Administration vs. Marketing), some may be geographic regions (Seattle has the neighborhoods of Capitol Hill, Wallingford, GreenLake, the International District and so forth). The role of a taxonomist is to determine the conceptual buckets used in that classification process, in essence by defining these enumerations. A taxonomist is, effectively, putting together a dictionary. The ontologist establishes the relevant form that the dictionary entries take (and how they connect to one another) but the taxonomist is the one who determines the buckets. In practice, an ontologist almost always does a certain amount of taxonomy work and a taxonomist often works out models, so the two roles do overlap to a significant degree, but an ontologist is usually someone with a stronger computer science orientation.

One challenge that the taxonomist faces comes in identifying what makes for the best buckets - or, put another way, how to define categories so that three objectives are met: you have the fewest number of category terms necessary to define a given category, the categories overlap as little as possible (preferably not at all, a condition known as orthogonality) and that you have as few items as possible in the bucket labeled "other". If this sounds like a mathematical problem it is, and is one of the reasons why machine learning techniques are beginning to be used as an integral part of semantics.

Machine Learning is something of a catch-all term for a number of different but related mathematical techniques pulled from data science. Classification, in general, is fuzzy, especially in the realms of perception, biology, psychology, and similar fields. Biologists face this problem all the time, for instance, when dealing with species. Distinguishing a dog from a cat is easy. Distinguishing a dog from a wolf or a coyote is considerably harder. Indeed, in the Northeast United States and Canada, there have been a number of families of animals found in the wild that genetically have feral dog, wolf and coyote in them, despite the fact that each of these is often treated as distinct species.

With the increasing use of genomics to determine biological categorization, the decisions increasingly come around to the use of clustering often in a higher dimensional space. If that sounds scary, well, it is. However, a good way of thinking about such clusters is that related species are likely to share a number of genes in common (where "number" here can be well into the millions). These traits, though, exist within a sparse matrix. This means that while there may be a large number of potential configurations of genes, in practice, only a few of those configurations actually have something in them. This kind of statistical analysis makes it possible to ascertain these clusters, and from there to then determine the "dimensions" that make up the broader categories.

Summary

What this hints at is that machine learning, and the kind of data analysis that machine learning draws on for recommendation engines, is becoming yet another tool in both the ontologist's and taxonomist's toolbox. This doesn't mean that ML will replace semantics (or vice versa) but it does mean that organizations that depend only upon machine learning are missing out the potential to use the fruits of that to better organize their information space. Over time, the distinctions between machine learning and semantics should end up disappearing - they are both simply tools for managing the metadata associated with the data that flows through every organization and domain.

Follow me on LinkedIn.

More From Forbes

Taxonomies, Ontologies And Machine Learning: The Future Of Knowledge Management