Knowledge Graph Schema Design Patterns and Principles

Gavin
March 14, 2023
Graph Database, Technical

Graphs are extremely expressive, which actually can be a bit of a problem if we create them without a well-defined schema. The schema helps to constrain the sorts of links, acts as documentation, providing both human and machine-readable semantics, and ensures that software gets things shaped the way they are expected.

While good schema design is very important for relational databases, it’s perhaps even more important and central for knowledge graphs.

Unfortunately, there isn’t a lot of guidance out there on how to do relatively mundane tasks. I’d like to provide at least a bit of assistance based on my experience.

We’re going to take a look at schema design patterns and principles using TerminusCMS, although many of these ideas can be used elsewhere.

Documents

Atoms of data in TerminusDB are represented by fields with a data type. This could be a string, or an integer, or a date. These are in turn, woven into a molecule of data, known as a document.

Let’s look at a Person document so we get a clearer picture.

				
					{ "@type" : "Class",
  "@id" : "Person",
  "first_name" : "xsd:string",
  "family_name" : "xsd:string",
  "date_of_birth" : "xsd:dateTime"
}

This person carries a name and a date of birth. In fact, it is, as written, very close to a row record in an RDBMS or a CSV file.

To make things interesting, however, we can add some additional links.

				
					{ "@type" : "Class",
  "@id" : "Person",
  "first_name" : "xsd:string",
  "family_name" : "xsd:string",
  "date_of_birth" : "xsd:dateTime",
  "friends" : { "@type" : "Set", "@class" : "Person" }
}

Now we can list also add links to friends of the person in question. This is the sort of data structure you might use for a social network, or even perhaps a Rolodex-type application.

This is the simplest sort of modeling that we can do – where you have a number of data properties, and a number of links to other documents, all bundled conveniently in your document.

Subdocuments

However, sometimes you want to have an internal structure in your document, that is not just an atom of data, but is intrinsically related to this specific object, and not simply a link to another object.

The most common variety of this type of object is data which is somehow annotated with additional structure. For instance, we might want to have a data point that is time scoped, has a specific source, or perhaps has a unit.

				
					{ "@type" : "Enum",
  "@id" : "Unit",
  "@value" : [ "meters", "kilograms" ] }

{ "@type" : "Class",
  "@id" : "UnitValue",
  "@subdocument" : [],
  "value" : "xsd:decimal",
  "unit" : "Unit" }

It doesn’t make sense for this value to just float around by itself, but it might be useful in the context of a specific object, for instance, the height of a person.

The "@subdocument" : [] specifies that this class is a subdocument class. It will be entirely owned by the containing class, nobody else will be allowed to point to it, and it will always come back as a fully expanded JSON document when we search for the containing document.

				
					{ "@type" : "Class",
  "@id" : "Person",
  "first_name" : "xsd:string",
  "family_name" : "xsd:string",
  "date_of_birth" : "xsd:dateTime",
  "friends" : { "@type" : "Set", "@class" : "Person" }
  "height" : "UnitValue",
  "weight" : "UnitValue",
}

You might notice that height and weight are both UnitValue, but are not necessarily of the right unit! We are currently in the process of adding restrictions that will allow such constraints also to be described, but that’s for another blog post! 😀

Relationships

Not all relationships can be reduced to a simple link. However, it is often possible to represent them with a subdocument, adding the auxiliary information in a way similar to the way we adorned the base-type xsd:decimal with a unit.

If you have a complex relationship, it often makes sense to lift it up as a first-class object itself.

For instance, suppose we want to represent a shareholding relationship. We can do this as follows:

				
					{ "@type" : "Class",
  "@id" : "Company",
  "name" : "xsd:string" }
{ "@type" : "Class",
  "@id" : "Shareholder",
  "name" : "xsd:string" }
{ "@type" : "Class",
  "@id" : "Company",
  "@inherits" : "Shareholder" }
{ "@type" : "Class",
  "@id" : "Person",
  "@inherits" : "Shareholder" }
{ "@type" : "Class",
  "@id" : "Shareholding",
  "quantity" : "xsd:decimal",
  "shares_in" : "Company",
  "held_by" : "Shareholder",
  "from" : "xsd:date",
  "to" : { "@type" : "Optional", "@class" : "xsd:date" }}

Our Shareholding relationship has two different links, one of which is the company in which shares are held, and the other is the shareholder, which could be either a person or a company. But now we have also adorned the object with a quantity, and a period over which they were held.

This sort of first-class relationship link can be expanded to deal also with hypergraphs where there are more than two objects in the relationship (a receivership is such a relationship).

Mixins: Multiple Inheritance for Aspect

Multiple inheritance is a very powerful tool in programming languages, but in data, it arguably works even better. And mixins (some traits which are mixed-in) are one of the ways that you can get re-use out of your data modeling.

A few examples of cross-cutting aspects of data modeling have come up repeatedly in my modeling experience. These include space, time, provenance, and units.

Temporal Scope

The Shareholding example above used a temporal component, but this could also be pulled out as a mixin which can be used elsewhere.

				
					{ "@type" : "Class",
  "@id" : "TemporalScope"
  "from" : "xsd:date",
  "to" : { "@type" : "Optional", "@class" : "xsd:date" }
}

The from date is given as required for something temporally scoped, but the to date is left as optional, in order to model scoping which has not yet ceased. Of course, you might not always want this, but it’s often a very useful approach.

We might also have an event that simply happens at a time:

				
					{ "@type" : "Class",
  "@id" : "Event"
  "at" : "xsd:date"
}

Spatial Scope

We can also refer to a geometry to add spatial scope to our objects by way of inheritance. The mixin for spatial scope might look like this:

				
					{ "@type" : "Class",
  "@id" : "GeographicScope",
  "geometry" : "Geometry" }

Where Geometry refers to the Geometry class from GeoJson.

Provenance

It’s very common to have a resource that has a source that needs to be recorded to understand how we have come to know something. This is typical when we obtain a resource from, for instance, a website.

In this case, we might have an object that inherits Event and Source

				
					{ "@type" : "Class",
  "@id" : "Source",
  "source" : "xsd:anyURI"
}
{ "@type" : "Class",
  "@id" : "WebScrape",
  "@inherits" : ["Event", "Source"],
  "page" : "xsd:string",
}

Collections

Collections in a graph can be modeled in many different ways. TerminusDB implements three different methods to try and simplify things for modeling, yet it’s important to understand the distinctions between these three methods: Set, List, and Array.

Set

The Set is the simplest of the three, as it has no order, and is really just an edge with greater multiplicity than one. In the graph a set for an edge with three elements looks as follows:

				
					.
    ∘
  ↗
∘ → ∘
  ↘
    ∘

Array

The Array is a more complicated object, which encodes an index, giving order, and enables a few additional features which differentiate it from Sets and Lists.

				
					.
       v0
 value↗
    ∘ idx→ 0
   ╱
  ╱  	v1
 ╱	  ↗
∘ → ∘ idx→ 1
  ↘
    ∘ → v2
   idx↘
        2

Each value element of the array has an additional (hidden) indirection object with an index (or multiple indexes for multidimensional arrays).

This allows us to have not only order, but multiple dimensions, and we can represent gaps. When returning the values in JSON, we will get back a multidimensional array with null fields for regions that are not filled. However, they are not actually represented at all in the database.

List

The List is actually lifted directly from rdf:List and uses the same fields as described in rdf, namely rdf:first and rdf:rest.

The list structure for a three-element list looks as follows:

				
					∘ → ∘ rest→ ∘ rest→ ∘ rest→ rdf:nil
    ↓ first ↓ first ↓ first
    v0      v1      v2
.

The linked-list style structure has potential technical advantages. You can insert anywhere in the list without having to reindex everything after the given element. However, you also have to traverse long chains in the graph to decode a list.

Conclusion

I’m always on the lookout for patterns and approaches that can make modeling a more pleasant experience, and even more importantly, make it easy to manipulate and discover data once it is modeled. If you have other interesting ideas, join our Discord and give us a shout!