Blank Nodes in RDF

Getting around the issue of blank nodes in RDF

This article focuses on getting around the issue of blank nodes in RDF. It is an accidental blog and is the work of TerminusDB Discord community member Somethingelseentirely.

For context, we’ll tell you how this article came about. If you want to jump to the main content, click here. It all started with this innocuous question from @Speller

				
					How is TerminusDB better than rest of the dbs?
				
			

To which, TerminusDB CTO Gavin replied…

				
					Different DBs do different things. TerminusDB is the only immutable graph database using a JSON document interface and with a git-like model allowing time-travel, branch merge etc.
				
			

This is where Somethingelseentirely got involved…

				
					Seeing it written out like that, makes me really wonder, how do you canonicalise the triple to create a unique hash for each commit? The whole "RDF is only the triples without a primary serialization format" is really not helpful when it comes to bit exact repeatability
				
			

The conversation continued…

Gavin: ntriple serialisation can be used to produce a rolling hash.

Somthingelseentirely: and then label blank nodes numercially ascending?

Gavin: Skolemize blank nodes

Somethingelseentirely: So no “true” blank nodes in terminus internally? That would be/is super awesome :D!The amount of discussions I had with ontologists and logicians alike, it’s odd that we haven’t seen half a dozen “blank nodes are RDFs 1million dollar mistake” blog posts*😄

Gavin: Haha, I totally concur @somethingelseentirely !
Since I first started using RDF I’ve taken the view that they need to be immediately skolemized
I’ve run into so much awkwardness from them. It was a very bad idea.

Somethingelseentirely: Yeah! The whole idea of “we’ll just solve graph isomorphism on the fly over and over again” is really insane 😂

This is when another community member commented about the topic of skolemization and the fact that there is so much to learn.

Here Somethingelseentirely contributed real value.

Blank Nodes in RDF by Somethingelseentirely

It’s best not to bother with blank nodes if you can. Of course, there are cases where you need to ingest data that contains them, RDF Turtle for example uses them automatically whenever you use anonymous objects with the [ ] syntax (relatively obvious).

				
					@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/stuff/1.0/> .

<http://www.w3.org/TR/rdf-syntax-grammar>
  dc:title "RDF/XML Syntax Specification (Revised)" ;
  ex:editor [
    ex:fullname "Dave Beckett";
    ex:homePage <http://purl.org/net/dajobe/>
  ] .
				
			

Or whenever you create a collection like an rdf:List with the ( ) syntax (definitely a lot less obvious).

				
					PREFIX : <http://example.org/stuff/1.0/>
:a :b ( "apple" "banana" ) .
				
			

Since RDF collections are essentially linked lists, with additional constraints (e.g. set semantics) you get a blank node for every list node.

If possible I simply forego the whole process of skolemization and use UUIDs directly, which can be embedded into the URI namespace like this urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 as per https://datatracker.ietf.org/doc/html/rfc4122 .

If you want to learn more about blank nodes, and why they cough require special attention cough:

TerminusDB makes this a lot easier with the latest release and provides capabilities to automatically derive subject/entity IDs based on either UUIDs, human-readable auto-incrementing values, or a content-aware hash.

Which is really really nice and straightforward.

My inner purist says that UUIDs are the go-to solution, but the content-aware stuff is a nice touch (if you absolutely know that you’re not gonna expand on an entity in the future).

I personally avoid anything human-readable for entity/subject and attribute/property identifiers, as it only encourages bike shedding amongst ontologists/developers, about the naming of stuff (much better to have an arbitrary constant that everybody can name whatever they want in their codebase), and makes schema migrations harder (can’t reuse a human-readable name, while creating a new random ID is dirt cheap). But that’s just my personal taste and opinion 😄.