Blank Nodes in RDF

Getting around the issue of blank nodes in RDF

This article focuses on getting around the issue of blank nodes in RDF. It is an accidental blog and is the work of TerminusDB Discord community member Somethingelseentirely.

For context, we’ll tell you how this article came about. If you want to jump to the main content, click here. It all started with this innocuous question from @Speller

				
					How is TerminusDB better than rest of the dbs?
				
			

To which, TerminusDB CTO Gavin replied…

				
					Different DBs do different things. TerminusDB is the only immutable graph database using a JSON document interface and with a git-like model allowing time-travel, branch merge etc.
				
			

This is where Somethingelseentirely got involved…

				
					Seeing it written out like that, makes me really wonder, how do you canonicalise the triple to create a unique hash for each commit? The whole "RDF is only the triples without a primary serialization format" is really not helpful when it comes to bit exact repeatability
				
			

The conversation continued…

Gavin: ntriple serialisation can be used to produce a rolling hash.

Somthingelseentirely: and then label blank nodes numercially ascending?

Gavin: Skolemize blank nodes

Somethingelseentirely: So no “true” blank nodes in terminus internally? That would be/is super awesome :D!The amount of discussions I had with ontologists and logicians alike, it’s odd that we haven’t seen half a dozen “blank nodes are RDFs 1million dollar mistake” blog posts*😄

Gavin: Haha, I totally concur @somethingelseentirely !
Since I first started using RDF I’ve taken the view that they need to be immediately skolemized
I’ve run into so much awkwardness from them. It was a very bad idea.

Somethingelseentirely: Yeah! The whole idea of “we’ll just solve graph isomorphism on the fly over and over again” is really insane 😂

This is when another community member commented about the topic of skolemization and the fact that there is so much to learn.

Here Somethingelseentirely contributed real value.

Blank Nodes in RDF by Somethingelseentirely

It’s best not to bother with blank nodes if you can. Of course, there are cases where you need to ingest data that contains them, RDF Turtle for example uses them automatically whenever you use anonymous objects with the [ ] syntax (relatively obvious).

				
					@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix ex: <http://example.org/stuff/1.0/> .

<http://www.w3.org/TR/rdf-syntax-grammar>
  dc:title "RDF/XML Syntax Specification (Revised)" ;
  ex:editor [
    ex:fullname "Dave Beckett";
    ex:homePage <http://purl.org/net/dajobe/>
  ] .
				
			

Or whenever you create a collection like an rdf:List with the ( ) syntax (definitely a lot less obvious).

				
					PREFIX : <http://example.org/stuff/1.0/>
:a :b ( "apple" "banana" ) .
				
			

Since RDF collections are essentially linked lists, with additional constraints (e.g. set semantics) you get a blank node for every list node.

If possible I simply forego the whole process of skolemization and use UUIDs directly, which can be embedded into the URI namespace like this urn:uuid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 as per https://datatracker.ietf.org/doc/html/rfc4122 .

If you want to learn more about blank nodes, and why they cough require special attention cough:

TerminusDB makes this a lot easier with the latest release and provides capabilities to automatically derive subject/entity IDs based on either UUIDs, human-readable auto-incrementing values, or a content-aware hash.

Which is really really nice and straightforward.

My inner purist says that UUIDs are the go-to solution, but the content-aware stuff is a nice touch (if you absolutely know that you’re not gonna expand on an entity in the future).

I personally avoid anything human-readable for entity/subject and attribute/property identifiers, as it only encourages bike shedding amongst ontologists/developers, about the naming of stuff (much better to have an arbitrary constant that everybody can name whatever they want in their codebase), and makes schema migrations harder (can’t reuse a human-readable name, while creating a new random ID is dirt cheap). But that’s just my personal taste and opinion 😄.

Latest Stories

Vector database and vector embeddings

Building a Vector Database to Make Use of Vector Embeddings

Vector databases are all the rage at the moment and it’s not just hype. The advance of AI, which is making use of vector embeddings, has significantly increased the buzz. This article talks about how we implemented a vector database in Rust in a week to give us semantic indexing and entity resolution using OpenAI to define our embeddings.

Read More »
Back link graph queries using GraphQL

Graph Back Link Queries

Graph back link queries find objects pointing at a particular object. This is useful for understanding the impacts of relationships, for example, in the supply chain back link queries can show the product impacted by a particular component shortage.

Read More »
The Content Revolution

The Content Revolution

A look at TerminusCMS from a technical perspective and how we’re trying to address what’s missing in today’s document management solutions.

Read More »