TerminusDB: A Technical History

Luke
April 13, 2020
Case Studies, TerminusDB

TerminusDB is an open source (GPLv3) fully featured in-memory graph database management system with a rich query language: WOQL (the Web Object Query Language).

TerminusDB originated in Trinity College Dublin in Ireland in 2015 when we started working on the information architecture for ‘Seshat: the Global Historical Databank’, an ambitious project to store lots of information about every society in human history. We needed a solution that could enable collaboration among a highly distributed team on a shared database whose primary function was the curation of high-quality datasets with a very rich structure, storing information about everything from religious practices to geographic extent.

The historical databank was a very challenging data storage problem. While the scale of data was not particularly large, the complexity was very high. First there were a large number of different types of facts that needed to be stored. Information about populations, carrying capacity, religious rituals, varieties of livestock etc. In addition to this each fact had to be scoped with the period over which the fact was likely to be true. This meant we needed to have ranges with uncertainty bars on each of the endpoints. Then the reasoning for the data point had to be apparent on inspection. If the value for population was deduced from the carrying capacity, it is critical for analysis to understand that this is not a “ground fact”. But even ground facts require provenance. We needed to store the information about which method of measurement gave rise to the fact, and who had undertaken it. And if that wasn’t bad enough, we needed to be able to store disagreement – literally the same fact might have two different values as argued by two different sources, potentially using different methods.

On top of this we needed to be able to allow data entry by graduate students who may or may not be reliable in their transcription of information. So additional provenance information was required about who put the fact into the database.

Of course, all of this is possible in an RDBMS but it would be a difficult modeling task to say the least. The richness of data and the extensive taxonomic information makes using a knowledge graph look more appropriate. Given that Trinity College Dublin had some linked data specialists we opted to try a linked data approach to solving the problem.

Unfortunately, the linked-data and RDF tool-chains were severely lacking. We evaluated several tools in an attempt to architect a solution, including Virtuoso and the most mature technology StarDog, but found the tools were not really up to the task. While StarDog enforced the knowledge graph or ontology as a constraint on the data, it did not provide us with usable refutation witnesses. That is, when something was said to be wrong, insufficient information was given to attempt automated resolution strategies. In addition, the tools were set up to facilitate batch approaches to processing data, rather than a live transactional approach which is the standard for RDBMSs. Our first prototype was simply a prolog program using the prolog database and temporary predicates which could be used within a transaction such that running updates could have constraints tested without disrupting the reader view of the database. (We thought that over time the use of prolog would be a hindrance to adoption as logic programming is not particularly popular – though there are now several green shoots!).

Not long after this we were asked to attempt the integration of a large commercial intelligence database into our system. The data was stored in a database in the 10s of gigabytes keeping information about all companies, boards of directors and various relationships between companies and important people over the entire course of the Polish economy since the 1990s. This, unsurprisingly, brought our prototype database to its knees. The problem gave us new design constraints. We needed a method of storing the graph that would allow us to find chains quickly (one of the problems they needed to solve), would be small enough that we didn’t run out of main memory and which would also allow the transaction processing to which we had grown accustomed. The natural solution was to reuse the idea of a temporary transaction layer, but to keep them around longer – and to make these layers very small and fast to query. We set about looking for alternatives. We spent some time trying to write a graph storage engine in postgres. Ultimately, we found this solution to be too large and too slow for multi-hop joins on the commercial intelligence use-case. Eventually we found HDT (Header-Dictionary-Triples). On evaluation this solution seemed to work very well to represent a given layer in our transactions as well as performing much better and so we built a prototype utilizing the library.

Unfortunately, the HDT library exhibited a number of problems. First it was not really designed to allow programmatic access which was required during transactions and so we found it quite awkward. Second, we had a lot of problems with re-entrance leading to segfaults. This was a serious problem on the Polish commercial use case which needed multi-threading to make search and update feasible. Managing all of the layering information in prolog was obviously less than ideal – this should clearly be built into the solution at the low level. We either had to fix HDT and extend it. Or build something else.

We opted to build something else and we opted to build it in Rust. Our specific use cases didn’t require a lot of the code that was in HDT. We were not as concerned with using it as an interchange format (one of the design principles for HDT) and in addition to the layering system, we had plans for other aspects to change fairly drastically as well. For instance, we needed to be able to conduct range queries and we wanted to segment information by type. HDT was standardized for a very different purpose and it was going to be hard to fit it into that shoe. The choice of Rust was partly one borne of the pain of tracking down segfaults in HDT (written in C++). We were willing to pay some upfront cost in development time not to search, oft times fruitlessly, for segfaults.

At this point our transaction processing system had linear histories. The possibility of tree histories, i.e. of having something with branching was now obvious and relatively simple to implement. It went on to the pile of things which would be “nice to have” (a very large pile). It wasn’t until we started using the product in a production environment for a Machine Learning pipeline that this “nice to have” became more and more obviously a “need to have”.

As with any technical product of this nature there are many different paths we could have followed, after spinning out from university and attempting to pursue a strategy of implementing large scale enterprise graph systems with some limited success, we decided to shift to open source and become TerminusDB (https://github.com/terminusdb/terminus-server). Potentially a long road, but a happier path for us. We also decided to double down on the elements in TerminusDB that we had found useful in project implementation and felt were weakest in existing information architectures. These are the ability to have very rich schemas combined with very fine-grained data-aware revision control enabling the types of Continuous Integration / Continuous Deployment (CI/CD) used extensively in software engineering to be used with data. We think that DataOps, basically DevOps for data or the ability to pipeline and reduce the cycle time of data, is going to become more and more important for data intensive teams.

The name ‘TerminusDB’ comes from two sources. The Roman God of boundaries is named Terminus and good databases need boundaries. Terminus is also the home planet of the Foundation in the Issac Asimov series of novels. As our origin is in building the technical architecture for the Global History Databank, the parallels with Hari Seldon’s psychohistory are compelling.

To take advantage of this DataOps/’Git for data’ use case we implement a graph database with a strong schema so as to retain both simplicity and generality of design. Second, we implement this graph using succinct immutable data structures. Prudent use of memory reduces cache contention while write-once, read-many data structures simplify parallel access significantly. Third, we adopted a delta encoding approach to updates as is used in source control systems such as git. This provides transaction processing and updates using immutable database data structures, recovering standard database management features while also providing the whole suite of revision control features: branch, merge, squash, rollback, blame, and time-travel facilitating CI/CD approaches on data. Some of these features are implemented elsewhere, the immutable data structures of both Fluree and Datomic for example, but it is the combination that makes TerminusDB a radical departure from historical architectures.

In pursuing this path, we see that innovations in CI/CD approaches to software design have left RDBMSs firmly planted in the past. Flexible distributed revision control systems have revolutionized the software development process. The maintenance of histories, of records of modification, of distribution (push/pull) and the ability to roll back, branch or merge enables engineers to have confidence in making modifications collaboratively. TerminusDB builds these operations into the core of our design with every choice about transaction processing designed to ease the efficient implementation of these operations.

Current databases are shared knowledge resources, but they can also be areas of shared destruction. The ability to modify the database for tests, for upgrades etc is hindered by having only the single current view. Any change is forced on all. The lack of fear of change is perhaps the greatest innovation that revision control has given us in code. Now it is time to have it for data. Want to reorganize the structure of the database without breaking all of the applications which are using it? Branch first, make sure it works, then you can rebase master in confidence. While we are satisfied with progress, databases are hard and typically take a number of years to harden. It is a difficult road as you need to have something which works while keeping your options for progress open. How do you eat an elephant? One bite at a time.

We hope that others will see the value in the project and contribute to the code base. We are committed to delivering all features as open source and will not have an enterprise version. Our monetization strategy is to roll out decentralized data sharing and a central hub (if TerminusDB is Git for data, TerminusHub will be the ‘GitHub for data’) in the near future. Any cash that we generate by charging for seats will subsidize feature development for the core open source database. We are not sure if this is the best strategy, but we are going to see how far it takes us.