Delta Rollups

Vivek
September 22, 2020
News & Events

Hi there! Matthijs here.

I’m one of the developers of TerminusDB, where I work on the backend. I’m the primary author of terminusdb-store, our storage backend, as well as a contributor to the database server. Unless things go terribly wrong, my work is largely invisible to people actually using TerminusDB, which is a shame, because it is fascinating stuff! Therefore, I have decided to start writing blog posts to highlight some of these things.

Today’s post will be about a new feature that we’re working on called ‘delta rollups’. Delta rollups will hopefully make queries much faster in the future, especially when done on databases that have a lot of commits. We’ll probably roll out this feature in an upcoming minor release, so stay tuned!

To describe what delta rollups are, we first need to discuss what problem it tries to solve. As you probably know, TerminusDB is append-only. This means that when you delete some data, that data is not actually thrown away. Instead, those deletions are saved in a new data layer on top of your old data. When querying, the query engine looks through all those layers to reconstruct your data from all the additions and deletions contained in the various layers. The big advantage of doing things this way, rather than modifying the data directly, is that we can actually reason about the changes done to a database over time. Among other things, this is what allows TerminusDB to be used collaboratively. It is what allows us to do branching, rebasing, pushing and pulling of data.

There is however a big disadvantage: the more commits in a database, the more layers queries need to search through to find data. Each layer adds some cost to the query. This is fine for a small number of layers, but when you have hundreds of commits, this really starts being noticeable. Roughly, query time scales linearly with the number of layers that need to be searched, so to keep query speeds low, we would rather not have so many layers.

A straightforward way to reduce the number of layers would be to squash them all together into one giant layer. This is an approach we actually already take for some of our own internal graphs, such as the system database. However, doing this throws away history. After doing a squash, it is no longer clear what sequence of operations led a database to be in a certain state. This is no problem for some internal databases where we don’t care about the history, but it is definitely a problem for any normal database. For example, if you were to squash your own database, and then tried to pull in changes from somebody else who worked off the original, the system would no longer be able to recognize the common history, and would reject the change.

What we need is a way to squash a layer, without throwing away the history. This is what delta rollups aim to achieve. A delta rollup is a squashed layer which to the system appears to be exactly the same as the layer stack that it replaces. It allows queries to be much faster, while still allowing all the collaboration features to work.

Let’s talk about what is required for delta rollups to deliver on this goal. First, it is important that the original changes aren’t thrown away. Delta rollups should be considered an optimization that is fully transparent. We don’t want to rewrite history. So rather than rewiring layer stacks to point at a new delta rollup layer, we’ll instead update the existing layer to point at its delta rollup. When loading the layer, the query engine will be able to detect this information and load the rollup version instead, all the while pretending to the rest of the system that it is looking at the original layer.

The second important feature for delta rollups to work properly is id stability. In terminusdb-store, each node, predicate or value string is assigned an id. Each layer stores newly added strings in dictionary files, where ids are assigned to them in the lexicographical order for that layer. Each layer then takes up an id range adjacent to its parent layer. So for example, the first layer takes up ids 1 through 10, the second layer 11 to 20, etcetera. This means that when you merge a bunch of dictionaries, the ids may change. Continuing our example, you’d now have a single layer with ids ranging from 1 to 20, and unless each string that originally came out of layer 2 is lexicographically ahead of each string from layer 1, those strings will now be interspersed, completely changing their ids.

We can’t just change the ids of strings. Suppose someone did some work based on the original layer, they may have done insertions and deletions with those strings, which are stored under the original id. If we were to change ids on delta rollup, such a change could no longer be pulled in cleanly. Therefore, delta rollups will store an id remap table, mapping the internal id to an external one and back.

The third important bit in making delta rollups work properly is dealing with queries that investigate the additions and removals to a layer. We can’t just query the delta rollup layer for these, since the delta rollup layer has a different set of additions and removals than the layer it replaces. In order to support these operations, we are introducing a way to load in the original layer on-demand, without loading in the entire layer stack. Addition and removal queries can then transparently use this layer delta, rather than the delta rollup layer.

In order to support all this, we’ve been hard at work over the past few weeks to revamp a lot of our internals. A lot of the logic dealing with building and querying of layers has been refactored to be much more modular so that it can be recombined in a different way to support the building and querying of delta rollup layers. As a side-effect, these changes should also bring in general query speedups, so everybody wins!

So that’s it! Delta rollups, coming soon to a TerminusDB near you! May it grace your data with ever-improving query speeds!

TerminusCMS

CMS Builder for Consultants and Developers

CMS for Compliance

CMS for Manufacturing & Engineering

CMS for Pharmaceutical and Medical Businesses

Pharmaceutical & Medical CMS

Fast Content Delivery with TerminusCMS

June 27, 2023

We recently switched from GitBook to TerminusCMS as the backend for our technical docs to improve UX and the speed improvements are good. This article looks at the speeds to provide a comparison.

Building a Vector Database to Make Use of Vector Embeddings

June 1, 2023

Vector databases are all the rage at the moment and it’s not just hype. The advance of AI, which is making use of vector embeddings, has significantly increased the buzz. This article talks about how we implemented a vector database in Rust in a week to give us semantic indexing and entity resolution using OpenAI to define our embeddings.

TerminusCMS’ Database Collaboration to Keep Humans in the Loop

May 5, 2023

We’ve updated TerminusCMS. The new database collaboration features offer great ways to keep humans in the loop of change for accuracy.

TerminusCMS Demo

February 15, 2023

In this video, TerminusDB CTO Gavin Mendel Gleason shows you around TerminusCMS, our headless CMS for devs.

TerminusDB 11 – Upgrade Required

January 30, 2023

We have upgraded to TerminusDB 11 – This is a big release, and you should pay careful attention to upgrade information as the storage back-end has changed and requires an upgrade.

Open source content and knowledge management system

A look ahead to our open-source headless content and knowledge management system

January 6, 2023

In a little under a month, we will be launching TerminusCMS, an open-source headless CMS to provide businesses with composable architectures and true organization-wide knowledge

All the best for 2023

December 20, 2022

We wish you all the best for 2023. As 2022 comes to a close, we wanted to say happy Christmas and give you a look at our plans for 2023

GraphQL is here – Use any programming language with TerminusDB

November 15, 2022

TerminusDB 10.1.8 has been released which means TerminusDB now comes with GraphQL to improve the developer experience. Your data product schemas are automatically loaded in

Delta Rollups

TerminusCMS

Latest Stories