Are you a data scientist, business analyst, developer, or project manager? No matter what your job title is, the chances are, you or your team will have to handle data every day in this data-centric world. Powerful software nowadays works like magic thanks to the amount of data we have access to. User data, transaction data, analytic data…all of these help us to make better decisions and applications.
However, things can get annoying quickly when the amount of data you and your team handle becomes too large. As an ex-data scientist, my past experiences did consist of times when handling data coherently with my colleagues provided challenges. Many times, it ended in a scenario with multiple versions of data sitting around with no good explanation of which one is meant to be used for a certain experiment.
Good data governance is what we are all looking for. Not just in the sense that data will be more secure (which is very important), but on the other side we want to streamline the process for the right people to gain access to the data to get their job done.
Is Git a solution for data versioning and governance?
Software engineers are problem solvers. Revision control is not a new concept in software development. When a team of developers works together to write into a single code source, revision control tools are put in place to resolve problems that could arise working asynchronously.
Among these, there’s Git, a wildly popular revision control tool. With Git, it allows a wide range of operations that keep track of what someone has done and can easily combine a team’s work. Here are a few noticeable operations:
Branches
With branches, each one of the contributors can work on their version of the code. Branches are made to have a separate history from the point where this branch originated from. Branches are not the same as making a copy of the project (repo). It is a separate version of the original that has a different history. This way of storing different versions is much more efficient if the project is huge but the change is minor.
Rollback
Since Git stores the history of the repo, it allows going back to an earlier version if something goes wrong. Checkpoints are made using commits, these store the minimal difference between the current state and the state of the last checkpoint. Each commit will have a message, which can be useful to describe what has been changed and why it was changed. It also stores who made the commit.
Diff
Being able to compare two different states of the repo and find the minimal changes is not as straightforward as you think. However, it is an important part of allowing magic-like collaboration operations. Diff is also useful to understand and approve changes by highlighting the elements that have been changed.
Merge
As diff provides the minimal changes between two stages, it makes comparing and combining two pieces of work manageable. If the minimal changes between two branches are not at the same place, there's no conflict and the work can be combined. However, if the same place has two different changes, it can still be resolved easily by manually comparing what has been altered.
Git gives control to text, but what can control data structures?
From my experience, Git does provide a good tool for collaboration and managing changes in a project. Data governance can benefit by adopting the same principles as Git. However, Git is a tool designed to compare text, which is how code is written down, it will not work for data structures that are not text.
Text is a list structure and Git compares changes line by line. On the other hand, data can be in tabular format, which consists of not just rows but columns. Or data can be in JSON format, which is a key-value pair structure. Both of them do not work well with Git. Imagine if you delete a column in a CSV, git will think you changed the whole table as changes are scattered across all the lines. Or imagine scrambling up the order of objects in a JSON file, nothing changes for the key-value pairs but Git will think you changed the whole file.
Better data governance requires a database with revision control
Git is not designed to store and manage data. What we need is a database with revision control functionalities. There are many (and more coming) databases that are trying to achieve that.
One among them would be TermiunsDB, which is a knowledge graph database by nature. Being a graph database, TerminusDB is flexible in the way it stores data and with the proper schema, it already provides the foundation to build a “commit-tree” that allows branch and rollback.
Moving forward, to achieve Git-like operations for structured data to provide better data governance, we need more than the ability to branch and rollback. We need an efficient way to compare differences and combine data. Right now, diff and patch are available in TerminusDB and as an open API. It aims to provide a better solution for comparing data in JSON format and gives users the ability to perform a number of patch operations. Table diffs have also been developed, which proved to be incredibly difficult, and will be integrated into our products. With efficient data diffs, merge capabilities are right at the doorstep.
This is an exciting time for technology. We have more application and AI power with our ability to make use of large amounts of data. What we need at this time is a way to govern data with better workflows that provide efficient and secure ways to collaborate and share data. Revision control has proven its benefits in software development, especially in open-source projects, it may be the right tool that we need in this data-driven world of technology.