Are you a data scientist, business analyst, developer, or project manager? No matter what your job title is, the chances are, you or your team will have to handle data every day in this data-centric world. Powerful software nowadays works like magic thanks to the amount of data we have access to. User data, transaction data, analytic data…all of these help us to make better decisions and applications.
However, things can get annoying quickly when the amount of data you and your team handle becomes too large. As an ex-data scientist, my past experiences did consist of times when handling data coherently with my colleagues provided challenges. Many times, it ended in a scenario with multiple versions of data sitting around with no good explanation of which one is meant to be used for a certain experiment.
Good data governance is what we are all looking for. Not just in the sense that data will be more secure (which is very important), but on the other side we want to streamline the process for the right people to gain access to the data to get their job done.
Is Git a solution for data versioning and governance?
Software engineers are problem solvers. Revision control is not a new concept in software development. When a team of developers works together to write into a single code source, revision control tools are put in place to resolve problems that could arise working asynchronously.
Among these, there’s Git, a wildly popular revision control tool. With Git, it allows a wide range of operations that keep track of what someone has done and can easily combine a team’s work. Here are a few noticeable operations:
Git gives control to text, but what can control data structures?
From my experience, Git does provide a good tool for collaboration and managing changes in a project. Data governance can benefit by adopting the same principles as Git. However, Git is a tool designed to compare text, which is how code is written down, it will not work for data structures that are not text.
Text is a list structure and Git compares changes line by line. On the other hand, data can be in tabular format, which consists of not just rows but columns. Or data can be in JSON format, which is a key-value pair structure. Both of them do not work well with Git. Imagine if you delete a column in a CSV, git will think you changed the whole table as changes are scattered across all the lines. Or imagine scrambling up the order of objects in a JSON file, nothing changes for the key-value pairs but Git will think you changed the whole file.
Better data governance requires a database with revision control
Git is not designed to store and manage data. What we need is a database with revision control functionalities. There are many (and more coming) databases that are trying to achieve that.
One among them would be TermiunsDB, which is a knowledge graph database by nature. Being a graph database, TerminusDB is flexible in the way it stores data and with the proper schema, it already provides the foundation to build a “commit-tree” that allows branch and rollback.
Moving forward, to achieve Git-like operations for structured data to provide better data governance, we need more than the ability to branch and rollback. We need an efficient way to compare differences and combine data. Right now, diff and patch are available in TerminusDB and as an open API. It aims to provide a better solution for comparing data in JSON format and gives users the ability to perform a number of patch operations. Table diffs have also been developed, which proved to be incredibly difficult, and will be integrated into our products. With efficient data diffs, merge capabilities are right at the doorstep.
This is an exciting time for technology. We have more application and AI power with our ability to make use of large amounts of data. What we need at this time is a way to govern data with better workflows that provide efficient and secure ways to collaborate and share data. Revision control has proven its benefits in software development, especially in open-source projects, it may be the right tool that we need in this data-driven world of technology.