Version control for data science and machine learning

This article looks at version control for data science and machine learning and has been written following an interview with our DevRel Lead and ex-data scientist Cheuk Ting Ho. During a TerminusDB discovery session, Cheuk mentioned versioned machine learning and it piqued my interest so I decided to pick her considerable brain to find out more. This is what I learned.

Who is Cheuk?

I would be remiss to begin this article without a proper introduction to Cheuk.

Before working in Developer Relations, Cheuk was a data scientist with experience in the finance, AI, publishing, and e-commerce sectors. The roles demanded high numerical and programmatical skills, especially in Python. Having been introduced to TerminusDB via a meet-up, Cheuk changed direction to embrace her passion for the tech community and became our DevRel Lead, maintaining the TerminusDB Python client and engaging with TerminusDB’s user community.

Away from work, Cheuk enjoys talking about Python on Twitch and via podcasts. Cheuk is a regular speaker at Universities and conferences and also organizes events for developers, including EuroPython, of which she is a board member, PyData Global, and Pyjamas Conf. Cheuk organizes workshops and mentored sprints for minority groups, believing in tech diversity and inclusion. In 2021, Cheuk has become a Python Software Foundation fellow.

What are the typical tasks a data scientist undertakes to implement machine learning?

A data scientist implementing machine learning will typically go through these tasks:

  • Understanding the problem.
  • Collecting data.
  • Preparing data.
    • Understanding it.
    • Normalizing it by eliminating duplicates and making error corrections.
  • Choosing a model based on the data and requirements – more info on these here.
  • Training the machine model.
  • Evaluating the results.
  • Fine-tuning the parameters.
  • Predictions – Once the model and results have been settled, it’s time to deploy the machine learning and hand it over to the relevant domain.

What are the biggest challenges for data scientists implementing machine learning?

The biggest challenges for Cheuk during her work as a data scientist included:

  • Access to data – This can be a long process, especially for those working for agencies when the data is sensitive and can involve a lot of paperwork.
  • Understanding the data – Challenges include, abstract field names, big CSV dumps, and how the process of data collection has been achieved over time, for example, the process may have changed if the data is over a long period of time.
  • Organization – With so many data sets, experiments, and different parameters and processes, being organized is difficult.
  • Repetition – Normalized data needs to be stored in the database for experiments and when running multiple experiments the database has to be set up each time.
  • Collaborative working – Because of the complicated nature of machine learning and the many facets involved, collaborative working is time-consuming due to the need to pick through and understand the various Jupiter Notebooks, files, and processes.
  • Inflexible databases – Many data scientists use relational databases for their work and the rows and columns format of these provide an inflexible way to combine data and their relationships.
  • Reworking models – Touching upon the organization aspect again, reworking and rebuilding experiments is difficult due to the various files and databases making it hardfor data scientists to recreate their models.

Version control for data science and machine learning, TerminusDB & its benefits

Version control for data science and machine learning takes the best bits of Git and applies them to the database and procedures data scientists undertake to implement machine learning. Git is a distributed version control system for source code where you can branch, clone, and merge to collaboratively develop software with control and safety. 

TerminusDB is an open-source document graph database. The schema language enables documents and their relationships to be specified using a simple JSON syntax. It combines the ease of working with JSON documents, with the powerful query capabilities of graph databases, and structure using schema. It is also designed as a distributed database with a collaboration model, namely version control.

Taking the same approach as Git, but for data, means that data scientists can perform the same branch, clone, and merge operations with their databases. Version control for data science and machine learning, together with TerminusDB’s functionality is beneficial because of:

  • Version control:
    • Saves time and gets results faster:
      • Data scientists can branch models and evolve them in different ways at the same time and perform multiple experiments from the same base to reduce repetition and workload, and add a layer of standardization.
      • Roll back to a particular commit and tweak parameters rather than having to rebuild from scratch.
    • Facilitates collaboration:
      • Give your team access to your database to branch, tweak, and merge changes.
  • Document graph database:
    • Improves organization, governance, and time efficiency:
      • TerminusDB is very flexible so data scientists can store not only the data for machine learning, but also metadata, data preparation procedures, hyperparameters, and process data. This makes it easier for yourself, and others, to understand how the results were achieved.
      • Featuring a flexible schema means that data can be modeled prior to insertion into the database. It is also another way to add governance and meaning to your models.
    • Better collaboration:
      • Because the database can store all of the data involved in the machine learning model, sharing with coworkers is easier. They have all of the information needed to understand how the results were achieved and the ability to branch and clone to make their own changes.
    • Faster deployment:
      • Once the data science team has decided upon the fruit of their labor, because the machine learning model is bundled with all the relevant data, passing it over to an engineer or developer to deploy is easier and faster to explain and implement.

Get started with TerminusDB for your machine learning experiments

If you’re interested in TerminusDB to try out versioned machine learning then we’d love to help you get started. As an open-source product, you can find out how to install TerminusDB via our documentation. There’s also a range of tutorials, including a getting started with the TerminusDB Python client guide, to help you figure out the intricacies.