This article will teach you how to set up a Machine Learning project with a DagsHub: a new tool designed to ease collaboration on Data Science projects that require data versioning and model building.
Today, we will talk about:
Problems with traditional Git and DVC setup for Machine Learning projects and why this approach does not fully work.
What is DAGsHub, and how does it solve the problems mentioned above?
And finally, we will learn how to set up your first project using DAGsHub and DAGsHub Storage.
The Problem with Traditional Git and DVC Setup
Most of the collaboration on Machine Learning projects is done through Git. This is a traditional set up that was inherited from software engineering projects.
However, there is a big difference between Software Engineering and Machine Learning projects. The latter usually involves massive data files that often need to be updated. Git does not handle this well. Even though many Machine Learning engineers have forced it to work initially, it was never the best solution.
Machine learning engineers have started using DVC (Data Version Control tool). This approach still uses Git, but the data files are stored in the remote cloud system such as S3, GDrive, Azure, etc.
This is a step towards easier collaboration on Big Data projects but still requires a pretty complicated initial set up where you need to configure the outside storage to work with DVC.
What is DAGsHub and DAGsHub Storage?
DAGsHub is a free-to-use tool for machine learning projects that avoids the problems mentioned above. It is built on top of Git and DVC and allows easy versioning of data, models, and permits tracking experiments. They have recently released a new product, DAGsHub Storage, that allows for an even easier setup of machine learning projects, removing the need for heavy-duty configuration.
This means no more need to purchase cloud storage such as AWS, GCS, Azure, or GDrive to host big machine learning data sets. And no more complicated setup to configure those to work with DVC. With DAGsHub Storage, you can effortlessly access all your code, data, models, and experiments from the same place.
Sounds interesting, doesn’t it? In the next section, I will walk you through how to set up a machine learning project with DAGsHub and DAGsHub Storage in just a few simple steps.
Set up your first project using DagsHub
The first step in the setup process is creating a DagsHub account. You can do this using this link.
You can see that you can sign up with a GitHub account, which I did. Once you have made your account, you will go to your own DAGsHub space, from where you will create a new repository.
You will fill in information about the repo. Just add a name and a short description. You can leave the rest of the settings as defaults. Once you’ve created a repo, you can create and initialize the project from the command line:
git remote add origin https://dagshub.com/<username>/<repo-name>.git
The project does not have any structure yet. Let’s add some data and setup files for your first project.
We will first create empty folders for storing data and outputs via the command line.
We will use data from an example of a machine learning project from the official DAGsHub tutorial. This means you need to download the project requirements from requirements.txt and add them to your main directory.
Once you have done it, just initialize a new virtual environment with the following commands:
python3 -m venv .venv
echo .venv/ >> .gitignore
echo __pycache__/ >> .gitignore
Now you can install all the requirements in your new virtual environment.
pip install -r requirements.txt
Finally, you are ready to add training data to a data folder you have prepared before.
echo /data/ >> .gitignore
echo /outputs/ >> .gitignore
wget https://dagshub-public.s3.us-east-2.amazonaws.com/tutorials/stackexchange/CrossValidated-Questions-Nov-2020.csv -O data/CrossValidated-Questions.csv
The commands above will download data for your project and ensure that you will not be adding data or outputs files to Git (.gitignore parts).
The last part of the initial setup is committing all the changes to Git.
git add .
git commit -m "Initialized my first DAGs Hub project"
git push -u origin master
Now your repo should look like this:
Note that we have only the .gitignore file and requirements. Great!
We will take care of data folders using DVC. We will first add a simple file that can be used to train our data. Copy this file (main.py) and add it to your project directory. Add it also to the remote repo.
git add main.py
git commit -m "adding training file"
git push -u origin master
Your project should look like this now.
This main.py file can either split the data into train and test or train the Machine Learning models. Let’s split the data first (make sure you have a virtual environment activated).
python main.py split
And then let's train a model.
python main.py train
you should see similar output:
As a result of splitting the data, you should have some new zip files in the data folder, and as a result of the training, you should be getting .joblib files in the outputs folder.
All of those need to be added to DVC. Let’s set it up by running:
Now you can add data and outputs to DVC tracking.
dvc add data
dvc add outputs
And then add corresponding information to Git.
git add .dvc data.dvc outputs.dvc
git commit -m "Added data and outputs to DVC"
Now we will need to set up DAGsHub Storage as our DVC remote. You can use the following code with your credentials.
dvc remote add origin "https://dagshub.com/<DAGsHub_username>/<repo_name>.dvc"
dvc remote default origin --local
dvc remote modify origin --local user <DAGsHub_username>
dvc remote modify origin --local auth basic
dvc remote modify origin --local password <DAGsHub_password>
Now we need to add the setup changes to Git.
git add .dvc/config
git commit -m "Configured the DVC remote"
And let's push all of this to the remote repo.
git push -u origin master
dvc push --all-commits
This should result in all the files being now pushed to DAGsHub Storage.
As you can see, all the components, code, data files, and models are versioned, stored, and accessible in one place.
You have just set up your first Data Science project using DAGsHub and DAGsHub Storage!
You may be wondering: wait a minute, where are the steps for setting up DAGsHub Storage? There was actually no step needed to configure any type of storage — we simply wrote just a few lines of code. Your data and models are automatically stored by DAGsHub Storage with zero configuration required. This is why it is preferable to other services by Amazon and Google.
Additionally, data stored at DAGsHub storage is easily viewable and searchable. In order to see this feature, head to the data folder in your DAGsHub Storage project view and inspect the CrossValidated-Questions.csv file.
CrossValidated-Questions.csv is the file we used to train the machine learning algorithm in the previous steps. Once you open the file via DAGsHub Storage, you can see that you can inspect the whole file very easily. It actually looks like a searchable pandas DataFrame.
DAGsHub Storage has many productive and useful features that aren’t mentioned here. They can elevate your machine learning project to the next level. It’s also a great collaboration tool, so your entire team can work together seamlessly.
In this article, you have learned how to set up your first project using DAGsHub and DAGsHub Storage that allows an easy setup for Machine Learning projects. It is a new solution that provides an easier collaboration than more traditional setups using Git, DVC, and additional remote storage providers.
I hope you have enjoyed it and happy coding!