• Magdalena Konkiewicz

How to successfully add large data sets to Google Drive and use them in Google Colab


Image by Speedy McVroom from Pixabay

Introduction


In this post, I will explain how to add large data sets to Google Drive so they can be accessed from Google Colab for processing and modeling.


Whereas uploading a single file can be done with the drag and drop interface of Google Drive, it becomes more difficult with a large number of files. Dragging the whole folder containing 1GB of files just fails and freezes Google Drive. The alternative is to drag a zipped folder. This process is usually successful and does not take even that long (a couple of minutes with 1GB file) but the problem comes with unzipping the file in Google Drive itself which results in random files missing.


My trials for three days to successfully upload photo data for CNN training in order to use free GPU led me to establish this alternative. I am going to describe here step by step how to upload successfully big data sets so they can be processed by Google Colab and take advantage of VMs provided by their services.


Steps


1. Zip the folder with the files. In my case, I had a folder called 'train' with 70257 .jpg files that were taking around 1GB.


2. Upload the zipped file using Google Drive Interface. I have uploaded mine to an empty directory called data.



3. Open a new Google Colab file and mount it to Google Drive to be able to access the zip file. If you do not have Colab installed you can follow this article that explains how to do it.


The command below will start the mounting process.


from google.colab import drive
drive.mount('/content/gdrive')

You will be asked to authorize access to Google Drive.

Follow instructions to give authorization by copy-pasting the code and you should be mounted.




4. Now extract files to the local environment with the following command.


!unzip gdrive/My\ Drive/data/train.zip

Note that my train.zip file in 'data' folder is located in the Google Drive root directory. You will need to modify the path accordingly to where your file is located.


You should see file unzipping.



It takes less than 1 minute to unzip 1GB so you should not wait too long. Once you are comfortable that this command is working you can use the variation below that suppresses the output.


!unzip gdrive/My\ Drive/data/train.zip > /dev/null


Once the cell has executed you can see the files have appeared in the local train folder. You can find it on the left-hand side of the colab interface.


5. You can use the files for anything right now. Juts access them from the new 'train' folder. In my case, I can display the first image using the following code.



import tensorflow as tf
img = tf.keras.preprocessing.image.load_img('train/abs_000.jpg')
img



You can now use your data for anything you wish. In my case, it was training a CNN using free GPU.


Summary


The process that I described above worked best for me! So please free to copy it.


On the other hand, it would be much better to unzip the files in Google Drive, so they just stay there. This however was 'mission impossible' for me for thee days in a row. Every time I have unzipped the files and saved them in Google Drive there were missing photos in the end.


I have tried to do it programmatically with Colab as well as using Zip Extractor connected to Google Drive. Both methods resulted in missing files with no warning about it. I have also tried two different Google accounts and the problem persisted.


After losing 3 days of trial and error. I came with the process described above and it works smoothly for me every time I rerun the code.


I hope it helps others that try to load large data sets to Google Drive in order to process them in Colab. If you are one that has found a better method or was able to unzip large files in Google Drive, I would be curious how this has been done.


Happy coding!




1,475 views0 comments