Train On More Photos In Less Space

how to train a deep learning model on more photos without using up all your memory
data science
deep learning

Jerad Acosta


August 20, 2022

More Efficient Training

[Free] Cloud Notebook Resources

I love using Kaggle when creating and training deep learning models. Their free, cloud hosted, Jupyter Notebooks are a great resource for Data Science, Machine and Deep Learning enthusiasts such as myself.

If you too use Kaggle, Paperspace, or another free cloud hosted notebook environment, then you are most definitely familiar with your resources and the management of computing power.

CPU Bound

One resource to look out for when training deep learning models on image data is the CPU. When a process is being held up on account of the CPU the computer is consider CPU Bound. Because the training of most AI, neural network, deep learning, and machine learning models employs parallelization through the GPU we should be concerned when we notice our process being help up by the CPU.

In-fact, anytime a process overly throttles a single resource we should take note - this usually spells out opportunity for optimisatoin.

Fortunately, Kaggle displays our CPU, GPU, and HDD (Hard Disk Drive) usage in the top right corner of the notebook. It is here we can identify when we are running low on resources or have used up our allotted amount of free computing power - if you are using their free tier access.

Believe it or not, rendering JPEG files is actually a slow and cumbersome process. Iterating through tens, hundred’s, thousands, or a greater number of images become quite a massive job to take on - and this all happens in the CPU!

This is an example of the early mentioned CPU Bound process.

A CPU Bound Kaggle Notebook

FAST.AI and resize_images()

Let’s check out a pretty straight forward method of helping out our CPUs - reducing the size of our images and requiring less memory to be processed.

Here we are going to use resize_images() from the fastai library.

trn_path = Path('sml')
resize_images(path/'train_images', dest=trn_path, max_size=256, recurse=True)

Here we:

  1. Create a new training images path trn_path named ./sml/.

  2. Then, using resize_images() we:

    • Grab each file in the directory: path/train_images (the images)

    • resize each file to a max_size of 256

    • drop the new images in the directory dest=trn_path

    • Runing this through each image in the directory

We have taken a collection of 1024px images and reduced their size 4x. Additionally, resize_images() keeps the directory structure in tact, leaving the new images in an identical form exactly where we directed in the dest parameter.

Wrapping Up

  1. Fastai has a great function resize_image() that allows us to significantly reduce out CPU overhead by reducing the size of the images we are working with
  2. If a process in your notebook is bogging you down and you find yourself CPU Bound, this is great opportunity for optimization
    • Look for ways to reduce the size of the files or data you are working with
    • Consider forms of parallel computing or perhaps employing the GPU in that share of the work using something like PyTorch