So you have heard a lot about Deep Learning and Convolutional Neural Network, and you want to quickly try it out. But before you dive into the theory you want to get your hands dirty. And you don’t want to write a line of code. You also want to monitor progress of your training process from your smart phone. All I can say is that I respect your laziness! Let’s get started.
In this post we will learn how to set up a Deep Learning framework ( NVIDIA DIGITS + Caffe / Torch ) on an Amazon EC2 instance. This setup will enable you to schedule training tasks, monitor progress, and visualize results using a web interface.
What is NVIDIA DIGITS ?
DIGITS stands for Deep Learning GPU Training System. It is a web / browser based graphical user interface that allows you to prepare data, set training parameters, choose from some popular neural net architectures (or use your own) and train a deep neural net. It is a perfect tool to get started if you know very little about Deep Learning. Under the hood DIGITS uses Caffe — the popular open source deep learning framework. Support for Torch — a deep learning framework backed by Facebook — is in beta, but you can try it out.
GPUs on EC2
One big obstacle in immediately starting with Deep Learning is access to a good GPU. You may not have an NVIDIA card on your laptop and even if you do it may not be very powerful. Sometimes training a deep neural net takes hours and it makes no sense to use your primary computer for the task.
Without a GPU deep learning is painfully slow. In fact, one of contributions of the 2012 paper that firmly established Deep Learning as the undisputed king of image classification algorithms was its clever use of two GPUs.
Fortunately we live in amazing times. We have access to near infinite compute power at our finger tips. All you need to do is to register for Amazon Web Services ( AWS ).
This will give you access to Amazon’s Elastic Compute Cloud (EC2) and its virtually unlimited compute resources ( for a price of course ). The web interface allows you to start a virtual server called an “instance”. We are interested in the two GPU enabled instance types that have the following specifications.
|Model||GPUs||vCPU||Mem (GiB)||SSD Storage (GB)|
|g2.2xlarge||1||8||15||1 x 60|
|g2.8xlarge||4||32||60||2 x 120|
In this tutorial we are going to use g2.2xlarge because it is less expensive ( $0.6 / hour ) and is sufficient for this tutorial. g2.8xlarge comes with 4 GPUs and you can use them all in parallel if you are using DIGITS with Caffe.
Install NVIDIA DIGITS using Amazon Web Services
I am going to assume that you have created an account on Amazon AWS and are logged in. Follow the steps below to set up an EC2 GPU instance. If you are already familiar with the process skip to the next section.
Set up EC2 GPU Instance
- Go to EC2 Management Console : On AWS Management Console click on EC2. This will bring you to EC2 Management Console.
- Launch instance : On EC2 Management Console go to Instances and click on Launch Instance button
- Choose Operating System : From the list of Operating Systems choose Ubuntu 14.04. Then click Next.
- Choose instance type : From the list of instance types choose g2.2xlarge. Then click on the Configure Instance Details button at the bottom of the page.
- Configure instance details : Make sure the number of instances is one. Pick a Subnet. It does not matter which one you pick. Later if you decided to attach a Volume ( storage space ) to your instance you will need to know the Subnet. Click the Next button.
- Add storage : I recommend you add 50GB at least. Click Next.
Note : This storage is NOT permanent. You will lose all data when you terminate your EC2 instance. If you are doing serious work, you should add an EC2 Volume to your instance.
- Tag Instance : Pick a name — any name is fine. Then click Next.
- Configure security group : Pick the “Create a new security group” option and give your security group a descriptive name. We want two ways to access the server. First, we want to be able to log on to the machine via ssh. Second, we want to open port 80 to run DIGITS web server. Notice these two services are available from my IP address only. You may choose other custom IP. I do not recommend you make it accessible from any IP address.
- Review & launch
- Download Key : You need a key pair to ssh into this machine. Create a new key if you don’t have one. Choose a descriptive name. The downloaded file will have a .pem extension.
- Verify instance : To verify your instance is running, go to the EC2 Management Console, and then click on “Instances”. Copy the public ip address into your clipboard.
Install NVIDIA DIGITS on EC2 GPU Instance
We are now ready to install NVIDIA DIGITS on the GPU instance we created in the last step.
- SSH into EC2 Instance : Open a terminal ( on OSX or Linux ) or use an ssh client on Windows to log onto the machine. Type the following command with the full path to the .pem file you had downloaded and the public IP address of your machine.
# Change permission of your ssh key file. chmod 600 your-pemfile.pem # SSH into machine. ssh -Y -i your-pemfile.pem [email protected]
If you do not change the permission of your ssh key file you may receive the following warning.
WARNING: UNPROTECTED PRIVATE KEY FILE! Permissions 0644 for 'yourpem.pem' are too open. It is recommended that your private key files are NOT accessible by others. This private key will be ignored. bad permissions: ignore key: sentiment.pem Permission denied (publickey).
- Update and upgrade package manager apt-get : Assuming you were able to log in and are on the server now.
sudo apt-get update &amp;amp;amp;&amp;amp;amp; sudo apt-get -y upgrade
- Install linux-image-extra : The base linux kernel package that comes with Ubuntu 14.04 instance on Amazon has some drivers missing. This is done to slim down the size of the linux image. So we need to install the drivers left out of the base package.
sudo apt-get install -y linux-image-extra-`uname -r`
- Install NVIDIA drivers
sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update sudo apt-get install nvidia-352 nvidia-settings
- Get CUDA and NVIDIA’s machine learning repos
CUDA_REPO_PKG=cuda-repo-ubuntu1404_7.5-18_amd64.deb &amp;amp;amp;&amp;amp;amp; wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/$CUDA_REPO_PKG &amp;amp;amp;&amp;amp;amp; sudo dpkg -i $CUDA_REPO_PKG ML_REPO_PKG=nvidia-machine-learning-repo_4.0-2_amd64.deb &amp;amp;amp;&amp;amp;amp; wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64/$ML_REPO_PKG &amp;amp;amp;&amp;amp;amp; sudo dpkg -i $ML_REPO_PKG
The machine learning repo above gives access to digits, caffe-nv, torch, libcudnn4.
- Install DIGITS
sudo apt-get update sudo apt-get install digits
If everything went well, go to your public IP on the browser, and you will see this screen.
Woohoo! we are all set up! BTW if you relax your security requirements, you can actually view this page and therefore monitor progress of your training process from your smart phone!
Getting Started with NDVIDIA DIGITS
NDVIDIA DIGITS Configuration FAQ
- How can you configure DIGITS to run a different port ?
You can configure DIGITS to run a different port using the following command.
sudo dpkg-reconfigure digits
- Where does DIGITS store the datasets and trained models ?
DIGITS stores all data inside /usr/share/digits/digits/jobs.
There are two kinds of jobs directories– 1) Dataset job — contains information about a dataset created using DIGITS 2) Training job — contains information about a model trained using DIGITS. You can tell a jobs directory contains a dataset if it contains labels.txt, mean.binaryproto, train_db, train.txt, val_db, val.txt etc. E.g.
# Here 20160208-182427-0f82 is a Dataset job $ ls -1 /usr/share/digits/digits/jobs/20160208-182427-0f82 create_train_db.log create_val_db.log labels.txt mean.binaryproto mean.jpg status.pickle train_db train.txt val_db val.txt
On the other hand if it contains a trained model, you will see files named deploy.prototxt, solver.prototxt, train_val.prototxt, snapshot_iter_*.caffemodel etc. E.g.
# Here 20160209-011941-7953 is a Training job $ ls -1 /usr/share/digits/digits/jobs/20160209-011941-7953 caffe_output.log deploy.prototxt snapshot_iter_104.caffemodel . . snapshot_iter_960.caffemodel snapshot_iter_960.solverstate solver.prototxt status.pickle train_val.prototxt
- How to start / stop / restart DIGITS server ?
- How to change the default jobs directory in NVIDIA DIGITS ?
As mentioned above, by default DIGITS stores all data inside /usr/share/digits/digits/jobs/ . You probably want a different location for your data. For example, you may want all the DIGITS jobs to be stored on an attached volume. You can do so using the following commands.
NOTE: The new jobs directory you choose should be writable by www-data.
sudo chown -R www-data path_to_new_jobs_dir
- How to change configurations in NVIDIA DIGITS ?
The following commands will allow you to change all configurations in DIGITS. The configurations include the jobs directory, the GPUs to use, the log file location, the log level, server name, location of caffe installation and the location of Torch installation.