So you have heard a lot about Deep Learning and Convolutional Neural Network, and you want to quickly try it out. But before you dive into the theory you want to get your hands dirty. And you don’t want to write a line of code. You also want to monitor progress of your training process from your smart phone. All I can say is that I respect your laziness! Let’s get started.
In this post we will learn how to set up a Deep Learning framework ( NVIDIA DIGITS + Caffe / Torch ) on an Amazon EC2 instance. This setup will enable you to schedule training tasks, monitor progress, and visualize results using a web interface.
What is NVIDIA DIGITS ?
DIGITS stands for Deep Learning GPU Training System. It is a web / browser based graphical user interface that allows you to prepare data, set training parameters, choose from some popular neural net architectures (or use your own) and train a deep neural net. It is a perfect tool to get started if you know very little about Deep Learning. Under the hood DIGITS uses Caffe — the popular open source deep learning framework. Support for Torch — a deep learning framework backed by Facebook — is in beta, but you can try it out.
GPUs on EC2
One big obstacle in immediately starting with Deep Learning is access to a good GPU. You may not have an NVIDIA card on your laptop and even if you do it may not be very powerful. Sometimes training a deep neural net takes hours and it makes no sense to use your primary computer for the task.
Without a GPU deep learning is painfully slow. In fact, one of contributions of the 2012 paper that firmly established Deep Learning as the undisputed king of image classification algorithms was its clever use of two GPUs.
Fortunately we live in amazing times. We have access to near infinite compute power at our finger tips. All you need to do is to register for Amazon Web Services ( AWS ).
This will give you access to Amazon’s Elastic Compute Cloud (EC2) and its virtually unlimited compute resources ( for a price of course ). The web interface allows you to start a virtual server called an “instance”. We are interested in the two GPU enabled instance types that have the following specifications.
Model | GPUs | vCPU | Mem (GiB) | SSD Storage (GB) |
g2.2xlarge | 1 | 8 | 15 | 1 x 60 |
g2.8xlarge | 4 | 32 | 60 | 2 x 120 |
In this tutorial we are going to use g2.2xlarge because it is less expensive ( $0.6 / hour ) and is sufficient for this tutorial. g2.8xlarge comes with 4 GPUs and you can use them all in parallel if you are using DIGITS with Caffe.
Install NVIDIA DIGITS using Amazon Web Services
I am going to assume that you have created an account on Amazon AWS and are logged in. Follow the steps below to set up an EC2 GPU instance. If you are already familiar with the process skip to the next section.
Set up EC2 GPU Instance
- Go to EC2 Management Console : On AWS Management Console click on EC2. This will bring you to EC2 Management Console.
- Launch instance : On EC2 Management Console go to Instances and click on Launch Instance button
- Choose Operating System : From the list of Operating Systems choose Ubuntu 14.04. Then click Next.
- Choose instance type : From the list of instance types choose g2.2xlarge. Then click on the Configure Instance Details button at the bottom of the page.
- Configure instance details : Make sure the number of instances is one. Pick a Subnet. It does not matter which one you pick. Later if you decided to attach a Volume ( storage space ) to your instance you will need to know the Subnet. Click the Next button.
- Add storage : I recommend you add 50GB at least. Click Next.
Note : This storage is NOT permanent. You will lose all data when you terminate your EC2 instance. If you are doing serious work, you should add an EC2 Volume to your instance. - Tag Instance : Pick a name — any name is fine. Then click Next.
- Configure security group : Pick the “Create a new security group” option and give your security group a descriptive name. We want two ways to access the server. First, we want to be able to log on to the machine via ssh. Second, we want to open port 80 to run DIGITS web server. Notice these two services are available from my IP address only. You may choose other custom IP. I do not recommend you make it accessible from any IP address.
- Review & launch
- Download Key : You need a key pair to ssh into this machine. Create a new key if you don’t have one. Choose a descriptive name. The downloaded file will have a .pem extension.
- Verify instance : To verify your instance is running, go to the EC2 Management Console, and then click on “Instances”. Copy the public ip address into your clipboard.
Install NVIDIA DIGITS on EC2 GPU Instance
We are now ready to install NVIDIA DIGITS on the GPU instance we created in the last step.
- SSH into EC2 Instance : Open a terminal ( on OSX or Linux ) or use an ssh client on Windows to log onto the machine. Type the following command with the full path to the .pem file you had downloaded and the public IP address of your machine.
# Change permission of your ssh key file. chmod 600 your-pemfile.pem # SSH into machine. ssh -Y -i your-pemfile.pem [email protected]
If you do not change the permission of your ssh key file you may receive the following warning.
WARNING: UNPROTECTED PRIVATE KEY FILE! Permissions 0644 for 'yourpem.pem' are too open. It is recommended that your private key files are NOT accessible by others. This private key will be ignored. bad permissions: ignore key: sentiment.pem Permission denied (publickey).
- Update and upgrade package manager apt-get : Assuming you were able to log in and are on the server now.
sudo apt-get update && sudo apt-get -y upgrade
- Install linux-image-extra : The base linux kernel package that comes with Ubuntu 14.04 instance on Amazon has some drivers missing. This is done to slim down the size of the linux image. So we need to install the drivers left out of the base package.
sudo apt-get install -y linux-image-extra-`uname -r`
- Install NVIDIA drivers
sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update sudo apt-get install nvidia-352 nvidia-settings
- Get CUDA and NVIDIA’s machine learning repos
CUDA_REPO_PKG=cuda-repo-ubuntu1404_7.5-18_amd64.deb && wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/$CUDA_REPO_PKG && sudo dpkg -i $CUDA_REPO_PKG ML_REPO_PKG=nvidia-machine-learning-repo_4.0-2_amd64.deb && wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1404/x86_64/$ML_REPO_PKG && sudo dpkg -i $ML_REPO_PKG
The machine learning repo above gives access to digits, caffe-nv, torch, libcudnn4.
- Install DIGITS
sudo apt-get update sudo apt-get install digits
If everything went well, go to your public IP on the browser, and you will see this screen.
Woohoo! we are all set up! BTW if you relax your security requirements, you can actually view this page and therefore monitor progress of your training process from your smart phone!
Getting Started with NDVIDIA DIGITS
The github page for DIGITS provides an example for creating a dataset and training at model. Click here to get started.
NDVIDIA DIGITS Configuration FAQ
- How can you configure DIGITS to run a different port ?
You can configure DIGITS to run a different port using the following command.
sudo dpkg-reconfigure digits
- Where does DIGITS store the datasets and trained models ?
DIGITS stores all data inside /usr/share/digits/digits/jobs.
ls /usr/share/digits/digits/jobs
There are two kinds of jobs directories– 1) Dataset job — contains information about a dataset created using DIGITS 2) Training job — contains information about a model trained using DIGITS. You can tell a jobs directory contains a dataset if it contains labels.txt, mean.binaryproto, train_db, train.txt, val_db, val.txt etc. E.g.
# Here 20160208-182427-0f82 is a Dataset job $ ls -1 /usr/share/digits/digits/jobs/20160208-182427-0f82 create_train_db.log create_val_db.log labels.txt mean.binaryproto mean.jpg status.pickle train_db train.txt val_db val.txt
On the other hand if it contains a trained model, you will see files named deploy.prototxt, solver.prototxt, train_val.prototxt, snapshot_iter_*.caffemodel etc. E.g.
# Here 20160209-011941-7953 is a Training job $ ls -1 /usr/share/digits/digits/jobs/20160209-011941-7953 caffe_output.log deploy.prototxt snapshot_iter_104.caffemodel . . snapshot_iter_960.caffemodel snapshot_iter_960.solverstate solver.prototxt status.pickle train_val.prototxt
- How to start / stop / restart DIGITS server ?
cd /usr/share/digits # set new config sudo python -m digits.config.edit -v # restart server sudo stop nvidia-digits-server sudo start nvidia-digits-server
- How to change the default jobs directory in NVIDIA DIGITS ?
As mentioned above, by default DIGITS stores all data inside /usr/share/digits/digits/jobs/ . You probably want a different location for your data. For example, you may want all the DIGITS jobs to be stored on an attached volume. You can do so using the following commands.
cd /usr/share/digits # set new config sudo python -m digits.config.edit -v # restart server sudo stop nvidia-digits-server sudo start nvidia-digits-server
NOTE: The new jobs directory you choose should be writable by www-data.
sudo chown -R www-data path_to_new_jobs_dir
- How to change configurations in NVIDIA DIGITS ?
The following commands will allow you to change all configurations in DIGITS. The configurations include the jobs directory, the GPUs to use, the log file location, the log level, server name, location of caffe installation and the location of Torch installation.cd /usr/share/digits # set new config sudo python -m digits.config.edit -v # restart server sudo stop nvidia-digits-server sudo start nvidia-digits-server
Im new using caffe, and how much to use this kind service ?
If you use GPU instance of Amazon, it will cost you $0.6 / hour. But if you can install DIGITS on your own linux box too.
Great…. Congrats, I really like your website…
I’m using caffe in my PC, but I think I need more hardware or I am not configuring well the solver and the layers to find a good solution in my problem…
Thanks Lucas. There are so many small things ( other than the hardware ) that can result in not so good results.
Yes you are right, I got some interesting results, but I need to learn more 🙂
Are you thinking to prepare some text like “how to use Caffe” or something like that?
I had not thought about it. Are Caffe’s online examples not useful ? If so, could you tell me what they are lacking ?
Thanks
Satya
Satya,
I haven’t tried this yet but that is awesome. Did you happen to make an AMI that I can copy?
I do 🙂 . I will share it shortly.
Hi Samir,
The AMI id is ami-d949afb9 . It is available in US West ( Oregon ) region. The name is bigvision-digits. Please let me know if it works for you. Make sure you allocate about 40 GB of space. I have also put a dataset called 17flowers in the data directory. This should help you get started.
Satya,
Thank you very very much for doing this. I found the AMI and I was able to access it.
It has no GUI so I assume that I have to run DIGITS from a local machine with the ip of the ec2 server to see the screenshot below?
I need to get digits on my machine before I proceed.
I am assuming you created an instance using that AMI. You have to find the public IP address of that instance. In this post search for “Verify instance” and you will see how to find the IP address. On your browser simple go to that ip address. If the above web page does not show up, you will have to restart the digits server. For this you have to log onto your instance and do
sudo stop nvidia-digits-server
sudo start nvidia-digits-server
Satya,
I don’t want to burden you with this and I appreciate your help. I have done that but I see nothing on my browsers.
Satya,
Sorry to bother with this. I think I did all of this but when I put the IP address in my browser, it doesn’t work.
Satya,
I was able to run the server.
1) I had an error in my custom IP setting: Under your step 9: “Review & launch”, you have a 5000 port although in your step 8 you had the correct port of 80. I now have port 80 and now it works.
2) I was able to train and test the flowers database although the labels are not correct. It seems that the folder names is not correct (but the training worked fine) and the validation worked great.
I want to thank you very very much for your help.
Thank you so much for the feedback. You are right, the new digits 3 runs on port 80. I will also check into the flower dataset. I had put it together quickly without checking so that you have something to try :).
Everything worked perfect so thank you again.
BTW I have new post based on the discussion we had here. I have fixed the flower classes and created a video that explains how to use the AMI for people who are not familiar with Amazon EC2.
https://learnopencv.com/deep-learning-example-using-nvidia-digits-3-on-ec2/
Thanks for your feed Satya. DIGITS is indeed a great tool to get started with. TensorFlow has also a Visualization board called TensorBoard, but the framework is a bit slower compared to the others deep learning frameworks (This should change on the next release).
Therefore, I think DIGITS is the best choice for training image classification models so far.
Another great thing would be to use afterwards the DNN module (that supports GoogleNet !) to perform a forward pass with the trained model. But unfortunately, DNN forward pass [1] does not generate the same results as DIGITS standalone image classification test (“classify one image” which is more more accurate). So maybe you could write something about that ?
Cheers,
[1] http://docs.opencv.org/trunk/d5/de7/tutorial_dnn_googlenet.html#gsc.tab=0
Thanks for the feedback, Djebril. I also have a suspicion the digits will soon have support for TensorFlow. Here is a recent quote from NVIDIA’s CEO Jen-Hsun Huang.
“TensorFlow will democratize deep learning” Jen-Hsun says. “That’s a huge contribution to humanity.”
hi Satya, i have a server running DIGITS. Do you know how to password protected the page? Or force some sort of login to access DIGITS. Thanks!
Hi Chris,
Authentication was added to DIGITS a few months back.
https://github.com/NVIDIA/DIGITS/pull/463
However, the apt-get package is not updated for minor release versions as noted here
https://github.com/NVIDIA/DIGITS/blob/master/docs/UbuntuInstall.md
Now the easiest option is to restrict access based on IP address while setting up AWS as I have described in this followup video
https://youtu.be/QZaAcl_F9R0?t=3m48s
There is a second more complicated option. Since DIGITS is a flask application you can use Flask’s basic auth to restrict access. I have not tried this.
http://flask.pocoo.org/snippets/8/
The final option is to compile digits from source using github.
https://github.com/NVIDIA/DIGITS
If many people have this need, I will change my AMI so that it uses digits compiled from source. But that will have to wait.
Hope this helps.
Satya
Thanks Satya for your quick reply! That’s exactly what I want!
Hi Satya! Thanks for your great tutorial. It helps a lot!
Unfortunately if I run an image classification, I get the following error:
Creating layer data
Check failed: error == cudaSuccess (38 vs. 0) no CUDA-capable device is detected
I set up the machine twice an did not see any errors while installing the packages. So do you have any clue what has gone wrong?
Same problem here. Any solutions?
@spmallick:disqus could you please check your tutorial to see it still works?
I will look into in on Tuesday and let you know. Caught up in something until then.
Until then can you try using the AMI I have published
https://learnopencv.com/deep-learning-example-using-nvidia-digits-3-on-ec2/
Thanks @spmallick:disqus , it works perfectly
Same problem here. It’s not recognizing the presence of a GPU anywhere.
Just posted the problem to Stackoverflow. Hopefully someone will provide a solution:
http://stackoverflow.com/questions/37647530/installing-nvidia-digits-on-amazon-ec2-error-cudasuccess-38-vs-0-no-cuda
Satya, really appreciate the tutorial but it seems as if something may have changed. The curren tdefault AMI GPU instance following all of your steps yields an install of DIGITS but no support for GPU (which was kind of the point, LOL). I’m trying to manually install the CUDA libraries and drivers, etc. using a different tutorial. Any help at all would be appreciated. I’m not the only one (see other comments in this thread).
Thanks.
Hi Jay,
I will look into in on Tuesday and let you know.
Satya
As a followup this may be related to a bug in the deployment package that NVIDIA just discovered after I and a few others were having problems.
Thanks man. I have been scratching my head over this but have not found a solution. Do you have a link to the buy you mentioned ?
https://github.com/NVIDIA/DIGITS/issues/801
Thanks. I am watching the conversation now and will try again after they confirm a good fix.
Did you see the new p2 instance type? 🙂
I sincerely took delight in reading your site, you explained some first-rate points. I want to bookmark your post. I saved you to delicious and yahoo bookmarks. I will attempt to revisit to your site and examine more posts.