GPU Computer: How much faster is it for Deep Learning?
Here, I will show the performance of the GPULAB, the custom build computer used by the ATLAS group in NBI for deep learning. Today I am just using 2 classic machine learning examples to show the performance of the machine compared to other NBI computers without GPU (actually without CUDA support). A nice package to benchmark your machine can be found here, by Max Woolf.
Spoiler:
The use of one GPU reduces the time per epoch a factor 5 to 44 with respect CPU only usage, equivalent to factor 20 to 176 times in a single thread.
CIFAR10: This is a sample made of 60000 images with a size of 32x32 pixels. The images are divided in 10 classes, with 6000 images for each category. The algorithm is a classification problem, with 10 possible outputs mutually exclusive and the method implemented is a set of 3 convolutional neural networks.
- GPU Computer using GPUs: Training time: 10min 32s, time per epoch: 25s, performance: loss 0.8645 - acc: 0.6986 - val_loss: 0.7276 - val_acc: 0.7494.
- GPU Computer using CPU: Training time: 42min 46s, time per epoch: 130s, performance: loss: 0.8779 - acc: 0.6955 - val_loss: 0.7469 - val_acc: 0.7466.
- 2012 iMac (i7): Training time: 76min 48s, time per epoch: 230s, performance: loss: 0.8793 - acc: 0.6935 - val_loss: 0.7753 - val_acc: 0.7297
- 2015 MacBook Pro 13 (i7): Training time: 92min 18s, time per epoch: 272s, performance: loss: 0.8818 - acc: 0.6946 - val_loss: 0.7849 - val_acc: 0.7300
MINST: This is a large database of handwritten digits (0-9) and the task consist in identifying the number. This can be approached as a multiple class classification problem. There are several papers showing that most accurate solution can be achieved by Convolutional Neural Networks (CNN). The solution tested contains 2 convolutional layers, one dense ReLU (rectified linear unit) layer and output given by a softmax layer. The MNIST database contains 60,000 training and 10,000 test images.
- GPU Computer using GPUs: Training time: 2min 13s, time per epoch: 4s, performance: loss: 0.0366 - acc: 0.9888 - val_loss: 0.0273 - val_acc: 0.9908.
- GPU Computer using CPU: Training time: 15min 28s, time per epoch: 75s, performance: loss: 0.0365 - acc: 0.9894 - val_loss: 0.0265 - val_acc: 0.9911
- 2015 MacBook Pro 13 (i7): Training time: 36min 59s, time per epoch: 177s, performance: loss: 0.0391 - acc: 0.9884 - val_loss: 0.0283 - val_acc: 0.9900
The use of GPUs reduces dramatically the time per iteration: a factor 5 to 18 with respect the same machine only using CPU (Xeon 4 cores / 8 threads using all of them for the CPU tests). If we compare to desktop computers (2012 iMac with i7 4 cores / 8 threads) the improvement is between a factor 9 to 44. In the case of a laptop (2015 macBook pro 13 with i7 with 2 cores / 4 threads) the improvement is a factor: 11 - 44.
When looking at millions of high energy collisions, the training time in standard computers can span for several days or weeks. The use of GPUs allow us to run much faster, getting results in hours and not days or weeks. This is fundamental, as deep learning requires always few iterations to fine tune the architecture and training of the network. For this, the same network has to be re-train several times with small changes/improvements, until the best performance is obtained. Something that in the past was very tedious, now can be done in a reasonable time.
This is just standard Machine Learning tests but in the next weeks I will publish results and improvements for real particle physics problems.
P.S.:
A large fraction of the particle physics community uses Apple computers. Tensorflow, for GPU computing, is only supporting CUDA so NVIDIA GPUs are needed but Apple only uses either Intel or ATI GPUs in current generations. Theano supports OpenCL, but I did not succeed to make they GPU mode work in my iMac. It would be so great if Apple add the option to configure NVIDIA GPUs in their computers, as they had in the past. Having a GPU NVIDIA 1080 Ti in my iMac would be JUST perfect for daily work.