DIY: Deep Learning DevBox

Dmytro Prylipko

Published Mar 14, 2016

Recently I have assembled and set up our development machine dedicated to deep learning tasks (which mostly refers to training deep neural networks). I had some experience in PC assembly and software installation before, so I though it would be an easy task. However, during the process I learned several non-trivial things (mostly related to setting up a RAID array) I would like to share with you.

Motivation

The original Nvidia Digits DevBox costs around 15,000$ and is supplied within 4-8 weeks, which is ridiculous from my point of view. In a nutshell, it is just a PC with the published list of hardware components available to anybody. With the total price of components around 9,000$, I can't see the point in purchasing it from Nvidia. Running few steps ahead, I can tell that the reason is to save your time and to strain your nerves :) But I believe that equipped with the passion for hardware and this tutorial, you will be able to make it faster and with less effort than I did.

Hardware

The list of the original DevBox components is published in the User Guide. I decided to follow the original list just because it is probably the best hardware one can purchase for a workstation. There are only two changes I introduced: 8-core CPU instead of 6-core one, and only two GPUs. So, my list looks like the following:

Motherboard: Asus X-99-E WS - $856. Excellent quality from Asus; impressive passive cooling; 4 PCI-E 16x slots and everything else you might need.
CPU: Intel Core i7-5960X Haswell-E 8-Core 3.0GHz LGA 2011-v3 - $1,118. This CPU gives you 16 execution threads. Another option is less expensive 6-core i7-5930K. Be aware: Those processors support up to 46 bits of physical address space and 48 bits of virtual address space, which means maximum 64GB of memory. If you want more, Intel Xeon is your choice.
RAM: Corsair Vengeance 64GB DDR4 - $330. Don't forget to remove protective covers before installation :)
GPU: 2x Nvidia Titan X from Gigabyte - 2x $1362 -$2,724. Asus and EVGA are another options. The best customer GPU from Nvidia for now. The most recent Maxwell architecture, 12GB of memory, 7 TFLOPS of processing power. You might also think about Tesla series (e.g. Tesla M40), but those are server solutions. If you do not need ECC (error-correcting) memory and you are not going to run GPU computations 24/7 in production, there is no need to spend 3-4 times more money for it.
Storage for OS: Samsung 850 EVO 1TB - $307. This one is much bigger to the 250GB SSD from the original list. More storage is always better.
Storage for data: Let's say, 3x Western Digital SE 3TB - 3x $174 = $522. Just pick good and reliable HDDs of the same model. They will be combined into 6TB RAID5 later.
RAID cache: Samsung 950 PRO M.2 SSD 512 GB - $327.
Power supply: EVGA Supernova 1600W - $352. Our setup requires a lot of power, especially if you stay for 4 GPUs. This one is a real monster with high-quality cables and ECO mode (turns the fan off unless loaded for 50%).
Case: Corsair Carbide Air 540 - $120. Just an appropriate case for the monster we are building here. Very beautiful. A lot of cooling power when 5 fans are installed. Originally, the fans are placed to form a constant air flow from the front panel to the back. Power supply and HDDs are mounted on another side of the motherboard, which gives more space to air.
CPU cooling: Corsair H110 liquid cooling system with 2x120 mm fans - $120.
Icy Box IB-543SSK 2x 5,25" Backplane for 3x 3,5" or (similar) - around $100. Our case has only two bays for 3,5" HDDs, so we need a place to install them. Additional cooling provided.
Don't forget about thermopaste.

The total price is around $4,150 for base workstation + $2,724 for GPUs = ~$6,800. With 4 GPUs it's around $9,500 - you save $5,000 compared to the Nvidia's version. Please note that Nvidia also introduces a custom rear support bracket to ensure proper GPU spacing. The original DevBox also uses a special version of the GeForce GTX Titan X that includes a modified bracket which removes unused display I/O ports to maximize airflow (see Digits DevBox design guide). Another difference is the Nvidia's brand shield on the case :)

Putting hardware together

Assembly of the hardware parts is pretty straightforward and shall not impose any substantial problems to an experienced user. There are only few notes I would like to mention:

You can install CPU cooling radiator either on the frontal or on the top part of the case. I prefer to leave native case fans on the front panel untouched and to install the radiator with the corresponding fans on the top. This way, warm is effectively extracted and there is no obstacles for cool air that comes from the frontal part and blows on GPUs. Put attention on the correct placement of the CPU radiator fans: They must blow upwards, so install them with Corsair logo on top.
The M.2 SSD storage is installed in the corresponding slot at the motherboard.
Don't forget to attach additional power cables to the graphical cards.
It seems that SLI is undesirable for our purposes: Nvidia forum post, CUDA documentation.

Having put all parts together, you can power your new workstation on. Don't be scared if you are getting no signal to monitor and 'A2' diagnostic code at the motherboard. A2 simply means there is bootable device found (no surprise :). Try to toggle you monitor on and off and to enter BIOS setup (F1, F2 or DEL).

Notes on BIOS RAID

BIOS is well-documented by Asus in the motherboard's user guide, and I did not need any further tuning of it. There is one thing that you might find tempting to use (like I did): BIOS RAID setup aka Intel Rapid Storage Technology Option ROM Utility. TL;DR: just ignore it.

First, I thought that after creating a RAID array in BIOS, Ubuntu will see it as a single volume device. Not at all. That's what official white paper tells us:

The primary benefit of using Intel RST is in the presence of an Intel RST option ROM where the system can boot directly from any Intel RST RAID volume type instead of creating a dedicated partition or using a RAID superblock partition to store the bootloader.

Another quote:

There is an option ROM (OROM) component in the BIOS that can create Intel RST RAID volumes and serves as the interface to the Intel RST RAID volumes in the pre- boot environment. Before the BIOS passes control to the system bootloader, the OROM leaves a copy of the features it supports, such as RAID 5, in system memory. This data can be read by mdadm to determine what features can be used when creating an Intel RST volume.

Thus, OROM component is not really useful for us (since we use RAID for data storage only) and is just a source of headache. You have to setup and manage your RAID via mdadm anyway. BIOS/firmware RAID is also known as FakeRAID and is not recommended for use by Linux community. I personally got a feeling that OROM is only needed if you want to install Windows on RAID volume and to boot from it later.

Software

Our operating system is Ubuntu 14.04 LTS aka Trusty Tahr.

Software RAID

Right after installation I created our RAID5 storage. Here you can find a lot of useful information concerning software RAID for Linux.

In the partition types chapter authors discuss which type of block devices to use for RAID:

Arrays are built on top of block devices; typically disks. This leads to 2 frequent questions: 1) Should I use entire device or a partition? 2) What partition type?

There is no right answer - you can choose. Your editor prefers to use partitions that are slightly smaller than a device. This allows: 1) the device has a partition table - no 'partition table not found'; 2) replacement disks even of the same model are often slightly smaller and making the partition 100Mb smaller than the device allows some tolerance; 3) no performance impact

Neil, the md/mdadm author, uses whole disks.

There is a right answer for us though. It's partitions. Initially, I created the array on disks, but after rebooting I could not find it anymore. After googling, I found this post on StackOverflow, and tried again, now with partitions. Everything was fine this time. So, the recipe for our RAID5 is the following:

Create partitions on the target disks using parted.
Format newly created partitions:

mkfs.ext4 -v -m .1 -b 4096 -E stride=32,stripe-width=64 /dev/sd[b-d]1
Create a RAID5 volume on three partitions with chunk size of 128Kb:

mdadm --create /dev/md/raid --chunk=128 --level=5 --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1

Pay attention that we do not use IMSM metadata format here, it just introduces execution errors related to write-intent bitmaps. The bitmaps are also disabled, I found them to reduce disk performance in benchmarks.
Format RAID partition:

mkfs.ext4 -v /dev/md/raid
Don't forget to save the configuration:

mdadm -E -s >> /etc/mdadm/mdadm.conf

Now, check the status: cat /proc/mdstat. You probably see that the array is being recovered (or something similar). Wait until the process ends, which may take up to 6 hours (in my case). The resulting volume is 6TB.

RAID cache with SSD

Using random-access SSD cache can greatly improve read and write speed of spinning HDDs. In our hardware list we have a 512GB SSD storage dedicated to this. Let's make use of it. For this we need additional software that implements caching, and Flashcache has no alternatives here. Clone it and follow these steps:

First, let's build and install it:
make
make install
Second, make it available as a dynamic kernel module via DKMS:

make -f Makefile.dkms boot_conf
modprobe flashcache
echo “flashcache” >> /etc/modules
Check, whether is was initialised successfully:
dmesg | tail
If everything is fine, you can continue with creating a cached block device:

flashcache_create -v -p back home_cached /dev/nvme0n1 /dev/md/raid
Here we use write-back mode of operation; /dev/nvme0n1 is my SSD device.
Update the initial RAM disk to add the FC modules and helper apps:
update-initramfs -k ALL -u
If everything was executed fine, you must see home_cached device in /dev/mapper. Check out its performance in benchmarks: in my case writing speed raised from 100 MB/s up to 400 MB/s ! However, possible improvement strongly depends of number of read/write hits, which in turn depends of how do use the storage.

Finally, you can mount the cached drive and to check whether it works flawlessly. If yes, why not to mount it at /home at boot time? Copy your /home content to it and add this to your /etc/fstab:

/dev/mapper/home_cached /home ext4 defaults,noatime,nodiratime 0 0

If you have troubles at boot time because the device is not ready yet, first check correctness of the steps above. I had made a typo when edited the fstab (mapped instead mapper) and had unforgettable time while trying to figure out what is the problem. You might also want to ensure flashcache module is loaded at boot time. Use sysv-rc-conf tool for this (check levels 2-5 for flashcache).

If you still experience boot load problems, save this startup script in /etc/init.d/flashcache, make it executable, and run update-rc.d flashcache defaults.

Disabling X server

I use the workstation as a remote host via ssh, so running X server makes no sense while loading one of the GPUs. To disable X server on startup, you have to edit /etc/default/grub. Find the line:

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"

and change it to:

GRUB_CMDLINE_LINUX_DEFAULT="text"

This prevents GUI to load at startup, but also disables auto-load of Nvidia kernel module, which makes usage of CUDA impossible. We have to load the kernel module and to create corresponding devices manually. Put one more startup script into /etc/init.d/nvidia, make it executable, and add this line /etc/rc.local to run it on startup:

bash /etc/init.d/nvidia

This is it, enjoy :)

Comments and discussion are welcome.

Stephen Balaban 8y

Lambda can somehow make these for just $7,000: https://lambdal.com

Anton Kochepasov 9y

How noisy is it?

Mohamed Sahmoune 9y

Excellent piece of work!! Thx you for sharing.

Alexandre Nguyen 9y

Hi Dmytro, Nice article. Would you care to comment on using a SSD raid array vs a HDD raid array? I'm mostly interested in performance, durability and the hability to expand storage in the future. Thanks in advance

JC (Jincheng) Li 9y

Sw installation and maintenance (esp. different versions etc.) is a hassle for a DIY ML machine. For personal reseach, it is OK, but Nvidia box is for production, :)

See more comments

To view or add a comment, sign in

DIY: Deep Learning DevBox

Dmytro Prylipko

Motivation