Build an Environment for Deep Learning

Recently I bought a Nvidia GPU card, GTX 1080 Ti, it’s a very powerful card. I’m very happy to have an my own card for deep learning purpose. In order to make it work I bought an used HP Z420 workstation to work with it. Though z420 is used, it’s very powerful to work with GTX 1080 Ti. Buying an used product also saves me a lot of money because of my limited budget. The time I got the card, I’m eager to make everything combined and functional. Now I already have an ready environment for deep learning purpose, here I want to share the process to make everything work and hope this will benefit others, good luck!

My operating system is the recent released Ubuntu LTS version, namely Ubuntu 18.04. So following operations are carried out in this setting.

Install nvidia driver

There are three methods to install nvidia driver for your card, I will introduce them later. Now an important work to do is disabling nouveau.

1. Disable nouveau

To make your card to work properly, one important thing to do is disabling the open source driver supplied by nouveau, you can do this by editing the grub config file (/boot/grub/grub.cfg), searching for line containing quiet splash and add acpi_osi=linux nomodeset to end of the line.

2. Learn about your card

You can check the type of your card using following method:

 ubuntu-drivers devices

The above command shows all devices which need drivers, and which packages apply to them. The output is as follows on my computer:

== /sys/devices/pci0000:00/0000:00:02.0/0000:05:00.0 ==
modalias : pci:v000010DEd00001B06sv00001458sd0000376Abc03sc00i00
vendor   : NVIDIA Corporation
model    : GP102 [GeForce GTX 1080 Ti]
manual_install: True
driver   : nvidia-driver-390 - distro non-free recommended
driver   : xserver-xorg-video-nouveau - distro free builtin

We can conclude from above outputs that the machine has a gpu card manufactured by the vendor NVIDIA Corporation, its model is GP102 (GeForce GTX 1080 Ti) and a driver nvidia-driver-390 is recommended to install.

3. Install the driver

To install the driver, one easy way is using the command ubuntu-drivers autoinstall which installs drivers that are appropriate for automatic installation. A second way is adding a ppa repository and installing from it. You can find related resources about that, here I mainly describe the third method I use to install the driver, installing from the executable run file downloaded from nvidia official website. By running the installer file provided officially you can install the most recent driver and get better performances.

Step 1: download the installer file

You can visit drivers site to download appropriate driver installer file that matches your card, os and language preference. In my case, I downloaded a most recent version that matches the options indicated by following picture.


Finally,I got a file named in my Downloads folder.

Step 2: install the driver

To install the driver, change to the Downloads directory and run the following commands in the terminal:

sudo telinit 3
sudo sh

The first command disables the gui and takes you to an terminal login, all you to do is login, run the second command and continue with the instructions.

Note: You might need to run sudo apt install build-essentials before you carry out the command to install the driver, because some requirements need to meet, like cc etc.

When the installation completes, reboot your machine to make the driver work. Check the installation with the following commands:

  • Run command lspci | grep -i nvidia to make sure the output is correct.
  • An application named NVDIA X Server Settings is installed, you can open it and check the settings.

4. Install cuda and cudnn

In this part, we install cuda toolkit and cudnn library. An important step is choosing the right cuda version, because I want to install tensorflow-gpu that only supports cuda 9.0, so this version is the only choice I can take.

ps: I try to use cuda 10.0 and cuda 9.2, neither works with tensorflow-gpu. Knowing this will save you a lot of time.

With the version determined, go to the cuda toolkit downloads site, this page shows you the recent cuda toolkit 10.0 download, you need to go to the legacy releases to download older versions.

Let’s check the pages to ease your downloads:

Choose Legacy Releases

Choose cuda legacy releases

After you clicked Legacy Releases, you are taken to the cuda toolkit archive site, here you need to select Cuda Toolkit 9.0. Remained steps are choosing the operating system, architecture, distribution, version and installer type, in my case these are Linux, x86_64, Ubuntu, 16.04 and runfile(local). To install this version, you need to install additional four patches, download and install them.

As in previous, check following picture easing your choosing:

Download Cuda Toolkist 9.0

After all necessary files are downloaded, just issue the command sudo sh runfile, the order is main runfile, then patch 1 runfile, patch 2 runfile, patch 3 runfile, finally patch 4 runfile. When all is done, you have cuda toolkit 9.0 installed.

Note: Add /usr/local/cuda-9.0/bin to your PATH and /usr/local/cuda-9.0/lib64 to your LD_LIBRARY_PATH.

To install cudnn library, you need a nvidia deveploper account to download it. When the download completes, you got an archive file, decompress it and you got an folder named cuda, all you need to do is copy files in this folder to previous installed cuda toolkit related location, see following for details:

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Congrats! When you are here, you got an cuda toolkit 9.0 and cudnn library instaled.

5. Install tensorflow-gpu

To install tensorflow, please refer to Install Tensorflow with pip. Following the guide, I created a python virtual environment and installed tensorflow-gpu with pip.

6. Test tensorflow

Now we have tensorflow installed, let’s check it by a simple tutorial.

import tensorflow as tf
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

model = tf.keras.models.Sequential([
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
              metrics=['accuracy']), y_train, epochs=5)
model.evaluate(x_test, y_test)

Let’s check the output by running above code:

(venv) duliqiang@hp-z420-workstation:~/ml$ python 
Epoch 1/5
2018-11-03 14:39:13.246295: I tensorflow/core/common_runtime/gpu/] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.721
pciBusID: 0000:05:00.0
totalMemory: 10.92GiB freeMemory: 10.17GiB
2018-11-03 14:39:13.246620: I tensorflow/core/common_runtime/gpu/] Adding visible gpu devices: 0
2018-11-03 14:39:17.931728: I tensorflow/core/common_runtime/gpu/] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-03 14:39:17.931773: I tensorflow/core/common_runtime/gpu/]      0 
2018-11-03 14:39:17.931782: I tensorflow/core/common_runtime/gpu/] 0:   N 
2018-11-03 14:39:17.932021: I tensorflow/core/common_runtime/gpu/] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9827 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1)
60000/60000 [==============================] - 14s 237us/step - loss: 0.2017 - acc: 0.9402
Epoch 2/5
60000/60000 [==============================] - 7s 116us/step - loss: 0.0813 - acc: 0.9747
Epoch 3/5
60000/60000 [==============================] - 7s 120us/step - loss: 0.0532 - acc: 0.9836
Epoch 4/5
60000/60000 [==============================] - 7s 119us/step - loss: 0.0370 - acc: 0.9880
Epoch 5/5
60000/60000 [==============================] - 7s 117us/step - loss: 0.0271 - acc: 0.9917
10000/10000 [==============================] - 1s 60us/step

Nice work! The gpu card is working as expected.