Only_Dead_Fish_Go_With_The_Flow: Deep Learning in Ubuntu: nvidia drivers and cuda

A Deep Learning environment in Ubuntu 18.04 Bionic Beaver:
Nvidia drivers and CUDA libraries

Hardware environment:
Lenovo ThinkStation P300, equipped with a intel i7 CPU 6core, 32Gb RAM, an Nvidia Quadro P2000 GPU, a nvme 512Gb SSD storage, and a 8Tb SATA HDD.

I am setting up the system to be dual boot: windows 10pro/ubuntu linux.
In order to have the dual boot environment operational, from within the preinstalled windows 10pro system, I was running the Windows 10 media creator, to prepare a USB key for system reinstall.

After that i proceeded reinstalling windows from the USB key, repartitioning the nvme drive, so to have about 300Gb free alongside the windows partitions.

After windows reinstall and consequent updates, I prepared a ubuntu linux installer usbkey, using pendrivelinux tool, burning on the usbkey a current ubuntu 18.04 image.

Be sure to connect the screen to the first Display Port on the Nvidia board (the port closest to the maiboard).

I performed the ubuntu installation on the nvme drive, specifying a swap partition of 64Gb and the rest of the available space devoted to a linux ext4 root partition mounted on "/".

After checking correct boot of both windows and linux, I proceeded in setting up the nvidia drivers on the linux system. This is a bit tricky.

#lsb_release -a

Ubuntu 18.04.2 LTS

#apt-get update

#apt-get install nvidia-driver-390

Since the nvidia driver contains proprietary (non open source) code, when the installation is performed, it is requested to define a password that will have to be entered at the subsequent reboot, at a bios prompt, called MOK.

nvidia-smi shows the nvidia driver status

root@tsp330:~# nvidia-smi

Sat Jun 29 01:46:02 2019

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 390.116 Driver Version: 390.116 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 Quadro P2000 Off | 00000000:01:00.0 On | N/A |

| 49% 42C P8 6W / 75W | 282MiB / 5056MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| 0 1207 G /usr/lib/xorg/Xorg 39MiB |

| 0 1279 G /usr/bin/gnome-shell 52MiB |

| 0 1466 G /usr/lib/xorg/Xorg 112MiB |

| 0 1599 G /usr/bin/gnome-shell 73MiB |

| 0 2692 G /usr/lib/firefox/firefox 1MiB |

+-----------------------------------------------------------------------------+

In order to use the GPU processor cores from code, two main things are needed: a GPU driver and a set of libraries (CUDA).
Both of these come from nvidia, but there are significant dependencies issues.

Nvidia driver setup
We need to add some specific software repositories to install current nvidia drivers.

see https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa

#add-apt-repository ppa:graphics-drivers/ppa

#apt-get update

#apt-get install ssh screen mc

#ip a l

We install also some other tools, namely to access the system from the network, via ssh (take note of the ip address of the system, and check its reachability from another system. It will be needed if something goes wrong with the graphical system).

Now, opening the graphical tool "Software & updates" (use the search function to locate it: it has an icon with a carboard box with the planet earth),

You will see several options under the driver tab

I first selected the 410 version, since it is labeled as a "long lived branch" on the nvidia website. Apply changes and the reboot.

At the reboot the workstation performed a bios reflash (quite scary), and then it was not loading the graphical system anymore. (I only got text console, and badly reacting).

I entered via the network with ssh. Executed apt-get update and apt-get upgrade. Running nvidia-smi I got this error:

“NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.”

I then reverted to the old driver.

#apt-get install nvidia-driver-390

After giving the bios password, I was able to reboot in graphical mode.

I then went again in the Software & Update graphical tool, and selected nvidia-driver-430 (the newest one). After a reboot everything was fine.

nvidia-smi now shows driver 430.26 and cuda version 10.2. Uname -a shows kernel 4.18.0-25

I assume that the nvidia drivers are ok now.

-------------

Cuda setup:

#apt-get install linux-headers-$(uname -r)

Executing #apt-get install nvidia-cuda-toolkit you can see it is offering cuda 9.1 version. From nvidia site, i see that there is cuda 10.1 version available from the nvidia repositories.

See:

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
https://askubuntu.com/questions/1077061/how-do-i-install-nvidia-and-cuda-drivers-into-ubuntu

I then execute the following commands from root (over 4 Gb of packages are downloaded):

#sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub

#sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'

#sudo bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'

#apt-get update

#apt-get install cuda-10-1

#apt-get update

#apt-get upgrade

#apt-get install libcudnn7

After this, some lines have to be added to the .bashrc file in the user home, for all the users that have to use the cuda environment.

# set PATH for cuda 10.1 installation

if [ -d "/usr/local/cuda-10.1/bin/" ]; then

export PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}

export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Then I performed another reboot, and not the nvidia-smi shows driver version 418.67 and cuda version 10.1. nvcc --version shows cuda version 10.1 too.
Going again in Software & Updates, current drivers appear to be 418. I set again the newest 430 and reboot again.
It now works in graphical mode, with nvidia-smi showing driver version 430.26 and cuda version 10.2. nvcc --version says cuda is 10.1
I now install nvidia profiler and another required library.

#apt-get install nvidia-profiler

#apt-get install libaccinj64-9.1

In order to test compiler and profiler functionality, create a text file with the following code and save it with add.cu name:

#include

#include

// Kernel function to add the elements of two arrays

__global__

void add(int n, float *x, float *y)

{

  for (int i = 0; i < n; i++)

    y[i] = x[i] + y[i];

}

int main(void)

{

  int N = 1<<20 span="">

  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU

  cudaMallocManaged(&x, N*sizeof(float));

  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host

  for (int i = 0; i < N; i++) {

    x[i] = 1.0f;

    y[i] = 2.0f;

  }

  // Run kernel on 1M elements on the GPU

  add<<<1 1="">>>(N, x, y);

  // Wait for GPU to finish before accessing on host

  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)

  float maxError = 0.0f;

  for (int i = 0; i < N; i++)

    maxError = fmax(maxError, fabs(y[i]-3.0f));

  std::cout << "Max error: " << maxError << std::endl;

  // Free memory

  cudaFree(x);

  cudaFree(y);

  return 0;

}

After preparing this file, we can compile

$nvcc add.cu -o add_cuda

And we get the add_cuda executable, which works.
We can then profile its execution, from root:

root@TSP339:/home/coder/cuda# nvprof ./add_cuda
==6205== NVPROF is profiling process 6205, command: ./add_cuda
Max error: 0
==6205== Profiling application: ./add_cuda
==6205== Profiling result:
            Type Time(%)      Time     Calls       Avg       Min       Max Name
GPU activities: 100.00% 248.23ms         1 248.23ms 248.23ms 248.23ms add(int, float*, float*)
      API calls:   67.05% 248.29ms         1 248.29ms 248.29ms 248.29ms cudaDeviceSynchronize
                   32.74% 121.24ms         2 60.621ms 72.403us 121.17ms cudaMallocManaged
                    0.11% 410.24us         2 205.12us 199.94us 210.31us cudaFree
                    0.06% 214.62us        97 2.2120us      89ns 147.25us cuDeviceGetAttribute
                    0.02% 87.711us         1 87.711us 87.711us 87.711us cuDeviceTotalMem
                    0.02% 57.659us         1 57.659us 57.659us 57.659us cudaLaunchKernel
                    0.01% 20.944us         1 20.944us 20.944us 20.944us cuDeviceGetName
                    0.00% 2.3990us         1 2.3990us 2.3990us 2.3990us cuDeviceGetPCIBusId
                    0.00% 1.6290us         3     543ns      86ns 1.1260us cuDeviceGetCount
                    0.00%     541ns         2     270ns     132ns     409ns cuDeviceGet
                    0.00%     173ns         1     173ns     173ns     173ns cuDeviceGetUuid

==6205== Unified Memory profiling result:
Device "Quadro P2000 (0)"
   Count Avg Size Min Size Max Size Total Size Total Time Name
      48 170.67KB 4.0000KB 0.9961MB 8.000000MB 735.5200us Host To Device
      24 170.67KB 4.0000KB 0.9961MB 4.000000MB 360.0960us Device To Host
      12         -         -         -           - 2.480928ms Gpu page fault groups
Total CPU Page faults: 36
root@TSP339:/home/coder/cuda#

This completes CUDA setup.

Only_Dead_Fish_Go_With_The_Flow

Sunday, June 30, 2019

Deep Learning in Ubuntu: nvidia drivers and cuda

No comments:

Search This Blog

About Me

Popular Posts

Blog Archive

__