一、系统环境

Ubuntu 18 4.15.0-159-generic

二、显卡驱动安装

方法1、查看显卡型号以及驱动

有的显卡用这种方无法查到型号,只能用第2种方法

#执行命令

ubuntu-drivers devices

#显示结果 


WARNING:root:_pkg_get_support nvidia-driver-390: package has invalid Support Legacyheader, cannot determine support level
== /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0 ==
modalias : pci:v000010DEd00001C20sv00001D05sd00001045bc03sc00i00
vendor   : NVIDIA Corporation
model    : GP106M [GeForce GTX 1060 Mobile]
manual_install: True
driver   : nvidia-driver-460-server - distro non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : nvidia-driver-470 - distro non-free recommended
driver   : nvidia-driver-470-server - distro non-free
driver   : nvidia-driver-460 - distro non-free
driver   : nvidia-driver-495 - distro non-free
driver   : nvidia-driver-390 - distro non-free
driver   : nvidia-driver-418-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin


方法2、

lspci | grep -i vga
04:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 41)
af:00.0 VGA compatible controller: NVIDIA Corporation Device 1f02 (rev a1)  //注意这个"1f02"10进制数字,到时候要去一个网站上去查显卡的型号
d8:00.0 VGA compatible controller: NVIDIA Corporation Device 1f02 (rev a1)

可以按照十六进制数字代码找到相应的显卡型号http://pci-ids.ucw.cz/mods/PC/10de?action=jump?origin=help


2、到Nvidia官网下载驱动

NVIDIA 驱动程序下载

 这里是下载了一个生产分支的NVIDIA-Linux-x86_64-470.94.run,建议是下载生产分支的

3、安装驱动

#安装驱动前需要把原来自带的驱动禁用nouveau

sudo echo "blacklist nouveau" >>/etc/modprobe.d/blacklist.conf

#执行以下命令,会重新生新一个内核

sudo update-initramfs -u

#重启服务器

sudo reboot

#重启后检查,如果没有输出就是禁用生效了

sudo lsmod | grep nouveau

#禁用图形界面服务

sudo service lightdm stop

#安装一些依赖包

sudo apt install gcc make dkms -y

#安装过程中会各种提示,按照提示操作就行 

sudo cd /opt/download

sudo chmod 755 NVIDIA-Linux-x86_64-470.94.ru

sudo ./NVIDIA-Linux-x86_64-470.94.ru --no-x-check --no-nouveau-check --no-opengl-files

 #安装后如果想要卸载可以执行

sudo /usr/bin/nvidia-uninstall

#检查安装完后是否正常

sudo nvidia-smi

#结果输出,看到以下结果,代表显卡驱动安装正确


Tue Jan 11 11:06:57 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P0    24W /  N/A |      0MiB /  6078MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

三、安装docker服务

1、卸载

sudo apt-get remove docker docker-engine docker.io containerd runc

2、安装Docker

sudo apt-get update
# 安装依赖包
sudo apt-get install apt-transport-https ca-certificates curl gnupg-agent software-properties-common
# 添加 Docker 的官方 GPG 密钥
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
# 验证您现在是否拥有带有指纹的密钥
sudo apt-key fingerprint 0EBFCD88
# 设置稳定版仓库
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"
# 更新
$ sudo apt-get update
# 安装最新的Docker-ce 
sudo apt-get install docker-ce
# 启动
sudo systemctl enable docker
sudo systemctl start docker

#查看Docker版本,版本要大于1.19

docker --version
Docker version 20.10.12, build e91ed57

3、安装NVIDIA Container Runtime

vim nvidia-container-runtime-script.sh

sudo curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \
sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
sudo curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \
sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt-get update

#执行安装

sh nvidia-container-runtime-script.sh

#安装

sudo  apt-get install nvidia-container-runtime

#重启docker

sudo systemctl restart docker

#测试通过docker访问显卡

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

#显示结果,跟直接在宿主机上执行的结果是一样的

Tue Jan 11 03:12:03 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.94       Driver Version: 470.94       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P0    24W /  N/A |      0MiB /  6078MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

4、普通用户下运行docker的环境配置

#添加docker用户组,只要安装完Docker服务,一般是这个docker组是已经存在的

groupadd docker    

  #将登陆用户加入到docker用户组中

gpasswd -a 用户名 docker

#更新用户组

newgrp docker

#重启Docker服务

systemctl restart docker

#测试,切找到普通用户模式下,执行以下命令,能正确输出就可以了。

docker run -u $(id -u):$(id -g) --gpus all nvidia/cuda:11.0-base nvidia-smi

环境部署好后,就可以去github上下载别人做的镜像,或得也可以自已构键镜像,满足公司的不同开的需求,就不需要在宿主机来部署这种复杂的环境了。

Logo

DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。

更多推荐