title: 66.k8s运行tensorflow  
CreateTime: 2020-12-19 17:00:00  
UpdateTime: 2020-12-19 17:00:00  
CategoryName: cloudnative  
---

---
title: "66.k8s运行tensorflow"
date: 2020-12-19T17:00:00+08:00
draft: false
tags: ["tensorflow"]
categories: ["cloudnative"]
author: "springrain"
---

## 1. 安装Nvidia显卡驱动
### 检查显卡
```shell
##安装依赖包
yum -y install pciutils gcc gcc-c++ wget kernel-devel kernel-headers
##检查Nvidia硬件
lspci | grep VGA
#出现类似以下的结果,如果服务器上没有NVIDIA显卡,就可以终止了
#04:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
```

### 检查内核版本  
```shell
uname -sr
#Linux 3.10.0-1160.6.1.el7.x86_64

ls /boot | grep vmlinuz
#vmlinuz-3.10.0-1160.6.1.el7.x86_64

rpm -aq |grep kernel-devel
#kernel-devel-3.10.0-1160.6.1.el7.x86_64

```
#### <font color="red">注意:保证内核版本和源码版本一样,否则安装报错!!!!!!</font>  
若内核和源码需要从FC官方网站上下载与内核版本对应的源码包进行安装.  
可以在以下网站下载并安装:  
http://rpmfind.net/linux/rpm2html/search.php?query=kernel-devel  


### 下载显卡驱动
驱动下载https://www.nvidia.com/Download/index.aspx  
勾选相应型号,并下载安装脚本  
![](/public/66/1.jpg)  

### 禁用自带的驱动
```shell
vi /lib/modprobe.d/dist-blacklist.conf

#blacklist nvidiafb
blacklist nouveau
options nouveau modeset=0
```
![](/public/66/2.jpg)  

```shell
##重建 initramfs image
mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
dracut /boot/initramfs-$(uname -r).img $(uname -r)

##切换到文本模式
systemctl set-default multi-user.target

##重启
init 6

##重启之后,检查是否禁用,如果没有显示相关的内容,说明已禁用
lsmod | grep nouveau
```

### 安装驱动
```shell
##安装下载的驱动
./NVIDIA-Linux-x86_64-455.45.01.run
##如果找不到kernel的源码,可以通过命令指定
#./NVIDIA-Linux-x86_64-455.45.01.run --kernel-source-path=/usr/src/kernels/3.10.0-1160.6.1.el7.x86_64  -k $(uname -r)

## 执行后,开始解压驱动包,进入安装步骤,可能中间会出现一些警告,但是不影响
```
![](/public/66/3.png)  

![](/public/66/4.png)  

![](/public/66/8.png)  

![](/public/66/5.png)  

![](/public/66/6.png)  

安装完成!  

### 检查驱动
```shell
# 查看GPU相关配置
nvidia-smi
# 如果正常显示安装的显卡信息,则说明驱动安装成功; 
# 如果提示找不到该指令,或什么信息都没有显示,则驱动安装失败,可以卸载驱动后重新安装
```
![](/public/66/7.jpg)  

## 2.安装nvidia-docker2
```shell

##删除老的nvidia-docker,安装nvidia-docker2
##docker volume ls -q -f driver=nvidia-docker |xargs -r -I{} -n1 docker ps -q -a -f volume={} |xargs -r docker rm -f
sudo yum remove nvidia-docker

distribution=$(. /etc/os-release;echo$ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo |\
sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y nvidia-docker2
sudo pkill -SIGHUP dockerd

##Docker测试Nvidia驱动
docker run --rm --gpus all nvidia/cuda:11.1-devel-centos7 nvidia-smi
```

## 3.安装k8s-device-plugin
参考:https://github.com/NVIDIA/k8s-device-plugin#enabling-gpu-support-in-kubernetes  
```shell
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.7.2/nvidia-device-plugin.yml
```

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:v0.7.2
        name: nvidia-device-plugin-ctr
        args: ["--fail-on-init-error=false"]
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

```

gpu-pod.yaml 进行测试  
```yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:11.1-devel-centos7
    - name: digits-container
      image: nvidia/digits:6.0
```
编写测试yaml,执行 ```kubectl create -f gpu-pod.yaml```  
发现pod调度到GPU服务器执行,并显示结果说明成功  

## 4.k8s安装tensorflow  
参考:https://blog.csdn.net/vah101/article/details/108098827  
创建tensorflow.yaml  
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: tensorflow-gpu-jupyter
  labels:
    app: tensorflow-gpu-jupyter
spec:
  replicas: 1
  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: tensorflow-gpu-jupyter
  template: # define the pods specifications
    metadata:
      labels:
        app: tensorflow-gpu-jupyter
    spec:
      containers:
      - name: tensorflow-gpu-jupyter
        image: tensorflow/tensorflow:latest-gpu-jupyter
        #resources:
        #  limits:
        #    nvidia.com/gpu: 1

---
apiVersion: v1
kind: Service
metadata:
 name: tensorflow-gpu-jupyter
 labels:
  app: tensorflow-gpu-jupyter
spec:
 type: NodePort
 ports:
 - port: 8888
   targetPort: 8888
   nodePort: 30999
 selector:
  app: tensorflow-gpu-jupyter
```
执行 ```kubectl create -f tensorflow.yaml```  
等待服务正常启动,就可以访问30999端口的jupyter服务了.