Title: 离线安装Ubuntu显卡机 CreateTime: 2026-07-04 14:51:33 UpdateTime: 2026-07-04 17:46:20 CategoryName: Web --- # 说明 部分英伟达显卡机是不能联网的,需要离线安装服务器环境,这里使用[ubuntu-22.04.5-live-server-amd64.iso](https://mirrors.tuna.tsinghua.edu.cn/ubuntu-releases/22.04/ubuntu-22.04.5-live-server-amd64.iso) , 需要有一台联网服务器(或者虚拟机)下载需要的deb包,拷贝到显卡机,进行安装 **注意:使用初始化干净的Ubuntu系统,已经安装的deb包,`apt-get install -y --download-only`不会再次下载** # 基础环境 ```shell ## 使用国内的apt源,这里使用 https://mirrors.aliyun.com sed -i 's/http:\/\/cn.archive.ubuntu.com/https:\/\/mirrors.aliyun.com/g' /etc/apt/sources.list sed -i 's/http:\/\/security.ubuntu.com/https:\/\/mirrors.aliyun.com/g' /etc/apt/sources.list ## 配置docker仓库 # 卸载旧版本 apt remove docker docker-engine docker.io containerd runc # 更新软件源 sudo apt update # 安装所需依赖 sudo apt -y install apt-transport-https ca-certificates curl software-properties-common # 安装 Docker GPG 证书 curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker-aliyun.gpg # 新增 Docker 软件源信息 echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker-aliyun.gpg] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | tee /etc/apt/sources.list.d/docker.list > /dev/null # 安装 nvidia-container-runtime GPG 证书 curl -fsSL https://mirrors.ustc.edu.cn/nvidia-container-runtime/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg # 新增 nvidia-container-runtime 软件源信息 echo "deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://mirrors.ustc.edu.cn/nvidia-container-runtime/stable/deb/$(dpkg --print-architecture) /" | sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list ``` ## 下载依赖 ```shell apt clean apt update ## 下载docker依赖包 apt-get install -y --download-only docker-ce docker-ce-cli containerd.io docker-compose-plugin ## 下载nvidia-container-runtime依赖包 apt-get install -y --download-only nvidia-container-runtime ## 安装显卡驱动的依赖,linux-modules-extra-5.15.0-119-generic $(uname -sr)需要匹配实际的内核版本 apt-get install -y --download-only build-essential libboost-program-options-dev cmake zip unzip rdma-core infiniband-diags ibverbs-providers libibverbs-dev dpkg-dev perl linux-modules-extra-5.15.0-119-generic ###下载的deb包都在 /var/cache/apt/archives mkdir -p ./deps/ cp -rf /var/cache/apt/archives/*.deb ./deps/ ## 下载并安装以下包 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvidia-fabricmanager_580.167.08-1ubuntu1_amd64.deb wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/libnvidia-nscq_580.167.08-1ubuntu1_amd64.deb ##B300需要 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/nvlsm_2025.10.14-1_amd64.deb sudo depmod -a sudo modprobe ib_umad #sudo modprobe rdma_ucm #sudo modprobe ib_uverbs #sudo modprobe ib_verbs #sudo modprobe ib_core lsmod | grep ib # nvidia-fabricmanager 服务 systemctl start nvidia-fabricmanager systemctl enable nvidia-fabricmanager ## 下载显卡驱动,可能需要梯子使用浏览器下载 wget https://us.download.nvidia.com/tesla/580.167.08/NVIDIA-Linux-x86_64-580.167.08.run wget https://developer.download.nvidia.com/compute/cuda/13.0.3/local_installers/cuda_13.0.3_580.126.20_linux.run ## 压测工具 #gpu-burn: https://github.com/wilicc/gpu-burn #p2pBandwidthLatencyTest: https://github.com/NVIDIA/cuda-samples #nvbandwidth: https://github.com/NVIDIA/nvbandwidth ``` # sglang运行GLM5.2-NVFP4 ## docker-compose.yaml ```yaml services: sglang-glm52-nvfp4: image: lmsysorg/sglang:dev-cu13-glm52-nvfp4 container_name: sglang-glm52-nvfp4 restart: unless-stopped network_mode: host privileged: true runtime: nvidia ipc: host shm_size: 128g #ports: # - "8000:8000" ulimits: memlock: soft: -1 hard: -1 stack: soft: 67108864 hard: 67108864 deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] volumes: # 宿主机模型路径改成你实际存放GLM5.2-NVFP4的目录 - /data/.cache/huggingface/hub/models--nvidia--GLM-5.2-NVFP4/snapshots/aec724e8c7b8ee9db3b48c01c320f63f9cdaf8aa:/app/models/GLM-5.2-NVFP4 command: > sglang serve --model-path /app/models/GLM-5.2-NVFP4 --served-model-name GLM5.2 --tensor-parallel-size 8 --quantization modelopt_fp4 --tool-call-parser glm47 --reasoning-parser glm45 --trust-remote-code --chunked-prefill-size 65536 --mem-fraction-static 0.85 --host 0.0.0.0 --port 8000 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 ``` - `--model-path`: **模型本地路径**,指定加载大模型权重文件所在目录 - `--served-model-name`: **对外服务模型名**,接口请求时填写的模型名称,自定义别名 - `--tensor-parallel-size 8`: **张量并行数**,拆分模型权重分到 8 张 GPU 运行,多卡均分负载 - `--quantization modelopt_fp4`: **量化方案**,使用 NVIDIA ModelOpt FP4 4 比特权重量化,大幅省显存 - `--tool-call-parser glm47`: **工具调用解析器**,适配 GLM47 系列格式解析函数调用、插件调用逻辑 - `--reasoning-parser glm45`: **思维链推理解析器**,适配 GLM45 格式解析模型内部思考 / 推理内容 - `--trust-remote-code`: **信任远程代码**,自动执行模型仓库内自定义建模代码,加载非标准架构模型必备 - `--chunked-prefill-size 65536`: **分块预填充长度**,超长上下文分块编码,支持更大输入文本,数值越大支持越长 prompt - `--mem-fraction-static 0.85`: **静态显存占用比例**,限制模型权重最多占用单卡 85% 显存,预留显存给推理 / 缓存 - `--host 0.0.0.0`: **监听地址**,允许局域网 / 外网所有 IP 访问服务 - `--port 8000`: **服务端口**,API 接口默认监听 8000 端口 - `--speculative-algorithm NEXTN`: **投机解码算法**,选用 NextN 极速投机推理算法,加速生成速度 - `--speculative-num-steps 3`: **投机迭代步数**,单次推理执行 3 轮投机校验 - `--speculative-eagle-topk 1`: **Eagle 采样候选数**,仅取 Top1 最优候选 token,提升稳定性 - `--speculative-num-draft-tokens 4`: **预生成草稿 token 数**,一次提前预推 4 个预测 token,显著提升生成吞吐 ## 压测 ```shell # 设置 huggingface 国内镜像 export HF_ENDPOINT=https://hf-mirror.com # --max-concurrency 1 单用户压测 python3 -m sglang.bench_serving --backend sglang --model nvidia/GLM-5.2-NVFP4 --dataset-name random --random-input-len 307680 --random-output-len 2048 --num-prompts 10 --max-concurrency 1 --request-rate inf --host 192.168.0.12 --port 8000 --served-model-name GLM5.2 --random-range-ratio 1.0 ```