Skip to content

PVE 创建 Ubuntu 虚拟机

概览:准备 ISO、安装 Ubuntu、加大容量数据盘 前置:已经有 vmdata 的 LVM-Thin (需要正确选择它来放系统盘)

一、准备 Ubuntu 22.04 ISO 镜像

  1. 在自己电脑中下载 Ubuntu 22.04 LTS

  2. 把下载好的 ISO 上传到 PVE 的 local 存储:

    • 在 PVE 网页左侧点击 Datacenter → 节点 pve → 选中 local (pve)
    • 点击选项卡 ISO Images → 点击左上角 Upload
    • Select File → 选择刚下载的 ubuntu-22.04.5-desktop-amd64.iso
    • 点击 Upload,等待上传完成
    • 传完后在 ISO Images 列表里可以看到这个 ISO 文件
    • 也可以用 Download from URL,但现在有 ISO 可以直接用 Upload
    • iso 文件默认存储在:/var/lib/vz/template/iso

二、创建 Ubuntu 22.04 虚拟机

要求:系统盘放在 vmdata 上。

  • 在左侧 Server View 的树状结构中,选中节点 pve
  • 右上角点击 Create VM

1. General 选项卡

  • Node:默认就是 pve
  • VM ID:默认是 101(保证不和现有的冲突)
  • Name:比如 AI-122
    • AI 表示用途
    • 122 表示后面要在内部分配的 IP 后缀
  • 其他保持默认,点 Next

2. OS 选项卡

  • 勾选 Use CD/DVD disc image file (iso)
    • Storage:选 local(就是刚刚上传 ISO 的存储)
    • ISO image:选 ubuntu-22.04.5-desktop-amd64.iso
    • Guest OS
      • TypeLinux
      • Version:选 6.x - 2.6 KernelUbuntu(有的话)
  • Next

3. System 选项卡

这里主要是启动方式和控制器。

  • Graphic card:默认 Default
  • Machine:选择 q35
    • q35 模拟更新的 Intel Q35 芯片组,支持 GPU 直通、NVMe 直通,多种 PCIe 设备拓扑时更好
    • i440fx 兼容性比较好,但是不支持原生 PCIe 拓扑,做直通和现代设备比较麻烦
  • BIOS
    • 推荐 OVMF (UEFI),方便以后用 UEFI;
    • 如果你有特别要求,也可以保留 SeaBIOS
  • SCSI Controller:选 VirtIO SCSI single(性能好,也是官方推荐)
  • 勾选 QEMU Agent(非常推荐,后面方便看到 IP、优雅关机等)
    • 后续可能还需要在 Ubuntu 中运行如下命令:
      sh
      sudo apt update
      sudo apt install qemu-guest-agent
      sudo systemctl enable --now qemu-guest-agent
  • Next

4. Disks 选项卡

关键:把系统盘放到 vmdata(3.84TB SSD)

  • Bus/Device:选 SCSI
  • Storage:选择 vmdata
    • 这样这个虚拟磁盘会建在 /dev/nvme1n1 上的 LVM-Thin 里
  • Disk size:按需设置,比如 2048 GB
    • 将来可以扩大,但是不能缩小
    • 扩容还是有点麻烦的,需要做分区和扩容操作,尽量第一次就给够
  • 勾选 Discard:TRIM/UNMAP,当虚拟机中删除文件时,PVE 可以把释放的空间真正还给底层 LVM-Thin
  • Cache:默认 No cache,稳定安全
  • 点开 Advanced
    • 勾选 SSD emulation:可以让虚拟机识别这块盘是 SSD,可以优化性能
  • Next

5. CPU 选项卡

  • Sockets:默认为 1
    • 对于 Linux,绝大多数情况下,1 个 Socket + 多个 Core 就可以了
  • Cores
    • 用如下命令查看宿主机 CPU 核心数:
    sh
    egrep '^processor' /proc/cpuinfo | sort -u | wc -l
    • 假设上面命令输出为 104,可以考虑给个 32
  • Type
    • 推荐选 host:性能好,尤其是目前是单节点用 PVE,又是较新的 Ubuntu 22.04
    • 如果以后有多节点、要做热迁移,可以选 x86-64-v2-AES,兼容性更好
  • 点开 Advanced
    • vCPUs:保持默认即可(可以删除掉值,会自动计算为 Sockets × Cores)
    • NUMA:不用开,不必引入复杂性
  • Next

6. Memory 选项卡

  • Memory (MiB):比如设为 1048576 (1TB)
  • 点开 Advanced
    • Ballooning:默认勾选
    • Allow KSM:默认勾选
  • Next

7. Network 选项卡

  • Bridge:选择默认的即可
    • 默认是已经配置好的 Linux Bridge,比如左边树结构中的 localnetwork (pve) 一般对应 vmbr0
  • Model:选 VirtIO (paravirtualized),性能更好
  • Next

8. Confirm 选项卡

  • 检查一下:
    • ide2local:iso/ubuntu-22.04.5-desktop-amd64.iso,media=cdrom
    • efidisk0vmdata
  • 确认无误后,点 Finish 创建 VM

此时 VM 已经建好,但系统还没装。

三、在 VM 里安装 Ubuntu 22.04

  • 在左侧的树结构中,选择刚创建的 VM(比如 101 (AI-122)
  • 点击上方 Start 启动
  • 点击 Console 下拉列表,选择 noVNC 打开控制台
  • 会从 ISO 启动进入 Ubuntu 安装界面:
    • 如果电脑的分辨率不够,需要滚动右侧的滚动条来看下方的选项
    • 语言选择 English,点击 Install Ubuntu
    • 选择键盘布局 English (US)
    • 选择 Normal installation
    • 取消勾选 Download updates while installing Ubuntu(后面可以手动更新)
    • 分区时选 Erase disk and install Ubuntu
      • 这里看到的“磁盘”是刚刚在 vmdata 上创建的虚拟磁盘,不是宿主机的物理盘,放心选
    • 选择时区:Shanghai
    • 设置主机名、用户、密码等
      • Your Name
      • Your computer's nameai122
      • Pick a username
      • Choose a password
      • Confirm your password
      • 勾选 Log in automatically(方便使用)
    • 等待安装完成
    • 可选:勾选安装 OpenSSH Server(以后方便用 SSH 登入)
    • 等待安装完成,点击 Restart Now 重启
    • 此时会提示 Please remove the installation medium, then press ENTER
  • 移除 ISO / 调整启动顺序(避免下次还从光驱启动)
    • 在 PVE 左侧选中该 VM → 点击选项卡 Hardware
    • 找到 CD/DVD DriveEdit
      • Do not use any media,点 OK
    • 然后点击选项卡 OptionsBoot OrderEdit
      • 确保 scsi0(系统盘)排在第一行
    • 回到 Console 的那个提示界面,回车,应该就可以直接从安装好的系统启动
  • 登录系统,点击 Activities → 搜索 Terminal 打开终端,并且添加到 Favorites 方便以后打开
  • 点击右上角电源图标,选择 Settings
    • Appearance → 选择 Dark 主题
    • PowerPower Saving Options
      • Screen Blank 设为 Never
      • Automatic suspend 设为 Off(避免虚拟机自动休眠)
    • NetworkWired → 点击 Connected 右边的设置图标 → IPv4
      • IPv4 MethodManual
      • Addresses
        • Address192.168.31.122
        • Netmask255.255.255.0
        • Gateway192.168.31.1
      • DNS:取消勾选 Automatic,填入 192.168.31.1
      • 点击 Apply 保存
      • 重启以使得静态 IP 地址设置生效
    • 或者在命令行中修改网络:参考 Ubuntu 设置静态 IP
      sh
      # 查看连接名称
      nmcli connection show
      
      # 假如输出 Name: Wired connection 2 (DEVICE:enp10s18)
      CONN_NAME="Wired connection 2"
      
      # 设置静态 IP
      sudo nmcli connection modify "$CONN_NAME" ipv4.addresses 192.168.31.122/24 ipv4.gateway 192.168.31.1 ipv4.dns 192.168.31.1 ipv4.method manual
      
      # 重启连接以使设置生效
      sudo nmcli connection down "$CONN_NAME" && sudo nmcli connection up "$CONN_NAME"

三a、软件环境配置

换源

参考:Ubuntu 换国内源

sh
sudo sed -i 's@//.*archive.ubuntu.com@//mirrors.ustc.edu.cn@g' /etc/apt/sources.list

# 一般不建议替换 security 源
# 镜像站同步有延迟,可能会导致生产环境不能及时安装上最新的安全更新
sudo sed -i 's/security.ubuntu.com/mirrors.ustc.edu.cn/g' /etc/apt/sources.list

# 使用 HTTPS 避免运营商缓存劫持
sudo sed -i 's/http:/https:/g' /etc/apt/sources.list

更新软件包列表:

sh
sudo apt update

开启SSH

参考:Ubuntu 开启 SSH服务

安装:

sh
sudo apt install openssh-server

启动:

sh
sudo systemctl enable ssh --now

查看服务状态:

sh
sudo systemctl status ssh

之后就可以通过 SSH 登录这台虚拟机了。

安装 tailscale

参考:使用 Tailscale 组网

之后就可以通过 Tailscale 远程访问这台虚拟机了。

安装 tmux

参考:安装 tmux

安装 zsh

参考:安装 zsh

安装 v2ray

参考:安装 v2ray

安装 conda + Python

参考:安装 conda, Python 依赖管理

安装 git

参考:安装 git

安装 docker

参考:安装 docker

四、在 Ubuntu 内安装 QEMU Guest Agent(建议)

如果在创建 VM 时已经勾了 QEMU Guest Agent,现在只需要在 VM 里安装软件。

  1. 在 PVE 里确认选项:
    • 选中 VM (AI-122)→ OptionsQEMU Guest Agent
    • 确保状态为 Enabled,如果不是就 Edit 勾上
  2. 在 Ubuntu 里执行(通过 Console 或 SSH):
    bash
    sudo apt update
    sudo apt install qemu-guest-agent
    sudo systemctl enable --now qemu-guest-agent
  3. 稍等几秒,在 PVE 中,选择 VM(比如 ai122)的 Summary 选项卡,就能看到 IP 等信息自动显示

五、启用显卡直通,并将分配给 VM

在 PVE 9 上启用 IOMMU

sh
nano /etc/kernel/cmdline

添加:

sh
intel_iommu=on iommu=pt

运行:

sh
proxmox-boot-tool refresh

重启 PVE:

sh
reboot

确认 IOMMU 状态:

sh
dmesg | grep -e DMAR -e IOMMU -e AMD-Vi | grep -i ioomu

如果输出中看到类似 IOMMU enabled 的内容,就说明 IOMMU 启用成功。

输出形如:

sh
[    9.191140] DMAR-IR: IOAPIC id 12 under DRHD base  0xc5ffc000 IOMMU 6
[    9.191142] DMAR-IR: IOAPIC id 11 under DRHD base  0xb87fc000 IOMMU 5
[    9.191144] DMAR-IR: IOAPIC id 10 under DRHD base  0xaaffc000 IOMMU 4
[    9.191146] DMAR-IR: IOAPIC id 18 under DRHD base  0xfbffc000 IOMMU 3
[    9.191147] DMAR-IR: IOAPIC id 17 under DRHD base  0xee7fc000 IOMMU 2
[    9.191149] DMAR-IR: IOAPIC id 16 under DRHD base  0xe0ffc000 IOMMU 1
[    9.191151] DMAR-IR: IOAPIC id 15 under DRHD base  0xd37fc000 IOMMU 0
[    9.191152] DMAR-IR: IOAPIC id  8 under DRHD base  0x9d7fc000 IOMMU 7
[    9.191154] DMAR-IR: IOAPIC id  9 under DRHD base  0x9d7fc000 IOMMU 7

加载 VFIO 模块

sh
nano /etc/modules

在末尾添加:

sh
vfio
vfio_pci
vfio_iommu_type1
vfio_virqfd

黑名单宿主机显卡驱动

WARNING

注意:如果宿主机还需要用某块卡输出图形,就不要把那一块卡对应的驱动全黑名单。 理想情况是宿主机用主板自带 iGPU 或 IPMI,把 8 块独显全部给 VM。

对于 NVIDIA 显卡:

sh
echo "blacklist nouveau"  > /etc/modprobe.d/blacklist-nouveau.conf
echo "blacklist nvidia"   > /etc/modprobe.d/blacklist-nvidia.conf
echo "blacklist nvidiafb" > /etc/modprobe.d/blacklist-nvidiafb.conf

将显卡全部绑定到 vfio-pci

查看显卡列表:

sh
lspci -nn | grep -E "VGA|3D|Display"

输出形如:

sh
03:00.0 VGA compatible controller [0300]: ASPEED Technology, Inc. ASPEED Graphics Family [1a03:2000] (rev 41)
1a:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
1b:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
3d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
3e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
88:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
89:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
b1:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
b2:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
d8:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe DC SSD [3DNAND, Sentinel Rock Controller] [8086:0b60]
d9:00.0 Non-Volatile memory controller [0108]: Intel Corporation NVMe DC SSD [3DNAND, Sentinel Rock Controller] [8086:0b60]
  • 方括号里的 10de:2206 就是 vendor:device ID。
  • 对同型号的 8 块卡,这个 ID 往往都是一样的。
sh
# 很多显卡有独立的音频功能
lspci -nn | grep -i audio

输出形如:

sh
1a:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
1b:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
3d:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
3e:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
88:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
89:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
b1:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
b2:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)
  • 这里的 10de:1aef 就是音频部分的 vendor:device ID。

查看 IOMMU 组:

sh
find /sys/kernel/iommu_groups/ -type l
  • 确认没有和别的重要设备绑在一起
  • 因为是“8 块卡都给同一台 VM”,即便多块卡在同一个 IOMMU 组里,问题也不大
  • 只要组里别混着 SATA 控制器之类宿主机必须用的设备

将这些卡全部绑定到 vfio-pci:

sh
nano /etc/modprobe.d/vfio.conf

添加:

sh
options vfio-pci ids=10de:2206,10de:1aef

重新生成 initramfs:

sh
update-initramfs -u

重启:

sh
reboot

重启之后确认每块 GPU 已经绑定到 vfio-pci:

sh
# lspci -nnk | grep -A3 -E "VGA|3D|Display"
lspci -nnk | grep -A3 -E "NVIDIA" | grep -i kernel

输出形如:

sh
Kernel driver in use: vfio-pci
# Kernel modules: nvidiafb, nouveau
# Kernel modules: snd_hda_intel

这就说明宿主机已经把显卡让出来了。

将显卡全部直通给 VM

现在开始在 PVE 的 Web 界面操作。

左侧树结构选中 VM (比如 ai122)→ Hardware 选项卡:

  • 确认 BIOS: OVMF (UEFI)
  • 确认 Machine: q35
  • 点击 Add 下拉列表 → 选择 PCI Device
  • 选择 Raw Device,点击出现列表,点选 IOMMU Group 正向排序(一般 GPU 都排在靠前的组)
  • 在列表里选第一块 GPU
    • 同一块 GPU 通常会有一个 VGA + 一个 Audio
    • 勾选 All Functions,让 PVE 自动把同一张卡的所有函数一起直通
    • 不勾 Primary GPU
      • 如果只是算力卡,用远程 SSH,不用勾选
      • 如果想用这块卡做虚拟机的显示输出(接显示器),可以在其中一块卡上勾
    • 点开 Advanced
      • 勾选 PCI-Express(q35 + 现代 GPU)
    • 确认无误,点击 Add
    • 添加好后,可以看到信息栏 PCI Device 多了一条记录,类似 0000:3d:00,pcie=1
  • 重复上一步,把剩下的 7 块 GPU 都按同样方式加进来:
    • 每次 Add → PCI Device,选不同的 GPU / IOMMU 组。
    • 如果某几块卡在同一个 IOMMU 组里,PVE 会强制你把整个组都直通过去,这对“全给 ai122”来说是OK的
    • 可以先加 1 块卡,确认没问题后再加剩下的

命令行一键直通

参考:SLOT 和 GPU 对应关系

清空旧的 hostpci 0-7:

sh
for i in {0..7}; do qm set 101 -delete hostpci$i; done

设置新的 hostpci 0-7:

sh
buses=(88 89 b1 b2 3d 3e 1a 1b); args=()
for i in "${!buses[@]}"; do args+=("-hostpci$i" "0000:${buses[$i]}:00,pcie=1"); done
qm set 101 "${args[@]}"

查看 VM 当前 PCI 设备:

sh
qm config 101 | grep -E '^hostpci'

常见问题:0 <= irq_num && irq_num < PCI_NUM_PINS

问题详情:

sh
kvm: ../hw/pci/pci.c:1815: pci_irq_handler: Assertion `0 <= irq_num && irq_num < PCI_NUM_PINS' failed.
TASK ERROR: start failed: QEMU exited with code 1

原因一般是掉卡。临时方案:

解决方法(暂时无效)

解决方法:禁用上游端口省电。

sh
nano /etc/kernel/cmdline

添加如下内容:

sh
pcie_port_pm=off pcie_aspm=off vfio-pci.disable_idle_d3=1
  • pcie_port_pm=off: 禁止 PCIe ports runtime PM(一般用于解决 device inaccessible)
  • pcie_aspm=off: 关闭链路 ASPM(PLX/switch/riser 很多时需要该参数保证稳定)
  • vfio-pci.disable_idle_d3=1: 不让 VFIO 管的设备在 idle 时进入 D3(避免 D3hot/D3cold → D0 失败)

也即修改后是:

sh
intel_iommu=on iommu=pt pcie_port_pm=off pcie_aspm=off vfio-pci.disable_idle_d3=1

然后运行:

sh
# proxmox-boot-tool refresh
# update-initramfs -u -k all
reboot

验证 VM 中显卡是否已经直通

在 PVE 中选择 AI-122,点击 Start

运行:

sh
lspci -nn | grep -E "VGA|3D|Display"

输出形如:

sh
00:01.0 VGA compatible controller [0300]: Device [1234:1111] (rev 02)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA102 [GeForce RTX 3080] [10de:2206] (rev a1)
sh
lspci -nn | grep -i audio

输出形如:

sh
00:1b.0 Audio device [0403]: Intel Corporation 82801I (ICH9 Family) HD Audio Controller [8086:293e] (rev 03)
01:00.1 Audio device [0403]: NVIDIA Corporation GA102 High Definition Audio Controller [10de:1aef] (rev a1)

这就表示已经成功直通了。

如果要添加新的 GPU,得先关闭 VM,再重复上面的“将显卡全部直通给 VM”的步骤。

安装 NVDIA 驱动和 NVCC+CUDA

参考:Ubuntu 安装 NVIDIA 驱动和 CUDA (NVCC)

五a、常见问题

启动时间太长

例如:超过5分钟。

【待解决】启用 IOMMU / Passthrough(直通)后,启动慢似乎是个已知问题

【Windows】PVE直通下的Windows开机巨慢的解决方案之一

Extremely slow VM startup when IOMMU/Passthrough is enabled

【已验证】似乎启动慢是因为给 VM 分配的内存太大,初始化需要很久

试试把 VM 的内存调小一些,比如从 1TB (1048576) 调到 128GB (131072) 或者 64GB (65536)。

启动失败

【已验证】试试 Remove 几张显卡。

五b、智能启动脚本

启动 VM 时自动诊断和排除故障显卡

start_vm101.sh
sh
#!/usr/bin/env bash

set -Eeuo pipefail

VMID="${1:-101}"
CONF="/etc/pve/qemu-server/${VMID}.conf"
BACKUP_DIR="/root/.vm-start-backups"
LOCK_FILE="/run/lock/start-vm${VMID}.lock"
LSPCI_TIMEOUT="${LSPCI_TIMEOUT:-2}"
QM_TIMEOUT="${QM_TIMEOUT:-20}"
PROBE_START_TIMEOUT="${PROBE_START_TIMEOUT:-45}"
FINAL_START_TIMEOUT="${FINAL_START_TIMEOUT:-90}"
STOP_WAIT_SECONDS="${STOP_WAIT_SECONDS:-20}"

declare -a hostpci_keys=()
declare -a candidate_keys=()
declare -a good_keys=()
declare -a bad_keys=()
declare -A hostpci_values=()
declare -A hostpci_slots=()
declare -A hostpci_desc=()

work_dir=""
active_conf=""
backup_file=""
restore_original_on_failure=1

log() {
	printf '[%s] %s\n' "$(date '+%F %T')" "$*"
}

warn() {
	printf '[%s] WARN: %s\n' "$(date '+%F %T')" "$*" >&2
}

die() {
	printf '[%s] ERROR: %s\n' "$(date '+%F %T')" "$*" >&2
	exit 1
}

require_cmd() {
	command -v "$1" >/dev/null 2>&1 || die "missing required command: $1"
}

vm_status() {
	local output

	output="$(timeout "$QM_TIMEOUT" qm status "$VMID" 2>/dev/null)" || return 1
	awk '{print $2}' <<<"$output"
}

ensure_vm_stopped() {
	local status
	local second

	status="$(vm_status || true)"
	case "$status" in
		stopped|'')
			return 0
			;;
		*)
			log "Stopping VM ${VMID}"
			timeout "$QM_TIMEOUT" qm stop "$VMID" --skiplock 1 >/dev/null 2>&1 || true
			;;
	esac

	for ((second = 0; second < STOP_WAIT_SECONDS; second++)); do
		status="$(vm_status || true)"
		[[ "$status" == "stopped" || -z "$status" ]] && return 0
		sleep 1
	done

	die "VM ${VMID} did not stop within ${STOP_WAIT_SECONDS}s"
}

normalize_slot() {
	local slot

	slot="${1%%,*}"
	slot="${slot,,}"

	if [[ "$slot" =~ ^[0-9a-f]{2}:[0-9a-f]{2}(\.[0-7])?$ ]]; then
		printf '0000:%s\n' "$slot"
		return 0
	fi

	if [[ "$slot" =~ ^[0-9a-f]{4}:[0-9a-f]{2}:[0-9a-f]{2}(\.[0-7])?$ ]]; then
		printf '%s\n' "$slot"
		return 0
	fi

	return 1
}

describe_slot() {
	local slot="$1"
	local output

	if [[ -z "$slot" ]]; then
		printf 'unavailable\n'
		return 0
	fi

	output="$(timeout "$LSPCI_TIMEOUT" lspci -Dnn -s "$slot" 2>/dev/null || true)"
	if [[ -n "$output" ]]; then
		printf '%s\n' "$output" | paste -sd '; ' -
	else
		printf 'unavailable\n'
	fi
}

is_gpu_candidate() {
	local slot="$1"
	local output
	local rc=0

	[[ -z "$slot" ]] && return 0

	output="$(timeout "$LSPCI_TIMEOUT" lspci -Dnn -s "$slot" 2>/dev/null)" || rc=$?

	if [[ $rc -eq 124 || $rc -eq 137 ]]; then
		warn "lspci timed out for ${slot}; treating it as a GPU candidate"
		return 0
	fi

	if [[ -z "$output" ]]; then
		return 0
	fi

	grep -Eq 'VGA compatible controller|3D controller|Display controller' <<<"$output"
}

delete_hostpci_key() {
	local key="$1"

	timeout "$QM_TIMEOUT" qm set "$VMID" -delete "$key" >/dev/null
}

set_hostpci_key() {
	local key="$1"
	local value="$2"

	timeout "$QM_TIMEOUT" qm set "$VMID" "-$key" "$value" >/dev/null
}

restore_original_config() {
	local key

	log "Restoring original hostpci configuration for VM ${VMID}"
	ensure_vm_stopped || true

	for key in "${candidate_keys[@]}"; do
		timeout "$QM_TIMEOUT" qm set "$VMID" -delete "$key" >/dev/null 2>&1 || true
	done

	for key in "${candidate_keys[@]}"; do
		timeout "$QM_TIMEOUT" qm set "$VMID" "-$key" "${hostpci_values[$key]}" >/dev/null 2>&1 || true
	done
}

cleanup() {
	local exit_code=$?

	if [[ $exit_code -ne 0 && $restore_original_on_failure -eq 1 ]]; then
		warn "Script failed before reaching a stable final config; restoring original hostpci entries"
		restore_original_config
	fi

	[[ -n "$work_dir" ]] && rm -rf "$work_dir"
}

qm_start_looks_successful() {
	local output_file="$1"
	local status

	if grep -Eq 'start failed:|TASK ERROR:|QEMU exited with code [1-9]' "$output_file"; then
		return 1
	fi

	status="$(vm_status || true)"
	[[ "$status" == "running" ]]
}

probe_current_config() {
	local label="$1"
	local output_file
	local rc=0

	output_file="$(mktemp "${work_dir}/start-${VMID}.XXXXXX.log")"

	timeout "$PROBE_START_TIMEOUT" qm start "$VMID" >"$output_file" 2>&1 || rc=$?

	if [[ $rc -eq 0 ]] && qm_start_looks_successful "$output_file"; then
		log "Probe succeeded: ${label}"
		ensure_vm_stopped
		rm -f "$output_file"
		return 0
	fi

	[[ $rc -eq 0 ]] && rc=1
	warn "Probe failed: ${label}"
	sed 's/^/  /' "$output_file" >&2 || true
	rm -f "$output_file"
	ensure_vm_stopped || true
	return "$rc"
}

start_final_config() {
	local output_file
	local rc=0

	output_file="$(mktemp "${work_dir}/final-${VMID}.XXXXXX.log")"

	timeout "$FINAL_START_TIMEOUT" qm start "$VMID" >"$output_file" 2>&1 || rc=$?

	if [[ $rc -eq 0 ]] && qm_start_looks_successful "$output_file"; then
		log "VM ${VMID} started successfully"
		rm -f "$output_file"
		return 0
	fi

	warn "Final start failed"
	sed 's/^/  /' "$output_file" >&2 || true
	rm -f "$output_file"
	return 1
}

print_summary() {
	local key

	log "Healthy passthrough GPUs retained: ${#good_keys[@]}"
	for key in "${good_keys[@]}"; do
		log "  keep ${key}: ${hostpci_values[$key]} (${hostpci_desc[$key]})"
	done

	log "Problematic passthrough GPUs removed: ${#bad_keys[@]}"
	for key in "${bad_keys[@]}"; do
		log "  drop ${key}: ${hostpci_values[$key]} (${hostpci_desc[$key]})"
	done
}

main() {
	local key
	local value
	local slot
	local status

	[[ $EUID -eq 0 ]] || die "this script must run as root"
	require_cmd qm
	require_cmd lspci
	require_cmd timeout
	require_cmd awk
	require_cmd grep
	require_cmd flock

	[[ -f "$CONF" ]] || die "VM config not found: $CONF"

	mkdir -p "$BACKUP_DIR"
	exec 9>"$LOCK_FILE"
	flock -n 9 || die "another start/probe job is already running for VM ${VMID}"

	work_dir="$(mktemp -d "/tmp/start-vm${VMID}.XXXXXX")"
	active_conf="${work_dir}/active.conf"
	backup_file="${BACKUP_DIR}/vm${VMID}-$(date '+%F-%H%M%S').conf"
	trap cleanup EXIT

	cp -a "$CONF" "$backup_file"
	awk '/^\[/ {exit} {print}' "$CONF" >"$active_conf"

	while IFS= read -r line; do
		[[ "$line" =~ ^(hostpci[0-9]+):[[:space:]]*(.+)$ ]] || continue
		key="${BASH_REMATCH[1]}"
		value="${BASH_REMATCH[2]}"
		slot="$(normalize_slot "$value" || true)"

		hostpci_keys+=("$key")
		hostpci_values["$key"]="$value"
		hostpci_slots["$key"]="$slot"
		hostpci_desc["$key"]="$(describe_slot "$slot")"
	done <"$active_conf"

	status="$(vm_status || true)"
	if [[ "$status" == "running" ]]; then
		log "VM ${VMID} is already running; nothing to do"
		restore_original_on_failure=0
		return 0
	fi

	if [[ ${#hostpci_keys[@]} -eq 0 ]]; then
		log "VM ${VMID} has no hostpci devices configured; starting directly"
		restore_original_on_failure=0
		start_final_config
		return 0
	fi

	for key in "${hostpci_keys[@]}"; do
		slot="${hostpci_slots[$key]}"
		if [[ -z "$slot" ]] || is_gpu_candidate "$slot"; then
			candidate_keys+=("$key")
		fi
	done

	if [[ ${#candidate_keys[@]} -eq 0 ]]; then
		candidate_keys=("${hostpci_keys[@]}")
	fi

	log "VM ${VMID} active config backup: ${backup_file}"
	log "GPU candidates to probe: ${#candidate_keys[@]}"
	for key in "${candidate_keys[@]}"; do
		log "  candidate ${key}: ${hostpci_values[$key]} (${hostpci_desc[$key]})"
	done

	ensure_vm_stopped

	for key in "${candidate_keys[@]}"; do
		log "Temporarily removing ${key}: ${hostpci_values[$key]}"
		delete_hostpci_key "$key"
	done

	if ! probe_current_config "baseline without passthrough GPUs"; then
		die "baseline start without passthrough GPUs failed; refusing to classify GPUs blindly"
	fi

	for key in "${candidate_keys[@]}"; do
		log "Probing ${key}: ${hostpci_values[$key]} (${hostpci_desc[$key]})"
		set_hostpci_key "$key" "${hostpci_values[$key]}"

		if probe_current_config "with ${key}=${hostpci_values[$key]}"; then
			good_keys+=("$key")
		else
			bad_keys+=("$key")
			log "Removing problematic device ${key}"
			delete_hostpci_key "$key"
		fi
	done

	print_summary

	restore_original_on_failure=0
	if ! start_final_config; then
		die "VM ${VMID} still failed to start after removing problematic passthrough GPUs; current config keeps only the validated subset"
	fi
}

main "$@"

六、把 20TB HDD 配置成大容量数据存储并挂给 VM

在 PVE 中查看磁盘信息

sh
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,MODEL,SERIAL
txt
NAME                            SIZE TYPE FSTYPE      MOUNTPOINT MODEL               SERIAL
sda                            18.2T disk                        WUH722020CLE604     PP****8P
nvme0n1                       894.3G disk                        INTEL SSDPF2KX960HZ PHA************QGN
├─nvme0n1p1                    1007K part
├─nvme0n1p2                       1G part vfat        /boot/efi
└─nvme0n1p3                     893G part LVM2_member
nvme1n1                         3.5T disk LVM2_member            INTEL SSDPF2KX038TZ PHA************AGN
├─vmdata-vmdata_tmeta          15.9G lvm
└─vmdata-vmdata_tdata           3.5T lvm

这里的 /dev/sda 就是 20TB 的 HDD。

查看详细信息:

sh
fdisk -l /dev/sda
txt
Disk /dev/sda: 18.19 TiB, 20000588955648 bytes, 39063650304 sectors
Disk model: WUH722020CLE604
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes

清理磁盘

如果是全新空盘,这一步可以跳过。

sh
wipefs -a /dev/sda

创建 GPT 分区和单一大分区

大容量适合 Directory 存储 + GPT 分区。

sh
# apt install -y parted
parted -a optimal /dev/sda --script mklabel gpt
parted -a optimal /dev/sda --script mkpart primary ext4 1MiB 100%
partprobe /dev/sda

应当看到新分区 /dev/sda1

sh
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT /dev/sda
txt
NAME    SIZE TYPE FSTYPE MOUNTPOINT
sda    18.2T disk
└─sda1 18.2T part

格式化为 ext4

sh
mkfs.ext4 -L hdd20t /dev/sda1

等待运行完成,然后检查:

sh
lsblk --fs /dev/sda
txt
NAME   FSTYPE FSVER LABEL  UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sda
└─sda1 ext4   1.0   hdd20t 9e******-****-****-****-**********7b
root@pve:~# blkid /dev/sda1

或者:

sh
blkid /dev/sda1
txt
/dev/sda1: LABEL="hdd20t" UUID="9e******-****-****-****-**********7b" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="primary" PARTUUID="e6******-****-****-****-**********5e"

这里的 UUID 就是后面要用来挂载的标识符。

挂载

创建挂载点:

sh
mkdir -p /mnt/pve/hdd20t

启动时自动挂载:

sh
nano /etc/fstab

添加一行:

sh
UUID=9e******-****-****-****-**********7b /mnt/pve/hdd20t ext4 defaults,nofail 0 2

系统重新读取并挂载:

sh
systemctl daemon-reload
mount -a

查看挂载情况:

sh
findmnt /mnt/pve/hdd20t
txt
TARGET          SOURCE    FSTYPE OPTIONS
/mnt/pve/hdd20t /dev/sda1 ext4   rw,relatime
sh
df -h /mnt/pve/hdd20t
txt
ilesystem      Size  Used Avail Use% Mounted on
/dev/sda1        19T  2.1M   18T   1% /mnt/pve/hdd20t

注册为 Directory 存储

sh
pvesm add dir hdd20t --path /mnt/pve/hdd20t --content images,backup,iso,vztmpl,rootdir
  • images: VM 磁盘
  • rootdir: LXC 容器
  • backup: 备份
  • iso: ISO 镜像
  • vztmpl: 容器模板

查看状态:

sh
pvesm status
txt
Name             Type     Status     Total (KiB)      Used (KiB) Available (KiB)        %
hdd20t            dir     active     19453053208            2096     18476443576    0.00%
local             dir     active        98497780        53327176        40121056   54.14%
local-lvm     lvmthin     active       794337280               0       794337280    0.00%
vmdata        lvmthin     active      3717050368       405901900      3311148467   10.92%
sh
cat /etc/pve/storage.cfg
txt
...
dir: hdd20t
        path /mnt/pve/hdd20t
        content iso,vztmpl,backup,rootdir,images

将这个存储加给 VM

在配置中查看槽位信息:

sh
qm config 101
...
parent: AI122-2025-1204-0606
scsi0: vmdata:vm-101-disk-1,discard=on,iothread=1,size=2T,ssd=1
scsihw: virtio-scsi-single
smbios1: uuid=f9******-****-****-****-**********bf
sockets: 1
vmgenid: ac******-****-****-****-**********f5

可以看到:

  • scsi0 是系统盘,已经在 vmdata
  • scsihwvirtio-scsi-single
  • 目前还有一个空闲的 SCSI 插槽 scsi1

因此可以把这个新的存储挂在 scsi1 上:

sh
qm set 101 --scsi1 hdd20t:4096,format=raw,iothread=1
  • 给 VM 101
  • 新增一块挂在 scsi1 的磁盘
  • 存储位置在 hdd20t
  • 大小 4096 GiB,也就是约 4TB,可以按需调整,见下一小节
  • 格式 raw
  • 开启 iothread=1
txt
update VM 101: -scsi1 hdd20t:4096,format=raw,iothread=1
Formatting '/mnt/pve/hdd20t/images/101/vm-101-disk-0.raw', fmt=raw size=4398046511104 preallocation=off
scsi1: successfully created disk 'hdd20t:101/vm-101-disk-0.raw,iothread=1,size=4T'

再次查看配置:

sh
qm config 101 | grep scsi
txt
...
scsi1: hdd20t:101/vm-101-disk-0.raw,iothread=1,size=4T

表明已经成功添加了新的磁盘。

优化磁盘占用

ext4 保留块比例降到 1%,提高空间利用率:

sh
tune2fs -m 1 /dev/sda1
tune2fs 1.47.2 (1-Jan-2025)
Setting reserved blocks percentage to 1% (48829557 blocks)

查看当前保留块比例:

sh
tune2fs -l /dev/sda1 | egrep 'Reserved block count|Block size'
Reserved block count:     48829557
Block size:               4096

查看当前磁盘使用情况:

sh
df -h /mnt/pve/hdd20t
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1        19T  2.1M   18T   1% /mnt/pve/hdd20t

预留 100GB 给宿主机和文件系统缓冲,剩下的都给 VM:

sh
avail_gib=$(df --output=avail -BG /mnt/pve/hdd20t | tail -1 | tr -dc '0-9')
target_gib=$((avail_gib - 100))
echo "$target_gib"
18266

考虑到 ext4 + 标准 4KiB 块大小,单文件大小上限是 16TB。不能直接将整个 18TB 分配给 VM,否则可能会遇到下面的报错:

txt
# qm resize 101 scsi1 ${target_gib}G
VM 101 qmp command 'block_resize' failed - Could not resize file: File too large

因此 VM 分配 16TB:

sh
qm resize 101 scsi1 16380G

查看配置:

sh
qm config 101 | grep scsi1
scsi1: hdd20t:101/vm-101-disk-0.raw,iothread=1,size=16380G

在 VM 中添加磁盘

上面的命令都是在 PVE 宿主机上执行的。下面的命令是在 VM 里执行的。

下面的很多命令可能似曾相识,但是需要注意区分。

上面的工作是在 PVE 中格式化 hdd20t 这个宿主机存储池,下面的工作是在 VM 中格式化 scsi1 这个虚拟磁盘。

sh
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,MODEL
...
sda        2T disk                                              QEMU HARDDISK
├─sda1   512M part vfat     /boot/efi
└─sda2     2T part ext4     /
sdb       16T disk                                              QEMU HARDDISK

这里的 sdb 就是新加的 16TB 磁盘。

在 VM 中创建 GPT 分区和单一大分区

sh
sudo parted -a optimal /dev/sdb --script mklabel gpt
sudo parted -a optimal /dev/sdb --script mkpart primary ext4 1MiB 100%
sudo partprobe /dev/sdb
sh
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT /dev/sdb
NAME   SIZE TYPE FSTYPE MOUNTPOINT
sdb     16T disk
└─sdb1  16T part

在 VM 中格式化文件系统

sh
sudo mkfs.ext4 -L data /dev/sdb1

等待一会,运行完成。然后查看:

sh
lsblk --fs /dev/sdb
NAME   FSTYPE FSVER LABEL UUID                                 FSAVAIL FSUSE% MOUNTPOINTS
sdb
└─sdb1 ext4   1.0   data  a6******-****-****-****-**********b2

在 VM 中挂载

sh
sudo mkdir -p /media/data
sudo mount /dev/sdb1 /media/data

查看挂载情况:

sh
df -h /media/data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        16T   28K   16T   1% /media/data

设置开机自动挂载

查看 uuid:

sh
sudo blkid /dev/sdb1
/dev/sdb1: LABEL="data" UUID="a6******-****-****-****-**********b2" BLOCK_SIZE="4096" TYPE="ext4" PARTLABEL="primary" PARTUUID="fb******-****-****-****-**********1c"

sudo nano /etc/fstab,添加一行:

sh
UUID=a6******-****-****-****-**********b2 /media/data ext4 defaults,nofail 0 2

挂载:

sh
sudo mount -a

查看挂载情况:

sh
df -h /media/data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb1        16T   28K   16T   1% /media/data

七、使用自动化脚本和配置

本节记录一套可复用的自动化脚本,用来在 PVE 宿主机上创建 Ubuntu VM,并在 VM 内继续完成基础环境、Tailscale、v2ray、HDD 挂载、NVIDIA 驱动、CUDA/NVCC、zsh/tmux、.gd.sh、conda、Python 工具、Docker 和 NVIDIA Container Toolkit 安装。

脚本模板保存在:

txt
docs/notes/scripts/pve-ubuntu/
├── pve_ubuntu.sh
├── pve_ubuntu.yaml
├── setup_ubuntu.sh
└── setup_ubuntu.yaml

.chats/pve-ubuntu/ 中保留的是本次实际运行用的完整脚本和配置,可能包含真实密码、IP、磁盘 by-id、GPU PCI 地址等信息,并且不会随文档提交。docs/notes/scripts/pve-ubuntu/ 中保存的是脱敏后的模板备份,用于以后在其他 PVE 宿主机或 VM 上参考和改写。

其中 pve_ubuntu.sh 在 PVE 宿主机上运行,负责创建 VM、准备 cloud-init、配置 GPU 直通、格式化并挂载 HDD、创建 PVE Directory Storage、给 VM 挂载 HDD-backed 数据盘。setup_ubuntu.sh 会被写入 VM,并在 Ubuntu 内运行,负责安装软件包、启动 SSH/Tailscale/v2ray、挂载数据盘、安装 NVIDIA 驱动和 CUDA Toolkit,并按照 xeon 的通用习惯生成 zsh/tmux/conda/Docker 配置。模板中不包含 xeon 上的私有环境变量。

v2ray 的通用安装脚本和客户端配置不在本目录重复保存,直接复用:

  • docs/notes/scripts/v2ray-install-release.sh
  • docs/notes/configs/v2ray-client-config.json
  • 详细说明见:安装 v2ray

运行前写好的配置

运行前需要准备并修改这些文件:

  • pve_ubuntu.yaml:PVE 宿主机侧配置,包含源宿主机、VM 参数、网络、GPU、HDD。
  • setup_ubuntu.yaml:Ubuntu VM 内部配置,包含用户、软件包、Tailscale、v2ray、HDD、NVIDIA/CUDA、Git、dotfiles、conda、Python 工具、Docker、NVIDIA Container Toolkit。
  • pve_ubuntu.sh:PVE 侧执行脚本。
  • setup_ubuntu.sh:Ubuntu 侧执行脚本,会被 pve_ubuntu.sh 写入 cloud-init。

这些模板中的真实 IP、密码、磁盘序列、GPU PCI 地址、Git 信息都已经用占位符脱敏。xeonpveqveai122bj123 只是用于区分宿主机或 VM 的标识,可以按场景保留或修改。

占位符说明

pve_ubuntu.yaml 中常见占位符:

占位符含义
<SOURCE_PVE_LAN_IP>用来复制 ISO 和 v2ray 配置的源 PVE 宿主机 LAN IP,例如 pve 的局域网地址。
<SOURCE_PVE_ROOT_PASSWORD>源 PVE 宿主机 root 密码;脚本通过 sshpass/scp 复制文件。
<SOURCE_UBUNTU_ISO_PATH>源 PVE 上已有的 Ubuntu ISO 路径。
<VM_ID>新 VM 的 PVE ID,例如 123
<VM_LAN_IP>新 Ubuntu VM 的静态 LAN IP。
<LAN_GATEWAY_IP>局域网网关。
<LAN_DNS_IP> / <PUBLIC_DNS_IP>DNS 服务器。
<UBUNTU_USER> / <UBUNTU_USER_PASSWORD> / <UBUNTU_FULL_NAME>VM 内创建的 Ubuntu 用户信息。
<GPU_PCI_ADDRESS_*>要直通给 VM 的 GPU PCI 地址,例如通过 `lspci -Dnn
<GPU_VENDOR_DEVICE_ID>GPU 显卡功能的 vendor/device id,例如 10de:2206
<GPU_AUDIO_VENDOR_DEVICE_ID>GPU HDMI/DP Audio 功能的 vendor/device id,例如 10de:1aef
<GPU_PCI_ADDRESS_TO_EXCLUDE>已知会导致 VM 启动失败或暂不直通的 GPU,可留作记录。
<HDD_DISK_BY_ID>HDD 的 /dev/disk/by-id/ 稳定设备名,必须在目标 PVE 上重新确认。
<PVE_HDD_LABEL>PVE 宿主机上 HDD 分区的文件系统标签。
<PVE_HDD_MOUNTPOINT>PVE 宿主机上 HDD 的挂载点,例如 /mnt/hdd-data
<PVE_HDD_STORAGE_NAME>PVE Directory Storage 名称。
<VM_DATA_DISK_SIZE_GB>挂给 VM 的数据盘大小,单位是 GB,例如 7000

setup_ubuntu.yaml 中常见占位符:

占位符含义
tailscale.auth_key可选 Tailscale auth key;默认留空并把 tailscale.up 设为 false,后续手动运行 tailscale up
<GIT_USER_NAME> / <GIT_USER_EMAIL>VM 内 Git 全局配置。
<VM_HDD_LABEL>Ubuntu VM 内数据盘分区的文件系统标签。
<VM_HDD_MOUNTPOINT>Ubuntu VM 内数据盘挂载点,例如 /media/data
dotfiles.gd_url.gd.sh 的下载地址;如果 cloud-init 已写入 dotfiles.gd_source,优先使用本地文件。
conda.installer_urlMiniconda 安装脚本地址,模板默认使用 TUNA 镜像。
conda.env_name / conda.python_version自动创建的 conda 环境名和 Python 版本。
docker.http_proxy / docker.https_proxyDocker daemon 代理。若 v2ray new.json 已开启 11119,可填 http://127.0.0.1:11119

v2ray 配置不在这里另写一份模板。pve_ubuntu.sh 会从 source.v2ray_config_dir 复制源宿主机上的 config.jsonnew.json,再通过 cloud-init 写入 VM;如果需要从头生成客户端配置,参考 安装 v2raydocs/notes/configs/v2ray-client-config.json

阶段开关

两份 YAML 都支持阶段开关:

yaml
global:
  mode: auto

stages:
  gpu_passthrough:
    mode: auto

可选值:

  • auto:自动运行该阶段。
  • manualskip:跳过该阶段。
  • confirm:运行时询问是否执行。

如果某个阶段已经完成,可以把它改成 skip。如果只想补跑某一阶段,可以把其他阶段设为 skip,目标阶段设为 auto

PVE 侧主要阶段:

txt
install_host_packages
copy_iso
copy_v2ray_config
create_vm
hdd_storage
attach_hdd
gpu_passthrough
start_vm

Ubuntu 侧主要阶段:

txt
apt_sources
base_packages
qemu_guest_agent
ssh
tailscale
v2ray
hdd_mount
nvidia_driver
cuda
git
dotfiles
conda
python_tools
docker
nvidia_container
zsh
desktop

其中 NVIDIA 相关阶段的职责边界如下:

  • nvidia_driver:只安装 Ubuntu VM 内核态 NVIDIA 驱动,目标是让 nvidia-smi 正常。
  • cuda:只安装 CUDA Toolkit / nvcc,不负责 Docker runtime。
  • docker:只安装 Docker Engine、Compose、registry mirror、daemon 代理和用户组。
  • nvidia_container:只在 Docker 和 nvidia-smi 都正常后,安装 NVIDIA Container Toolkit,并通过 nvidia-ctk runtime configure --runtime=docker 合并更新 Docker runtime。

在 PVE 宿主机上运行

先把模板复制到目标 PVE 宿主机,例如 qve

sh
scp -r docs/notes/scripts/pve-ubuntu root@qve:/root/pve-ubuntu
scp docs/notes/scripts/v2ray-install-release.sh root@qve:/root/pve-ubuntu/

登录目标 PVE:

sh
ssh root@qve
cd /root/pve-ubuntu

按目标机器修改配置:

sh
nano pve_ubuntu.yaml
nano setup_ubuntu.yaml
chmod +x v2ray-install-release.sh

至少需要确认:

  • source.ipsource.passwordsource.iso_path
  • vm.idvm.namevm.hostname
  • vm.vga,纯 SSH/算力卡场景可设为 none
  • network.ipv4network.gateway4network.dns
  • ubuntu.userubuntu.password
  • gpu_passthrough.pci_addressesgpu_passthrough.vfio_ids
  • hdd.disk_by_idhdd.partition
  • hdd.wipe_existing
  • hdd.vm_disk.size
  • source.v2ray_config_dir,以及源宿主机中是否存在 config.jsonnew.json

如果某张 GPU 是宿主机的 boot VGA,但仍希望作为 VM 的算力卡直通,可以先确认:

sh
cat /sys/bus/pci/devices/<GPU_PCI_ADDRESS>.0/boot_vga

值为 1 表示它是宿主机启动显卡。纯计算用途下,可以在 gpu_passthrough.pci_addresses 中给这张卡加 rombar=0,并把 vm.vga 设为 none

yaml
vm:
  vga: none

gpu_passthrough:
  pci_addresses:
    - "0000:02:00"
    - "0000:03:00"
    - "0000:04:00,rombar=0"

这会让 PVE 的图形 Console 预期不可用或黑屏,但 SSH、guest agent、NVIDIA/CUDA 计算不受影响。

WARNING

hdd.wipe_existing: true 会格式化目标 HDD。换机器前必须用下面命令确认设备确实是要清空的数据盘:

sh
lsblk -o NAME,SIZE,TYPE,FSTYPE,MOUNTPOINT,MODEL,SERIAL
ls -l /dev/disk/by-id/

运行 PVE 侧脚本:

sh
bash pve_ubuntu.sh pve_ubuntu.yaml

如果脚本配置了 GPU 直通并提示需要重启,执行:

sh
reboot

宿主机重新上线后再次运行同一命令。已经完成的阶段会按配置和现有状态跳过。

sh
cd /root/pve-ubuntu
bash pve_ubuntu.sh pve_ubuntu.yaml

在 Ubuntu VM 内运行或补跑

正常情况下,setup_ubuntu.shsetup_ubuntu.yaml 会通过 cloud-init 写入 VM 的 /opt/bj123-setup/,并在首次启动时自动运行。

如果需要手动补跑:

sh
ssh <UBUNTU_USER>@<VM_LAN_IP>
sudo bash /opt/bj123-setup/setup_ubuntu.sh /opt/bj123-setup/setup_ubuntu.yaml

如果没有配置 Tailscale auth key,手动运行:

sh
sudo tailscale up --hostname=bj123

命令会输出一个授权链接。复制链接到浏览器中确认授权后,VM 会加入 tailnet。

Ubuntu 22.04 Kernel 版本

Ubuntu 22.04.5 LTS 可能运行 GA kernel 5.15,也可能运行 HWE kernel 6.8。这不是 /etc/os-release 里的 Ubuntu 发行版号,而是内核分支差异。

对纯 SSH 和算力卡 VM,只要下面内容正常,就不需要为了和其他机器一致而升级 kernel:

  • nvidia-smi 能看到全部 GPU。
  • nvcc --version 正常。
  • nvidia-container-cli info 正常。
  • SSH、Tailscale、qemu guest agent、Docker 正常。

升级到 HWE kernel 会触发 NVIDIA DKMS 重新构建并需要重启,反而会增加变量。只有在明确需要新内核功能或当前驱动/硬件存在内核相关问题时,再考虑升级。

运行后自动生成的内容

PVE 宿主机上会生成或修改:

  • /root/pve-ubuntu/:运行目录,包含脚本、配置、复制来的 ISO 和 v2ray 配置。
  • /root/pve-ubuntu/seed/:cloud-init 的 user-datameta-datanetwork-config
  • /var/lib/vz/template/iso/<vm>-cidata.iso:cloud-init seed ISO。
  • /etc/default/grub/etc/kernel/cmdline:IOMMU/VFIO 内核参数。
  • /etc/modules-load.d/vfio.conf
  • /etc/modprobe.d/vfio.conf
  • /etc/modprobe.d/blacklist-nvidia-passthrough.conf
  • /etc/fstab:HDD 自动挂载。
  • PVE storage:例如 <PVE_HDD_STORAGE_NAME>
  • VM 配置:scsi0 系统盘、scsi1 HDD-backed 数据盘、hostpci* GPU 直通、efidisk0

Ubuntu VM 内会生成或修改:

  • /opt/bj123-setup/:VM 内脚本和配置。
  • /var/log/setup_ubuntu.log
  • /usr/local/etc/v2ray/config.json
  • /usr/local/etc/v2ray/new.json
  • v2ray.service
  • v2ray@new.service
  • /etc/fstab:数据盘自动挂载。
  • <VM_HDD_MOUNTPOINT>:数据盘挂载点。
  • /etc/profile.d/cuda.sh
  • /usr/local/cuda
  • ~/.zshrc~/.zshenv:zsh prompt、常用 alias、.gd.sh、conda、CUDA、Hugging Face mirror、数据盘环境变量。
  • ~/.tmux.conf:以 xeon 通用配置为参考的 tmux 配置和插件入口。
  • ~/.gd.sh
  • ~/.pip/pip.conf
  • ~/.condarc
  • ~/miniconda3/ 和 conda env,例如 ai
  • /etc/docker/daemon.json
  • /etc/systemd/system/docker.service.d/proxy.conf
  • /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  • /etc/apt/sources.list.d/nvidia-container-toolkit.list
  • Tailscale 状态和主机名。

验证命令

PVE 宿主机上检查:

sh
cat /proc/cmdline
dmesg | grep -m1 -E "DMAR: IOMMU enabled|IOMMU enabled"
lspci -Dnnk | awk '/NVIDIA/{print; n=1; next} n && /Kernel driver in use/{print; n=0}'
qm config <VM_ID>
pvesm status
findmnt <PVE_HDD_MOUNTPOINT>

Ubuntu VM 内检查:

sh
hostname
tailscale ip -4
systemctl is-active qemu-guest-agent ssh tailscaled v2ray v2ray@new
ss -ltnp | grep -E ":1111(0|1|8|9)"
findmnt <VM_HDD_MOUNTPOINT>
df -h <VM_HDD_MOUNTPOINT>
nvidia-smi
nvcc --version
nvidia-container-cli info
docker info | grep -i runtime
zsh -ic 'echo $CONDA_DEFAULT_ENV; python --version; command -v gd gpustat pipreqs'
mokutil --sb-state

样例脚本和配置

pve_ubuntu.sh
sh
#!/usr/bin/env bash
set -euo pipefail

CONFIG="${1:-$(dirname "$0")/pve_ubuntu.yaml}"
BASE_DIR="$(cd "$(dirname "$0")" && pwd)"
LOG_FILE="/var/log/pve_ubuntu.log"

exec > >(tee -a "$LOG_FILE") 2>&1

log() {
  printf '[%s] %s\n' "$(date '+%F %T')" "$*"
}

need_root() {
  if [[ "${EUID}" -ne 0 ]]; then
    echo "Run as root on the PVE host." >&2
    exit 1
  fi
}

ensure_yaml() {
  if python3 - <<'PY' >/dev/null 2>&1
import yaml
PY
  then
    return
  fi
  apt-get update
  DEBIAN_FRONTEND=noninteractive apt-get install -y python3-yaml
}

yaml_get() {
  local path="$1"
  local default="${2:-}"
  python3 - "$CONFIG" "$path" "$default" <<'PY'
import sys, yaml
cfg_path, key_path, default = sys.argv[1:4]
with open(cfg_path, "r", encoding="utf-8") as f:
    data = yaml.safe_load(f) or {}
cur = data
for part in key_path.split("."):
    if isinstance(cur, dict) and part in cur:
        cur = cur[part]
    else:
        print(default)
        sys.exit(0)
if cur is None:
    print(default)
elif isinstance(cur, bool):
    print("true" if cur else "false")
elif isinstance(cur, list):
    print("\n".join(str(x) for x in cur))
else:
    print(cur)
PY
}

stage_mode() {
  local stage="$1"
  local mode
  mode="$(yaml_get "stages.${stage}.mode" "")"
  if [[ -z "$mode" ]]; then
    mode="$(yaml_get "global.mode" "confirm")"
  fi
  printf '%s' "$mode"
}

run_stage() {
  local stage="$1"
  local mode
  mode="$(stage_mode "$stage")"
  case "$mode" in
    auto) return 0 ;;
    manual|skip) log "skip stage ${stage} (mode=${mode})"; return 1 ;;
    confirm)
      read -r -p "Run stage ${stage}? [y/N] " answer
      [[ "${answer,,}" == y* ]]
      ;;
    *) log "skip stage ${stage} (unknown mode=${mode})"; return 1 ;;
  esac
}

install_host_packages() {
  DEBIAN_FRONTEND=noninteractive apt-get update
  DEBIAN_FRONTEND=noninteractive apt-get install -y python3-yaml sshpass wget curl genisoimage parted e2fsprogs
}

copy_iso_from_source() {
  local src_ip src_user src_pass src_iso iso_dir dst_iso
  src_ip="$(yaml_get source.ip)"
  src_user="$(yaml_get source.user root)"
  src_pass="$(yaml_get source.password)"
  src_iso="$(yaml_get source.iso_path)"
  iso_dir="$(yaml_get vm.iso_storage_dir /var/lib/vz/template/iso)"
  dst_iso="${iso_dir}/$(basename "$src_iso")"
  mkdir -p "$iso_dir"
  if [[ -f "$dst_iso" ]]; then
    log "ISO already exists: ${dst_iso}"
    return
  fi
  log "copy ISO from ${src_user}@${src_ip}:${src_iso} to ${dst_iso}"
  sshpass -p "$src_pass" scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null "${src_user}@${src_ip}:${src_iso}" "$dst_iso"
}

copy_v2ray_config_from_source() {
  local src_ip src_user src_pass src_dir workdir dst_dir
  src_ip="$(yaml_get source.ip)"
  src_user="$(yaml_get source.user root)"
  src_pass="$(yaml_get source.password)"
  src_dir="$(yaml_get source.v2ray_config_dir /usr/local/etc/v2ray)"
  workdir="$(yaml_get global.workdir /root/pve-ubuntu)"
  dst_dir="${workdir}/v2ray"
  mkdir -p "$dst_dir"
  log "copy v2ray config from ${src_user}@${src_ip}:${src_dir}"
  sshpass -p "$src_pass" scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null "${src_user}@${src_ip}:${src_dir}/"* "$dst_dir/" || true
}

download_cloud_image() {
  local url img
  url="$(yaml_get vm.cloud_image_url)"
  img="$(yaml_get vm.cloud_image_path)"
  mkdir -p "$(dirname "$img")"
  if [[ -s "$img" ]]; then
    log "cloud image already exists: ${img}"
    return
  fi
  log "download cloud image: ${url}"
  wget -O "${img}.tmp" "$url"
  mv "${img}.tmp" "$img"
}

password_hash() {
  local password="$1"
  openssl passwd -6 "$password"
}

write_b64_file_entry() {
  local path="$1"
  local dst="$2"
  local perms="${3:-0644}"
  [[ -f "$path" ]] || return
  {
    echo "  - path: ${dst}"
    echo "    permissions: '${perms}'"
    echo "    encoding: b64"
    echo "    content: $(base64 -w0 "$path")"
  } >>"$USER_DATA"
}

create_cloud_init_seed() {
  local workdir seed_dir user_data meta_data network_config cidata_iso vmid hostname username password full_name timezone ssh_auth ip prefix gateway dns password_hash_value
  workdir="$(yaml_get global.workdir /root/pve-ubuntu)"
  vmid="$(yaml_get vm.id)"
  hostname="$(yaml_get vm.hostname)"
  username="$(yaml_get ubuntu.user)"
  password="$(yaml_get ubuntu.password)"
  full_name="$(yaml_get ubuntu.full_name "$username")"
  timezone="$(yaml_get ubuntu.timezone Asia/Shanghai)"
  ssh_auth="$(yaml_get ubuntu.ssh_password_auth true)"
  ip="$(yaml_get network.ipv4)"
  prefix="$(yaml_get network.prefix 24)"
  gateway="$(yaml_get network.gateway4)"
  mapfile -t dns < <(yaml_get network.dns "192.168.31.1")
  password_hash_value="$(password_hash "$password")"
  seed_dir="${workdir}/seed-${vmid}"
  cidata_iso="$(yaml_get vm.iso_storage_dir /var/lib/vz/template/iso)/${hostname}-cidata.iso"
  mkdir -p "$seed_dir" "$(dirname "$cidata_iso")"

  USER_DATA="${seed_dir}/user-data"
  meta_data="${seed_dir}/meta-data"
  network_config="${seed_dir}/network-config"

  cat >"$USER_DATA" <<EOF
#cloud-config
hostname: ${hostname}
manage_etc_hosts: true
timezone: ${timezone}
locale: $(yaml_get ubuntu.locale en_US.UTF-8)
ssh_pwauth: ${ssh_auth}
disable_root: false
users:
  - default
  - name: ${username}
    gecos: ${full_name}
    shell: /bin/bash
    lock_passwd: false
    passwd: '${password_hash_value}'
    groups: [adm, cdrom, dip, lxd, plugdev, sudo]
    sudo: ['ALL=(ALL) ALL']
package_update: true
packages:
  - openssh-server
  - qemu-guest-agent
  - python3-yaml
write_files:
EOF
  write_b64_file_entry "${BASE_DIR}/setup_ubuntu.sh" "/opt/bj123-setup/setup_ubuntu.sh" "0755"
  write_b64_file_entry "${BASE_DIR}/setup_ubuntu.yaml" "/opt/bj123-setup/setup_ubuntu.yaml" "0644"
  write_b64_file_entry "${BASE_DIR}/v2ray-install-release.sh" "/opt/bj123-setup/v2ray-install-release.sh" "0755"
  if [[ -f "${BASE_DIR}/dotfiles/.gd.sh" ]]; then
    write_b64_file_entry "${BASE_DIR}/dotfiles/.gd.sh" "/opt/bj123-setup/dotfiles/.gd.sh" "0644"
  fi
  write_b64_file_entry "${workdir}/v2ray/config.json" "/opt/bj123-setup/v2ray/config.json" "0644"
  write_b64_file_entry "${workdir}/v2ray/new.json" "/opt/bj123-setup/v2ray/new.json" "0644"
  cat >>"$USER_DATA" <<EOF
runcmd:
  - systemctl enable --now ssh
  - systemctl enable --now qemu-guest-agent
  - chmod +x /opt/bj123-setup/setup_ubuntu.sh
  - [ bash, /opt/bj123-setup/setup_ubuntu.sh, /opt/bj123-setup/setup_ubuntu.yaml ]
EOF

  cat >"$meta_data" <<EOF
instance-id: ${hostname}-${vmid}
local-hostname: ${hostname}
EOF

  cat >"$network_config" <<EOF
version: 2
ethernets:
  lan0:
    match:
      name: "en*"
    dhcp4: false
    addresses:
      - ${ip}/${prefix}
    routes:
      - to: default
        via: ${gateway}
    nameservers:
      addresses: [$(printf '%s,' "${dns[@]}" | sed 's/,$//')]
EOF
  genisoimage -output "$cidata_iso" -volid cidata -joliet -rock "$USER_DATA" "$meta_data" "$network_config"
  log "cloud-init seed created: ${cidata_iso}"
}

create_vm() {
  local vmid name storage bridge memory cores sockets cpu machine bios ostype vga disk_size img overwrite cidata_iso
  vmid="$(yaml_get vm.id)"
  name="$(yaml_get vm.name)"
  storage="$(yaml_get vm.storage local-lvm)"
  bridge="$(yaml_get vm.bridge vmbr0)"
  memory="$(yaml_get vm.memory_mib 65536)"
  cores="$(yaml_get vm.cores 16)"
  sockets="$(yaml_get vm.sockets 1)"
  cpu="$(yaml_get vm.cpu host)"
  machine="$(yaml_get vm.machine q35)"
  bios="$(yaml_get vm.bios ovmf)"
  ostype="$(yaml_get vm.ostype l26)"
  vga="$(yaml_get vm.vga std)"
  disk_size="$(yaml_get vm.disk_size 256G)"
  img="$(yaml_get vm.cloud_image_path)"
  overwrite="$(yaml_get vm.overwrite_existing false)"
  cidata_iso="$(yaml_get vm.iso_storage_dir /var/lib/vz/template/iso)/$(yaml_get vm.hostname)-cidata.iso"

  if qm status "$vmid" >/dev/null 2>&1; then
    if [[ "$overwrite" != "true" ]]; then
      log "VM ${vmid} already exists; overwrite_existing=false"
      return
    fi
    qm stop "$vmid" --skiplock 1 || true
    qm destroy "$vmid" --purge 1 --destroy-unreferenced-disks 1
  fi

  qm create "$vmid" \
    --name "$name" \
    --memory "$memory" \
    --cores "$cores" \
    --sockets "$sockets" \
    --cpu "$cpu" \
    --machine "$machine" \
    --bios "$bios" \
    --ostype "$ostype" \
    --agent enabled=1 \
    --scsihw virtio-scsi-single \
    --net0 "virtio,bridge=${bridge}"

  qm importdisk "$vmid" "$img" "$storage"
  qm set "$vmid" --vga "$vga"
  qm set "$vmid" --scsi0 "${storage}:vm-${vmid}-disk-0,discard=on,ssd=1,iothread=1"
  qm set "$vmid" --efidisk0 "${storage}:0,efitype=4m,pre-enrolled-keys=0"
  qm set "$vmid" --ide2 "local:iso/$(basename "$cidata_iso"),media=cdrom"
  qm set "$vmid" --boot "order=scsi0;ide2;net0"
  qm set "$vmid" --serial0 socket
  qm resize "$vmid" scsi0 "$disk_size"
  if [[ "$(yaml_get vm.start_on_boot false)" == "true" ]]; then
    qm set "$vmid" --onboot 1
  fi
  log "VM ${vmid}/${name} created"
}

configure_gpu_passthrough() {
  if [[ "$(yaml_get gpu_passthrough.enabled false)" != "true" ]]; then
    log "gpu_passthrough.enabled=false; skip host VFIO changes"
    return
  fi
  local ids vmid idx changed arg current grub_line spec
  vmid="$(yaml_get vm.id)"
  ids="$(yaml_get gpu_passthrough.vfio_ids "" | paste -sd, -)"
  changed=0
  if [[ -f /etc/kernel/cmdline ]]; then
    current="$(cat /etc/kernel/cmdline)"
    while read -r arg; do
      [[ -z "$arg" ]] && continue
      if ! grep -qw -- "$arg" <<<"$current"; then
        sed -i "s/$/ ${arg}/" /etc/kernel/cmdline
        current="${current} ${arg}"
        changed=1
      fi
    done < <(yaml_get gpu_passthrough.kernel_args "intel_iommu=on"$'\n'"iommu=pt")
  elif [[ -f /etc/default/grub ]]; then
    grub_line="$(grep -E '^GRUB_CMDLINE_LINUX_DEFAULT=' /etc/default/grub || true)"
    current="${grub_line#*=}"
    current="${current%\"}"
    current="${current#\"}"
    while read -r arg; do
      [[ -z "$arg" ]] && continue
      if ! grep -qw -- "$arg" <<<"$current"; then
        current="${current} ${arg}"
        changed=1
      fi
    done < <(yaml_get gpu_passthrough.kernel_args "intel_iommu=on"$'\n'"iommu=pt")
    if [[ $changed -eq 1 ]]; then
      if grep -qE '^GRUB_CMDLINE_LINUX_DEFAULT=' /etc/default/grub; then
        sed -i "s|^GRUB_CMDLINE_LINUX_DEFAULT=.*|GRUB_CMDLINE_LINUX_DEFAULT=\"${current# }\"|" /etc/default/grub
      else
        echo "GRUB_CMDLINE_LINUX_DEFAULT=\"${current# }\"" >>/etc/default/grub
      fi
      update-grub
    fi
  else
    echo "Neither /etc/kernel/cmdline nor /etc/default/grub exists; cannot set IOMMU kernel args" >&2
    exit 1
  fi
  cat >/etc/modules-load.d/vfio.conf <<'EOF'
vfio
vfio_pci
vfio_iommu_type1
EOF
  cat >/etc/modprobe.d/blacklist-nvidia-passthrough.conf <<'EOF'
blacklist nouveau
blacklist nvidia
blacklist nvidiafb
EOF
  if [[ -n "$ids" ]]; then
    {
      echo "options vfio-pci ids=${ids}"
      echo "softdep snd_hda_intel pre: vfio-pci"
    } >/etc/modprobe.d/vfio.conf
  fi
  proxmox-boot-tool refresh || true
  update-initramfs -u -k all
  qm stop "$vmid" --skiplock 1 >/dev/null 2>&1 || true
  qm set "$vmid" --vga "$(yaml_get vm.vga std)"
  idx=0
  while read -r pci; do
    [[ -z "$pci" ]] && continue
    if [[ "$pci" == *,* ]]; then
      spec="$pci"
      [[ "$spec" != *pcie=* ]] && spec="${spec},pcie=1"
    else
      spec="${pci},pcie=1"
    fi
    qm set "$vmid" "--hostpci${idx}" "$spec"
    idx=$((idx + 1))
  done < <(yaml_get gpu_passthrough.pci_addresses "")
  if qm config "$vmid" | grep -q '^efidisk0: .*pre-enrolled-keys=1'; then
    log "VM ${vmid} still has OVMF secure boot keys enrolled; recreate efidisk0 manually if NVIDIA module signing blocks driver loading"
  fi
  if [[ $changed -eq 1 || "$(lspci -Dnnk | awk '/NVIDIA/{n=1} n&&/Kernel driver in use/{print; n=0}' | grep -c vfio-pci || true)" -eq 0 ]]; then
    touch /run/pve_ubuntu_reboot_required
    log "GPU passthrough host configuration changed; reboot required"
  fi
  log "GPU passthrough configured; reboot qve before starting GPU workload"
}

configure_hdd_storage() {
  [[ "$(yaml_get hdd.enabled false)" == "true" ]] || return 0
  local disk part fs label mountpoint storage uuid existing_fs existing_label
  disk="$(yaml_get hdd.disk_by_id)"
  part="$(yaml_get hdd.partition)"
  fs="$(yaml_get hdd.filesystem ext4)"
  label="$(yaml_get hdd.label hdd8t)"
  mountpoint="$(yaml_get hdd.mountpoint /mnt/hdd8t)"
  storage="$(yaml_get hdd.storage_name hdd8t)"

  if mountpoint -q "$mountpoint" && pvesm status | awk '{print $1}' | grep -qx "$storage"; then
    log "HDD storage ${storage} already mounted at ${mountpoint}"
    return
  fi

  if [[ ! -b "$disk" ]]; then
    echo "HDD disk not found: $disk" >&2
    exit 1
  fi
  if [[ "$(yaml_get hdd.wipe_existing false)" != "true" ]] && [[ ! -b "$part" ]]; then
    echo "HDD partition missing and hdd.wipe_existing=false: $part" >&2
    exit 1
  fi

  existing_fs="$(blkid -s TYPE -o value "$part" 2>/dev/null || true)"
  existing_label="$(blkid -s LABEL -o value "$part" 2>/dev/null || true)"
  if [[ "$existing_fs" == "$fs" && "$existing_label" == "$label" ]]; then
    log "HDD partition ${part} already formatted as ${fs} with label ${label}; skip format"
  elif [[ "$(yaml_get hdd.wipe_existing false)" == "true" ]]; then
    log "Formatting HDD ${disk} as ${fs}; existing data will be destroyed"
    umount "$part" >/dev/null 2>&1 || true
    wipefs -a "$disk"
    parted -s "$disk" mklabel gpt
    parted -s "$disk" mkpart primary "$fs" 0% 100%
    partprobe "$disk" || true
    udevadm settle
    mkfs -t "$fs" -F -L "$label" "$part"
  fi

  mkdir -p "$mountpoint"
  uuid="$(blkid -s UUID -o value "$part")"
  grep -q " ${mountpoint} " /etc/fstab || echo "UUID=${uuid} ${mountpoint} ${fs} defaults,nofail 0 2" >>/etc/fstab
  mountpoint -q "$mountpoint" || mount "$mountpoint"
  if ! pvesm status | awk '{print $1}' | grep -qx "$storage"; then
    pvesm add dir "$storage" --path "$mountpoint" --content "$(yaml_get hdd.storage_content images,backup,iso)" --is_mountpoint 1
  fi
  log "HDD storage ${storage} ready at ${mountpoint}"
}

attach_hdd_to_vm() {
  [[ "$(yaml_get hdd.vm_disk.enabled false)" == "true" ]] || return 0
  local vmid storage bus size opts
  vmid="$(yaml_get vm.id)"
  storage="$(yaml_get hdd.storage_name hdd8t)"
  bus="$(yaml_get hdd.vm_disk.bus scsi1)"
  size="$(yaml_get hdd.vm_disk.size 7000)"
  if qm config "$vmid" | grep -q "^${bus}:"; then
    log "VM ${vmid} already has ${bus}; skip HDD attach"
    return
  fi
  opts="${storage}:${size},format=$(yaml_get hdd.vm_disk.format raw)"
  [[ "$(yaml_get hdd.vm_disk.iothread true)" == "true" ]] && opts="${opts},iothread=1"
  [[ "$(yaml_get hdd.vm_disk.discard false)" == "true" ]] && opts="${opts},discard=on"
  [[ "$(yaml_get hdd.vm_disk.ssd false)" == "true" ]] && opts="${opts},ssd=1"
  qm set "$vmid" "--${bus}" "$opts"
  log "Attached HDD-backed disk to VM ${vmid}: ${bus}=${opts}"
}

start_vm() {
  local vmid
  vmid="$(yaml_get vm.id)"
  if [[ -f /run/pve_ubuntu_reboot_required ]]; then
    log "Host reboot is required before starting VM ${vmid}; skip start"
    return
  fi
  if qm status "$vmid" | grep -q 'status: running'; then
    log "VM ${vmid} already running"
    return
  fi
  qm start "$vmid"
  log "VM ${vmid} started"
}

main() {
  need_root
  ensure_yaml
  log "pve_ubuntu started with config=${CONFIG}"
  if run_stage install_host_packages; then install_host_packages; fi
  if run_stage copy_iso; then copy_iso_from_source; fi
  if run_stage copy_v2ray_config; then copy_v2ray_config_from_source; fi
  download_cloud_image
  create_cloud_init_seed
  if run_stage create_vm; then create_vm; fi
  if run_stage hdd_storage; then configure_hdd_storage; fi
  if run_stage attach_hdd; then attach_hdd_to_vm; fi
  if run_stage gpu_passthrough; then configure_gpu_passthrough; fi
  if run_stage start_vm; then start_vm; fi
  log "pve_ubuntu finished"
}

main "$@"
pve_ubuntu.yaml
yaml
global:
  mode: auto
  workdir: /root/pve-ubuntu

source:
  host: pve
  ip: <SOURCE_PVE_LAN_IP>
  user: root
  password: <SOURCE_PVE_ROOT_PASSWORD>
  iso_path: <SOURCE_UBUNTU_ISO_PATH>
  v2ray_config_dir: /usr/local/etc/v2ray

stages:
  install_host_packages:
    mode: auto
  copy_iso:
    mode: auto
  copy_v2ray_config:
    mode: auto
  create_vm:
    mode: auto
  hdd_storage:
    mode: auto
  attach_hdd:
    mode: auto
  gpu_passthrough:
    mode: auto
  start_vm:
    mode: auto

vm:
  id: <VM_ID>
  name: bj123
  hostname: bj123
  bridge: vmbr0
  storage: local-lvm
  iso_storage_dir: /var/lib/vz/template/iso
  snippets_storage_dir: /var/lib/vz/snippets
  cloud_image_url: https://mirrors.tuna.tsinghua.edu.cn/ubuntu-cloud-images/jammy/current/jammy-server-cloudimg-amd64.img
  cloud_image_path: /var/lib/vz/template/cache/jammy-server-cloudimg-amd64.img
  disk_size: 256G
  memory_mib: 65536
  cores: 16
  sockets: 1
  cpu: host
  machine: q35
  bios: ovmf
  vga: none
  ostype: l26
  agent: true
  start_on_boot: false
  overwrite_existing: false

network:
  ipv4: <VM_LAN_IP>
  prefix: 24
  gateway4: <LAN_GATEWAY_IP>
  dns:
    - <LAN_DNS_IP>
    - <PUBLIC_DNS_IP>

ubuntu:
  user: <UBUNTU_USER>
  password: <UBUNTU_USER_PASSWORD>
  full_name: <UBUNTU_FULL_NAME>
  timezone: Asia/Shanghai
  locale: en_US.UTF-8
  ssh_password_auth: true

guest_setup:
  config: setup_ubuntu.yaml
  script: setup_ubuntu.sh
  run_on_first_boot: true

gpu_passthrough:
  enabled: true
  auto_detect_nvidia: true
  pci_addresses:
    - "<GPU_PCI_ADDRESS_1>"
    - "<GPU_PCI_ADDRESS_2>"
    - "<BOOT_VGA_GPU_PCI_ADDRESS>,rombar=0"
  excluded_pci_addresses:
    - "<GPU_PCI_ADDRESS_TO_EXCLUDE> # optional; keep problematic GPUs out of the VM"
  vfio_ids:
    - "<GPU_VENDOR_DEVICE_ID>"
    - "<GPU_AUDIO_VENDOR_DEVICE_ID>"
  kernel_args:
    - intel_iommu=on
    - iommu=pt
    - pcie_port_pm=off
    - pcie_aspm=off
    - vfio-pci.disable_idle_d3=1
  reboot_after_config: true

hdd:
  enabled: true
  disk_by_id: /dev/disk/by-id/<HDD_DISK_BY_ID>
  partition: /dev/disk/by-id/<HDD_DISK_BY_ID>-part1
  filesystem: ext4
  label: <PVE_HDD_LABEL>
  mountpoint: <PVE_HDD_MOUNTPOINT>
  storage_name: <PVE_HDD_STORAGE_NAME>
  storage_content: images,backup,iso
  wipe_existing: true
  vm_disk:
    enabled: true
    bus: scsi1
    size: <VM_DATA_DISK_SIZE_GB>
    format: raw
    discard: false
    ssd: false
    iothread: true
setup_ubuntu.sh
sh
#!/usr/bin/env bash
set -euo pipefail

CONFIG="${1:-/opt/bj123-setup/setup_ubuntu.yaml}"
LOG_FILE="/var/log/setup_ubuntu.log"

exec > >(tee -a "$LOG_FILE") 2>&1

log() {
  printf '[%s] %s\n' "$(date '+%F %T')" "$*"
}

ensure_yaml() {
  if python3 - <<'PY' >/dev/null 2>&1
import yaml
PY
  then
    return
  fi
  apt-get update
  DEBIAN_FRONTEND=noninteractive apt-get install -y python3-yaml
}

yaml_get() {
  local path="$1"
  local default="${2:-}"
  python3 - "$CONFIG" "$path" "$default" <<'PY'
import sys, yaml
cfg_path, key_path, default = sys.argv[1:4]
with open(cfg_path, "r", encoding="utf-8") as f:
    data = yaml.safe_load(f) or {}
cur = data
for part in key_path.split("."):
    if isinstance(cur, dict) and part in cur:
        cur = cur[part]
    else:
        print(default)
        sys.exit(0)
if cur is None:
    print(default)
elif isinstance(cur, bool):
    print("true" if cur else "false")
elif isinstance(cur, list):
    print("\n".join(str(x) for x in cur))
else:
    print(cur)
PY
}

stage_mode() {
  local stage="$1"
  local mode
  mode="$(yaml_get "stages.${stage}.mode" "")"
  if [[ -z "$mode" ]]; then
    mode="$(yaml_get "global.mode" "confirm")"
  fi
  printf '%s' "$mode"
}

run_stage() {
  local stage="$1"
  local mode
  mode="$(stage_mode "$stage")"
  case "$mode" in
    auto) return 0 ;;
    manual|skip) log "skip stage ${stage} (mode=${mode})"; return 1 ;;
    confirm)
      read -r -p "Run stage ${stage}? [y/N] " answer
      [[ "${answer,,}" == y* ]]
      ;;
    *) log "skip stage ${stage} (unknown mode=${mode})"; return 1 ;;
  esac
}

install_packages() {
  mapfile -t pkgs < <(yaml_get packages.base "")
  if [[ "${#pkgs[@]}" -gt 0 ]]; then
    DEBIAN_FRONTEND=noninteractive apt-get install -y "${pkgs[@]}"
  fi
}

configure_apt_sources() {
  local mirror
  mirror="$(yaml_get system.apt_mirror "")"
  [[ -z "$mirror" ]] && return
  if [[ -f /etc/apt/sources.list ]]; then
    sed -i "s@http://.*archive.ubuntu.com@${mirror}@g; s@https://.*archive.ubuntu.com@${mirror}@g; s@http://security.ubuntu.com@${mirror}@g; s@https://security.ubuntu.com@${mirror}@g" /etc/apt/sources.list
    sed -i 's@http://@https://@g' /etc/apt/sources.list
  fi
}

configure_git() {
  local user email http_proxy https_proxy target_user home_dir
  target_user="$(yaml_get user.name ubuntu)"
  home_dir="$(getent passwd "$target_user" | cut -d: -f6)"
  user="$(yaml_get git.user_name "")"
  email="$(yaml_get git.user_email "")"
  http_proxy="$(yaml_get git.http_proxy "")"
  https_proxy="$(yaml_get git.https_proxy "")"
  [[ -z "$home_dir" ]] && return
  sudo -u "$target_user" git config --global user.name "$user"
  sudo -u "$target_user" git config --global user.email "$email"
  [[ -n "$http_proxy" ]] && sudo -u "$target_user" git config --global http.proxy "$http_proxy"
  [[ -n "$https_proxy" ]] && sudo -u "$target_user" git config --global https.proxy "$https_proxy"
  git lfs install --system || true
}

configure_zsh() {
  local target_user shell_path home_dir
  target_user="$(yaml_get user.name ubuntu)"
  home_dir="$(getent passwd "$target_user" | cut -d: -f6)"
  shell_path="$(yaml_get user.shell /usr/bin/zsh)"
  if [[ -x "$shell_path" && -n "$home_dir" ]] && id "$target_user" >/dev/null 2>&1; then
    chsh -s "$shell_path" "$target_user" || true
    sudo -u "$target_user" mkdir -p "$home_dir/.zsh"
    if [[ ! -d "$home_dir/.zsh/zsh-autocomplete/.git" ]]; then
      sudo -u "$target_user" git clone --depth 1 https://github.com/marlonrichert/zsh-autocomplete.git "$home_dir/.zsh/zsh-autocomplete" || true
    fi
    if [[ ! -f "$home_dir/.zsh/zsh-autosuggestions.zsh" ]]; then
      if [[ -d "$home_dir/.zsh/zsh-autosuggestions/.git" ]]; then
        cp "$home_dir/.zsh/zsh-autosuggestions/zsh-autosuggestions.zsh" "$home_dir/.zsh/zsh-autosuggestions.zsh" || true
      else
        timeout 60 wget -q https://raw.staticdn.net/zsh-users/zsh-autosuggestions/master/zsh-autosuggestions.zsh -O "$home_dir/.zsh/zsh-autosuggestions.zsh" || true
      fi
    fi
    cat >"$home_dir/.zshrc" <<'EOF'
autoload -Uz promptinit
promptinit
PROMPT='%F{yellow}%~ # %f'

setopt histignorealldups sharehistory
bindkey -e
HISTSIZE=1000
SAVEHIST=1000
HISTFILE=~/.zsh_history

zstyle ':completion:*' auto-description 'specify: %d'
zstyle ':completion:*' completer _expand _complete _correct _approximate
zstyle ':completion:*' format 'Completing %d'
zstyle ':completion:*' group-name ''
zstyle ':completion:*' menu select=2
eval "$(dircolors -b)"
zstyle ':completion:*:default' list-colors ${(s.:.)LS_COLORS}
zstyle ':completion:*' list-colors ''
zstyle ':completion:*' list-prompt %SAt %p: Hit TAB for more, or the character to insert%s
zstyle ':completion:*' matcher-list '' 'm:{a-z}={A-Z}' 'm:{a-zA-Z}={A-Za-z}' 'r:|[._-]=* r:|=* l:|=*'
zstyle ':completion:*' menu select=long
zstyle ':completion:*' select-prompt %SScrolling active: current selection at %p%s
zstyle ':completion:*' use-compctl false
zstyle ':completion:*' verbose true
zstyle ':completion:*:*:kill:*:processes' list-colors '=(#b) #([0-9]#)*=0=01;31'
zstyle ':completion:*:kill:*' command 'ps -u $USER -o pid,%cpu,tty,cputime,cmd'

alias ls="ls --color"
alias gs="git status"
alias gb="git rev-parse --abbrev-ref HEAD"
alias gba="git -P branch"
alias gdp="git -P diff"
alias gdh="git diff HEAD^ HEAD"
alias gl="git log"
alias gn="git --no-pager log --pretty='format:%Cgreen[%h] %Cblue[%ai] %Creset[%an]%C(Red)%d %n  %Creset%s %n' -n5"
alias ga="git add"
alias gas="git add . && git status"
alias gc="git commit"
alias gk="git checkout"
alias gau="git add -u"
alias gcm="git commit -m"
alias gcan="git commit --amend --no-edit"
alias gp="git push"
alias gpf="git push -f"
alias gacp="git add -u && git commit --amend --no-edit && git push -f"
[[ -f ~/.gd.sh ]] && source ~/.gd.sh

alias ta="tmux a"
alias td="tmux detach"
alias tn="tmux new -s x"
alias tl="tmux ls"
alias ts="tmux select-pane -T"
alias tm="top -o %MEM -d 2 -c"
alias tc="top -o %CPU -d 2 -c"
alias k9="kill -9"
alias lt="ls -lt"
alias hi="hostname -i"

bindkey "^[[1;5C" forward-word
bindkey "^[[1;3C" forward-word
bindkey "^[[1;5D" backward-word
bindkey "^[[1;3D" backward-word
bindkey "^[[1~"   beginning-of-line
bindkey "^[[4~"   end-of-line
bindkey "^[[3~"   delete-char
bindkey "^[^[[3~" delete-word

if [[ -f ~/.zsh/zsh-autosuggestions.zsh ]]; then
  ZSH_AUTOSUGGEST_HIGHLIGHT_STYLE="fg=#ff00ff"
  source ~/.zsh/zsh-autosuggestions.zsh
fi
if [[ -f ~/.zsh/zsh-autocomplete/zsh-autocomplete.plugin.zsh ]]; then
  source ~/.zsh/zsh-autocomplete/zsh-autocomplete.plugin.zsh 2>/dev/null
  zstyle ':completion:*' list-colors '=*=96'
fi

if [[ -f "$HOME/miniconda3/etc/profile.d/conda.sh" ]]; then
  . "$HOME/miniconda3/etc/profile.d/conda.sh"
elif [[ -x "$HOME/miniconda3/bin/conda" ]]; then
  export PATH="$HOME/miniconda3/bin:$PATH"
fi
alias cda="conda activate ai"
alias cdd="conda deactivate"
if command -v conda >/dev/null 2>&1 && conda env list | awk '{print $1}' | grep -qx ai; then
  conda activate ai
fi

alias nu="gpustat -cpu -i -F -P"
alias nsd="nvidia-smi | grep Default"
export HF_ENDPOINT=https://hf-mirror.com
export REPOS=$HOME/repos
export DATA=/media/data1
export PATH=/usr/local/cuda/bin:$HOME/.local/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}

ulimit -n 1048576 2>/dev/null || true
fpath+=~/.zfunc
autoload -Uz compinit
compinit
EOF
    cat >"$home_dir/.zshenv" <<'EOF'
skip_global_compinit=1
EOF
    chown "$target_user:$target_user" "$home_dir/.zshrc" "$home_dir/.zshenv"
  fi
}

configure_tmux() {
  local target_user home_dir
  target_user="$(yaml_get user.name ubuntu)"
  home_dir="$(getent passwd "$target_user" | cut -d: -f6)"
  [[ -z "$home_dir" ]] && return
  sudo -u "$target_user" mkdir -p "$home_dir/.tmux/plugins" "$home_dir/.config/systemd/user"
  if [[ ! -d "$home_dir/.tmux/plugins/tpm/.git" ]]; then
    sudo -u "$target_user" git clone --depth 1 https://github.com/tmux-plugins/tpm "$home_dir/.tmux/plugins/tpm" || true
  fi
  if [[ ! -d "$home_dir/.tmux/plugins/tmux-resurrect/.git" ]]; then
    sudo -u "$target_user" git clone --depth 1 https://github.com/tmux-plugins/tmux-resurrect "$home_dir/.tmux/plugins/tmux-resurrect" || true
  fi
  cat >"$home_dir/.tmux.conf" <<'EOF'
unbind C-b
set -g prefix M-z
bind M-z send-prefix
bind r source-file ~/.tmux.conf \; display ".tmux.conf reloaded!"
set -g mouse on
set -g status-interval 1
set-option -g status-position bottom
set-option -g status-style bg=default
set-option -g status-left ""
set-option -g window-status-format ""
set-option -g window-status-separator ""
set -g window-status-current-format "#[fg=cyan] #{pane_title}: [#{pane_current_path}]"
set-option -g status-right "#[fg=cyan,bold] [ww%V.%w] %m-%d %H:%M:%S"
set -g pane-border-status top
set -g pane-border-lines heavy
set -g pane-border-style bg=default,fg=cyan
set -g pane-active-border-style bg=cyan,fg=black
setw -g pane-border-format ' #{pane_index}: [#{pane_current_path}] '
unbind -n a
unbind-key -T root MouseDrag1Pane
unbind-key -T copy-mode-vi MouseDrag1Pane
unbind-key -T copy-mode MouseDrag1Pane
set-option -g default-shell /usr/bin/zsh
set-option -g history-limit 100000
set -g @plugin 'tmux-plugins/tpm'
set -g @plugin 'tmux-plugins/tmux-resurrect'
set -g @resurrect-hook-pre-restore-pane-processes 'tmux kill-session -t=0 2>/dev/null || true'
set -g @resurrect-processes '\
    ssh mongosh \
    "~npx->npx *" \
    "~npm->npm *" \
    "~python->python *" \
    "~docker->docker *" \
    "~gpustat->gpustat *" \
'
run '~/.tmux/plugins/tpm/tpm'
EOF
  chown "$target_user:$target_user" "$home_dir/.tmux.conf"
}

install_dotfiles() {
  local target_user home_dir gd_src
  target_user="$(yaml_get user.name ubuntu)"
  home_dir="$(getent passwd "$target_user" | cut -d: -f6)"
  [[ -z "$home_dir" ]] && return
  gd_src="$(yaml_get dotfiles.gd_source /opt/bj123-setup/dotfiles/.gd.sh)"
  if [[ -f "$gd_src" ]]; then
    install -m 0644 -o "$target_user" -g "$target_user" "$gd_src" "$home_dir/.gd.sh"
  elif [[ ! -f "$home_dir/.gd.sh" ]]; then
    timeout 60 wget -q "$(yaml_get dotfiles.gd_url https://raw.staticdn.net/Hansimov/blog/main/docs/notes/scripts/.gd.sh)" -O "$home_dir/.gd.sh" || true
    chown "$target_user:$target_user" "$home_dir/.gd.sh" 2>/dev/null || true
  fi
  sudo -u "$target_user" mkdir -p "$home_dir/.pip"
  cat >"$home_dir/.pip/pip.conf" <<'EOF'
[global]
index-url = https://mirrors.ustc.edu.cn/pypi/simple

[install]
trusted-host = mirrors.ustc.edu.cn
EOF
cat >"$home_dir/.condarc" <<'EOF'
channels:
  - conda-forge
  - bioconda
  - nodefaults
custom_channels:
  conda-forge: https://mirrors.ustc.edu.cn/anaconda/cloud
  bioconda: https://mirrors.ustc.edu.cn/anaconda/cloud
show_channel_urls: true
EOF
  chown -R "$target_user:$target_user" "$home_dir/.pip" "$home_dir/.condarc"
  configure_zsh
  configure_tmux
}

install_conda() {
  [[ "$(yaml_get conda.install false)" == "true" ]] || return
  local target_user home_dir installer url python_version env_name
  target_user="$(yaml_get user.name ubuntu)"
  home_dir="$(getent passwd "$target_user" | cut -d: -f6)"
  [[ -z "$home_dir" ]] && return
  url="$(yaml_get conda.installer_url https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh)"
  installer="/tmp/miniconda.sh"
  if [[ ! -x "$home_dir/miniconda3/bin/conda" ]]; then
    wget -O "$installer" "$url"
    sudo -u "$target_user" bash "$installer" -b -u -p "$home_dir/miniconda3"
  fi
  cat >"$home_dir/.condarc" <<'EOF'
channels:
  - conda-forge
  - bioconda
  - nodefaults
custom_channels:
  conda-forge: https://mirrors.ustc.edu.cn/anaconda/cloud
  bioconda: https://mirrors.ustc.edu.cn/anaconda/cloud
show_channel_urls: true
EOF
  chown "$target_user:$target_user" "$home_dir/.condarc"
  sudo -u "$target_user" "$home_dir/miniconda3/bin/conda" config --set show_channel_urls true || true
  env_name="$(yaml_get conda.env_name ai)"
  python_version="$(yaml_get conda.python_version 3.13)"
  if [[ "$(yaml_get conda.create_env true)" == "true" ]]; then
    if ! sudo -u "$target_user" "$home_dir/miniconda3/bin/conda" env list | awk '{print $1}' | grep -qx "$env_name"; then
      sudo -u "$target_user" "$home_dir/miniconda3/bin/conda" create -y -n "$env_name" "python=${python_version}" --override-channels -c https://mirrors.ustc.edu.cn/anaconda/cloud/conda-forge || true
    fi
  fi
  configure_zsh
}

install_python_tools() {
  [[ "$(yaml_get python_tools.install true)" == "true" ]] || return
  local target_user home_dir pip_bin env_name
  target_user="$(yaml_get user.name ubuntu)"
  home_dir="$(getent passwd "$target_user" | cut -d: -f6)"
  [[ -z "$home_dir" ]] && return
  DEBIAN_FRONTEND=noninteractive apt-get install -y python3-pip python3-venv
  sudo -u "$target_user" python3 -m pip install --user -U pip pipreqs gpustat || true
  env_name="$(yaml_get conda.env_name ai)"
  if [[ -x "$home_dir/miniconda3/envs/${env_name}/bin/pip" ]]; then
    pip_bin="$home_dir/miniconda3/envs/${env_name}/bin/pip"
    sudo -u "$target_user" "$pip_bin" install -U pip pipreqs gpustat || true
  fi
}

install_docker() {
  [[ "$(yaml_get docker.install false)" == "true" ]] || return
  local target_user mirror http_proxy https_proxy no_proxy
  target_user="$(yaml_get user.name ubuntu)"
  mirror="$(yaml_get docker.repo_mirror https://mirrors.ustc.edu.cn/docker-ce)"
  DEBIAN_FRONTEND=noninteractive apt-get install -y ca-certificates curl gnupg
  install -m 0755 -d /etc/apt/keyrings
  rm -f /etc/apt/keyrings/docker.gpg
  curl -fsSL "${mirror}/linux/ubuntu/gpg" | gpg --dearmor -o /etc/apt/keyrings/docker.gpg
  chmod a+r /etc/apt/keyrings/docker.gpg
  echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] ${mirror}/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" >/etc/apt/sources.list.d/docker.list
  apt-get update
  DEBIAN_FRONTEND=noninteractive apt-get install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
  usermod -aG docker "$target_user" || true
  gpasswd -a "$target_user" docker || true
  mkdir -p /etc/docker
  python3 - <<'PY'
import json, pathlib
path = pathlib.Path("/etc/docker/daemon.json")
data = {}
if path.exists():
    try:
        data = json.loads(path.read_text())
    except Exception:
        data = {}
data.setdefault("registry-mirrors", [
    "https://docker.1ms.run",
    "https://docker.1panel.live",
    "https://docker.m.daocloud.io",
])
path.write_text(json.dumps(data, indent=2, ensure_ascii=False) + "\n")
PY
  http_proxy="$(yaml_get docker.http_proxy "")"
  https_proxy="$(yaml_get docker.https_proxy "$http_proxy")"
  no_proxy="$(yaml_get docker.no_proxy localhost,127.0.0.1)"
  if [[ -n "$http_proxy" ]]; then
    mkdir -p /etc/systemd/system/docker.service.d
    cat >/etc/systemd/system/docker.service.d/proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=${http_proxy}"
Environment="HTTPS_PROXY=${https_proxy}"
Environment="NO_PROXY=${no_proxy}"
EOF
  fi
  systemctl daemon-reload
  systemctl enable --now docker
  systemctl restart docker
}

install_nvidia_container() {
  [[ "$(yaml_get nvidia_container.install false)" == "true" ]] || return
  local base_url
  command -v docker >/dev/null 2>&1 || install_docker
  if ! command -v docker >/dev/null 2>&1; then
    log "Docker is not installed; skip NVIDIA Container Toolkit"
    return
  fi
  if ! command -v nvidia-smi >/dev/null 2>&1 || ! nvidia-smi >/dev/null 2>&1; then
    log "NVIDIA driver is not ready; skip NVIDIA Container Toolkit"
    return
  fi
  if [[ "$(yaml_get nvidia_container.use_ustc_mirror true)" == "true" ]]; then
    base_url="https://mirrors.ustc.edu.cn/libnvidia-container"
  else
    base_url="https://nvidia.github.io/libnvidia-container"
  fi
  rm -f /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  curl -fsSL "${base_url}/gpgkey" | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  curl -fsSL "${base_url}/stable/deb/nvidia-container-toolkit.list" | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' >/etc/apt/sources.list.d/nvidia-container-toolkit.list
  if [[ "$(yaml_get nvidia_container.use_ustc_mirror true)" == "true" ]]; then
    sed -i 's#nvidia.github.io#mirrors.ustc.edu.cn#g' /etc/apt/sources.list.d/nvidia-container-toolkit.list
  fi
  apt-get update
  DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-container-toolkit
  if command -v nvidia-ctk >/dev/null 2>&1; then
    nvidia-ctk runtime configure --runtime=docker
  fi
  systemctl daemon-reload
  systemctl restart docker
}

install_tailscale() {
  [[ "$(yaml_get tailscale.install false)" != "true" ]] && return
  if ! command -v tailscale >/dev/null 2>&1; then
    curl -fsSL https://tailscale.com/install.sh | sh
  fi
  systemctl enable --now tailscaled
  local auth_key
  auth_key="$(yaml_get tailscale.auth_key "")"
  if [[ "$(yaml_get tailscale.up false)" == "true" ]]; then
    if [[ -n "$auth_key" ]]; then
      tailscale up --auth-key "$auth_key"
    else
      tailscale up
    fi
  fi
}

install_v2ray() {
  [[ "$(yaml_get v2ray.install true)" != "true" ]] && return
  local script config_src config_dst
  script="$(yaml_get v2ray.install_script /opt/bj123-setup/v2ray-install-release.sh)"
  config_src="$(yaml_get v2ray.config_src /opt/bj123-setup/v2ray/config.json)"
  config_dst="$(yaml_get v2ray.config_dst /usr/local/etc/v2ray/config.json)"
  if [[ -x "$script" ]]; then
    "$script" || true
  fi
  if [[ "$(yaml_get v2ray.install_dat true)" == "true" ]]; then
    mkdir -p /usr/local/share/v2ray
    timeout 60 wget -q https://githubfast.com/v2fly/geoip/releases/latest/download/geoip.dat -O /usr/local/share/v2ray/geoip.dat || true
    timeout 60 wget -q https://githubfast.com/v2fly/domain-list-community/releases/latest/download/dlc.dat -O /usr/local/share/v2ray/geosite.dat || true
  fi
  if [[ -f "$config_src" ]]; then
    mkdir -p "$(dirname "$config_dst")"
    install -m 0644 "$config_src" "$config_dst"
  fi
  while IFS=$'\t' read -r name src dst service; do
    [[ -z "$name" ]] && continue
    if [[ -f "$src" ]]; then
      mkdir -p "$(dirname "$dst")"
      install -m 0644 "$src" "$dst"
      systemctl enable --now "$service" || true
    fi
  done < <(python3 - "$CONFIG" <<'PY'
import sys, yaml
with open(sys.argv[1], "r", encoding="utf-8") as f:
    data = yaml.safe_load(f) or {}
for item in (((data.get("v2ray") or {}).get("extra_configs")) or []):
    name = str(item.get("name", "") or "")
    if not name:
        continue
    src = str(item.get("src", f"/opt/bj123-setup/v2ray/{name}.json"))
    dst = str(item.get("dst", f"/usr/local/etc/v2ray/{name}.json"))
    service = str(item.get("service", f"v2ray@{name}"))
    print("\t".join([name, src, dst, service]))
PY
  )
  if [[ "$(yaml_get v2ray.enable_service true)" == "true" ]]; then
    systemctl enable --now v2ray || true
  fi
}

mount_hdd() {
  [[ "$(yaml_get hdd.enabled false)" == "true" ]] || return
  local dev part fs label mountpoint uuid existing_fs existing_label
  dev="$(yaml_get hdd.device /dev/sdb)"
  part="$(yaml_get hdd.partition /dev/sdb1)"
  fs="$(yaml_get hdd.filesystem ext4)"
  label="$(yaml_get hdd.label data1)"
  mountpoint="$(yaml_get hdd.mountpoint /media/data1)"
  if mountpoint -q "$mountpoint"; then
    log "HDD already mounted at ${mountpoint}"
    return
  fi
  if [[ ! -b "$dev" ]]; then
    log "HDD device ${dev} is not present; skip guest HDD mount"
    return
  fi
  if [[ "$(findmnt -no SOURCE / 2>/dev/null)" == "$dev"* ]]; then
    log "Refusing to format root disk ${dev}"
    return 1
  fi
  existing_fs="$(blkid -s TYPE -o value "$part" 2>/dev/null || true)"
  existing_label="$(blkid -s LABEL -o value "$part" 2>/dev/null || true)"
  if [[ "$existing_fs" == "$fs" && "$existing_label" == "$label" ]]; then
    log "HDD partition ${part} already formatted as ${fs} with label ${label}; skip format"
  elif [[ "$(yaml_get hdd.wipe_existing false)" == "true" || ! -b "$part" ]]; then
    umount "$part" >/dev/null 2>&1 || true
    wipefs -a "$dev"
    parted -s "$dev" mklabel gpt
    parted -s "$dev" mkpart primary "$fs" 0% 100%
    partprobe "$dev" || true
    udevadm settle
    mkfs -t "$fs" -F -L "$label" "$part"
  fi
  mkdir -p "$mountpoint"
  uuid="$(blkid -s UUID -o value "$part")"
  grep -q " ${mountpoint} " /etc/fstab || echo "UUID=${uuid} ${mountpoint} ${fs} defaults,nofail 0 2" >>/etc/fstab
  mountpoint -q "$mountpoint" || mount "$mountpoint"
  log "HDD mounted at ${mountpoint}"
}

install_nvidia_driver() {
  [[ "$(yaml_get nvidia.install_driver false)" == "true" ]] || return
  if ! lspci -nn | grep -Eq 'NVIDIA.*(VGA|3D|Display)|VGA.*NVIDIA|3D.*NVIDIA|Display.*NVIDIA'; then
    log "No NVIDIA GPU visible in guest; skip NVIDIA driver"
    return
  fi
  if command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi >/dev/null 2>&1; then
    log "NVIDIA driver already works"
    return
  fi
  DEBIAN_FRONTEND=noninteractive apt-get install -y ubuntu-drivers-common
  local pkg
  pkg="$(yaml_get nvidia.driver_package auto)"
  if [[ "$pkg" == "auto" || -z "$pkg" ]]; then
    pkg="$(ubuntu-drivers devices 2>/dev/null | sed -n 's/.*driver *: *\\([^ ]*\\).*recommended.*/\\1/p' | head -1)"
  fi
  [[ -z "$pkg" ]] && pkg="nvidia-driver-535"
  log "Installing NVIDIA driver package: ${pkg}"
  DEBIAN_FRONTEND=noninteractive apt-get install -y "$pkg"
}

install_cuda() {
  [[ "$(yaml_get nvidia.install_cuda false)" == "true" ]] || return
  if command -v nvcc >/dev/null 2>&1; then
    log "CUDA nvcc already installed: $(command -v nvcc)"
    return
  fi
  if ! lspci -nn | grep -Eq 'NVIDIA.*(VGA|3D|Display)|VGA.*NVIDIA|3D.*NVIDIA|Display.*NVIDIA'; then
    log "No NVIDIA GPU visible in guest; skip CUDA"
    return
  fi
  local method package keyring_url tmpdeb
  method="$(yaml_get nvidia.cuda_method nvidia_repo)"
  package="$(yaml_get nvidia.cuda_package cuda-toolkit-13-0)"
  if [[ "$method" == "apt" ]]; then
    DEBIAN_FRONTEND=noninteractive apt-get install -y nvidia-cuda-toolkit
  else
    keyring_url="$(yaml_get nvidia.cuda_keyring_url)"
    tmpdeb="/tmp/cuda-keyring.deb"
    if [[ ! -f /etc/apt/sources.list.d/cuda-ubuntu2204-x86_64.list ]]; then
      wget -O "$tmpdeb" "$keyring_url"
      dpkg -i "$tmpdeb"
      apt-get update
    fi
    DEBIAN_FRONTEND=noninteractive apt-get install -y "$package"
  fi
  cat >/etc/profile.d/cuda.sh <<'EOF'
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH:-}
EOF
}

install_desktop() {
  [[ "$(yaml_get desktop.install false)" != "true" ]] && return
  local package
  package="$(yaml_get desktop.package ubuntu-desktop-minimal)"
  DEBIAN_FRONTEND=noninteractive apt-get install -y "$package"
}

main() {
  log "setup_ubuntu started with config=${CONFIG}"
  ensure_yaml
  if run_stage apt_sources; then
    configure_apt_sources
  fi
  apt-get update
  if run_stage base_packages; then
    install_packages
  fi
  if run_stage qemu_guest_agent; then
    systemctl enable --now qemu-guest-agent
  fi
  if run_stage ssh; then
    systemctl enable --now ssh
  fi
  if run_stage tailscale; then
    install_tailscale
  fi
  if run_stage v2ray; then
    install_v2ray
  fi
  if run_stage hdd_mount; then
    mount_hdd
  fi
  if run_stage nvidia_driver; then
    install_nvidia_driver
  fi
  if run_stage cuda; then
    install_cuda
  fi
  if run_stage git; then
    configure_git
  fi
  if run_stage dotfiles; then
    install_dotfiles
  fi
  if run_stage conda; then
    install_conda
  fi
  if run_stage python_tools; then
    install_python_tools
  fi
  if run_stage docker; then
    install_docker
  fi
  if run_stage nvidia_container; then
    install_nvidia_container
  fi
  if run_stage zsh; then
    configure_zsh
  fi
  if run_stage desktop; then
    install_desktop
  fi
  log "setup_ubuntu finished"
}

main "$@"
setup_ubuntu.yaml
yaml
global:
  mode: auto

stages:
  apt_sources:
    mode: auto
  base_packages:
    mode: auto
  qemu_guest_agent:
    mode: auto
  ssh:
    mode: auto
  tailscale:
    mode: auto
  v2ray:
    mode: auto
  hdd_mount:
    mode: auto
  nvidia_driver:
    mode: auto
  cuda:
    mode: auto
  git:
    mode: auto
  dotfiles:
    mode: auto
  conda:
    mode: auto
  python_tools:
    mode: auto
  docker:
    mode: auto
  nvidia_container:
    mode: auto
  zsh:
    mode: auto
  desktop:
    mode: manual

system:
  hostname: bj123
  timezone: Asia/Shanghai
  apt_mirror: https://mirrors.ustc.edu.cn

user:
  name: <UBUNTU_USER>
  password: <UBUNTU_USER_PASSWORD>
  shell: /usr/bin/zsh

packages:
  base:
    - ca-certificates
    - curl
    - wget
    - gnupg
    - lsb-release
    - software-properties-common
    - build-essential
    - net-tools
    - pciutils
    - htop
    - tmux
    - unzip
    - qemu-guest-agent
    - openssh-server
    - git
    - git-lfs
    - zsh
    - python3-pip
    - python3-venv
    - lm-sensors

dotfiles:
  gd_source: /opt/bj123-setup/dotfiles/.gd.sh
  gd_url: https://raw.staticdn.net/Hansimov/blog/main/docs/notes/scripts/.gd.sh

conda:
  install: true
  installer_url: https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh
  env_name: ai
  python_version: "3.13"
  create_env: true

python_tools:
  install: true

docker:
  install: true
  repo_mirror: https://mirrors.ustc.edu.cn/docker-ce
  http_proxy: http://127.0.0.1:11119
  https_proxy: http://127.0.0.1:11119
  no_proxy: localhost,127.0.0.1

nvidia_container:
  install: true
  use_ustc_mirror: true

tailscale:
  install: true
  up: false
  auth_key: ""

v2ray:
  install: true
  install_script: /opt/bj123-setup/v2ray-install-release.sh
  config_src: /opt/bj123-setup/v2ray/config.json
  config_dst: /usr/local/etc/v2ray/config.json
  extra_configs:
    - name: new
      src: /opt/bj123-setup/v2ray/new.json
      dst: /usr/local/etc/v2ray/new.json
      service: v2ray@new
  install_dat: true
  enable_service: true

hdd:
  enabled: true
  device: /dev/sdb
  partition: /dev/sdb1
  filesystem: ext4
  label: <VM_HDD_LABEL>
  mountpoint: <VM_HDD_MOUNTPOINT>
  wipe_existing: true

nvidia:
  install_driver: true
  driver_package: auto
  install_cuda: true
  cuda_method: nvidia_repo
  cuda_package: cuda-toolkit-13-0
  cuda_keyring_url: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb

git:
  user_name: <GIT_USER_NAME>
  user_email: <GIT_USER_EMAIL>
  http_proxy: http://127.0.0.1:11119
  https_proxy: http://127.0.0.1:11119

desktop:
  install: false
  package: ubuntu-desktop-minimal