优先级	文件	用途
1	reports_h100_acceptance_current_status_20260523.md	当前总状态：已测项、失败项、阻塞项、下一步
2	reports_multinode_nccl_latest_index_20260523.md	多节点 NCCL 相关报告索引
3	reports_multinode_nccl_handoff_plan_20260523.md	接手人复跑和继续定位计划
4	reports_test_all_latest_summary_cn_20260523.md	单节点 `test all` 中文原始汇总
5	reports_rdma_cross_node_mlx5_0_20260523.md	跨节点 RDMA `mlx5_0` 双向结果

当前主要阻塞：

单节点 test all：两台节点均为 6/10 PASS，Compute、NCCL、Stress、RDMA 未过。
跨节点 RDMA：mlx5_0 写带宽接近/达到阈值，但读带宽和读写延迟未过。
多节点 NCCL：2x8 allreduce、2x8 alltoall 按 PDF 阈值未过；NCCL wrong_count=0，主要是性能不达标。
环境差异：当前可用 400G IB rail 主要是 mlx5_0,mlx5_1,mlx5_6,mlx5_7，未发现外部 NCCL net plugin / SHARP / HCOLL。

H100 复跑入口

远端默认路径为 /root/test_gpu_scripts，建议在 nccl-gpu-1 作为发起节点执行多节点测试。

# 单节点全量验收，分别在每台机器执行
bash scripts/run_h100_single_node_all.sh

# 多节点 NCCL PDF 矩阵：allreduce/alltoall x 2x1/2x2/2x4/2x8
bash scripts/run_multinode_nccl_pdf_matrix.sh

# 多节点 NCCL 六类 collective：2 节点 x 8 GPU
bash scripts/run_multinode_nccl_all_collectives.sh

# 多节点 NCCL 深度诊断和环境证据抓取
bash scripts/multinode_nccl_deep_diagnose.sh preflight
bash scripts/multinode_nccl_deep_diagnose.sh all

项目结构

test_gpu_scripts/
├── gpu_tester.py                               # 主入口：CLI + 交互式菜单
├── install_deps.sh                             # 一键安装三方工具
├── configs/
│   ├── default.yaml                            # 默认配置
│   ├── multinode_nccl_nccl227_pdf_matrix.yaml  # H100 多节点 PDF 矩阵配置
│   └── multinode_nccl_nccl227_all_collectives_2x8.yaml
├── modules/
│   ├── gpu_specs.py                            # GPU 规格数据库
│   ├── gpu_info.py                             # GPU 检测 & 信息
│   ├── health_check.py                         # 健康诊断
│   ├── benchmark.py                            # 内存带宽 + 计算吞吐
│   ├── nccl_test.py                            # NCCL 多卡/多节点通信
│   ├── stress_test.py                          # GPU 压力/稳定性
│   ├── rdma_test.py                            # RDMA/InfiniBand
│   ├── training_sim.py                         # 训练模拟
│   └── report.py                               # 报告生成
├── scripts/
│   ├── run_h100_single_node_all.sh             # H100 单节点全量复跑
│   ├── run_multinode_nccl_pdf_matrix.sh        # 多节点 NCCL PDF 矩阵复跑
│   ├── run_multinode_nccl_all_collectives.sh   # 多节点 NCCL 六类 collective 复跑
│   └── multinode_nccl_deep_diagnose.sh         # 多节点 NCCL 深度诊断
├── docs/                                       # 指标说明和 runbook
├── reports_*20260523*.md                       # 当前 H100 验收证据和汇总报告
└── requirements.txt

环境要求

最低要求（基础诊断）

项目	要求
OS	Ubuntu 22.04 / RHEL 8+ / Rocky 8+
Python	3.10+
NVIDIA Driver	≥ 470（Ampere）/ ≥ 535（Hopper）/ ≥ 550（Blackwell）
CUDA	≥ 12.1
nvidia-smi	必须可用
pip 包	rich, pyyaml

完整测试（推荐）

项目	要求
GPU	≥ 1 张 NVIDIA 数据中心 GPU（A100/A800/H100/H200/B200/B300 SXM）
MPI	OpenMPI ≥ 4.1
RDMA	Mellanox ConnectX-7 / BlueField
nvbandwidth	源码编译安装
nccl-tests	源码编译安装
gpu-burn	源码编译安装
PyTorch	≥ 2.1（含 CUDA 支持）
transformers	≥ 4.30（训练模拟可选）

快速开始

# 1. 克隆项目到服务器
git clone git@github.com:qinyusen/test_gpu_scripts.git
cd test_gpu_scripts

# 2. 安装依赖（需要 root）
sudo bash install_deps.sh

# 3. 运行交互式测试（自动检测 GPU 型号）
python3 gpu_tester.py

# 4. 或一键全量测试
python3 gpu_tester.py --test all

# 5. 手动指定 GPU 型号（跳过自动检测）
python3 gpu_tester.py --gpu-type b200 --test all

依赖安装

一键安装（推荐）

sudo bash install_deps.sh

该脚本自动完成：

安装系统包（build-essential, openmpi, infiniband-diags, perftest）
源码编译 nvbandwidth → $INSTALL_DIR/nvbandwidth/
源码编译 nccl-tests → $INSTALL_DIR/nccl-tests/build/
源码编译 gpu-burn → $INSTALL_DIR/gpu-burn/
安装 Python 包（rich, pyyaml）
检查 DCGM 和 RDMA 工具状态

默认安装目录 /opt/gpu-test-tools，可通过环境变量自定义。

自定义安装目录

sudo GPU_TOOLS_DIR=/data/tools bash install_deps.sh

手动安装单项

TOOLS=/opt/gpu-test-tools

# nvbandwidth
git clone https://github.com/NVIDIA/nvbandwidth.git $TOOLS/nvbandwidth
cd $TOOLS/nvbandwidth && mkdir build && cd build
cmake .. && make -j$(nproc)

# nccl-tests
git clone https://github.com/NVIDIA/nccl-tests.git $TOOLS/nccl-tests
cd $TOOLS/nccl-tests
make MPI=1 MPI_HOME=/usr CUDA_HOME=/usr/local/cuda -j$(nproc)

# gpu-burn
git clone https://github.com/wilicc/gpu-burn.git $TOOLS/gpu-burn
cd $TOOLS/gpu-burn && make

使用方法

交互式菜单（默认模式）

python3 gpu_tester.py

显示带编号的测试菜单，输入数字选择测试：

 [1]  GPU Information
 [2]  Health Check
 [3]  Memory Benchmark (nvbandwidth)
 [4]  Compute Benchmark
 [5]  NCCL Multi-GPU Test
 [6]  GPU Stress Test (PyTorch/gpu-burn)
 [7]  RDMA/IB Test
 [8]  Training Simulation
 [9]  Full Test Suite (All Tests)
 [0]  Generate Report
 [q]  Quit

命令行模式（脚本化/批量）

# 单项测试
python3 gpu_tester.py --test gpu-info
python3 gpu_tester.py --test health
python3 gpu_tester.py --test benchmark --type memory
python3 gpu_tester.py --test benchmark --type compute --dtype bf16
python3 gpu_tester.py --test nccl
python3 gpu_tester.py --test stress
python3 gpu_tester.py --test rdma
python3 gpu_tester.py --test training

# 全量测试
python3 gpu_tester.py --test all

# GPU 型号控制
python3 gpu_tester.py --gpu-type auto --test all       # 自动检测（默认）
python3 gpu_tester.py --gpu-type h200 --test all        # 强制指定 H200
python3 gpu_tester.py --gpu-type b300 --test benchmark  # 强制指定 B300

# 指定自定义配置
python3 gpu_tester.py --config /path/to/config.yaml --test all

GPU 自动检测

系统启动时自动运行 nvidia-smi --query-gpu=name 检测 GPU 型号，匹配规则：

GPU 名称关键词	识别为	使用规格
`A100`	A100 SXM	Ampere, 80GB HBM2e, 2.0 TB/s
`A800`	A800 SXM	Ampere, 80GB HBM2e, 2.0 TB/s
`H100`	H100 SXM5	Hopper, 80GB HBM3, 3.4 TB/s
`H200`	H200 SXM	Hopper, 141GB HBM3e, 4.8 TB/s
`B200`	B200 SXM	Blackwell, 180GB HBM3e, 8 TB/s
`B300`	B300 SXM	Blackwell Ultra, 288GB HBM3e, 8 TB/s

检测后自动选择对应的：

峰值 TFLOPS（用于计算吞吐效率百分比）
内存带宽峰值（用于带宽效率百分比）
TDP 功耗（用于健康检查功耗阈值）
NVLink 带宽（用于 NCCL 测试最低带宽阈值）

如果检测失败或不匹配，所有峰值显示为 N/A，测试仍可正常运行。

测试模块详解

1. GPU Information（GPU 信息）

检测所有 GPU 的硬件规格和运行状态。

指标	说明
型号	自动检测并确认 GPU 型号（A100/A800/H100/H200/B200/B300）
VRAM	总量 / 已用 / 空闲
温度	实时温度
功耗	实时功耗 / 功耗上限
时钟频率	SM 时钟 / 内存时钟
PCIe	链路代数和宽度（Ampere: Gen4, Hopper/Blackwell: Gen5）
Persistence Mode	应开启
ECC 错误	单比特 / 双比特计数
NVLink 拓扑	显示 `nvidia-smi topo -m` 输出

2. Health Check（健康诊断）

全面检查 GPU 和系统健康状态，输出 PASS/WARN/FAIL 评级。功耗上限根据 GPU 型号自动设定。

检查项	判定标准
温度	< 80°C PASS, < 90°C WARN, ≥ 90°C FAIL
功耗	≤ 功耗上限 ×1.05 PASS（上限自动匹配 GPU TDP）
ECC 单比特	≤ 100 WARN, > 100 WARN
ECC 双比特	= 0 PASS, > 0 FAIL
PCIe 链路	≥ Gen4 x8 PASS
时钟频率	> 0 PASS
节流	无活跃节流原因 PASS
Persistence Mode	Enabled PASS
Hugepages	已配置 WARN
Swap	已禁用 PASS
文件描述符	soft ≥ 65536 WARN
InfiniBand	设备存在 WARN
NCCL 环境变量	列出已设置的变量

3. Memory Benchmark（内存带宽）

优先使用 NVIDIA 官方 nvbandwidth，不可用时 PyTorch fallback。

nvbandwidth 模式测试项：

host_to_device_memcpy_read_ce — H2D 带宽（PCIe）
device_to_host_memcpy_write_ce — D2H 带宽（PCIe）
device_to_device_memcpy_write_ce — D2D 带宽（NVLink）
device_to_device_memcpy_read_ce — D2D 读带宽
device_to_device_bidirectional_sm — D2D 双向带宽

GPU 参考值（D2D 峰值带宽）： A100/A800: 2,039 GB/s | H100: 3,400 GB/s | H200: 4,800 GB/s | B200/B300: 8,000 GB/s

效率评级： ≥ 80% 绿色, 50-80% 黄色, < 50% 红色

4. Compute Benchmark（计算吞吐）

使用 PyTorch matmul 测试各精度 GEMM 吞吐量。峰值 TFLOPS 根据 GPU 型号自动匹配。

精度	A100/A800 峰值	H100/H200 峰值	B200 峰值	B300 峰值
FP32	19.5 TFLOPS	67 TFLOPS	90 TFLOPS	125 TFLOPS
TF32	156 TFLOPS	495 TFLOPS	1,125 TFLOPS	1,750 TFLOPS
FP16	312 TFLOPS	990 TFLOPS	2,250 TFLOPS	3,500 TFLOPS
BF16	312 TFLOPS	990 TFLOPS	2,250 TFLOPS	3,500 TFLOPS
FP8	N/A	1,979 TFLOPS	4,500 TFLOPS	7,000 TFLOPS
FP64	9.7 TFLOPS	67 TFLOPS	TBD	TBD
INT8	624 TOPS	1,979 TOPS	TBD	TBD

默认配置：8192×8192 矩阵，50 次 warmup，500 次迭代；逐 GPU 跑 FP32/TF32/FP16/BF16/FP8/FP64/INT8，并按同 dtype 的极差/均值判断一致性。

5. NCCL Multi-GPU Test（多卡通信）

优先使用官方 nccl-tests（通过 mpirun 调用）并解析真实 bus BW；如果只能走 torchrun fallback，验收结果会标记 FAIL。

操作	说明
AllReduce	最常用的集合通信
AllToAll	模型并行关键操作
Broadcast	参数同步
ReduceScatter	必测
AllGather	必测
SendRecv	必测

默认按 PDF 口径测试 1MB、256MB、2GB 三个 size，每个 op 重复 3 次，取 worst bus BW 和标准差；标准差超过 3% 判 FAIL。

NVLink 参考带宽： A100/A800 ≥ 240 GB/s | H100/H200 ≥ 360 GB/s | B200/B300 ≥ 720 GB/s（40% NVLink 峰值）

6. GPU Stress Test（压力测试）

默认使用 PyTorch BF16/FP16 GEMM 进行长时高功耗满载测试；也可在配置中启用 gpu-burn。测试期间采集温度、功耗、throttle、XID，并计算稳态功耗、温差和 TFLOPS 抖动。

参数	默认值	说明
duration_sec	1800	测试时长（秒）
use_tensor_cores	true	使用 Tensor Core
memory_pct	90	内存占用比例

7. RDMA/IB Test（网络测试）

检测 InfiniBand 设备并测量带宽和延迟。

测试	工具
写带宽	ib_write_bw
读带宽	ib_read_bw
写延迟	ib_write_lat
读延迟	ib_read_lat

参考阈值： 端口 ACTIVE 且 ≥400Gbps；4MB 写/读带宽 ≥47GB/s；8B 写延迟 ≤2μs、读延迟 ≤3.5μs；PFC/ECN/CNP/congestion 计数为 0。

8. Training Simulation（训练模拟）

默认跑 8 卡 DDP synthetic 1.5B Transformer 训练模拟。

模式	说明
DDP 合成模型	约 1.5B 参数，8 卡 torchrun
单进程 fallback	仅用于调试；生产验收按 FAIL

输出：tokens/sec、步时、warmup 后 step 抖动、峰值显存、最终 loss，并检查 loss 是否 NaN/Inf。

配置说明

配置文件路径：configs/default.yaml

# GPU type: auto-detect or override to a100/a800/h100/h200/b200/b300
gpu_type: auto

tools:
  install_dir: /opt/gpu-test-tools    # 三方工具安装目录

benchmark:
  memory:
    nvbandwidth_buffer_mb: 512          # nvbandwidth 缓冲区大小
    nvbandwidth_samples: 3              # nvbandwidth 采样次数
  compute:
    dtypes: [fp32, tf32, fp16, bf16, fp8, fp64, int8]
    matrix_size: 8192                   # GEMM 矩阵维度
    warmup: 50
    iterations: 500

health:
  temp_warning: 75                      # 温度警告阈值 °C
  temp_critical: 85                     # 温度严重阈值 °C
  power_limit: null                     # null = 自动匹配 GPU TDP

nccl:
  min_bandwidth_gbps: null              # null = 40% GPU NVLink 峰值
  test_allreduce: true
  test_alltoall: true
  test_broadcast: true
  test_reduce_scatter: true
  test_allgather: true
  test_sendrecv: true
  message_sizes: [1M, 256M, 2G]
  repeats: 3
  max_stddev_pct: 3

multinode_nccl:
  enabled: false                        # true 时纳入 --test all
  hosts:
    - {name: nccl-gpu-1, addr: 172.72.8.12, slots: 8}
    - {name: nccl-gpu-2, addr: 172.72.8.16, slots: 8}
  tests: [all_reduce_perf, alltoall_perf]
  topologies:
    - {nodes: 2, gpus_per_node: 8}
  mpirun_path: /usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun
  extra_ld_library_path:                # 传给远端 rank 的 MPI/NCCL/CUDA 库路径
    - /usr/mpi/gcc/openmpi-4.1.9a1/lib
    - /root/gpu-test-venv/lib/python3.10/site-packages/nvidia/nccl/lib
    - /usr/local/cuda-12.4/targets/x86_64-linux/lib
  begin_size: 1k
  end_size: 16g
  step_factor: 2
  warmup_iters: 10
  socket_ifname: bond0
  ib_gid_index: 3
  ib_hca: mlx5_0,mlx5_1,mlx5_6,mlx5_7

stress:
  duration_sec: 1800                   # 压力测试时长
  use_gpu_burn: false                  # 默认走 PyTorch GEMM stress
  dtype: bf16
  matrix_size: 24576
  telemetry_interval_sec: 1
  min_power_watts: 630
  max_tflops_jitter_pct: 5
  require_tflops_jitter: true
  use_tensor_cores: true

rdma:
  min_bandwidth_gbps: 47              # RDMA 最低可接受带宽
  min_port_rate_gbps: 400             # IB 端口最低速率
  max_write_latency_us: 2.0
  max_read_latency_us: 3.5
  msg_size: 4194304                   # 4MB 带宽测试消息
  latency_msg_size: 8                 # 8B 延迟测试消息
  server_addr: null                   # client 模式 perftest 对端 IP
  ibping_target: null                 # ibping 对端 LID/GID，不是 IP
  role: auto                          # auto / server / client
  pfc_ecn_counters: true

nvlink:
  expected_links_per_gpu: 18
  expected_link_speed_gbps: 25
  require_zero_errors: true

dcgm:
  diag_level: 3
  timeout_sec: 3600
  expected_num_gpus: 8
  json_output: true
  require_subtests: true

training:
  model: synthetic_1.5b                # 8 卡 synthetic Transformer
  batch_size: 8
  seq_length: 2048
  num_steps: 50
  warmup_steps: 5
  dtype: bf16
  mode: ddp
  min_tokens_per_sec: 45000
  max_step_jitter_pct: 3

report:
  output_dir: ./reports
  format: json                         # json / html / md

测试 SOP（标准操作流程）

SOP-1: 新服务器到货验收

适用场景： GPU 服务器首次上架，需要确认硬件完整可用。支持 A100/A800/H100/H200/B200/B300。

步骤 1: 环境准备
├── 确认 OS 已安装（Ubuntu 22.04 推荐）
├── 确认 NVIDIA 驱动已安装（nvidia-smi 可用）
├── 执行: sudo bash install_deps.sh
└── 确认所有工具安装成功

步骤 2: GPU 信息核对
├── python3 gpu_tester.py --test gpu-info
├── 确认: 系统自动检测到 GPU 型号
├── 核对: GPU 数量是否与采购规格一致
├── 核对: 型号与预期一致（A100/A800/H100/H200/B200/B300）
├── 核对: VRAM 总量符合规格（A100: 80GB, A800: 80GB, H100: 80GB, H200: 141GB, B200: 180GB, B300: 288GB）
├── 核对: PCIe 链路正常（Ampere Gen4 x16, Hopper/Blackwell Gen5 x16）
└── 核对: NVLink 拓扑显示正确

步骤 3: 健康诊断
├── python3 gpu_tester.py --test health
├── 确认: 所有检查项 PASS
├── 重点关注: ECC 双比特错误 = 0
├── 重点关注: 温度 < 80°C（空载）
├── 重点关注: 无节流原因
└── 如有 WARN/FAIL: 记录问题，联系供应商

步骤 4: 内存带宽基准
├── python3 gpu_tester.py --test benchmark --type memory
├── 确认: D2D 带宽效率 ≥ 90%（自动与 GPU 峰值对比）
└── 低于 80%: 检查散热/ECC/固件版本

步骤 5: 计算吞吐基准
├── python3 gpu_tester.py --test benchmark --type compute
├── 确认: 各精度 TFLOPS ≥ 峰值的 80%（自动与 GPU 规格对比）
└── 异常低: 检查功耗限制、时钟频率、驱动版本

步骤 6: NCCL 多卡通信
├── python3 gpu_tester.py --test nccl
├── 确认: AllReduce/AllToAll bus bandwidth ≥ 最低阈值（自动根据 NVLink 带宽计算）
└── 异常低: 检查 NVLink 连接、NVSwitch 状态

步骤 7: 压力稳定性
├── 修改 configs/default.yaml: stress.duration_sec = 600（10分钟）
├── python3 gpu_tester.py --test stress
├── 确认: 所有 GPU PASS
├── 测试期间观察: 温度不超 90°C
└── 测试期间观察: 无 ECC 错误增长

步骤 8: 生成验收报告
├── python3 gpu_tester.py --test all
├── 检查 reports/ 目录下的报告文件
└── 保存报告作为验收依据

验收通过标准：

8 项测试全部无 FAIL
内存带宽效率 ≥ 90%（自动与检测到的 GPU 峰值对比）
计算吞吐效率 ≥ 80%
NCCL 带宽 ≥ 最低阈值（自动计算）
压力测试 10 分钟无错误

SOP-2: 日常巡检

适用场景： 已投产服务器的周期性健康检查。

频率: 每周一次 或 维护窗口

步骤:
1. python3 gpu_tester.py --test health
2. 重点关注:
   - ECC 错误是否增长
   - 温度是否异常升高
   - PCIe 链路是否降级
   - 节流是否出现
3. 异常处理:
   - ECC 双比特错误 > 0: 立即隔离 GPU，联系 NVIDIA
   - 持续节流: 检查散热（风扇/液冷）
   - PCIe 降级: 重新插拔或更换 riser cable

SOP-3: 多节点集群验收

适用场景： 多台 GPU 服务器组成训练集群，验证节点间通信。

前置条件: 每台单节点已通过 SOP-1

步骤 1: 单节点验收
├── 在每台节点上执行 SOP-1
└── 确保所有单节点测试通过

步骤 2: RDMA 网络测试
├── python3 gpu_tester.py --test rdma
├── 确认: IB 设备被识别
├── 确认: 端口状态 ACTIVE 且 ≥400Gbps
├── 确认: 4MB 写/读带宽 ≥47 GB/s
├── 确认: 8B 写延迟 ≤2 μs、读延迟 ≤3.5 μs
├── 确认: ibping 双向连通
├── 确认: PFC/ECN/CNP/congestion 计数为 0
└── 异常: 检查 IB 线缆、交换机配置、子网管理器

步骤 3: 多节点 NCCL 测试
├── 在发起节点确认 mpirun、nccl-tests、跨节点 root SSH 可用
├── 配置 configs/default.yaml 的 multinode_nccl.hosts / IB 参数
├── 执行 PDF 风格 sweep:
│   python3 gpu_tester.py --test multinode-nccl --report --format md
├── 默认命令口径:
│   mpirun -H <node1>:8,<node2>:8 --map-by ppr:8:node -np 16 \
│     all_reduce_perf/alltoall_perf -b 1k -e 16g -f 2 -g 1 -w 10
└── 确认: Peak Bus BW、Peak Size、wrong_count 正常

步骤 4: 训练验证
├── python3 gpu_tester.py --test training
├── 可选: 加载更大模型（如 llama 模型）
└── 确认: 训练 loss 正常下降

多节点 NCCL 深度诊断

当 SOP-3 的多节点 NCCL 结果与验收 PDF 不一致时，可以在发起节点运行深度诊断脚本，复现 counter 抓取、GRAPH/TUNING 日志和 PXN disabled sweep：

bash scripts/multinode_nccl_deep_diagnose.sh preflight
bash scripts/multinode_nccl_deep_diagnose.sh all

详细参数、输出目录和解读方法见 docs/multinode_nccl_deep_diagnose_runbook.md。

SOP-4: 故障诊断

适用场景： 训练过程中出现异常（loss spike、GPU 掉线、OOM）。

步骤 1: 快速诊断
├── python3 gpu_tester.py --test health
├── python3 gpu_tester.py --test gpu-info
└── 记录所有 WARN/FAIL 项

步骤 2: 定位故障 GPU
├── 检查 nvidia-smi 输出
├── 关注: 温度、ECC、功耗异常的 GPU
└── 对故障 GPU 执行:
    python3 gpu_tester.py --test stress
    (stress.gpus 设为故障 GPU 编号)

步骤 3: 通信排查
├── python3 gpu_tester.py --test nccl
├── 如果 AllReduce 带宽异常低:
│   - 检查 NVLink 连接: nvidia-smi nvlink -s
│   - 检查 NVSwitch: nvidia-smi nvswitch -a
│   - 重置 GPU: nvidia-smi -i <id> -r
└── 如果多节点异常:
    python3 gpu_tester.py --test rdma

步骤 4: 固件/驱动排查
├── nvidia-smi -q | head -20  (查看驱动/CUDA 版本)
├── 确认驱动版本满足要求（Ampere ≥ 470, Hopper ≥ 535, Blackwell ≥ 550）
├── 确认固件版本与集群一致
└── 必要时更新: apt upgrade nvidia-driver-*

SOP-5: 定期基准回归

适用场景： 固件/驱动/驱动升级后，确认性能未退化。

频率: 每次变更后 或 每月一次

步骤:
1. 变更前运行全量测试，保存基线报告:
   python3 gpu_tester.py --test all

2. 执行变更（驱动升级/固件更新等）

3. 变更后再次运行:
   python3 gpu_tester.py --test all

4. 对比两份报告:
   - 内存带宽偏差 < 5%
   - 计算吞吐偏差 < 5%
   - NCCL 带宽偏差 < 10%

5. 如性能退化:
   - 检查功耗限制是否变更
   - 检查时钟频率是否降低
   - 回滚驱动验证

报告输出

测试结果自动保存到 reports/ 目录。

JSON 格式

python3 gpu_tester.py --test all
# 报告位置: ./reports/gpu_report_<timestamp>.json

包含所有测试的完整数据，可用于自动化分析。

HTML 格式

python3 gpu_tester.py --test all --format html --output report.html

生成深色主题的可视化报告，包含：

GPU 规格概览
健康检查 PASS/FAIL 状态
内存带宽效率图表
计算吞吐对比（各精度 vs 峰值）
训练模拟指标

故障排除

问题	原因	解决方案
`nvidia-smi not found`	驱动未安装	安装 NVIDIA 驱动（Ampere ≥ 470, Hopper ≥ 535, Blackwell ≥ 550）
`nvbandwidth not found`	未编译安装	运行 `install_deps.sh` 或手动编译
`nccl-tests not found`	未编译安装	运行 `install_deps.sh`，确认 CUDA_HOME 正确
`mpirun not found`	MPI 未安装	`apt install openmpi-bin libopenmpi-dev`
`gpu_burn not found`	未编译安装	运行 `install_deps.sh` 或手动 `make`
NCCL 带宽异常低	NVLink/NVSwitch 问题	检查 `nvidia-smi nvlink -s`，重新插拔
内存带宽低于预期	ECC/散热问题	检查温度、确认 ECC 启用、更新固件
训练模拟 OOM	VRAM 不足	减小 batch_size 或 seq_length
RDMA 测试超时	IB 未配置	检查 `ibstat`，确认 SM/子网管理器运行
PyTorch 导入失败	未安装 torch	`pip install torch --index-url https://download.pytorch.org/whl/cu121`
DCGM 未检测到	未安装	`apt install datacenter-gpu-manager`
CUDA_HOME 错误	环境变量未设	`export CUDA_HOME=/usr/local/cuda`

GPU 关键规格参考

系统自动检测 GPU 型号，以下为各型号参考规格（dense TFLOPS）：

参数	A100 SXM	A800 SXM	H100 SXM5	H200 SXM	B200 SXM	B300 SXM
架构	Ampere	Ampere	Hopper	Hopper	Blackwell	Blackwell Ultra
计算能力	8.0	8.0	9.0	9.0	10.0	10.0
HBM 容量	80 GB (HBM2e)	80 GB (HBM2e)	80 GB (HBM3)	141 GB (HBM3e)	180 GB (HBM3e)	288 GB (HBM3e)
内存带宽	2,039 GB/s	2,039 GB/s	3,400 GB/s	4,800 GB/s	8,000 GB/s	8,000 GB/s
TDP	400W	400W	700W	700W	1,000W	1,200W
FP32	19.5 TFLOPS	19.5 TFLOPS	67 TFLOPS	67 TFLOPS	90 TFLOPS	125 TFLOPS
TF32 (dense)	156 TFLOPS	156 TFLOPS	495 TFLOPS	495 TFLOPS	1,125 TFLOPS	1,750 TFLOPS
FP16/BF16 (dense)	312 TFLOPS	312 TFLOPS	990 TFLOPS	990 TFLOPS	2,250 TFLOPS	3,500 TFLOPS
FP8 (dense)	N/A	N/A	1,979 TFLOPS	1,979 TFLOPS	4,500 TFLOPS	7,000 TFLOPS
NVLink	第 3 代, 600 GB/s	第 3 代, 600 GB/s	第 4 代, 900 GB/s	第 4 代, 900 GB/s	第 5 代, 1,800 GB/s	第 5 代, 1,800 GB/s
PCIe	Gen4 x16	Gen4 x16	Gen5 x16	Gen5 x16	Gen5 x16	Gen5 x16
最低驱动	470	470	535	535	550	550
最低 CUDA	11.0	11.0	12.1	12.1	12.4	12.4

README.md Unescape Escape

GPU Training Server Test Suite

H100 当前验收入口