Add H100 acceptance test coverage and reports

This commit is contained in:
cs 2026-05-23 10:41:09 +08:00
parent dd77a882f1
commit 86f15544d7
44 changed files with 6938 additions and 190 deletions

1
.gitignore vendored
View File

@ -15,3 +15,4 @@ reports/
venv/ venv/
.qoder/* .qoder/*
.claude/settings.local.json .claude/settings.local.json
.omx/

View File

@ -0,0 +1,85 @@
# H100 PDF 验收项 vs 当前 `test all` 覆盖对比
对比对象:
- PDF`/Users/d-robotics/Downloads/H100_production_acceptance.pdf`
- 当前脚本:`python gpu_tester.py --config configs/default.yaml --test all --report --format md`
- 范围:单节点 8 卡 H100。跨节点 NCCL/RDMA 暂不纳入本轮。
## 结论
当前 `test all` 已经从“功能巡检”扩成了“接近生产验收”的单节点套件GPU 健康、NVLink/NVSwitch、HBM/PCIe/NVLink 带宽、计算、NCCL、压力、RDMA 本机端口、DCGM、训练模拟都会进入同一个 all。
最新 stress smoke 已确认 PyTorch BF16 GEMM 压力能把两台机器压到 PDF 要求的功耗区间:
- `aikubeworker0012`45 秒 smoke稳态平均功耗约 `697-698W/卡`TFLOPS jitter `4.07%`XID `0`,但温差 `12C``clocks_throttle_reasons.active=0x4`,按 PDF 严格 FAIL。
- `aikubeworker0016`45 秒 smoke稳态平均功耗约 `697-699W/卡`TFLOPS jitter `3.77%`XID `0`,但温差 `8C``clocks_throttle_reasons.active=0x4`,按 PDF 严格 FAIL。
也就是说,当前卡点已经不是“脚本压不满 H100”而是机器在满功耗压力下没有满足 PDF 的 `温差 <=5C``Throttle Reasons 全程 0x0` 两个严格门槛。
但如果严格按 PDF 做最终验收,现在还差这些:
1. 24 小时类指标未覆盖PDF 要求 SBE 24h 增长率、长稳态观察;当前 `all` 是单次快照 + 30 分钟压力,不等于 24 小时老化。
2. 跨节点项目本轮故意不测PDF 的 IB/RDMA 生产验收通常要双端 `ib_write_bw/read_bw/lat``ibping`;当前按你的要求先做单节点,跨节点未纳入。
3. PFC/ECN/AER 的覆盖依赖机器暴露的系统计数器:脚本会读能找到的 sysfs 计数器和 dmesg但如果交换机侧 PFC/ECN 不在主机暴露,仍需要网络侧补证据。
4. NCCL 1MB 档会被严格阈值打失败:实测 1MB AllReduce bus BW 约 23 GB/s而 256MB AllReduce 已通过 `nccl-tests` 验证,约 421 GB/s如果 PDF 要求 1MB 也达到 405 GB/s这项不是“没测”而是会被判 FAIL。
5. Stress 已能达到功耗和 jitter 要求,但短测已经暴露温差和 throttle strict FAIL完整 1800 秒只会给出更正式的证据,不会自动改变这个判据。
## 覆盖表
| PDF 验收项 | 当前 `test all` 状态 | 还少什么 |
|---|---:|---|
| GPU 基本信息、Driver/CUDA | 已覆盖 | 无;会记录 driver、CUDA、GPU 型号 |
| 温度阈值:稳态 ≤75C、峰值 ≤85C | 已覆盖健康快照;压力项覆盖 ≤80C | 24h 稳态曲线不在一次 all 内 |
| idle power ≤100W/card | 部分覆盖 | 当前 health 会采功耗,但 idle 判据还不是独立验收项 |
| stress power ≥630W/card | 已覆盖;短测两台约 697-699W/卡 | 完整 1800 秒仍待跑 |
| throttle reasons active=0x0 | 已覆盖;短测两台出现 0x4 | 按 PDF 严格判 FAIL不是脚本跳过项 |
| DBE/SBE/retired pages | 部分覆盖 | retired pages 和内核错误已查SBE 24h 增长率未覆盖 |
| PCIe Gen5 x16 | 部分覆盖 | GPU 信息/拓扑可见Replay/AER 依赖 dmesg/sysfs可能还需额外主板侧证据 |
| Fabric Manager active 且无 ERROR | 已覆盖 | 无health 会查 systemd 和 journal |
| NVLink18 links/GPU、25GB/s/link、错误为 0 | 已覆盖 | 无;新增 `nvlink` 项 |
| D2D/H2D/D2H 带宽 | 已覆盖 | 依赖 `nvbandwidth`,两台已具备 |
| 8x8 P2P matrix off-diagonal mean/min/deviation | 已覆盖 | 无;由 nvbandwidth JSON 解析 |
| Compute FP32/TF32/FP16/BF16/FP8/FP64/INT8 | 已覆盖 | INT8 为 PyTorch `_int_mm` 路径,若要供应商标准 INT8 kernel 需再换实现 |
| NCCL AllReduce/AllGather/ReduceScatter/Broadcast/SendRecv/AllToAll | 已覆盖 | 无;`nccl-tests` 已在两台编好 |
| NCCL 1MB/256MB/2GBrepeat 3stddev ≤3% | 已覆盖 | 严格按 PDF 阈值时 1MB 档大概率 FAIL256MB AllReduce 两台 `nccl-tests` 实测约 421GB/s |
| Stress ≥30minBF16/FP16 GEMM 81921s telemetry | 已覆盖;默认 BF16 GEMM `24576`1s telemetrywarmup 后稳态判定 | 完整 1800 秒待执行;短测已暴露温差/throttle FAIL |
| DCGM `dcgmi diag -r 3` | 已覆盖DCGM 4.5.3 已安装,服务已启用 | 两台完整 `-r 3` 已 PASS日志见 `/root/test_gpu_scripts/reports/dcgm_r3_*_20260522_17010*.log` |
| RDMA 端口 ACTIVE、400Gbps | 部分覆盖 | 单节点可查端口;严格双端吞吐/时延本轮不跑 |
| RDMA write/read bw ≥47GB/s、latency ≤2/3.5us | 部分覆盖 | 单机 localhost/perftest 不等价跨节点线速验收 |
| PFC/ECN errors=0、ibping 双向 OK | 部分覆盖 | 主机能读到的计数器会查;交换机侧/跨节点 ibping 未覆盖 |
| 1.5B synthetic Transformer BF168 卡≥45k tokens/s | 已覆盖 DDP 路径 | 8 进程 DDP smoke 已通过;完整 50 step 长跑待执行 |
| 任一子项 FAIL 则总体验收 FAIL | 已覆盖 | `all` 现在会按 strict verdict 退出非 0 |
## 如果现在直接跑 `all`
推荐命令:
```bash
cd /root/test_gpu_scripts
/root/gpu-test-venv/bin/python gpu_tester.py --config configs/default.yaml --test all --report --format json --output reports/h100_all_$(hostname)_$(date +%Y%m%d_%H%M%S).json
```
如果要直接生成中文 Markdown 报告,用这个:
```bash
cd /root/test_gpu_scripts
/root/gpu-test-venv/bin/python gpu_tester.py --config configs/default.yaml --test all --report --format md --output reports/h100_all_$(hostname)_$(date +%Y%m%d_%H%M%S).md
```
预计行为:
- 会跑完整单节点项目,压力默认 1800 秒,默认使用 PyTorch BF16 GEMM 压力并采 1 秒 telemetry/XID。
- stress 默认矩阵为 `24576`,用于把 H100 压到 ≥630W/卡PDF 只要求 `matrix_size >=8192`,这里是为了满足功耗门槛。
- NCCL 会跑 6 个 op × 3 个 message size × 3 次 repeat。
- DCGM 会跑 `dcgmi diag -r 3 -n gpu:8 -j`DCGM 工具链已安装并启动,`diag -r 1` 与两台独立 `r3` 长跑均已 PASS。
- NCCL 1MB 档按 405GB/s 阈值也会失败256MB AllReduce 已验证走 `nccl-tests`,两台约 421GB/s。
- stress 按 PDF 严格口径预计会 FAIL当前短测证据显示温差超过 5C且 throttle active 出现 `0x4`
- 跨节点 RDMA/NCCL 不在这次单节点 all 里。
## 当前最小补齐清单
1. 如果要严格 RDMA 生产验收,下一轮用两台机器做 server/client 双端测试。
2. 执行完整 1.5B DDP 50 step 训练验收并归档 tokens/s、jitter、显存和 loss。
3. 执行完整 1800 秒 stress 并归档 1 秒 telemetry、XID、throttle、功耗和温度当前预期会因温差/throttle FAIL。
4. 如果要 24 小时验收,增加一个 24h monitor 模式,记录 SBE 增长率、XID、温度、功耗、降频曲线。

View File

@ -0,0 +1,100 @@
# H100 生产验收标准 vs 当前 `gpu_tester.py --test all` 覆盖差距
对比文件:`/Users/d-robotics/Downloads/H100_production_acceptance.pdf`
对比对象:当前仓库执行 `python gpu_tester.py --test all --report --format md/json`
## 结论
当前仓库的 `test all` 能覆盖验收文档里的大类框架,但还不是完整的 H100 生产验收。
它会跑 8 个模块:
1. GPU Information
2. Health Check
3. Memory Benchmark
4. Compute Benchmark
5. NCCL Test
6. GPU Stress Test
7. RDMA/IB Test
8. Training Simulation
但是按照 PDF 的生产验收标准,仍缺少这些关键项:
- NVLink 每卡 18 条链路的 active/速率/错误计数逐项验收
- DCGM `dcgmi diag -r 3`
- 30-60 分钟 burn-in 和 1 秒级温度/功耗/throttle/XID 采样
- NCCL 官方 `nccl-tests` 的性能验收,包括 1MB/256MB/2GB 三个消息大小、重复 3 次取最差值、标准差
- RDMA 生产口径4MB 带宽、8B 延迟、PFC/ECN 错误、ibping 双向
- 8 卡逐卡 compute 一致性,要求同 dtype 极差/均值 <= 3%
- FP64、INT8 计算项
- 训练项应为 8 卡 1.5B synthetic Transformer并按 45k tokens/s、step 抖动、显存、loss 健康度验收
## 覆盖矩阵
| PDF 验收项 | `test all` 是否覆盖 | 当前覆盖程度 | 主要缺口 |
| --- | --- | --- | --- |
| 1. 健康检查 | 部分覆盖 | 温度、功耗、ECC、PCIe、时钟、throttle、persistence、IB 设备 | idle 功耗 <=100W 未单独判定stress 功耗 >=630W 未判定retired pages 未查24h SBE 增长率未查AER/Replay errors 未查fabricmanager 服务和 ERROR 日志未查 |
| 2. NVLink 拓扑与链路 | 部分覆盖 | GPU info 会保存 `nvidia-smi topo -m` | 未跑 `nvidia-smi nvlink -s/-c/-e`;未验证每卡 18 条 NVLink未验证每条 25GB/s未验证 CRC/Replay/Recovery error = 0 |
| 3. Memory Bandwidth | 部分覆盖 | 会用 nvbandwidth 测 H2D、D2H、D2D write/read/bidir | 未输出完整 8x8 P2P 矩阵;未验非对角均值 >=360GB/s、最小值 >=320GB/s、相对均值偏差 <=±5%D2D 口径和 PDF 的单卡/P2P 验收口径还没完全对齐 |
| 4. Compute Throughput | 大部分覆盖 | 默认配置已是 matrix_size=8192、warmup=50、iterations=500、use_compile=trueH100 绝对 TFLOPS 阈值在 `gpu_specs.py` 里有 | 目前测试结果是整体/单进程口径,未真正逐 GPU 分别测出 8 卡极差/均值;未测 FP64、INT8 |
| 5. NCCL Multi-GPU | 部分覆盖,依赖工具 | 代码支持 nccl-tests若缺 binary 会 fallback torchrun 功能连通性 | 当前远端没装好 nccl-tests实际会退化成功能测试且失败/无性能数据;默认只启 allreduce/alltoall/broadcast未启 allgather/reducescatter/sendrecv消息大小不是 1MB/256MB/2GB 三点;未重复 3 次取 worst未统计标准差 |
| 6. Stress/Burn-in | 部分覆盖 | 会跑 stress默认 60 秒;无 gpu-burn 时用 PyTorch fallback | PDF 要 >=30min推荐 60min要 FP16/BF16 大 GEMM matrix >=8192要每分钟 TFLOPS 抖动、温度 <=80、卡间温差 <=5、功耗 >=630W、throttle=0、XID=0当前 PyTorch fallback 只分配约 64MB/卡,压力不够 |
| 7. DCGM 诊断 | 未覆盖 | 无 | 没有执行 `dcgmi diag -r 3`,也没有解析 Software/Deployment/Hardware/Integration/Stress/Power 子项 |
| 8. RDMA/IB | 部分覆盖 | 会发现 IB 设备,跑 ib_write_bw/read_bw/write_lat/read_lat | 当前脚本用 `localhost`不是跨节点msg_size 是 64KB不是 4MBlatency 没指定 8B阈值是 50GB/s 和 10us不是 PDF 的 write/read >=47GB/s、write_lat <=2us、read_lat <=3.5us;未查 PFC/ECN、ibping 双向 |
| 9. Training Simulation | 部分覆盖 | 会跑 GPT-2 或 synthetic transformer输出 tokens/s、step time、显存、loss | 当前 synthetic 是约 1.47B 参数但实际单进程 `.cuda()`,不是 8 卡分布式训练;未按 45k tokens/s、step 抖动 <=±3%、peak <=70GB/卡、NaN/Inf 做硬判定 |
| 10. 总体 Verdict | 部分覆盖 | report 有 summary | 当前 `all` 的 pass/fail 逻辑偏“模块是否报错”,不是 PDF 的任一子项 FAIL 即整机禁上生产 |
## 如果现在直接执行 `test all`,能得到什么
会得到一份“单节点综合体检/基准测试报告”,包含:
- 8 张 H100 的基础信息、驱动/CUDA、PCIe、显存、温度、功耗
- 健康检查结果
- nvbandwidth 的 H2D/D2H/D2D 汇总带宽
- FP32/TF32/FP16/BF16/FP8 计算吞吐
- NCCL 测试结果,如果 nccl-tests 缺失会退化到 torchrun fallback
- 60 秒 stress 结果
- 本机 localhost RDMA/IB 结果
- 训练模拟结果
这份报告能作为“快速冒烟 + 单机初筛”,不能直接作为 PDF 标准下的“生产验收合格报告”。
## 当前两台机器执行前置状态
已经确认:
- `nvbandwidth` 已装好并能被项目脚本调用
- PyTorch CUDA 环境已装好
- RDMA perftest 工具已存在
- `nccl-tests``gpu-burn` 目前没有按 PDF 生产验收口径准备好
另外,我刚才误触发的 `test all`
- `aikubeworker0016` 已经在跑单节点 `test all`,当前到 Training Simulation
- `aikubeworker0012` 没有成功启动
## 要补齐到 PDF 验收口径,需要加的最小清单
1. 安装/修复 `nccl-tests`,确保真正输出 bus BW而不是 torchrun fallback。
2. 安装/修复 `gpu-burn`,或把 PyTorch stress 改成真正高占用 FP16/BF16 GEMM并支持 30/60 分钟。
3. 增加 NVLink 专项:`nvidia-smi nvlink -s/-c/-e`,按 18 条/卡、25GB/s、error=0 判定。
4. 增加 DCGM 专项:`dcgmi diag -r 3`,解析子项 PASS/FAIL。
5. 增加 telemetry 采样stress 期间每 1 秒采温度、功耗、throttle、XID计算稳态功耗、温差、抖动。
6. 修改 RDMA支持指定 server/client、4MB 带宽、8B 延迟、双向 ibping、PFC/ECN 计数。
7. 修改 NCCL 配置:全 op 开启,按 1MB/256MB/2GB 三个 size重复 3 次取最差值和标准差。
8. 修改 Compute逐 GPU 分别跑,计算同 dtype 极差/均值;增加 FP64、INT8。
9. 修改 Training Simulation明确 8 卡 1.5B synthetic 分布式训练,加入 tokens/s、step 抖动、显存、loss NaN/Inf 的 PASS/FAIL。
10. 修改最终 verdict按 PDF 规则,任一子项 FAIL 就整机不通过。
## 建议执行策略
现在直接跑:
```bash
/root/gpu-test-venv/bin/python gpu_tester.py --test all --report --format md --output reports_all/test_all.md
```
得到的是“当前仓库 all 覆盖范围报告”。
要拿来做生产验收,需要先补齐上面的缺口,尤其是 `nccl-tests``gpu-burn`、NVLink、DCGM、长时间 burn-in、跨节点 RDMA。

View File

@ -159,7 +159,7 @@ python3 gpu_tester.py
[3] Memory Benchmark (nvbandwidth) [3] Memory Benchmark (nvbandwidth)
[4] Compute Benchmark [4] Compute Benchmark
[5] NCCL Multi-GPU Test [5] NCCL Multi-GPU Test
[6] GPU Stress Test (gpu-burn) [6] GPU Stress Test (PyTorch/gpu-burn)
[7] RDMA/IB Test [7] RDMA/IB Test
[8] Training Simulation [8] Training Simulation
[9] Full Test Suite (All Tests) [9] Full Test Suite (All Tests)
@ -279,33 +279,35 @@ python3 gpu_tester.py --config /path/to/config.yaml --test all
| FP16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS | | FP16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS |
| BF16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS | | BF16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS |
| FP8 | N/A | 1,979 TFLOPS | 4,500 TFLOPS | 7,000 TFLOPS | | FP8 | N/A | 1,979 TFLOPS | 4,500 TFLOPS | 7,000 TFLOPS |
| FP64 | 9.7 TFLOPS | 67 TFLOPS | TBD | TBD |
| INT8 | 624 TOPS | 1,979 TOPS | TBD | TBD |
默认配置4096×4096 矩阵10 次 warmup100 次迭代。 默认配置:8192×8192 矩阵50 次 warmup500 次迭代;逐 GPU 跑 FP32/TF32/FP16/BF16/FP8/FP64/INT8并按同 dtype 的极差/均值判断一致性
### 5. NCCL Multi-GPU Test多卡通信 ### 5. NCCL Multi-GPU Test多卡通信
优先使用官方 nccl-tests通过 mpirun 调用),不可用时 torchrun fallback 优先使用官方 nccl-tests通过 mpirun 调用)并解析真实 bus BW如果只能走 torchrun fallback验收结果会标记 FAIL
| 操作 | 说明 | | 操作 | 说明 |
|---|---| |---|---|
| AllReduce | 最常用的集合通信 | | AllReduce | 最常用的集合通信 |
| AllToAll | 模型并行关键操作 | | AllToAll | 模型并行关键操作 |
| Broadcast | 参数同步 | | Broadcast | 参数同步 |
| ReduceScatter | 可选 | | ReduceScatter | 必测 |
| AllGather | 可选 | | AllGather | 必测 |
| SendRecv | 可选 | | SendRecv | 必测 |
默认测试数据量范围 8B ~ 256MB5 次 warmup20 次迭代 默认按 PDF 口径测试 1MB、256MB、2GB 三个 size每个 op 重复 3 次,取 worst bus BW 和标准差;标准差超过 3% 判 FAIL
**NVLink 参考带宽:** A100/A800 ≥ 240 GB/s | H100/H200 ≥ 360 GB/s | B200/B300 ≥ 720 GB/s40% NVLink 峰值) **NVLink 参考带宽:** A100/A800 ≥ 240 GB/s | H100/H200 ≥ 360 GB/s | B200/B300 ≥ 720 GB/s40% NVLink 峰值)
### 6. GPU Stress Test压力测试 ### 6. GPU Stress Test压力测试
使用 gpu-burn 进行长时满载测试,验证热稳定性和内存正确性 默认使用 PyTorch BF16/FP16 GEMM 进行长时高功耗满载测试;也可在配置中启用 gpu-burn。测试期间采集温度、功耗、throttle、XID并计算稳态功耗、温差和 TFLOPS 抖动
| 参数 | 默认值 | 说明 | | 参数 | 默认值 | 说明 |
|---|---|---| |---|---|---|
| duration_sec | 60 | 测试时长(秒) | | duration_sec | 1800 | 测试时长(秒) |
| use_tensor_cores | true | 使用 Tensor Core | | use_tensor_cores | true | 使用 Tensor Core |
| memory_pct | 90 | 内存占用比例 | | memory_pct | 90 | 内存占用比例 |
@ -320,18 +322,18 @@ python3 gpu_tester.py --config /path/to/config.yaml --test all
| 写延迟 | ib_write_lat | | 写延迟 | ib_write_lat |
| 读延迟 | ib_read_lat | | 读延迟 | ib_read_lat |
**参考阈值:** 带宽 ≥ 50 GB/s, 延迟 ≤ 10 μs **参考阈值:** 端口 ACTIVE 且 ≥400Gbps4MB 写/读带宽 ≥47GB/s8B 写延迟 ≤2μs、读延迟 ≤3.5μsPFC/ECN/CNP/congestion 计数为 0。
### 8. Training Simulation训练模拟 ### 8. Training Simulation训练模拟
使用真实或合成模型模拟训练负载 默认跑 8 卡 DDP synthetic 1.5B Transformer 训练模拟
| 模式 | 说明 | | 模式 | 说明 |
|---|---| |---|---|
| 真实模型 | 加载 HuggingFace GPT-2需安装 transformers | | DDP 合成模型 | 约 1.5B 参数8 卡 torchrun |
| 合成模型 | 6 层 Transformer无需额外依赖 | | 单进程 fallback | 仅用于调试;生产验收按 FAIL |
输出tokens/sec、步时、峰值显存、最终 loss。 输出tokens/sec、步时、warmup 后 step 抖动、峰值显存、最终 loss,并检查 loss 是否 NaN/Inf
--- ---
@ -351,14 +353,14 @@ benchmark:
nvbandwidth_buffer_mb: 512 # nvbandwidth 缓冲区大小 nvbandwidth_buffer_mb: 512 # nvbandwidth 缓冲区大小
nvbandwidth_samples: 3 # nvbandwidth 采样次数 nvbandwidth_samples: 3 # nvbandwidth 采样次数
compute: compute:
dtypes: [fp32, tf32, fp16, bf16, fp8] dtypes: [fp32, tf32, fp16, bf16, fp8, fp64, int8]
matrix_size: 4096 # GEMM 矩阵维度 matrix_size: 8192 # GEMM 矩阵维度
warmup: 10 warmup: 50
iterations: 100 iterations: 500
health: health:
temp_warning: 80 # 温度警告阈值 °C temp_warning: 75 # 温度警告阈值 °C
temp_critical: 90 # 温度严重阈值 °C temp_critical: 85 # 温度严重阈值 °C
power_limit: null # null = 自动匹配 GPU TDP power_limit: null # null = 自动匹配 GPU TDP
nccl: nccl:
@ -366,26 +368,62 @@ nccl:
test_allreduce: true test_allreduce: true
test_alltoall: true test_alltoall: true
test_broadcast: true test_broadcast: true
test_reduce_scatter: true
test_allgather: true
test_sendrecv: true
message_sizes: [1M, 256M, 2G]
repeats: 3
max_stddev_pct: 3
stress: stress:
duration_sec: 60 # 压力测试时长 duration_sec: 1800 # 压力测试时长
use_gpu_burn: false # 默认走 PyTorch GEMM stress
dtype: bf16
matrix_size: 24576
telemetry_interval_sec: 1
min_power_watts: 630
max_tflops_jitter_pct: 5
require_tflops_jitter: true
use_tensor_cores: true use_tensor_cores: true
rdma: rdma:
min_bandwidth_gbps: 50 # RDMA 最低可接受带宽 min_bandwidth_gbps: 47 # RDMA 最低可接受带宽
max_latency_us: 10 # RDMA 最大可接受延迟 min_port_rate_gbps: 400 # IB 端口最低速率
msg_size: 65536 # 测试消息大小 max_write_latency_us: 2.0
max_read_latency_us: 3.5
msg_size: 4194304 # 4MB 带宽测试消息
latency_msg_size: 8 # 8B 延迟测试消息
server_addr: null # client 模式 perftest 对端 IP
ibping_target: null # ibping 对端 LID/GID不是 IP
role: auto # auto / server / client
pfc_ecn_counters: true
nvlink:
expected_links_per_gpu: 18
expected_link_speed_gbps: 25
require_zero_errors: true
dcgm:
diag_level: 3
timeout_sec: 3600
expected_num_gpus: 8
json_output: true
require_subtests: true
training: training:
model: gpt2 # HuggingFace 模型名 model: synthetic_1.5b # 8 卡 synthetic Transformer
batch_size: 8 batch_size: 8
seq_length: 2048 seq_length: 2048
num_steps: 50 num_steps: 50
warmup_steps: 5
dtype: bf16 dtype: bf16
mode: ddp
min_tokens_per_sec: 45000
max_step_jitter_pct: 3
report: report:
output_dir: ./reports output_dir: ./reports
format: json # json 或 html format: json # json / html / md
``` ```
--- ---
@ -493,9 +531,11 @@ report:
步骤 2: RDMA 网络测试 步骤 2: RDMA 网络测试
├── python3 gpu_tester.py --test rdma ├── python3 gpu_tester.py --test rdma
├── 确认: IB 设备被识别 ├── 确认: IB 设备被识别
├── 确认: 端口状态 Active ├── 确认: 端口状态 ACTIVE 且 ≥400Gbps
├── 确认: 写带宽 ≥ 50 GB/s ├── 确认: 4MB 写/读带宽 ≥47 GB/s
├── 确认: 延迟 ≤ 10 μs ├── 确认: 8B 写延迟 ≤2 μs、读延迟 ≤3.5 μs
├── 确认: ibping 双向连通
├── 确认: PFC/ECN/CNP/congestion 计数为 0
└── 异常: 检查 IB 线缆、交换机配置、子网管理器 └── 异常: 检查 IB 线缆、交换机配置、子网管理器
步骤 3: 多节点 NCCL 测试 步骤 3: 多节点 NCCL 测试

View File

@ -0,0 +1,255 @@
# H100 `test all` 指标说明
本文解释 `gpu_tester.py --test all` 报告里每一项指标的意义、它在验收中代表什么,以及异常时通常应该优先排查什么。
适用报告:
- `reports_test_all_latest_aikubeworker0012_20260522_203246.md`
- `reports_test_all_latest_aikubeworker0016_20260522_203447.md`
- `reports_test_all_latest_summary_cn_20260523.md`
## 总体判定
| 指标 | 意义 | 怎么看 |
|---|---|---|
| `Overall Acceptance Verdict` | 整机验收结论 | 按 PDF 生产验收规则,任一必测子项 FAIL则整机 FAIL |
| `Suite complete: x/10 tests passed` | 10 个测试模块里通过了几个 | 用来快速看整体健康度,但最终以 `Overall Acceptance Verdict` 为准 |
| `PASS` | 达到当前配置阈值 | 表示该指标在当前测试口径下通过 |
| `FAIL` | 未达到当前配置阈值,或证据不足 | 表示该项不能作为生产验收通过证据 |
| `WARN` | 旧报告或非强制警告口径 | 当前 PDF 生产验收里,关键性能未达标应按 FAIL 处理 |
## GPU Info
GPU Info 是基础盘点项,用来确认机器硬件、驱动和 CUDA 环境是否符合预期。
| 指标 | 意义 | 异常影响 |
|---|---|---|
| GPU count | 当前系统识别到的 GPU 数量 | H100 8 卡机器如果不是 8 张,后续所有多卡测试都不可信 |
| GPU model | GPU 型号,例如 H100 | 型号不对会导致阈值、峰值、验收口径都不对 |
| Driver version | NVIDIA 驱动版本 | 版本过旧可能影响 CUDA、NCCL、DCGM、NVLink 工具 |
| CUDA version | CUDA 运行时或驱动支持版本 | CUDA 不匹配会导致 PyTorch、nccl-tests 或编译工具异常 |
| GPU UUID / PCI bus id | GPU 唯一标识和 PCIe 拓扑位置 | 用于定位具体故障卡、对应槽位和链路 |
这项通常不直接代表性能好坏,它是确认“测的是不是目标机器、目标 GPU、目标软件栈”。
## Health Check
Health Check 是空闲或轻负载状态下的基础健康检查。
| 指标 | 意义 | 怎么看 |
|---|---|---|
| Temperature | 当前 GPU 温度 | 空闲温度过高可能说明散热、风道、环境温度异常 |
| Power | 当前功耗 | 空闲功耗异常高可能说明有残留进程或功耗状态异常 |
| ECC errors | 显存纠错错误 | 单比特错误过多或双比特错误通常需要重点关注硬件稳定性 |
| PCIe | PCIe 代际和宽度,例如 Gen5 x16 | 降速或降宽会影响 CPU-GPU、RDMA、部分数据搬运性能 |
| Throttle | 当前是否触发限速 | 空闲状态下非 idle throttle 不正常,可能影响后续性能 |
| XID / NVRM events | 驱动或 GPU 错误事件 | 出现新 XID 通常说明硬件、驱动、供电或内核态异常 |
Health PASS 只能说明基础状态正常,不代表满载性能一定达标。
## Memory Bandwidth
Memory Bandwidth 衡量数据搬运能力,包括 CPU 到 GPU、GPU 到 CPU、GPU 到 GPU。
| 指标 | 意义 | 代表什么 |
|---|---|---|
| H2D | Host to DeviceCPU 内存到 GPU 显存带宽 | 受 PCIe、NUMA、CPU 内存、驱动影响 |
| D2H | Device to HostGPU 显存到 CPU 内存带宽 | 受 PCIe、NUMA、CPU 内存、驱动影响 |
| D2D | Device to DeviceGPU 到 GPU 带宽 | 单节点多卡通常主要受 NVLink/NVSwitch 影响 |
| Efficiency | 实测值相对理论或配置阈值的比例 | 用于快速判断是否达到预期带宽 |
H2D/D2H 主要看 PCIe 和 CPU 侧链路是否正常。D2D 更接近多卡训练、NCCL 和 P2P 通信的基础能力。
## Compute Throughput
Compute Throughput 衡量 GPU 在不同数值格式下的矩阵计算吞吐,单位通常是 TFLOPS。
| 指标 | 意义 | 常见用途 |
|---|---|---|
| FP32 | 32 位浮点性能 | 传统科学计算、部分模型训练和验证 |
| TF32 | TensorFloat-32 Tensor Core 性能 | NVIDIA Ampere/Hopper 上常见的 FP32 加速路径 |
| FP16 | 16 位浮点 Tensor Core 性能 | 深度学习训练和推理常用 |
| BF16 | bfloat16 Tensor Core 性能 | 大模型训练常用,数值范围比 FP16 更稳 |
| FP8 | 8 位浮点 Tensor Core 性能 | 新一代低精度训练/推理加速 |
| FP64 | 64 位双精度性能 | HPC、科学计算、仿真 |
| INT8 | 8 位整数性能 | 推理、量化模型 |
| Achieved | 实测吞吐 | 越接近峰值越好 |
| Peak | 理论峰值或规格峰值 | 用来计算效率 |
| Threshold | 当前验收阈值 | 低于阈值则 FAIL |
| Efficiency | `Achieved / Peak` | 衡量实测利用率 |
### Compute Consistency
Consistency 是看同一种 dtype 下,不同 GPU 之间性能是否均衡。
| 指标 | 意义 | 异常含义 |
|---|---|---|
| Min | 8 张 GPU 里最慢卡的实测值 | 用于发现拖后腿的卡 |
| Mean | 8 张 GPU 平均值 | 用于看整体水平 |
| Max | 8 张 GPU 里最快卡的实测值 | 和 Min 一起计算离散度 |
| Spread | `(Max - Min) / Mean` | 反映卡间性能差异 |
Spread 超过阈值通常说明某些卡受温度、功耗、PCIe、后台负载、时钟策略或硬件状态影响。即使平均性能还可以卡间差异过大也会拖慢分布式训练。
## NVLink / NVSwitch
NVLink/NVSwitch 测试确认 GPU 间高速互联是否完整、速率是否正确、错误计数是否干净。
| 指标 | 意义 | 怎么看 |
|---|---|---|
| Active Links | 每张 GPU 当前活跃 NVLink 数 | H100 8 卡 SXM 常见期望是每卡 18 条 |
| Expected Links | 配置期望链路数 | 少一条都可能影响拓扑和 NCCL 性能 |
| Link speed | 单条链路速率 | 速率不对说明链路降级或识别异常 |
| Error counters | NVLink 错误计数,例如 CRC/replay/recovery | 非零可能说明链路质量或硬件问题 |
NVLink PASS 表示链路状态看起来正常,但 NCCL 仍可能因算法、拓扑、消息大小、NCCL 参数或系统噪声而不达标。
## DCGM Diagnostic
DCGM 是 NVIDIA 官方诊断工具。`dcgmi diag -r 3` 是比较完整的生产诊断级别。
| 子项 | 意义 |
|---|---|
| Deployment/software | 驱动、库、系统软件依赖检查 |
| Hardware/memory | GPU 显存健康检查 |
| Hardware/diagnostic | GPU 硬件基础诊断 |
| Hardware/nvbandwidth | GPU/NVLink/NVSwitch 带宽诊断 |
| Integration/pcie | PCIe 集成和链路相关检查 |
| Stress/targeted_stress | DCGM 自带目标压力测试 |
| Stress/targeted_power | DCGM 自带目标功耗压力测试 |
| summary | 该分类汇总结果 |
DCGM PASS 是强证据,说明官方诊断没有发现明显硬件故障。但它不替代项目里的 NCCL、RDMA、长时间 telemetry 和训练模拟验收。
## NCCL Multi-GPU
NCCL 测试衡量单节点多 GPU 集合通信能力。它直接关系到多卡训练效率。
| 指标 | 意义 | 为什么重要 |
|---|---|---|
| source | 测试来源 | 必须是 `nccl-tests` 才有真实 bus BW`torchrun_fallback` 只能说明功能连通,不是性能验收 |
| bus BW | NCCL 报告的总线等效带宽 | 用来衡量通信是否吃满 NVLink/NVSwitch |
| message size | 消息大小,例如 1M、256M、2G | 小消息看延迟和调度,中大消息看带宽 |
| repeats | 重复次数 | 减少偶然波动,当前按 3 次取样 |
| worst bus BW | 多次结果里的最差值 | 生产验收更关注最差情况 |
| mean bus BW | 多次平均值 | 反映稳定水平 |
| stddev | 标准差或波动 | 波动大说明通信稳定性不足 |
### NCCL op 含义
| Op | 意义 | 常见场景 |
|---|---|---|
| allreduce | 每张卡都有一份数据,做规约后每张卡都拿到结果 | 数据并行梯度同步最常见 |
| allgather | 每张卡收集所有卡的数据分片 | 模型并行、张量并行、参数/激活收集 |
| reducescatter | 先规约再把结果切分给各卡 | ZeRO、优化器状态切分、分布式训练常用 |
| broadcast | 一张卡把数据广播给其他卡 | 参数同步、初始化权重分发 |
| sendrecv | 点对点发送和接收 | pipeline、定制通信、拓扑验证 |
| alltoall | 每张卡向每张卡交换不同数据 | MoE、专家并行、shuffle 类通信 |
NCCL 小消息失败常见于延迟、调度或阈值口径较严大消息失败更偏向链路带宽、拓扑、NCCL 参数或 NVSwitch/PCIe/NUMA 配置问题。
## Stress Test
Stress Test 是长时间高负载稳定性测试。它不是只看“能不能跑完”,还要看满载期间的温度、功耗、限速和错误事件。
| 指标 | 意义 | 怎么看 |
|---|---|---|
| duration | 实际压力测试时长 | 生产验收通常需要 30/60 分钟 |
| source | 压力来源,例如 `pytorch``gpu-burn` | 说明用什么负载压 GPU |
| dtype | 压力计算的数据类型,例如 BF16 | 影响 Tensor Core、功耗和温度 |
| matrix_size | GEMM 矩阵边长 | 越大越容易形成持续高占用 |
| memory_pct | 目标显存占用比例 | 避免只测很小负载 |
| Avg steady power | 稳态平均功耗 | 判断是否真的把卡压起来 |
| Max steady temp | 稳态最高温度 | 判断散热上限 |
| Temp delta | 8 卡之间最高温和最低温的差 | 差异过大说明风道、散热或卡位不均衡 |
| TFLOPS jitter | 稳态吞吐波动 | 波动大说明性能不稳定 |
| Throttle events | 限速事件数量 | 非 idle throttle 会影响性能稳定性 |
| XID events | 压测期间新增 XID 错误 | 出现 XID 通常是严重风险 |
### Throttle 常见含义
| 代码 | 常见含义 | 解释 |
|---|---|---|
| `0x1` | idle throttle | 空闲状态限速,通常不算真实问题 |
| `0x4` | `sw_power_cap` | 达到软件功耗上限,性能可能被功耗墙限制 |
| `0x8` | hardware slowdown | 硬件触发降速 |
| `0x10` | thermal slowdown | 温度触发降速 |
| `0x20` | power brake | 外部供电或硬件功率保护 |
| `0x40` | software thermal slowdown | 软件温度策略触发降速 |
当前报告里的 `sw_power_cap` 表示负载确实压到了功耗墙附近,但验收口径把非 idle throttle 作为失败原因之一,因为它会影响长时间稳定输出。
## RDMA / InfiniBand
RDMA 测试衡量 IB 网卡和网络链路性能。单节点 loopback 和跨节点 server/client 是两种不同证据,不能混用。
| 指标 | 意义 | 怎么看 |
|---|---|---|
| Device | IB 设备名,例如 `mlx5_0` | 对应具体 HCA/端口 |
| Port | 端口号 | 通常是 port 1 |
| State | 端口状态,例如 ACTIVE/DOWN | ACTIVE 才能作为可用链路 |
| Rate | 端口速率,例如 400 Gb/sec | 低于期望说明链路降级或接错网络 |
| GID/LID | IB 寻址信息 | `ibping` 和跨节点定位会用到 |
| ib_write_bw | RDMA write 带宽 | 客户端向远端写数据的吞吐 |
| ib_read_bw | RDMA read 带宽 | 客户端从远端读数据的吞吐 |
| ib_write_lat | RDMA write 延迟 | 小消息写延迟 |
| ib_read_lat | RDMA read 延迟 | 小消息读延迟 |
| ibping | IB 层连通性测试 | 看 LID/GID 层是否可达 |
| PFC/ECN/CNP counters | 拥塞和流控相关计数 | 非零或增长可能说明网络拥塞/丢包/流控问题 |
### 单节点与跨节点的区别
| 口径 | 意义 | 能证明什么 | 不能证明什么 |
|---|---|---|---|
| `local_loopback` | 在同一台机器本地启动 perftest server/client | 工具、设备、单机端口基本可用 | 不能证明两台机器之间 RDMA 网络达标 |
| server/client 跨节点 | 一台做 server另一台做 client | 能证明实际跨节点 RDMA 带宽/延迟 | 需要明确 server_addr、ib_device、ib_port、ibping_target |
RDMA read 带宽低于 write 带宽很常见,但生产验收会给 read/write 各自设置阈值。read 不过线时,需要排查 HCA 固件、BIOS、PCIe、NUMA、RoCE/IB 配置、交换机、PFC/ECN、线缆和端口速率。
## Training Simulation
Training Simulation 用一个合成 1.5B Transformer 训练负载验证 8 卡分布式训练是否能稳定运行。
| 指标 | 意义 | 怎么看 |
|---|---|---|
| Model | 模型类型 | 当前是 synthetic 1.5B,不依赖真实数据集 |
| Parameters | 参数量 | 用来确认负载规模是否达到预期 |
| GPU Count | 参与训练的 GPU 数 | 生产口径要求 8 卡 DDP |
| DType | 训练数值格式,例如 BF16 | 大模型训练常用 BF16 |
| Batch Size | 每步 batch 大小 | 影响吞吐和显存 |
| Seq Length | 序列长度 | 影响计算量和显存 |
| Steps | 计入统计的训练步数 | 步数太少会导致统计不稳 |
| Warmup Steps | 预热步数 | 避免把 CUDA 初始化、编译、缓存冷启动计入性能 |
| Avg Step Time | 平均每步耗时 | 越低越好 |
| Throughput | tokens/sec | 训练吞吐核心指标 |
| Samples/sec | 每秒样本数 | 辅助衡量数据处理速度 |
| Peak Memory | 峰值显存 | 看是否接近 OOM 或显存利用不足 |
| Final Loss | 最后 loss | 用于确认数值是有限值,没有 NaN/Inf |
| Step Jitter | step 时间抖动 | 抖动大说明训练不稳定 |
| Distributed Mode | 分布式模式 | 必须是 `ddp` 才满足 8 卡分布式口径 |
Training PASS 说明 8 卡 DDP 训练路径、NCCL 功能连通、PyTorch CUDA 和基本数值稳定性都没问题。但它不能替代 NCCL 性能测试,因为训练负载可能没有覆盖所有通信模式和消息大小。
## 常见误读
1. `DCGM PASS` 不等于整机验收 PASS。DCGM 是官方诊断的一部分,不覆盖全部业务性能门槛。
2. `Training PASS` 不等于 NCCL 性能 PASS。训练能跑只说明功能链路通NCCL bus BW 仍可能不达标。
3. `NVLink PASS` 不等于 NCCL PASS。链路数量和错误计数正常不代表所有 NCCL op/size 都达到阈值。
4. `ibping PASS` 不等于 RDMA 带宽 PASS。`ibping` 只证明连通性,不证明吞吐和延迟达标。
5. `local_loopback` 不能当作跨节点 RDMA 证据。跨节点验收必须有 server/client 两端证据。
6. Stress 跑满 30 分钟不等于 PASS。温差、功耗、throttle、XID、jitter 都要一起看。
7. 小消息 NCCL 低不一定是链路断了,可能是延迟、算法、启动开销或阈值口径导致;但生产验收仍按阈值判定。
## 排查优先级建议
| 失败项 | 优先看什么 |
|---|---|
| Compute FAIL | GPU 时钟、功耗策略、MIG/MPS、后台进程、PyTorch/CUDA 版本、benchmark 算法是否用到目标 Tensor Core 路径 |
| NCCL FAIL | `NCCL_DEBUG=INFO`、拓扑、NVSwitch/NVLink、NCCL 算法、消息大小、PCIe/NUMA、进程绑核 |
| Stress FAIL | 机箱风道、风扇、环境温度、功耗上限、`nvidia-smi -q -d POWER,CLOCK,TEMPERATURE` |
| RDMA FAIL | 端口速率、HCA 固件、线缆、交换机、PFC/ECN、NUMA、BIOS、跨节点 server/client 配置 |
| Training FAIL | torchrun、NCCL 环境变量、CUDA OOM、loss NaN/Inf、DDP 初始化、网络/共享内存 |
## 一句话版
这套报告不是只看 GPU 能不能亮、训练能不能跑而是同时验证硬件识别、基础健康、显存和互联带宽、计算吞吐、多卡通信、长时间满载稳定性、IB/RDMA 网络、官方 DCGM 诊断和 8 卡训练业务路径。任何一个关键项 FAIL按生产验收都应判整机不通过。

View File

@ -0,0 +1,362 @@
# 多机多卡 NCCL 测试概念说明
本文先讲概念,不涉及脚本改造。目标是理解两台 8 卡 H100 服务器做多机多卡通信测试时,应该从哪些层次逐步验证,以及每一层到底在证明什么。
当前示例机器:
| 别名 | 主机名 | 内网 IP | GPU |
|---|---|---|---|
| nccl-gpu-1 | aikubeworker0012 | 172.72.8.12 | 8 x H100 |
| nccl-gpu-2 | aikubeworker0016 | 172.72.8.16 | 8 x H100 |
两台机器合起来就是 16 张 GPU。多机 NCCL 测试的核心问题是:这 16 张 GPU 是否能通过正确的 GPU、NVLink、PCIe、IB/RDMA 网络路径,高效且正确地完成集体通信。
## 1. 总体思路
多机多卡通信测试是一个自底向上的过程。越底层越接近硬件和链路,越上层越接近真实训练业务。
```mermaid
flowchart TD
L0["0. 物理与基础连通<br/>电源 / GPU / 网卡 / 线缆 / 交换机 / SSH"] --> L1["1. 系统识别层<br/>nvidia-smi / lspci / ibstat / ibdev2netdev"]
L1 --> L2["2. 单机 GPU 健康层<br/>温度 / 功耗 / ECC / PCIe / Throttling / NVLink Topo"]
L2 --> L3["3. 单机 GPU 性能层<br/>HBM 带宽 / H2D-D2H / FP32-TF32-FP16-BF16-FP8 算力"]
L3 --> L4["4. 单机多卡通信层<br/>单节点 8 卡 NCCL over NVLink/NVSwitch"]
L4 --> L5["5. 跨机网络与 RDMA 层<br/>IP 连通 / IB Active / RDMA 带宽 / RDMA 延迟"]
L5 --> L6["6. 跨机 NCCL 层<br/>两机 16 卡 AllReduce / AllGather / ReduceScatter / Broadcast / AllToAll"]
L6 --> L7["7. 训练负载层<br/>torchrun / Megatron / DeepSpeed / 业务训练压测"]
```
最重要的原则:
**上层失败,不一定是上层问题。**
比如两机 `all_reduce_perf` 失败,原因可能在 NCCL也可能在 SSH、MPI、IB、GID、网卡选择、驱动版本、CUDA 版本、NCCL 版本或 GPU Direct RDMA。
所以排查顺序应该是:
```text
基础连通 -> 单机健康 -> 单机性能 -> 单机 NCCL -> 跨机 RDMA -> 跨机 NCCL -> 训练业务
```
## 2. 两机 16 卡通信路径
单机内部主要走 NVLink/NVSwitch跨机器时数据必须经过 GPU、PCIe/NVLink、网卡、交换机和对端网卡。
```mermaid
flowchart LR
subgraph A["aikubeworker0012 / 172.72.8.12"]
A0["GPU0"] --- ASW["NVSwitch / NVLink"]
A1["GPU1"] --- ASW
A2["..."] --- ASW
A7["GPU7"] --- ASW
ASW --> ANIC["IB/RDMA NIC(s)"]
end
subgraph NET["InfiniBand / RoCE Fabric"]
SW["IB Switch"]
end
subgraph B["aikubeworker0016 / 172.72.8.16"]
BNIC["IB/RDMA NIC(s)"] --> BSW["NVSwitch / NVLink"]
B0["GPU0"] --- BSW
B1["GPU1"] --- BSW
B2["..."] --- BSW
B7["GPU7"] --- BSW
end
ANIC <--> SW
SW <--> BNIC
```
这里有两个不同的通信域:
| 通信域 | 典型路径 | 主要测试 |
|---|---|---|
| 单机内 8 卡 | GPU -> NVLink/NVSwitch -> GPU | 单机 NCCL、NVLink topo、D2D |
| 跨机器 16 卡 | GPU -> NIC -> IB/RDMA 网络 -> NIC -> GPU | RDMA、跨机 NCCL |
这两个域的性能阈值不能混用。单机 NVSwitch 很快,跨机 RDMA 一般慢一些,跨机 NCCL 的瓶颈通常在 IB/RDMA 网络。
## 3. 每一层要测什么
### 3.1 基础连通层
这一层只证明机器能访问、身份和地址正确。
要确认:
| 检查项 | 目的 |
|---|---|
| SSH 互通 | MPI/NCCL 多机启动依赖远端拉起进程 |
| hostname 正确 | 避免登录错机器 |
| IP 正确 | 确认使用的是训练网络或 IB/RDMA 对应网络 |
| 时间同步 | 长时间训练日志和超时排查更可靠 |
这一层不证明 GPU 或 RDMA 性能,只证明“机器能互相找到”。
### 3.2 系统识别层
这一层证明系统能看见 GPU 和网卡。
常见信息:
| 工具 | 看什么 |
|---|---|
| `nvidia-smi` | GPU 数量、型号、驱动、CUDA、温度、功耗 |
| `nvidia-smi topo -m` | GPU、NIC、CPU NUMA、NVLink/NVSwitch 拓扑 |
| `ibstat` | IB 设备、端口状态、链路速率 |
| `ibdev2netdev` | mlx5 设备和网络接口的映射 |
| `/sys/class/infiniband` | 端口状态、link layer、rate、GID |
这一层很关键,因为 NCCL 经常因为选错网卡而跑到 TCP 或错误的接口上。
### 3.3 单机 GPU 健康层
这一层证明每台机器自己是健康的。
```mermaid
flowchart LR
H["单机健康检查"] --> T["温度"]
H --> P["功耗"]
H --> E["ECC 错误"]
H --> PCIE["PCIe Gen/Width"]
H --> C["SM/Mem Clock"]
H --> TH["Throttling"]
H --> PM["Persistence Mode"]
```
如果某张卡温度过高、ECC double-bit、PCIe 降级或 throttling后面的 NCCL 测试即使能跑,结果也不可信。
### 3.4 单机 GPU 性能层
这一层证明每台机器的 GPU 本身性能正常。
| 测试 | 证明什么 |
|---|---|
| HBM/D2D 带宽 | GPU 显存和设备间拷贝能力 |
| H2D/D2H 带宽 | CPU/Host 到 GPU 的 PCIe 路径 |
| FP32/TF32 | 基础矩阵计算能力 |
| FP16/BF16/FP8 | 训练常用 Tensor Core 能力 |
这一步是单机验收。它不能证明两台机器之间通信正常,但可以排除“某台机器本身 GPU 算力或带宽异常”。
### 3.5 单机多卡 NCCL 层
这一层验证单台机器 8 卡之间的集体通信。
```mermaid
flowchart TD
S["单机 8 卡 NCCL"] --> AR["AllReduce"]
S --> AG["AllGather"]
S --> RS["ReduceScatter"]
S --> BC["Broadcast"]
S --> AT["AllToAll"]
```
单机 NCCL 主要看 NVLink/NVSwitch 通信路径是否正常。常见指标:
| 指标 | 含义 |
|---|---|
| `algbw` | 算法视角的有效带宽 |
| `busbw` | 总线视角的带宽,更适合比较通信链路利用率 |
| `#wrong` | 结果错误数量,必须是 0 |
单机测试通过后,只能说明单台服务器内部 8 卡通信正常。
### 3.6 跨机 RDMA 层
这一层验证两台机器之间的网络和 RDMA 能力,不涉及 NCCL。
```mermaid
sequenceDiagram
participant N1 as aikubeworker0012
participant FAB as IB/RDMA Fabric
participant N2 as aikubeworker0016
N1->>N2: ping / ssh
N1->>FAB: ib_write_bw client
FAB->>N2: ib_write_bw server
N1->>FAB: ib_read_bw client
FAB->>N2: ib_read_bw server
N1->>N2: ib_write_lat / ib_read_lat
```
这一层要回答:
| 问题 | 说明 |
|---|---|
| IB 端口是否 Active | 没 Active 就不用跑 NCCL |
| RDMA 带宽是否达标 | 证明网络数据面能跑起来 |
| RDMA 延迟是否正常 | 高延迟会影响小消息和训练同步 |
| 是否是 InfiniBand/RoCE | 两者环境变量和排障点不同 |
如果 RDMA 层失败,跨机 NCCL 大概率也会失败或退化到 TCP。
### 3.7 跨机 NCCL 层
这一层才是真正的多机多卡 NCCL 测试。
两台 8 卡机器通常是:
```text
2 nodes x 8 GPUs = 16 ranks
每个 rank 绑定 1 张 GPU
```
概念上是:
```mermaid
flowchart LR
subgraph N1["Node 1: 172.72.8.12"]
R0["rank 0 / GPU0"]
R1["rank 1 / GPU1"]
R2["..."]
R7["rank 7 / GPU7"]
end
subgraph N2["Node 2: 172.72.8.16"]
R8["rank 8 / GPU0"]
R9["rank 9 / GPU1"]
R10["..."]
R15["rank 15 / GPU7"]
end
R0 <--> R8
R1 <--> R9
R7 <--> R15
N1 <--> N2
```
典型测试项:
| NCCL 测试 | 训练里对应什么 |
|---|---|
| AllReduce | 数据并行梯度同步 |
| ReduceScatter | ZeRO/FSDP 梯度切分 |
| AllGather | ZeRO/FSDP 参数聚合 |
| Broadcast | 参数广播、初始化 |
| AllToAll | MoE、专家并行、部分并行策略 |
| SendRecv | 点对点通信、pipeline parallel |
跨机 NCCL 要看:
| 指标 | 判定 |
|---|---|
| 是否成功启动 16 rank | MPI/SSH/路径/环境是否正常 |
| `#wrong == 0` | 正确性必须过 |
| `busbw` | 跨节点通信链路利用率 |
| 是否走 IB/RDMA | 需要从 `NCCL_DEBUG=INFO` 确认 |
| 是否退化 TCP | 如果退化,性能会明显偏低 |
## 4. NCCL 为什么要分单机和跨机
单机 8 卡通信和跨机 16 卡通信的瓶颈不同。
```mermaid
flowchart TD
A["NCCL 性能结果"] --> B{"测试范围"}
B --> C["单机 8 卡"]
B --> D["跨机 16 卡"]
C --> C1["主要瓶颈NVLink / NVSwitch"]
C --> C2["阈值可参考 GPU NVLink 能力"]
D --> D1["主要瓶颈IB/RDMA 网络"]
D --> D2["阈值应参考网卡数量、速率、拓扑和 rail 数"]
```
所以不能用单机 NVLink 的阈值直接判断跨机 NCCL。跨机要根据真实网络能力设阈值例如
| 网络配置 | 理论上限理解 |
|---|---|
| 单张 400G 网卡 | 约 50 GB/s 单向原始带宽 |
| 8 张 400G 网卡 | 约 400 GB/s 原始聚合带宽 |
| 实测 NCCL busbw | 会受拓扑、GDR、rail、NUMA、交换机、NCCL 算法影响 |
实际验收时,应该先知道每台机器有几张 IB/RDMA 网卡、每张速率多少、GPU 到 NIC 的拓扑关系,再定跨机 NCCL 阈值。
## 5. 常见失败位置
```mermaid
flowchart TD
F["跨机 NCCL 失败"] --> A["启动失败"]
F --> B["能启动但很慢"]
F --> C["运行中 timeout"]
F --> D["结果 #wrong 非 0"]
A --> A1["SSH 不通"]
A --> A2["远端路径不存在"]
A --> A3["MPI 环境不一致"]
A --> A4["root 运行未允许"]
B --> B1["NCCL_SOCKET_IFNAME 选错"]
B --> B2["没走 IB/RDMA退化 TCP"]
B --> B3["NCCL_IB_HCA 没选对"]
B --> B4["GPU Direct RDMA 没生效"]
C --> C1["IB 端口不稳定"]
C --> C2["交换机/PFC/ECN 问题"]
C --> C3["NCCL timeout 配置"]
C --> C4["驱动/CUDA/NCCL 版本不兼容"]
D --> D1["通信正确性失败"]
D --> D2["必须 FAIL不能只看带宽"]
```
## 6. 推荐验收顺序
下面是面向两台 8 卡机器的推荐顺序:
```mermaid
flowchart TD
A["Step 1: 两台机器基础信息"] --> B["Step 2: 两台机器单机 GPU 健康"]
B --> C["Step 3: 两台机器单机 benchmark"]
C --> D["Step 4: 两台机器分别跑单机 8 卡 NCCL"]
D --> E["Step 5: 两台机器互测 RDMA bandwidth/latency"]
E --> F["Step 6: 两机 16 卡 NCCL correctness"]
F --> G["Step 7: 两机 16 卡 NCCL performance"]
G --> H["Step 8: 两机训练 demo 或业务压测"]
```
每一步的意义:
| 步骤 | 目的 |
|---|---|
| Step 1 | 确认没有登录错机器,基础网络和环境存在 |
| Step 2 | 排除 GPU 健康问题 |
| Step 3 | 排除 GPU 单卡/单机性能问题 |
| Step 4 | 排除单机 NVLink/NVSwitch/NCCL 问题 |
| Step 5 | 排除跨机 RDMA 问题 |
| Step 6 | 先证明 NCCL 正确性 |
| Step 7 | 再证明 NCCL 性能 |
| Step 8 | 最后用真实训练形态验证稳定性 |
## 7. 对当前脚本的映射
当前脚本已有模块和上面层次的关系:
| 当前模块 | 覆盖层次 | 备注 |
|---|---|---|
| `gpu_info` | 系统识别层 | 单机 |
| `health` | 单机 GPU 健康层 | 单机 |
| `benchmark` | 单机 GPU 性能层 | 单机 |
| `nccl` | 单机多卡通信层 | 当前主要是单机 |
| `rdma` | RDMA 检查 | 当前偏本机检查,不是两机互测 |
| `stress` | 稳定性 | 单机 |
| `training` | 训练负载层 | 当前偏单机 |
| 建议新增 `multi_node_nccl` | 跨机 NCCL 层 | 专门处理 hostfile、mpirun、多节点环境、结果解析 |
如果未来要扩展脚本,比较自然的方向是新增一个多机模块,而不是把所有逻辑塞进现有 `nccl` 模块。
## 8. 最小概念模型
记住这句话即可:
```text
单机 NCCL 验证 GPU 之间的 NVLink/NVSwitch。
跨机 RDMA 验证机器之间的网络。
跨机 NCCL 验证 NCCL 是否能把 GPU 和网络组合起来,为真实训练提供高效通信。
```
因此,多机多卡测试不是一个命令,而是一条验证链路。

View File

@ -5,6 +5,7 @@ import argparse
import json import json
import os import os
import signal import signal
import socket
import sys import sys
import time import time
from datetime import datetime from datetime import datetime
@ -25,6 +26,8 @@ from modules.nccl_test import NCCLTest
from modules.training_sim import TrainingSim from modules.training_sim import TrainingSim
from modules.stress_test import StressTest from modules.stress_test import StressTest
from modules.rdma_test import RDMATest from modules.rdma_test import RDMATest
from modules.nvlink_test import NVLinkTest
from modules.dcgm_test import DCGMTest
from modules.report import ReportGenerator from modules.report import ReportGenerator
from modules.gpu_specs import detect_gpu_type, get_gpu_specs, get_gpu_label, get_supported_gpus, validate_driver_compatibility from modules.gpu_specs import detect_gpu_type, get_gpu_specs, get_gpu_label, get_supported_gpus, validate_driver_compatibility
@ -32,43 +35,87 @@ DEFAULT_CONFIG = {
"benchmark": { "benchmark": {
"memory": {"size_mb": 4096, "iterations": 10, "nvbandwidth_buffer_mb": 512, "nvbandwidth_samples": 3}, "memory": {"size_mb": 4096, "iterations": 10, "nvbandwidth_buffer_mb": 512, "nvbandwidth_samples": 3},
"compute": { "compute": {
"dtypes": ["fp32", "tf32", "fp16", "bf16", "fp8"], "dtypes": ["fp32", "tf32", "fp16", "bf16", "fp8", "fp64", "int8"],
"matrix_size": 4096, "matrix_size": 8192,
"warmup": 10, "warmup": 50,
"iterations": 100, "iterations": 500,
"use_compile": True,
}, },
}, },
"health": {"temp_warning": 80, "temp_critical": 90, "power_limit": None}, "health": {"temp_warning": 75, "temp_critical": 85, "power_limit": None},
"nccl": { "nccl": {
"min_bandwidth_gbps": None, "min_bandwidth_gbps": None,
"test_allreduce": True, "test_allreduce": True,
"test_alltoall": True, "test_alltoall": True,
"test_broadcast": True, "test_broadcast": True,
"test_reduce_scatter": False, "test_reduce_scatter": True,
"test_allgather": False, "test_allgather": True,
"test_sendrecv": False, "test_sendrecv": True,
"message_sizes": ["1M", "256M", "2G"],
"repeats": 3,
"max_stddev_pct": 3,
}, },
"stress": { "stress": {
"duration_sec": 60, "duration_sec": 1800,
"production_duration_sec": 1800,
"use_gpu_burn": False,
"use_doubles": False, "use_doubles": False,
"use_tensor_cores": True, "use_tensor_cores": True,
"memory_pct": 90, "memory_pct": 90,
"gpus": "all", "gpus": "all",
"dtype": "bf16",
"matrix_size": 24576,
"telemetry_interval_sec": 1,
"warmup_sec": 60,
"min_steady_samples": 10,
"max_temp_c": 80,
"max_temp_delta_c": 5,
"min_power_watts": 630,
"max_tflops_jitter_pct": 5,
"require_tflops_jitter": True,
}, },
"rdma": { "rdma": {
"min_bandwidth_gbps": 50, "min_bandwidth_gbps": 47,
"max_latency_us": 10, "min_port_rate_gbps": 400,
"max_latency_us": 3.5,
"max_write_latency_us": 2.0,
"max_read_latency_us": 3.5,
"ib_iterations": 1000, "ib_iterations": 1000,
"msg_size": 65536, "msg_size": 4194304,
"latency_msg_size": 8,
"ib_device": None, "ib_device": None,
"ib_port": 1, "ib_port": 1,
"server_addr": None,
"ibping_target": None,
"ibping_count": 5,
"role": "auto",
"pfc_ecn_counters": True,
},
"nvlink": {
"expected_links_per_gpu": 18,
"expected_link_speed_gbps": 25,
"require_zero_errors": True,
},
"dcgm": {
"diag_level": 3,
"timeout_sec": 1200,
"expected_num_gpus": 8,
"json_output": True,
"require_subtests": True,
}, },
"training": { "training": {
"model": "gpt2", "model": "synthetic_1.5b",
"batch_size": 8, "batch_size": 8,
"seq_length": 2048, "seq_length": 2048,
"num_steps": 50, "num_steps": 50,
"warmup_steps": 5,
"dtype": "bf16", "dtype": "bf16",
"mode": "ddp",
"synthetic_params_b": 1.5,
"min_tokens_per_sec": 45000,
"max_step_jitter_pct": 3,
"max_peak_memory_gb": 70,
"require_distributed": True,
}, },
"report": {"output_dir": "./reports", "format": "json"}, "report": {"output_dir": "./reports", "format": "json"},
"tools": {"install_dir": "/opt/gpu-test-tools"}, "tools": {"install_dir": "/opt/gpu-test-tools"},
@ -131,7 +178,7 @@ def interactive_menu(config: dict):
if not check_prerequisites(console): if not check_prerequisites(console):
return return
results_store: dict = {"timestamp": datetime.now().isoformat(), "tests": {}} results_store: dict = {"timestamp": datetime.now().isoformat(), "hostname": socket.gethostname(), "tests": {}}
menu_items = [ menu_items = [
("1", "GPU Information", "gpu_info"), ("1", "GPU Information", "gpu_info"),
@ -139,10 +186,12 @@ def interactive_menu(config: dict):
("3", "Memory Benchmark (nvbandwidth)", "memory_bench"), ("3", "Memory Benchmark (nvbandwidth)", "memory_bench"),
("4", "Compute Benchmark", "compute_bench"), ("4", "Compute Benchmark", "compute_bench"),
("5", "NCCL Multi-GPU Test", "nccl"), ("5", "NCCL Multi-GPU Test", "nccl"),
("6", "GPU Stress Test (gpu-burn)", "stress"), ("6", "GPU Stress Test (PyTorch/gpu-burn)", "stress"),
("7", "RDMA/IB Test", "rdma"), ("7", "RDMA/IB Test", "rdma"),
("8", "Training Simulation", "training"), ("8", "NVLink/NVSwitch Test", "nvlink"),
("9", "Full Test Suite (All Tests)", "all"), ("9", "DCGM Diagnostic", "dcgm"),
("10", "Training Simulation", "training"),
("11", "Full Test Suite (All Tests)", "all"),
("0", "Generate Report", "report"), ("0", "Generate Report", "report"),
] ]
@ -164,8 +213,10 @@ def interactive_menu(config: dict):
"memory_bench": "HBM bandwidth via nvbandwidth", "memory_bench": "HBM bandwidth via nvbandwidth",
"compute_bench": "GEMM TFLOPS across FP32/TF32/FP16/BF16/FP8", "compute_bench": "GEMM TFLOPS across FP32/TF32/FP16/BF16/FP8",
"nccl": "AllReduce, AllToAll, Broadcast via nccl-tests", "nccl": "AllReduce, AllToAll, Broadcast via nccl-tests",
"stress": "Long-running GPU stress via gpu-burn", "stress": "Long-running high-power GEMM stress with telemetry",
"rdma": "InfiniBand bandwidth & latency (ib_write_bw)", "rdma": "InfiniBand bandwidth & latency (ib_write_bw)",
"nvlink": "NVLink links, speed, and error counters",
"dcgm": "DCGM diag -r 3 production diagnostic",
"training": "Simulate LLM training with PyTorch", "training": "Simulate LLM training with PyTorch",
"all": "Run all tests sequentially", "all": "Run all tests sequentially",
"report": "Export results to JSON/HTML", "report": "Export results to JSON/HTML",
@ -257,6 +308,18 @@ def _run_test(test_name: str, config: dict, console: Console) -> dict:
m.print_results(result) m.print_results(result)
return result return result
elif test_name == "nvlink":
m = NVLinkTest(config)
result = m.run()
m.print_results(result)
return result
elif test_name == "dcgm":
m = DCGMTest(config)
result = m.run()
m.print_results(result)
return result
elif test_name == "training": elif test_name == "training":
m = TrainingSim(config) m = TrainingSim(config)
result = m.run() result = m.run()
@ -280,15 +343,17 @@ def _run_test(test_name: str, config: dict, console: Console) -> dict:
def _run_full_suite(config: dict, console: Console) -> dict: def _run_full_suite(config: dict, console: Console) -> dict:
"""Run all tests sequentially.""" """Run all tests sequentially."""
console.print(Panel("[bold cyan]Running Full Test Suite[/bold cyan]", box=box.DOUBLE)) console.print(Panel("[bold cyan]Running Full Test Suite[/bold cyan]", box=box.DOUBLE))
all_results: dict = {"timestamp": datetime.now().isoformat()} all_results: dict = {"timestamp": datetime.now().isoformat(), "hostname": socket.gethostname()}
tests = [ tests = [
("gpu_info", "GPU Information", GPUInfo), ("gpu_info", "GPU Information", GPUInfo),
("health", "Health Check", HealthCheck), ("health", "Health Check", HealthCheck),
("memory_bench", "Memory Benchmark", lambda c: Benchmark(c)), ("memory_bench", "Memory Benchmark", lambda c: Benchmark(c)),
("compute_bench", "Compute Benchmark", lambda c: Benchmark(c)), ("compute_bench", "Compute Benchmark", lambda c: Benchmark(c)),
("nvlink", "NVLink/NVSwitch Test", NVLinkTest),
("nccl", "NCCL Test", NCCLTest), ("nccl", "NCCL Test", NCCLTest),
("stress", "GPU Stress Test", StressTest), ("stress", "GPU Stress Test", StressTest),
("rdma", "RDMA/IB Test", RDMATest), ("rdma", "RDMA/IB Test", RDMATest),
("dcgm", "DCGM Diagnostic", DCGMTest),
("training", "Training Simulation", TrainingSim), ("training", "Training Simulation", TrainingSim),
] ]
@ -313,14 +378,49 @@ def _run_full_suite(config: dict, console: Console) -> dict:
# Summary # Summary
console.print("\n" + "=" * 60) console.print("\n" + "=" * 60)
# Only count test results, exclude metadata like timestamp # Only count test results, exclude metadata like timestamp
test_results = {k: v for k, v in all_results.items() if k != "timestamp"} test_results = {k: v for k, v in all_results.items() if k not in ("timestamp", "hostname")}
passed = sum(1 for v in test_results.values() if not isinstance(v, dict) or "error" not in v) passed = sum(1 for v in test_results.values() if _test_result_passed(v))
total = len(test_results) total = len(test_results)
color = "green" if passed == total else ("yellow" if passed > 0 else "red") color = "green" if passed == total else ("yellow" if passed > 0 else "red")
console.print(f"[bold {color}]Suite complete: {passed}/{total} tests passed[/bold {color}]") console.print(f"[bold {color}]Suite complete: {passed}/{total} tests passed[/bold {color}]")
return all_results return all_results
def _test_result_passed(result) -> bool:
"""Strict production verdict helper for full-suite exit status."""
if not isinstance(result, dict):
return True
if result.get("error"):
return False
if result.get("skipped") or result.get("status") == "SKIP":
return False
if result.get("source") == "torchrun_fallback":
return False
if "passed" in result:
return bool(result.get("passed"))
if "memory" in result:
mem = result["memory"]
if isinstance(mem, dict) and "passed" in mem:
return bool(mem.get("passed"))
if mem.get("error") or mem.get("source") == "pytorch":
return False
eff = mem.get("d2d_efficiency_pct") or mem.get("efficiency_pct") or 0
return eff >= 80
if "compute" in result:
comp = result["compute"]
if isinstance(comp, dict) and "passed" in comp:
return bool(comp.get("passed"))
thresholds = comp.get("pass_thresholds_tflops", {}) or {}
per_dtype = comp.get("per_dtype_tflops", {})
for dt, threshold in thresholds.items():
val = per_dtype.get(dt)
if not isinstance(val, (int, float)) or val < threshold:
return False
consistency = comp.get("consistency", {})
return not any(not c.get("passed", False) for c in consistency.values())
return True
def main(): def main():
gpu_list_str = " / ".join(g.upper() for g in get_supported_gpus()) gpu_list_str = " / ".join(g.upper() for g in get_supported_gpus())
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
@ -335,15 +435,17 @@ Examples:
python gpu_tester.py --test benchmark --type memory python gpu_tester.py --test benchmark --type memory
python gpu_tester.py --test benchmark --type compute --dtype fp16 python gpu_tester.py --test benchmark --type compute --dtype fp16
python gpu_tester.py --test nccl # NCCL test python gpu_tester.py --test nccl # NCCL test
python gpu_tester.py --test nvlink # NVLink/NVSwitch test
python gpu_tester.py --test dcgm # DCGM diagnostic
python gpu_tester.py --test training # Training sim python gpu_tester.py --test training # Training sim
python gpu_tester.py --test all # Full suite python gpu_tester.py --test all # Full suite
python gpu_tester.py --report --format json --output report.json python gpu_tester.py --report --format json --output report.json
""", """,
) )
parser.add_argument("--test", choices=["gpu-info", "health", "benchmark", "nccl", "stress", "rdma", "training", "all"], parser.add_argument("--test", choices=["gpu-info", "health", "benchmark", "nccl", "stress", "rdma", "nvlink", "dcgm", "training", "all"],
help="Run a specific test") help="Run a specific test")
parser.add_argument("--type", choices=["memory", "compute"], help="Benchmark type (with --test benchmark)") parser.add_argument("--type", choices=["memory", "compute"], help="Benchmark type (with --test benchmark)")
parser.add_argument("--dtype", choices=["fp32", "tf32", "fp16", "bf16", "fp8"], parser.add_argument("--dtype", choices=["fp32", "tf32", "fp16", "bf16", "fp8", "fp64", "int8"],
help="Compute benchmark dtype (with --test benchmark --type compute)") help="Compute benchmark dtype (with --test benchmark --type compute)")
parser.add_argument("--interactive", action="store_true", help="Force interactive mode") parser.add_argument("--interactive", action="store_true", help="Force interactive mode")
parser.add_argument("--report", action="store_true", help="Generate report from last results") parser.add_argument("--report", action="store_true", help="Generate report from last results")
@ -399,6 +501,8 @@ Examples:
"nccl": "nccl", "nccl": "nccl",
"stress": "stress", "stress": "stress",
"rdma": "rdma", "rdma": "rdma",
"nvlink": "nvlink",
"dcgm": "dcgm",
"training": "training", "training": "training",
"all": "all", "all": "all",
} }
@ -415,19 +519,30 @@ Examples:
result = bench.run() result = bench.run()
Benchmark.print_results(result) Benchmark.print_results(result)
if args.report: if args.report:
ReportGenerator(config).generate({"benchmark": result, "timestamp": datetime.now().isoformat()}, ReportGenerator(config).generate({
"benchmark": result,
"timestamp": datetime.now().isoformat(),
"hostname": socket.gethostname(),
},
fmt=args.format, output=args.output) fmt=args.format, output=args.output)
sys.exit(0 if _test_result_passed(result) else 1)
elif args.test == "all": elif args.test == "all":
results = _run_full_suite(config, console) results = _run_full_suite(config, console)
if args.report: if args.report:
ReportGenerator(config).generate(results, fmt=args.format, output=args.output) ReportGenerator(config).generate(results, fmt=args.format, output=args.output)
has_errors = any("error" in v for v in results.values() if isinstance(v, dict)) failed = any(not _test_result_passed(v) for k, v in results.items() if k not in ("timestamp", "hostname"))
sys.exit(1 if has_errors else 0) sys.exit(1 if failed else 0)
else: else:
result = _run_test(test_map[args.test], config, console) result = _run_test(test_map[args.test], config, console)
if args.report and result: if args.report and result:
ReportGenerator(config).generate({args.test: result, "timestamp": datetime.now().isoformat()}, report_key = test_map[args.test] or args.test
ReportGenerator(config).generate({
report_key: result,
"timestamp": datetime.now().isoformat(),
"hostname": socket.gethostname(),
},
fmt=args.format, output=args.output) fmt=args.format, output=args.output)
sys.exit(0 if _test_result_passed(result) else 1)
if __name__ == "__main__": if __name__ == "__main__":

231
modules/dcgm_test.py Normal file
View File

@ -0,0 +1,231 @@
"""DCGM diagnostic acceptance wrapper."""
import json
import os
import re
import shutil
import signal
import subprocess
from datetime import datetime
from typing import Optional
from rich.console import Console
from rich.table import Table
class DCGMTest:
def __init__(self, config: dict):
self.config = config
self.console = Console()
self.cfg = config.get("dcgm", {})
def run(self) -> dict:
dcgmi = shutil.which("dcgmi")
if not dcgmi:
return {
"passed": False,
"error": "dcgmi not found",
"timestamp": datetime.now().isoformat(),
}
level = str(self.cfg.get("diag_level", 3))
timeout = int(self.cfg.get("timeout_sec", 1200))
cmd = [dcgmi, "diag", "-r", level]
expected_gpus = self.cfg.get("expected_num_gpus")
if expected_gpus:
cmd.extend(["-n", f"gpu:{int(expected_gpus)}"])
if self.cfg.get("json_output", True):
cmd.append("-j")
try:
r = self._run_with_process_group_timeout(cmd, timeout)
except subprocess.TimeoutExpired as e:
output = ((e.output or "") + "\n" + (e.stderr or "")).strip()
return {
"passed": False,
"error": f"dcgmi diag -r {level} timeout after {timeout}s",
"command": cmd,
"raw_output_tail": output[-8000:],
"timestamp": datetime.now().isoformat(),
}
output = r.stdout + "\n" + r.stderr
subtests = self._parse_json_output(output) or self._parse_output(output)
strict_statuses = {"PASS"}
failed = [s for s in subtests if s["status"] not in strict_statuses]
require_subtests = bool(self.cfg.get("require_subtests", True))
passed = r.returncode == 0 and not failed and (bool(subtests) or not require_subtests)
return {
"passed": passed,
"returncode": r.returncode,
"level": int(level),
"command": cmd,
"expected_num_gpus": int(expected_gpus) if expected_gpus else None,
"subtests": subtests,
"raw_output_tail": output[-8000:],
"timestamp": datetime.now().isoformat(),
}
@staticmethod
def _run_with_process_group_timeout(cmd: list[str], timeout: int) -> subprocess.CompletedProcess:
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
text=True,
start_new_session=True,
)
try:
stdout, stderr = proc.communicate(timeout=timeout)
except subprocess.TimeoutExpired as e:
try:
os.killpg(proc.pid, signal.SIGTERM)
stdout, stderr = proc.communicate(timeout=10)
except subprocess.TimeoutExpired:
os.killpg(proc.pid, signal.SIGKILL)
stdout, stderr = proc.communicate(timeout=10)
raise subprocess.TimeoutExpired(cmd, timeout, output=stdout, stderr=stderr) from e
return subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
@classmethod
def _parse_json_output(cls, output: str) -> list[dict]:
text = output.strip()
if not text:
return []
try:
payload = json.loads(text)
except json.JSONDecodeError:
m = re.search(r"(\{.*\})", text, re.S)
if not m:
return []
try:
payload = json.loads(m.group(1))
except json.JSONDecodeError:
return []
dcgm_payload = payload.get("DCGM Diagnostic") if isinstance(payload, dict) else None
if isinstance(dcgm_payload, dict):
parsed = cls._parse_dcgm_diagnostic_json(dcgm_payload)
if parsed:
return parsed
subtests = []
def walk(node, path: list[str]):
if isinstance(node, dict):
node_name = (
node.get("name")
or node.get("testName")
or node.get("test_name")
or node.get("category")
or node.get("category_name")
)
child_path = [*path, str(node_name)] if node_name else path
status = node.get("status") or node.get("result") or node.get("Result")
if isinstance(status, str):
name = (
node_name
or " / ".join(path[-3:])
)
normalized = cls._normalize_status(status)
if normalized:
subtests.append({
"name": str(name)[:160],
"status": normalized,
"raw": json.dumps(node, default=str)[:1000],
})
for key, value in node.items():
walk(value, [*child_path, str(key)])
elif isinstance(node, list):
for idx, item in enumerate(node):
walk(item, [*path, str(idx)])
walk(payload, [])
return subtests
@classmethod
def _parse_dcgm_diagnostic_json(cls, payload: dict) -> list[dict]:
subtests = []
for category in payload.get("test_categories", []) or []:
category_name = str(category.get("category") or "DCGM")
for test in category.get("tests", []) or []:
test_name = str(test.get("name") or "unnamed")
for result in test.get("results", []) or []:
status = cls._normalize_status(str(result.get("status", "")))
if not status:
continue
entity_group = result.get("entity_group") or "entity"
entity_id = result.get("entity_id", "unknown")
name = f"{category_name}/{test_name}/{entity_group}{entity_id}"
subtests.append({
"name": name[:160],
"status": status,
"raw": json.dumps(result, default=str)[:1000],
})
summary = test.get("test_summary") or {}
status = cls._normalize_status(str(summary.get("status", "")))
if status:
subtests.append({
"name": f"{category_name}/{test_name}/summary"[:160],
"status": status,
"raw": json.dumps(summary, default=str)[:1000],
})
return subtests
@staticmethod
def _normalize_status(status: str) -> str:
s = status.strip().upper()
aliases = {
"PASS": "PASS",
"PASSED": "PASS",
"OK": "PASS",
"FAIL": "FAIL",
"FAILED": "FAIL",
"ERROR": "ERROR",
"WARN": "WARN",
"WARNING": "WARN",
"SKIP": "SKIP",
"SKIPPED": "SKIP",
"NOT_RUN": "SKIP",
"NOT RUN": "SKIP",
}
return aliases.get(s, s if s in {"PASS", "FAIL", "ERROR", "WARN", "SKIP"} else "")
@staticmethod
def _parse_output(output: str) -> list[dict]:
subtests = []
for line in output.splitlines():
stripped = line.strip()
if not stripped:
continue
m = re.search(r"(.+?)\s*[:|]\s*(PASS|FAIL|WARN|ERROR|SKIP)\b", stripped, re.I)
if not m:
m = re.search(r"\b(PASS|FAIL|WARN|ERROR|SKIP)\b\s*[-:|]\s*(.+)", stripped, re.I)
if m:
status = DCGMTest._normalize_status(m.group(1))
name = m.group(2).strip()
else:
continue
else:
name = m.group(1).strip(" .|-")
status = DCGMTest._normalize_status(m.group(2))
if name and len(name) < 160:
subtests.append({"name": name, "status": status, "raw": stripped})
return subtests
@staticmethod
def print_results(results: dict, console: Optional[Console] = None):
c = console or Console()
if results.get("error"):
c.print(f"[bold red]DCGM error: {results['error']}[/bold red]")
return
passed = results.get("passed", False)
c.print("[bold green]✓ DCGM diag PASSED[/bold green]" if passed else "[bold red]✗ DCGM diag FAILED[/bold red]")
subtests = results.get("subtests", [])
if subtests:
table = Table(box=None, padding=(0, 1))
table.add_column("Subtest")
table.add_column("Status", style="bold")
for s in subtests:
table.add_row(s.get("name", ""), s.get("status", ""))
c.print(table)

View File

@ -171,6 +171,10 @@ class HealthCheck:
gpu_health.append({"index": i, "status": worst, "checks": checks}) gpu_health.append({"index": i, "status": worst, "checks": checks})
system_health = self._check_system() system_health = self._check_system()
for key in ("fabricmanager", "retired_pages", "kernel_errors"):
item = system_health.get(key, {})
if isinstance(item, dict) and item.get("status") == "FAIL":
overall_pass = False
return { return {
"passed": overall_pass, "passed": overall_pass,
@ -228,6 +232,9 @@ class HealthCheck:
rdma_devs = os.listdir("/sys/class/infiniband_verbs") rdma_devs = os.listdir("/sys/class/infiniband_verbs")
nccl_env = {k: v for k, v in os.environ.items() if k.startswith("NCCL_")} nccl_env = {k: v for k, v in os.environ.items() if k.startswith("NCCL_")}
fabric = self._check_fabricmanager()
retired = self._check_retired_pages()
kernel_errors = self._check_kernel_errors()
return { return {
"nvidia_persistenced": {"installed": persistd, "running": persistd_running}, "nvidia_persistenced": {"installed": persistd, "running": persistd_running},
@ -238,6 +245,41 @@ class HealthCheck:
"infiniband_devices": ib_devs, "infiniband_devices": ib_devs,
"rdma_devices": rdma_devs, "rdma_devices": rdma_devs,
"nccl_env_vars": nccl_env, "nccl_env_vars": nccl_env,
"fabricmanager": fabric,
"retired_pages": retired,
"kernel_errors": kernel_errors,
}
def _check_fabricmanager(self) -> dict:
r = self._run_cmd(["systemctl", "is-active", "nvidia-fabricmanager"], timeout=5)
active = r == "active"
logs = self._run_cmd(["journalctl", "-u", "nvidia-fabricmanager", "-n", "200", "--no-pager"], timeout=10) or ""
has_error = "ERROR" in logs.upper() or "FAILED" in logs.upper()
return {
"active": active,
"has_error_logs": has_error,
"status": "PASS" if active and not has_error else "FAIL",
}
def _check_retired_pages(self) -> dict:
raw = self._run_cmd(["nvidia-smi", "-q", "-d", "PAGE_RETIREMENT"], timeout=30) or ""
nums = [int(x) for x in __import__("re").findall(r"Retired Pages.*?:\s*(\d+)", raw, flags=__import__("re").I)]
pending = "Pending Page Blacklist" in raw and "Yes" in raw
total = sum(nums)
return {
"retired_pages": total,
"pending_blacklist": pending,
"status": "PASS" if total == 0 and not pending else "FAIL",
}
def _check_kernel_errors(self) -> dict:
raw = self._run_cmd(["dmesg", "--ctime", "--level=err,crit,alert,emerg"], timeout=10) or ""
upper = raw.upper()
hits = [line for line in raw.splitlines() if any(k in line.upper() for k in ("XID", "AER", "PCIE", "NVRM"))]
return {
"count": len(hits),
"tail": hits[-20:],
"status": "PASS" if not hits else "FAIL",
} }
@staticmethod @staticmethod

View File

@ -5,6 +5,8 @@ import os
import re import re
import shutil import shutil
import subprocess import subprocess
import statistics
import sys
from datetime import datetime from datetime import datetime
from typing import Optional from typing import Optional
@ -70,6 +72,38 @@ class NCCLTest:
return p return p
return None return None
def _message_sizes(self) -> list[str]:
return list(self.nccl_cfg.get("message_sizes") or ["1M", "256M", "2G"])
def _repeats(self) -> int:
return int(self.nccl_cfg.get("repeats", 3))
def _max_stddev_pct(self) -> float:
return float(self.nccl_cfg.get("max_stddev_pct", 3))
def _runtime_env(self) -> dict:
env = {**os.environ, "NCCL_DEBUG": "WARN"}
lib_dirs = []
nccl_home = env.get("NCCL_HOME") or self.nccl_cfg.get("nccl_home")
if nccl_home:
lib_dirs.append(os.path.join(str(nccl_home), "lib"))
for path in sys.path:
lib_dirs.append(os.path.join(path, "nvidia", "nccl", "lib"))
venv_root = os.path.dirname(os.path.dirname(sys.executable))
lib_dirs.extend(glob.glob(os.path.join(venv_root, "lib", "python*", "site-packages", "nvidia", "nccl", "lib")))
existing = env.get("LD_LIBRARY_PATH", "")
valid_dirs = []
for d in lib_dirs:
if d and os.path.isdir(d) and d not in valid_dirs:
valid_dirs.append(d)
if valid_dirs:
env["LD_LIBRARY_PATH"] = ":".join(valid_dirs + ([existing] if existing else []))
return env
def run(self) -> dict: def run(self) -> dict:
gpu_count = 0 gpu_count = 0
if TORCH_AVAILABLE: if TORCH_AVAILABLE:
@ -89,7 +123,7 @@ class NCCLTest:
if self.nccl_cfg.get("test_reduce_scatter", False): if self.nccl_cfg.get("test_reduce_scatter", False):
tests.append(("reduce_scatter_perf", "ReduceScatter")) tests.append(("reduce_scatter_perf", "ReduceScatter"))
if self.nccl_cfg.get("test_allgather", False): if self.nccl_cfg.get("test_allgather", False):
tests.append(("allgather_perf", "AllGather")) tests.append(("all_gather_perf", "AllGather"))
if self.nccl_cfg.get("test_sendrecv", False): if self.nccl_cfg.get("test_sendrecv", False):
tests.append(("sendrecv_perf", "SendRecv")) tests.append(("sendrecv_perf", "SendRecv"))
@ -170,39 +204,7 @@ class NCCLTest:
if not binary: if not binary:
return {"status": "SKIP", "error": f"{binary_name} not found"} return {"status": "SKIP", "error": f"{binary_name} not found"}
cmd = [ return self._run_nccl_matrix([binary, "-g", str(gpu_count)], min_bw)
binary,
"-b", "8M",
"-e", "8G",
"-f", "2",
"-g", str(gpu_count),
"-w", "5",
"-n", "20",
]
try:
env = os.environ.copy()
env["NCCL_DEBUG"] = "WARN"
r = subprocess.run(cmd, capture_output=True, text=True, timeout=180, env=env)
combined = r.stdout + r.stderr
# Check for NCCL/CUDA compatibility errors
if "CUDA driver version is insufficient" in combined or \
"Test NCCL failure" in combined:
error_msg = "NCCL/CUDA driver version mismatch" \
if "CUDA driver version" in combined \
else "NCCL test failure (library incompatibility)"
return {"status": "FAIL", "error": error_msg}
if r.returncode != 0:
return {"status": "FAIL", "error": r.stderr[:300]}
return self._parse_nccl_output(r.stdout, min_bw)
except subprocess.TimeoutExpired:
return {"status": "FAIL", "error": "timeout"}
except Exception as e:
return {"status": "FAIL", "error": str(e)}
def _run_one_nccl_test_mpirun(self, binary_name: str, label: str, def _run_one_nccl_test_mpirun(self, binary_name: str, label: str,
gpu_count: int, mpirun: str, min_bw: float) -> dict: gpu_count: int, mpirun: str, min_bw: float) -> dict:
@ -218,37 +220,64 @@ class NCCLTest:
"-x", "NCCL_DEBUG=WARN", "-x", "NCCL_DEBUG=WARN",
"-x", "CUDA_VISIBLE_DEVICES=" + ",".join(str(i) for i in range(gpu_count)), "-x", "CUDA_VISIBLE_DEVICES=" + ",".join(str(i) for i in range(gpu_count)),
binary, binary,
"-b", "8",
"-e", "256M",
"-f", "2",
"-g", "1", "-g", "1",
"-w", "5",
"-n", "20",
] ]
return self._run_nccl_matrix(cmd, min_bw)
def _run_nccl_matrix(self, base_cmd: list[str], min_bw: float) -> dict:
size_results = []
failures = []
env = self._runtime_env()
try: try:
env = os.environ.copy() for size in self._message_sizes():
env["NCCL_DEBUG"] = "WARN" runs = []
r = subprocess.run(cmd, capture_output=True, text=True, timeout=180, env=env) for _ in range(self._repeats()):
cmd = [*base_cmd, "-b", size, "-e", size, "-f", "2", "-w", "5", "-n", "20"]
r = subprocess.run(cmd, capture_output=True, text=True, timeout=300, env=env)
combined = r.stdout + r.stderr combined = r.stdout + r.stderr
if "CUDA driver version is insufficient" in combined or \ if "CUDA driver version is insufficient" in combined or "Test NCCL failure" in combined:
"Test NCCL failure" in combined: failures.append({"size": size, "error": "NCCL/CUDA/library failure"})
error_msg = "NCCL/CUDA driver version mismatch" \ continue
if "CUDA driver version" in combined \
else "NCCL test failure (library incompatibility)"
return {"status": "FAIL", "error": error_msg}
if r.returncode != 0: if r.returncode != 0:
return {"status": "FAIL", "error": r.stderr[:300]} failures.append({"size": size, "error": r.stderr[:300]})
continue
return self._parse_nccl_output(r.stdout, min_bw) parsed = self._parse_nccl_output(r.stdout, min_bw)
runs.append(parsed.get("best_busbw_gbps", 0))
if runs:
worst = min(runs)
mean = sum(runs) / len(runs)
std_pct = (statistics.pstdev(runs) / mean * 100) if len(runs) > 1 and mean else 0
size_results.append({
"size": size,
"runs_busbw_gbps": [round(v, 1) for v in runs],
"worst_busbw_gbps": round(worst, 1),
"mean_busbw_gbps": round(mean, 1),
"stddev_pct": round(std_pct, 2),
"status": "PASS" if worst >= min_bw and std_pct <= self._max_stddev_pct() else "FAIL",
})
else:
size_results.append({"size": size, "status": "FAIL", "runs_busbw_gbps": []})
except subprocess.TimeoutExpired: except subprocess.TimeoutExpired:
return {"status": "FAIL", "error": "timeout"} return {"status": "FAIL", "error": "timeout"}
except Exception as e: except Exception as e:
return {"status": "FAIL", "error": str(e)} return {"status": "FAIL", "error": str(e)}
best_bus = max((r.get("mean_busbw_gbps", 0) for r in size_results), default=0)
worst_bus = min((r.get("worst_busbw_gbps", 0) for r in size_results if r.get("runs_busbw_gbps")), default=0)
passed = bool(size_results) and all(r.get("status") == "PASS" for r in size_results) and not failures
return {
"status": "PASS" if passed else "FAIL",
"best_busbw_gbps": round(best_bus, 1),
"worst_busbw_gbps": round(worst_bus, 1),
"min_required_gbps": min_bw,
"max_stddev_pct": self._max_stddev_pct(),
"by_size": size_results,
"failures": failures,
}
@staticmethod @staticmethod
def _parse_nccl_output(stdout: str, min_bw: float) -> dict: def _parse_nccl_output(stdout: str, min_bw: float) -> dict:
"""Parse nccl-tests tabular output and extract bandwidth results.""" """Parse nccl-tests tabular output and extract bandwidth results."""
@ -363,7 +392,7 @@ dist.destroy_process_group()
r = subprocess.run( r = subprocess.run(
[torchrun_cmd, f"--nproc_per_node={gpu_count}", tmp.name], [torchrun_cmd, f"--nproc_per_node={gpu_count}", tmp.name],
capture_output=True, text=True, timeout=120, capture_output=True, text=True, timeout=120,
env={**os.environ, "NCCL_DEBUG": "WARN"}, env=self._runtime_env(),
) )
os.unlink(tmp.name) os.unlink(tmp.name)
@ -390,10 +419,15 @@ dist.destroy_process_group()
} }
return { return {
"passed": all_passed, # torchrun fallback is a functional smoke only. It never proves
# production bus bandwidth, so it must not satisfy acceptance.
"passed": False,
"functional_passed": all_passed,
"source": "torchrun_fallback", "source": "torchrun_fallback",
"tests": tests, "tests": tests,
"gpu_count": gpu_count, "gpu_count": gpu_count,
"error": None if all_passed else "torchrun functional NCCL smoke failed",
"acceptance_gap": "nccl-tests bus bandwidth was not measured",
} }
except Exception as e: except Exception as e:
return {"passed": False, "source": "torchrun_fallback", "error": str(e)} return {"passed": False, "source": "torchrun_fallback", "error": str(e)}
@ -410,7 +444,8 @@ dist.destroy_process_group()
if source == "torchrun_fallback": if source == "torchrun_fallback":
# Connectivity check mode # Connectivity check mode
verdict = "[bold green]✓ NCCL Connectivity OK[/bold green]" if passed else "[bold red]✗ NCCL Connectivity FAILED[/bold red]" functional = results.get("functional_passed", passed)
verdict = "[bold yellow]⚠ NCCL bus BW NOT VERIFIED[/bold yellow]" if functional else "[bold red]✗ NCCL Connectivity FAILED[/bold red]"
c.print(f"{verdict} [dim](basic check via torchrun)[/dim]") c.print(f"{verdict} [dim](basic check via torchrun)[/dim]")
tests = results.get("tests", {}) tests = results.get("tests", {})
@ -427,7 +462,7 @@ dist.destroy_process_group()
else: else:
c.print(f" [{s_color}]{op_name}[/{s_color}]") c.print(f" [{s_color}]{op_name}[/{s_color}]")
c.print("\n[yellow]Note: functional connectivity test only (no performance data)[/yellow]") c.print("\n[yellow]Note: functional connectivity test only (no bus bandwidth data; acceptance FAIL)[/yellow]")
else: else:
# nccl-tests mode # nccl-tests mode
verdict = "[bold green]✓ NCCL tests PASSED[/bold green]" if passed else "[bold yellow]⚠ NCCL tests WARNING[/bold yellow]" verdict = "[bold green]✓ NCCL tests PASSED[/bold green]" if passed else "[bold yellow]⚠ NCCL tests WARNING[/bold yellow]"
@ -448,12 +483,16 @@ dist.destroy_process_group()
if by_size: if by_size:
t = Table(box=None, padding=(0, 1)) t = Table(box=None, padding=(0, 1))
t.add_column("Size", style="bold", justify="right") t.add_column("Size", style="bold", justify="right")
t.add_column("Time (us)", justify="right") t.add_column("Worst Bus BW", justify="right")
t.add_column("Alg BW (GB/s)", justify="right") t.add_column("Mean Bus BW", justify="right")
t.add_column("Bus BW (GB/s)", justify="right") t.add_column("StdDev", justify="right")
t.add_column("Status", justify="right")
for r in by_size: for r in by_size:
sz = r.get("size", 0) t.add_row(
sz_str = f"{sz/1024:.0f}K" if sz < 1048576 else f"{sz/1048576:.0f}M" str(r.get("size", "")),
t.add_row(sz_str, f"{r.get('time_us',0):.1f}", f"{r.get('worst_busbw_gbps', 0):.1f}",
f"{r.get('algbw_gbps',0):.1f}", f"{r.get('busbw_gbps',0):.1f}") f"{r.get('mean_busbw_gbps', 0):.1f}",
f"{r.get('stddev_pct', 0):.2f}%",
r.get("status", "?"),
)
c.print(t) c.print(t)

188
modules/nvlink_test.py Normal file
View File

@ -0,0 +1,188 @@
"""NVLink / NVSwitch production acceptance checks."""
import re
import shutil
import subprocess
from datetime import datetime
from typing import Optional
from rich.console import Console
from rich.table import Table
class NVLinkTest:
def __init__(self, config: dict):
self.config = config
self.console = Console()
self.cfg = config.get("nvlink", {})
def _run(self, args: list[str], timeout: int = 60) -> tuple[int, str, str]:
if not shutil.which("nvidia-smi"):
return 127, "", "nvidia-smi not found"
r = subprocess.run(["nvidia-smi", *args], capture_output=True, text=True, timeout=timeout)
return r.returncode, r.stdout, r.stderr
def run(self) -> dict:
expected_links = int(self.cfg.get("expected_links_per_gpu", 18))
expected_speed = float(self.cfg.get("expected_link_speed_gbps", 25))
require_zero_errors = bool(self.cfg.get("require_zero_errors", True))
rc_s, out_s, err_s = self._run(["nvlink", "-s"])
rc_c, out_c, err_c = self._run(["nvlink", "-c"])
rc_e, out_e, err_e = self._run(["nvlink", "-e"])
if rc_s != 0:
return {
"passed": False,
"error": (err_s or out_s or "nvidia-smi nvlink -s failed")[:1000],
"timestamp": datetime.now().isoformat(),
}
links = self._parse_status(out_s)
if not links:
return {
"passed": False,
"error": "no NVLink status entries parsed from nvidia-smi nvlink -s",
"raw_status": out_s[-4000:],
"timestamp": datetime.now().isoformat(),
}
speeds = self._parse_speeds(out_c) if rc_c == 0 else {}
status_speeds = self._parse_speeds(out_s)
for gpu, gpu_speeds in status_speeds.items():
speeds.setdefault(gpu, {}).update({k: v for k, v in gpu_speeds.items() if k not in speeds.get(gpu, {})})
errors = self._parse_errors(out_e) if rc_e == 0 else {}
gpu_results = []
overall = True
for gpu, gpu_links in sorted(links.items(), key=lambda x: int(x[0])):
active = sum(1 for l in gpu_links.values() if l.get("active"))
inactive = [lid for lid, l in gpu_links.items() if not l.get("active")]
speed_bad = []
for lid in gpu_links:
speed = speeds.get(gpu, {}).get(lid)
if speed is not None and speed < expected_speed:
speed_bad.append({"link": lid, "speed_gbps": speed})
err_bad = []
if require_zero_errors:
for lid, counters in errors.get(gpu, {}).items():
total = sum(v for v in counters.values() if isinstance(v, int))
if total:
err_bad.append({"link": lid, "counters": counters})
passed = active == expected_links and not inactive and not speed_bad and not err_bad
if not passed:
overall = False
gpu_results.append({
"gpu": int(gpu),
"active_links": active,
"expected_links": expected_links,
"inactive_links": inactive,
"speed_issues": speed_bad,
"error_issues": err_bad,
"passed": passed,
})
return {
"passed": overall,
"expected_links_per_gpu": expected_links,
"expected_link_speed_gbps": expected_speed,
"require_zero_errors": require_zero_errors,
"gpus": gpu_results,
"raw_status": out_s[-4000:],
"raw_speed": out_c[-4000:] if out_c else "",
"raw_errors": out_e[-4000:] if out_e else "",
"timestamp": datetime.now().isoformat(),
}
@staticmethod
def _parse_status(text: str) -> dict[str, dict[str, dict]]:
result: dict[str, dict[str, dict]] = {}
gpu = None
for line in text.splitlines():
m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
if m_gpu:
gpu = m_gpu.group(1)
result.setdefault(gpu, {})
continue
if gpu is None:
continue
m_link = re.search(r"Link\s+(\d+).*?(Active|Inactive|Disabled|Off|Down)", line, re.I)
if m_link:
state = m_link.group(2)
result[gpu][m_link.group(1)] = {
"state": state,
"active": state.lower() == "active",
"raw": line.strip(),
}
continue
m_speed = re.search(r"Link\s+(\d+).*?([0-9.]+)\s*GB/s", line, re.I)
if m_speed:
result[gpu][m_speed.group(1)] = {
"state": "Active",
"active": True,
"raw": line.strip(),
}
return result
@staticmethod
def _parse_speeds(text: str) -> dict[str, dict[str, float]]:
result: dict[str, dict[str, float]] = {}
gpu = None
for line in text.splitlines():
m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
if m_gpu:
gpu = m_gpu.group(1)
result.setdefault(gpu, {})
continue
if gpu is None:
continue
m_link = re.search(r"Link\s+(\d+).*?([0-9.]+)\s*GB/s", line, re.I)
if m_link:
result[gpu][m_link.group(1)] = float(m_link.group(2))
return result
@staticmethod
def _parse_errors(text: str) -> dict[str, dict[str, dict[str, int]]]:
result: dict[str, dict[str, dict[str, int]]] = {}
gpu = None
link = None
for line in text.splitlines():
m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
if m_gpu:
gpu = m_gpu.group(1)
result.setdefault(gpu, {})
continue
m_link = re.search(r"Link\s+(\d+)", line, re.I)
if m_link and gpu is not None:
link = m_link.group(1)
result[gpu].setdefault(link, {})
if gpu is None or link is None:
continue
for name in ("CRC", "Replay", "Recovery"):
m = re.search(rf"{name}[^0-9]*(\d+)", line, re.I)
if m:
result[gpu][link][name.lower()] = int(m.group(1))
return result
@staticmethod
def print_results(results: dict, console: Optional[Console] = None):
c = console or Console()
if results.get("error"):
c.print(f"[bold red]NVLink error: {results['error']}[/bold red]")
return
passed = results.get("passed", False)
c.print("[bold green]✓ NVLink PASSED[/bold green]" if passed else "[bold red]✗ NVLink FAILED[/bold red]")
table = Table(box=None, padding=(0, 1))
table.add_column("GPU", style="bold")
table.add_column("Active Links", justify="right")
table.add_column("Issues")
for g in results.get("gpus", []):
issues = []
if g.get("inactive_links"):
issues.append("inactive=" + ",".join(g["inactive_links"]))
if g.get("speed_issues"):
issues.append(f"speed={len(g['speed_issues'])}")
if g.get("error_issues"):
issues.append(f"errors={len(g['error_issues'])}")
table.add_row(str(g["gpu"]), f"{g['active_links']}/{g['expected_links']}", "; ".join(issues) or "OK")
c.print(table)

View File

@ -93,8 +93,8 @@ class ReportGenerator:
def _generate_html(self, results: dict, output: str) -> str: def _generate_html(self, results: dict, output: str) -> str:
import socket import socket
hostname = socket.gethostname() hostname = results.get("hostname") or socket.gethostname()
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S") timestamp = results.get("timestamp") or datetime.now().strftime("%Y-%m-%d %H:%M:%S")
sections = [] sections = []
@ -178,8 +178,8 @@ class ReportGenerator:
def _generate_markdown(self, results: dict, output: str) -> str: def _generate_markdown(self, results: dict, output: str) -> str:
import socket import socket
hostname = socket.gethostname() hostname = results.get("hostname") or socket.gethostname()
timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S") timestamp = results.get("timestamp") or datetime.now().strftime("%Y-%m-%d %H:%M:%S")
lines: list[str] = [] lines: list[str] = []
@ -201,6 +201,21 @@ class ReportGenerator:
# --- Summary table --- # --- Summary table ---
summary_items = self._build_summary(results) summary_items = self._build_summary(results)
if summary_items: if summary_items:
verdict, failures, missing = self._overall_acceptance_verdict(summary_items)
lines.append("## Overall Acceptance Verdict\n")
lines.append(f"**Result: {verdict}**")
lines.append("")
if failures:
lines.append("Failed or unverified items:")
for name, status in failures:
lines.append(f"- {name}: {status}")
lines.append("")
if missing:
lines.append("Missing required evidence:")
for name in missing:
lines.append(f"- {name}")
lines.append("")
lines.append("## Summary\n") lines.append("## Summary\n")
lines.append("| Test | Result |") lines.append("| Test | Result |")
lines.append("|------|--------|") lines.append("|------|--------|")
@ -319,8 +334,6 @@ class ReportGenerator:
if use_abs and thr: if use_abs and thr:
if val >= thr: if val >= thr:
status = "PASS" status = "PASS"
elif val >= thr * 0.9:
status = "WARN"
else: else:
status = "FAIL" status = "FAIL"
lines.append(f"| {dt.upper()} | {val:.1f} | {pk:.0f} | >= {thr} | {status} |") lines.append(f"| {dt.upper()} | {val:.1f} | {pk:.0f} | >= {thr} | {status} |")
@ -331,29 +344,122 @@ class ReportGenerator:
overall_status = status overall_status = status
lines.append("") lines.append("")
if use_abs: if use_abs:
if any(not row.get("passed", False) for row in (comp_data.get("consistency", {}) or {}).values()):
overall_status = "FAIL"
lines.append(f"**Verdict: {overall_status}** (absolute TFLOPS thresholds; worst efficiency {worst_eff:.1f}%)\n") lines.append(f"**Verdict: {overall_status}** (absolute TFLOPS thresholds; worst efficiency {worst_eff:.1f}%)\n")
else: else:
overall_status = "PASS" if worst_eff >= 80 else ("WARN" if worst_eff >= 50 else "FAIL") overall_status = "PASS" if worst_eff >= 80 else ("WARN" if worst_eff >= 50 else "FAIL")
lines.append(f"**Verdict: {overall_status}** (worst efficiency {worst_eff:.1f}%)\n") lines.append(f"**Verdict: {overall_status}** (worst efficiency {worst_eff:.1f}%)\n")
consistency = comp_data.get("consistency", {}) or {}
if consistency:
lines.append("### Compute Consistency\n")
lines.append("| DType | Min | Mean | Max | Spread | Limit | Status |")
lines.append("|-------|-----|------|-----|--------|-------|--------|")
for dt, row in consistency.items():
status = "PASS" if row.get("passed") else "FAIL"
lines.append(
f"| {dt.upper()} | {row.get('min_tflops', 0):.1f} | "
f"{row.get('mean_tflops', 0):.1f} | {row.get('max_tflops', 0):.1f} | "
f"{row.get('spread_pct', 0):.2f}% | <= {row.get('max_allowed_pct', 3)}% | {status} |"
)
lines.append("")
per_gpu = comp_data.get("per_gpu", []) or []
dtype_order = [dt for dt in per_dtype.keys() if not isinstance(per_dtype.get(dt), str)]
if per_gpu and dtype_order:
lines.append("### Compute Per-GPU TFLOPS\n")
headers = ["GPU", *[dt.upper() for dt in dtype_order]]
lines.append("| " + " | ".join(headers) + " |")
lines.append("|" + "|".join(["---"] * len(headers)) + "|")
for row in per_gpu:
cells = [str(row.get("index", ""))]
for dt in dtype_order:
val = row.get(dt, "")
cells.append(f"{val:.1f}" if isinstance(val, (int, float)) else str(val))
lines.append("| " + " | ".join(cells) + " |")
lines.append("")
# --- NCCL ---
nvlink = results.get("nvlink")
if nvlink and not nvlink.get("error"):
lines.append("## NVLink/NVSwitch\n")
lines.append(f"**Overall: {'PASS' if nvlink.get('passed') else 'FAIL'}**\n")
lines.append("| GPU | Active Links | Issues |")
lines.append("|-----|--------------|--------|")
for g in nvlink.get("gpus", []):
issues = []
if g.get("inactive_links"):
issues.append("inactive=" + ",".join(g["inactive_links"]))
if g.get("speed_issues"):
issues.append(f"speed issues={len(g['speed_issues'])}")
if g.get("error_issues"):
issues.append(f"errors={len(g['error_issues'])}")
lines.append(f"| {g.get('gpu')} | {g.get('active_links')}/{g.get('expected_links')} | {', '.join(issues) or 'OK'} |")
lines.append("")
elif nvlink and nvlink.get("error"):
lines.append("## NVLink/NVSwitch\n")
lines.append(f"**Overall: FAIL** ({nvlink.get('error')})\n")
dcgm = results.get("dcgm")
if dcgm and not dcgm.get("error"):
lines.append("## DCGM Diagnostic\n")
lines.append(f"**Overall: {'PASS' if dcgm.get('passed') else 'FAIL'}**\n")
if dcgm.get("subtests"):
lines.append("| Subtest | Status |")
lines.append("|---------|--------|")
for s in dcgm.get("subtests", []):
lines.append(f"| {s.get('name', '')} | {s.get('status', '')} |")
lines.append("")
elif dcgm and dcgm.get("error"):
lines.append("## DCGM Diagnostic\n")
lines.append(f"**Overall: FAIL** ({dcgm.get('error')})\n")
# --- NCCL --- # --- NCCL ---
nccl = results.get("nccl") nccl = results.get("nccl")
if nccl and not nccl.get("error"): if nccl and not nccl.get("error"):
lines.append("## NCCL Multi-GPU\n") lines.append("## NCCL Multi-GPU\n")
lines.append(f"Source: {nccl.get('source', 'unknown')} | " lines.append(f"Source: {nccl.get('source', 'unknown')} | "
f"GPUs: {nccl.get('gpu_count', '?')}\n") f"GPUs: {nccl.get('gpu_count', '?')}\n")
if nccl.get("source") == "torchrun_fallback":
lines.append("> Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.\n")
tests = nccl.get("tests", {}) tests = nccl.get("tests", {})
if tests: if tests:
lines.append("| Operation | Bus BW (GB/s) | Threshold | Status |") lines.append("> Summary reports the best Bus BW observed for each operation. PASS/FAIL is evaluated across every tested message size and repeat run shown in the detail table below.\n")
lines.append("|-----------|---------------|-----------|--------|") lines.append("| Operation | Best Bus BW (GB/s) | Failed Sizes | Threshold | Status |")
lines.append("|-----------|--------------------|--------------|-----------|--------|")
for op, data in tests.items(): for op, data in tests.items():
if isinstance(data, dict) and not data.get("error"): if isinstance(data, dict) and not data.get("error"):
bw = data.get("best_busbw_gbps", 0) bw = data.get("best_busbw_gbps", 0)
req = data.get("min_required_gbps", 0) req = data.get("min_required_gbps", 0)
status = data.get("status", "?") status = data.get("status", "?")
lines.append(f"| {op} | {bw:.1f} | >= {req:.0f} | {status} |") failed_sizes = [
str(row.get("size", "?"))
for row in data.get("by_size", [])
if row.get("status") != "PASS"
]
failed_sizes_text = ", ".join(failed_sizes) if failed_sizes else "-"
lines.append(f"| {op} | {bw:.1f} | {failed_sizes_text} | >= {req:.0f} | {status} |")
elif isinstance(data, dict) and data.get("error"): elif isinstance(data, dict) and data.get("error"):
lines.append(f"| {op} | - | - | ERROR: {data['error']} |") lines.append(f"| {op} | - | - | - | ERROR: {data['error']} |")
lines.append("")
for op, data in tests.items():
by_size = data.get("by_size", []) if isinstance(data, dict) else []
if not by_size:
continue
lines.append(f"### NCCL {op} by size\n")
lines.append("| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |")
lines.append("|------|---------------------|-------|------|--------|-----------|--------|")
for row in by_size:
runs = ", ".join(str(v) for v in row.get("runs_busbw_gbps", []))
lines.append(
f"| {row.get('size', '')} | {runs} | "
f"{row.get('worst_busbw_gbps', 0):.1f} | "
f"{row.get('mean_busbw_gbps', 0):.1f} | "
f"{row.get('stddev_pct', 0):.2f}% | "
f">= {data.get('min_required_gbps', 0):.0f} | "
f"{row.get('status', '?')} |"
)
lines.append("") lines.append("")
passed = nccl.get("passed", False) passed = nccl.get("passed", False)
lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n") lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n")
@ -368,6 +474,21 @@ class ReportGenerator:
source = stress.get("source", "unknown") source = stress.get("source", "unknown")
lines.append(f"- **Source:** {source}") lines.append(f"- **Source:** {source}")
lines.append(f"- **Duration:** {elapsed:.0f}s (requested {duration}s)") lines.append(f"- **Duration:** {elapsed:.0f}s (requested {duration}s)")
telemetry = stress.get("telemetry") or {}
if telemetry:
lines.append(f"- **Telemetry samples:** {telemetry.get('samples', 0)}")
lines.append(f"- **Max temp:** {telemetry.get('max_temp_c', {})}")
lines.append(f"- **Avg power:** {telemetry.get('avg_power_w', {})}")
lines.append(f"- **Temp delta:** {telemetry.get('temp_delta_c', 'N/A')} C")
lines.append(f"- **TFLOPS jitter:** {telemetry.get('tflops_jitter_pct', 'N/A')}%")
lines.append(f"- **Steady TFLOPS samples:** {telemetry.get('steady_tflops_samples', 0)}")
lines.append(f"- **Throttle events:** {telemetry.get('throttle_event_count', len(telemetry.get('throttle_events', [])))}")
lines.append(f"- **XID events:** {len(telemetry.get('xid_events', []))}")
failures = telemetry.get("failures") or []
if failures:
lines.append("- **Failure reasons:**")
for reason in failures:
lines.append(f" - {reason}")
lines.append(f"- **Result: {'PASS' if passed else 'FAIL'}**") lines.append(f"- **Result: {'PASS' if passed else 'FAIL'}**")
lines.append("") lines.append("")
@ -378,26 +499,70 @@ class ReportGenerator:
lines.append(f"**Overall: SKIP** [{rdma.get('reason', 'no IB hardware detected')}]\n") lines.append(f"**Overall: SKIP** [{rdma.get('reason', 'no IB hardware detected')}]\n")
elif rdma and not rdma.get("error"): elif rdma and not rdma.get("error"):
lines.append("## RDMA/InfiniBand\n") lines.append("## RDMA/InfiniBand\n")
rdma_legacy_note = self._rdma_legacy_note(rdma)
if rdma_legacy_note:
lines.append(f"> {rdma_legacy_note}\n")
port_checks = rdma.get("port_checks", [])
if port_checks:
lines.append("### RDMA Port Checks\n")
lines.append("| Device | Port | State | Rate | Required | Status |")
lines.append("|--------|------|-------|------|----------|--------|")
for p in port_checks:
lines.append(
f"| {p.get('device', '')} | {p.get('port', '')} | "
f"{p.get('state', '')} | {p.get('rate', '')} | "
f">= {p.get('min_rate_gbps', 400):.0f}Gbps ACTIVE | {p.get('status', '?')} |"
)
lines.append("")
bw_tests = rdma.get("bandwidth_tests", []) bw_tests = rdma.get("bandwidth_tests", [])
lat_tests = rdma.get("latency_tests", []) lat_tests = rdma.get("latency_tests", [])
if bw_tests or lat_tests: ibping_tests = rdma.get("ibping_tests", [])
if bw_tests or lat_tests or ibping_tests:
lines.append("| Test | Value | Threshold | Status |") lines.append("| Test | Value | Threshold | Status |")
lines.append("|------|-------|-----------|--------|") lines.append("|------|-------|-----------|--------|")
for bt in bw_tests: for bt in bw_tests:
if not bt.get("error"): if bt.get("error"):
lines.append(f"| {bt.get('test', 'ib_bw')} | {bt.get('error')} | required runnable test | {bt.get('status', 'FAIL')} |")
else:
threshold, status = self._rdma_bandwidth_verdict(bt)
lines.append(f"| {bt['test']} | {bt.get('bandwidth_gbps', 0):.1f} GB/s | " lines.append(f"| {bt['test']} | {bt.get('bandwidth_gbps', 0):.1f} GB/s | "
f">= {bt.get('min_required_gbps', 0)} GB/s | {bt.get('status', '?')} |") f">= {threshold:g} GB/s | {status} |")
for lt in lat_tests: for lt in lat_tests:
if not lt.get("error"): if lt.get("error"):
lines.append(f"| {lt.get('test', 'ib_lat')} | {lt.get('error')} | required runnable test | {lt.get('status', 'FAIL')} |")
else:
threshold, status = self._rdma_latency_verdict(lt)
lines.append(f"| {lt['test']} | {lt.get('latency_us', 0):.2f} us | " lines.append(f"| {lt['test']} | {lt.get('latency_us', 0):.2f} us | "
f"<= {lt.get('max_allowed_us', 0)} us | {lt.get('status', '?')} |") f"<= {threshold:g} us | {status} |")
for it in ibping_tests:
direction = it.get("direction") or it.get("role", "N/A")
if it.get("error"):
lines.append(f"| {it.get('test', 'ibping')} | {it.get('error')} | bidirectional peer evidence | {it.get('status', 'FAIL')} |")
else:
lines.append(f"| {it['test']} | {direction} target={it.get('target', 'N/A')} count={it.get('count', 'N/A')} | "
f"0% packet loss | {it.get('status', '?')} |")
lines.append("") lines.append("")
fabric = rdma.get("fabric_counters") or {}
if fabric:
counters = fabric.get("counters", {})
lines.append(f"- **PFC/ECN/CNP/congestion counters checked:** {len(counters)}")
lines.append(f"- **PFC/ECN/CNP/congestion non-zero:** {'yes' if fabric.get('failed') else 'no'}")
if not counters:
lines.append("- **PFC/ECN/CNP/congestion evidence:** missing")
failures = rdma.get("failures") or []
if not failures:
failures = self._rdma_failure_reasons(rdma)
if failures:
lines.append("- **Failure reasons:**")
for reason in failures:
lines.append(f" - {reason}")
passed = rdma.get("passed", False) passed = rdma.get("passed", False)
lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n") lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n")
# --- Training --- # --- Training ---
training = results.get("training") training = results.get("training")
if training and not training.get("error"): if training and not training.get("error"):
training_status, training_detail, training_missing = self._training_verdict(training)
lines.append("## Training Simulation\n") lines.append("## Training Simulation\n")
lines.append("| Metric | Value |") lines.append("| Metric | Value |")
lines.append("|--------|-------|") lines.append("|--------|-------|")
@ -405,8 +570,14 @@ class ReportGenerator:
lines.append(f"| Params | {training.get('total_params_m', 0):.1f}M |") lines.append(f"| Params | {training.get('total_params_m', 0):.1f}M |")
lines.append(f"| Throughput | {training.get('throughput_tokens_per_sec', 0):.0f} tokens/sec |") lines.append(f"| Throughput | {training.get('throughput_tokens_per_sec', 0):.0f} tokens/sec |")
lines.append(f"| Avg Step Time | {training.get('avg_step_time_ms', 0):.1f} ms |") lines.append(f"| Avg Step Time | {training.get('avg_step_time_ms', 0):.1f} ms |")
lines.append(f"| Warmup Steps | {training.get('warmup_steps', 'N/A')} |")
lines.append(f"| Peak Memory | {training.get('peak_memory_gb', 0):.1f} GB |") lines.append(f"| Peak Memory | {training.get('peak_memory_gb', 0):.1f} GB |")
lines.append(f"| Final Loss | {training.get('final_loss', 'N/A')} |") lines.append(f"| Final Loss | {training.get('final_loss', 'N/A')} |")
lines.append(f"| Step Jitter | {training.get('step_jitter_pct', 'N/A')}% |")
lines.append(f"| Distributed Mode | {training.get('distributed_mode', 'N/A')} |")
if training_missing:
lines.append(f"| Acceptance Gaps | missing {', '.join(training_missing)} |")
lines.append(f"| Verdict | {training_status} ({training_detail}) |")
lines.append("") lines.append("")
# --- Footer --- # --- Footer ---
@ -441,6 +612,101 @@ class ReportGenerator:
return bench["compute"] return bench["compute"]
return {} return {}
@staticmethod
def _training_verdict(training: dict) -> tuple[str, str, list[str]]:
"""Return report status for both current and legacy training result schemas."""
tps = float(training.get("throughput_tokens_per_sec", 0) or 0)
if "passed" in training:
status = "PASS" if training.get("passed") else "FAIL"
return status, f"{tps:.0f} tokens/sec", []
required = ["passed", "step_jitter_pct", "distributed_mode", "loss_finite"]
missing = [k for k in required if k not in training]
return "UNVERIFIED", f"{tps:.0f} tokens/sec; legacy result lacks explicit acceptance verdict", missing
def _rdma_cfg_value(self, key: str, default: float) -> float:
try:
return float((self.config.get("rdma", {}) or {}).get(key, default))
except (TypeError, ValueError):
return default
def _rdma_bandwidth_verdict(self, row: dict) -> tuple[float, str]:
threshold = self._rdma_cfg_value("min_bandwidth_gbps", 47.0)
value = float(row.get("bandwidth_gbps", 0) or 0)
return threshold, "PASS" if value >= threshold else "FAIL"
def _rdma_latency_verdict(self, row: dict) -> tuple[float, str]:
name = row.get("test", "")
if name == "ib_write_lat":
threshold = self._rdma_cfg_value("max_write_latency_us", 2.0)
elif name == "ib_read_lat":
threshold = self._rdma_cfg_value("max_read_latency_us", 3.5)
else:
threshold = self._rdma_cfg_value("max_latency_us", 3.5)
value = float(row.get("latency_us", 0) or 0)
return threshold, "PASS" if 0 < value <= threshold else "FAIL"
def _rdma_legacy_note(self, rdma: dict) -> str:
"""Flag old RDMA result schemas whose embedded thresholds were looser."""
for row in rdma.get("bandwidth_tests", []) or []:
if row.get("min_required_gbps") != self._rdma_cfg_value("min_bandwidth_gbps", 47.0):
return (
"Legacy RDMA result re-evaluated with current PDF acceptance thresholds; "
"old WARN statuses and old 50GB/s/10us limits are not used for verdict."
)
for row in rdma.get("latency_tests", []) or []:
threshold, _ = self._rdma_latency_verdict(row)
if row.get("max_allowed_us") != threshold:
return (
"Legacy RDMA result re-evaluated with current PDF acceptance thresholds; "
"old WARN statuses and old 50GB/s/10us limits are not used for verdict."
)
return ""
def _rdma_failure_reasons(self, rdma: dict) -> list[str]:
failures = []
for row in rdma.get("bandwidth_tests", []) or []:
threshold, status = self._rdma_bandwidth_verdict(row)
if status != "PASS":
failures.append(
f"{row.get('test')} bandwidth {row.get('bandwidth_gbps', 0)}GB/s < {threshold:g}GB/s"
)
for row in rdma.get("latency_tests", []) or []:
threshold, status = self._rdma_latency_verdict(row)
if status != "PASS":
failures.append(
f"{row.get('test')} latency {row.get('latency_us', 0)}us > {threshold:g}us"
)
for row in rdma.get("ibping_tests", []) or []:
if row.get("status") != "PASS":
failures.append(f"{row.get('test')} failed")
return failures
@staticmethod
def _overall_acceptance_verdict(summary_items: list[tuple[str, str]]) -> tuple[str, list[tuple[str, str]], list[str]]:
"""PDF-style machine verdict: every required item must be present and PASS."""
required = [
"GPU Info",
"Health Check",
"Memory Bandwidth",
"Compute Throughput",
"NVLink/NVSwitch",
"NCCL",
"Stress Test",
"RDMA",
"DCGM",
"Training",
]
status_by_name = dict(summary_items)
missing = [name for name in required if name not in status_by_name]
failures = [
(name, status)
for name, status in summary_items
if name in required and not str(status).startswith("PASS")
]
verdict = "PASS" if not missing and not failures else "FAIL"
return verdict, failures, missing
def _build_summary(self, results: dict) -> list[tuple[str, str]]: def _build_summary(self, results: dict) -> list[tuple[str, str]]:
"""Build summary verdict list from results.""" """Build summary verdict list from results."""
items = [] items = []
@ -473,7 +739,7 @@ class ReportGenerator:
d2d = mem.get("d2d_bandwidth_gbps") or 0 d2d = mem.get("d2d_bandwidth_gbps") or 0
items.append(("Memory Bandwidth", f"WARN ({d2d:.0f} GB/s via PyTorch fallback)")) items.append(("Memory Bandwidth", f"WARN ({d2d:.0f} GB/s via PyTorch fallback)"))
else: else:
eff = mem.get("efficiency_pct") or 0 eff = mem.get("d2d_efficiency_pct") or mem.get("efficiency_pct") or 0
verdict = "PASS" if eff >= 80 else ("WARN" if eff >= 60 else "FAIL") verdict = "PASS" if eff >= 80 else ("WARN" if eff >= 60 else "FAIL")
items.append(("Memory Bandwidth", f"{verdict} ({eff:.1f}%)")) items.append(("Memory Bandwidth", f"{verdict} ({eff:.1f}%)"))
@ -491,20 +757,38 @@ class ReportGenerator:
rank = {"PASS": 0, "WARN": 1, "FAIL": 2} rank = {"PASS": 0, "WARN": 1, "FAIL": 2}
worst_status = "PASS" worst_status = "PASS"
worst_dt = None worst_dt = None
lowest_margin = None
for dt, thr in pass_thresholds.items(): for dt, thr in pass_thresholds.items():
val = per_dtype.get(dt) val = per_dtype.get(dt)
if not isinstance(val, (int, float)): if not isinstance(val, (int, float)):
continue continue
if val >= thr: if val >= thr:
st = "PASS" st = "PASS"
elif val >= thr * 0.9:
st = "WARN"
else: else:
st = "FAIL" st = "FAIL"
margin = val / thr if thr else 0
if lowest_margin is None or margin < lowest_margin:
lowest_margin = margin
worst_dt = dt
if rank[st] > rank[worst_status]: if rank[st] > rank[worst_status]:
worst_status = st worst_status = st
worst_dt = dt
if worst_dt: if worst_dt:
consistency = comp.get("consistency", {}) or {}
failed_consistency = [
(dt, row)
for dt, row in consistency.items()
if not row.get("passed", False)
]
if failed_consistency:
worst_status = "FAIL"
fail_dt, fail_row = failed_consistency[0]
items.append((
"Compute Throughput",
f"FAIL ({fail_dt.upper()} spread "
f"{fail_row.get('spread_pct', 0):.2f}% > "
f"{fail_row.get('max_allowed_pct', 3)}%)"
))
else:
items.append(( items.append((
"Compute Throughput", "Compute Throughput",
f"{worst_status} (worst {worst_dt.upper()} " f"{worst_status} (worst {worst_dt.upper()} "
@ -521,11 +805,32 @@ class ReportGenerator:
else: else:
items.append(("Compute Throughput", "N/A")) items.append(("Compute Throughput", "N/A"))
# NCCL
if "nvlink" in results:
nvl = results["nvlink"]
if nvl.get("error"):
items.append(("NVLink/NVSwitch", f"ERROR: {nvl['error']}"))
elif nvl.get("passed"):
items.append(("NVLink/NVSwitch", "PASS"))
else:
items.append(("NVLink/NVSwitch", "FAIL"))
if "dcgm" in results:
d = results["dcgm"]
if d.get("error"):
items.append(("DCGM", f"ERROR: {d['error']}"))
elif d.get("passed"):
items.append(("DCGM", "PASS"))
else:
items.append(("DCGM", "FAIL"))
# NCCL # NCCL
if "nccl" in results: if "nccl" in results:
n = results["nccl"] n = results["nccl"]
if n.get("error"): if n.get("error"):
items.append(("NCCL", f"ERROR: {n['error']}")) items.append(("NCCL", f"ERROR: {n['error']}"))
elif n.get("source") == "torchrun_fallback":
items.append(("NCCL", "FAIL (no nccl-tests bus BW)"))
elif n.get("passed"): elif n.get("passed"):
items.append(("NCCL", "PASS")) items.append(("NCCL", "PASS"))
else: else:
@ -559,7 +864,7 @@ class ReportGenerator:
if t.get("error"): if t.get("error"):
items.append(("Training", f"ERROR: {t['error']}")) items.append(("Training", f"ERROR: {t['error']}"))
else: else:
tps = t.get("throughput_tokens_per_sec", 0) status, detail, _missing = self._training_verdict(t)
items.append(("Training", f"PASS ({tps:.0f} tokens/sec)")) items.append(("Training", f"{status} ({detail})"))
return items return items

View File

@ -1,9 +1,10 @@
"""GPU stress test module — wraps gpu-burn for long-running stability tests.""" """GPU stress test module — gpu-burn or PyTorch GEMM with telemetry."""
import glob import glob
import os import os
import shutil import shutil
import subprocess import subprocess
import threading
import time import time
from datetime import datetime from datetime import datetime
@ -46,7 +47,7 @@ class StressTest:
memory_pct = cfg.get("memory_pct", 90) memory_pct = cfg.get("memory_pct", 90)
target_gpus = cfg.get("gpus", "all") target_gpus = cfg.get("gpus", "all")
gpu_burn = self._find_gpu_burn() gpu_burn = self._find_gpu_burn() if cfg.get("use_gpu_burn", False) else ""
if gpu_burn: if gpu_burn:
# Try gpu-burn first # Try gpu-burn first
@ -60,7 +61,7 @@ class StressTest:
return result return result
self.console.print("[yellow]gpu_burn not found, using PyTorch stress test[/yellow]") self.console.print("[yellow]Using PyTorch stress test[/yellow]")
return self._run_pytorch_stress(duration_sec, memory_pct) return self._run_pytorch_stress(duration_sec, memory_pct)
def _run_gpu_burn(self, gpu_burn: str, duration: int, def _run_gpu_burn(self, gpu_burn: str, duration: int,
@ -77,12 +78,26 @@ class StressTest:
cmd.append(str(duration)) cmd.append(str(duration))
t0 = time.time() t0 = time.time()
xid_before = self._collect_xid_events()
interval = int(self.stress_cfg.get("telemetry_interval_sec", 1))
telemetry = []
stop_sampling = threading.Event()
sampler = threading.Thread(
target=self._sample_telemetry,
args=(telemetry, stop_sampling, interval),
daemon=True,
)
sampler.start()
try: try:
r = subprocess.run(cmd, capture_output=True, text=True, timeout=duration + 120) r = subprocess.run(cmd, capture_output=True, text=True, timeout=duration + 120)
elapsed = round(time.time() - t0, 1) elapsed = round(time.time() - t0, 1)
stop_sampling.set()
sampler.join(timeout=interval + 1)
output = r.stdout + r.stderr output = r.stdout + r.stderr
passed = r.returncode == 0 xid_events = self._new_xid_events(xid_before, self._collect_xid_events())
telemetry_summary = self._evaluate_telemetry(telemetry, [], xid_events)
passed = r.returncode == 0 and telemetry_summary.get("passed", False)
gpu_results = [] gpu_results = []
for line in output.split("\n"): for line in output.split("\n"):
@ -96,25 +111,36 @@ class StressTest:
"duration_sec": duration, "duration_sec": duration,
"elapsed_sec": elapsed, "elapsed_sec": elapsed,
"gpu_results": gpu_results, "gpu_results": gpu_results,
"telemetry": telemetry_summary,
"raw_output_tail": output[-500:] if output else "", "raw_output_tail": output[-500:] if output else "",
"timestamp": datetime.now().isoformat(), "timestamp": datetime.now().isoformat(),
} }
except subprocess.TimeoutExpired: except subprocess.TimeoutExpired:
stop_sampling.set()
return { return {
"source": "gpu-burn", "source": "gpu-burn",
"passed": False, "passed": False,
"duration_sec": duration, "duration_sec": duration,
"error": "timeout", "error": "timeout",
"telemetry": self._evaluate_telemetry(
telemetry, [], self._new_xid_events(xid_before, self._collect_xid_events())
),
"timestamp": datetime.now().isoformat(), "timestamp": datetime.now().isoformat(),
} }
except Exception as e: except Exception as e:
stop_sampling.set()
return { return {
"source": "gpu-burn", "source": "gpu-burn",
"passed": False, "passed": False,
"error": str(e), "error": str(e),
"telemetry": self._evaluate_telemetry(
telemetry, [], self._new_xid_events(xid_before, self._collect_xid_events())
),
"timestamp": datetime.now().isoformat(), "timestamp": datetime.now().isoformat(),
} }
finally:
stop_sampling.set()
def _run_pytorch_stress(self, duration: int, memory_pct: int = 90) -> dict: def _run_pytorch_stress(self, duration: int, memory_pct: int = 90) -> dict:
try: try:
@ -127,58 +153,79 @@ class StressTest:
gpu_count = torch.cuda.device_count() gpu_count = torch.cuda.device_count()
self.console.print(f"[cyan]PyTorch Stress Test ({duration}s, {gpu_count} GPUs, target {memory_pct}% memory)[/cyan]") self.console.print(f"[cyan]PyTorch Stress Test ({duration}s, {gpu_count} GPUs, target {memory_pct}% memory)[/cyan]")
dtype_name = self.stress_cfg.get("dtype", "bf16")
matrix_size = int(self.stress_cfg.get("matrix_size", 8192))
interval = int(self.stress_cfg.get("telemetry_interval_sec", 1))
dtype_map = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}
dtype = dtype_map.get(dtype_name, torch.bfloat16)
gpu_status = {} gpu_status = {}
telemetry = []
stop_sampling = threading.Event()
t0 = time.time() t0 = time.time()
xid_before = self._collect_xid_events()
try: try:
sampler = threading.Thread(
target=self._sample_telemetry,
args=(telemetry, stop_sampling, interval),
daemon=True,
)
sampler.start()
tensors = {} tensors = {}
ballast = {}
pass_tflops = []
for i in range(gpu_count): for i in range(gpu_count):
with torch.cuda.device(i): with torch.cuda.device(i):
# Get actual free memory (accounting for other processes)
free_mem, total_mem = torch.cuda.mem_get_info(i) free_mem, total_mem = torch.cuda.mem_get_info(i)
side = matrix_size
# Calculate allocation from configured memory_pct elem = torch.tensor([], dtype=dtype).element_size()
target_mem = int(total_mem * memory_pct / 100) compute_bytes = side * side * elem * 3
target_mem = min(int(total_mem * memory_pct / 100), int(free_mem * 0.90))
# Cap at actual free memory with 5% safety margin ballast_bytes = max(0, target_mem - compute_bytes)
alloc_bytes = min(target_mem, int(free_mem * 0.95)) if ballast_bytes:
ballast_elems = ballast_bytes // 2
# matmul(A, A.T) needs 2x input memory (input + output) ballast[i] = torch.empty(ballast_elems, device=f"cuda:{i}", dtype=torch.float16)
mem_side = int((alloc_bytes / 4 / 2) ** 0.5) actual_mem_mb = (compute_bytes + ballast_bytes) / 1024 / 1024
# Cap compute matrix so a single matmul completes in ~2s on H100/H200
# (FP32 ≈ 67 TFLOPS → 2*4096³/67e12 ≈ 2s). Without this cap, a 141GB
# HBM yields side ≈ 131K → single matmul ~68s × 8 GPUs serial → loop
# overshoots a 60s duration request by 10×+.
MAX_COMPUTE_SIDE = 4096
side = min(mem_side, MAX_COMPUTE_SIDE)
actual_mem_mb = side * side * 4 / 1024 / 1024
total_mem_mb = total_mem / 1024 / 1024 total_mem_mb = total_mem / 1024 / 1024
free_mem_mb = free_mem / 1024 / 1024 free_mem_mb = free_mem / 1024 / 1024
self.console.print( self.console.print(
f" [dim]GPU {i}: total {total_mem_mb:.0f}MB, free {free_mem_mb:.0f}MB, " f" [dim]GPU {i}: total {total_mem_mb:.0f}MB, free {free_mem_mb:.0f}MB, "
f"alloc {actual_mem_mb:.0f}MB ({actual_mem_mb/total_mem_mb*100:.0f}%) - " f"alloc {actual_mem_mb:.0f}MB ({actual_mem_mb/total_mem_mb*100:.0f}%) - "
f"matrix {side}x{side}[/dim]" f"{dtype_name} matrix {side}x{side}[/dim]"
)
tensors[i] = (
torch.randn(side, side, device=f"cuda:{i}", dtype=dtype),
torch.randn(side, side, device=f"cuda:{i}", dtype=dtype),
torch.empty(side, side, device=f"cuda:{i}", dtype=dtype),
) )
tensors[i] = torch.randn(side, side, device=f"cuda:{i}", dtype=torch.float32)
self.console.print(f"\n[cyan]Starting stress test for {duration} seconds...[/cyan]") self.console.print(f"\n[cyan]Starting stress test for {duration} seconds...[/cyan]")
elapsed_check = 0 elapsed_check = 0
while time.time() - t0 < duration: while time.time() - t0 < duration:
loop_start = time.perf_counter()
# Dispatch matmul on all GPUs in parallel — do NOT synchronize between # Dispatch matmul on all GPUs in parallel — do NOT synchronize between
# GPUs, otherwise the 8 GPUs run serially and overshoot the duration. # GPUs, otherwise the 8 GPUs run serially and overshoot the duration.
for i in range(gpu_count): for i in range(gpu_count):
with torch.cuda.device(i): with torch.cuda.device(i):
tensors[i] = torch.matmul(tensors[i], tensors[i].T) a, b, out = tensors[i]
torch.matmul(a, b, out=out)
# Single sync per pass — waits for all 8 streams concurrently # Single sync per pass — waits for all 8 streams concurrently
for i in range(gpu_count): for i in range(gpu_count):
with torch.cuda.device(i): with torch.cuda.device(i):
torch.cuda.synchronize() torch.cuda.synchronize()
loop_elapsed = time.perf_counter() - loop_start
current_elapsed = time.time() - t0
if loop_elapsed > 0:
flops = gpu_count * 2 * (matrix_size ** 3)
pass_tflops.append({
"elapsed_sec": current_elapsed,
"tflops": flops / loop_elapsed / 1e12,
})
# Show progress every 10 seconds # Show progress every 10 seconds
current_elapsed = time.time() - t0
if int(current_elapsed) != int(elapsed_check) and int(current_elapsed) % 10 == 0: if int(current_elapsed) != int(elapsed_check) and int(current_elapsed) % 10 == 0:
self.console.print(f" [dim]Running {int(current_elapsed)}s / {duration}s[/dim]") self.console.print(f" [dim]Running {int(current_elapsed)}s / {duration}s[/dim]")
elapsed_check = current_elapsed elapsed_check = current_elapsed
@ -198,21 +245,196 @@ class StressTest:
"duration_sec": duration, "duration_sec": duration,
"error": error_msg, "error": error_msg,
"gpu_status": gpu_status, "gpu_status": gpu_status,
"telemetry": self._evaluate_telemetry(
telemetry, pass_tflops if "pass_tflops" in locals() else [],
self._new_xid_events(xid_before, self._collect_xid_events()),
),
} }
finally: finally:
stop_sampling.set()
tensors.clear() tensors.clear()
ballast.clear()
torch.cuda.empty_cache() torch.cuda.empty_cache()
elapsed = round(time.time() - t0, 1) elapsed = round(time.time() - t0, 1)
xid_events = self._new_xid_events(xid_before, self._collect_xid_events())
telemetry_summary = self._evaluate_telemetry(telemetry, pass_tflops, xid_events)
passed = all(v == "PASS" for v in gpu_status.values()) and telemetry_summary.get("passed", False)
return { return {
"source": "pytorch", "source": "pytorch",
"passed": True, "passed": passed,
"duration_sec": duration, "duration_sec": duration,
"elapsed_sec": elapsed, "elapsed_sec": elapsed,
"gpu_status": gpu_status, "gpu_status": gpu_status,
"telemetry": telemetry_summary,
"timestamp": datetime.now().isoformat(), "timestamp": datetime.now().isoformat(),
} }
def _sample_telemetry(self, telemetry: list, stop_event: threading.Event, interval: int):
query = "index,temperature.gpu,power.draw,clocks_throttle_reasons.active"
while not stop_event.is_set():
try:
r = subprocess.run(
["nvidia-smi", f"--query-gpu={query}", "--format=csv,noheader,nounits"],
capture_output=True, text=True, timeout=10,
)
if r.returncode == 0:
sample = {"time": time.time(), "gpus": []}
for line in r.stdout.splitlines():
parts = [p.strip() for p in line.split(",")]
if len(parts) >= 4:
sample["gpus"].append({
"index": int(parts[0]),
"temp_c": float(parts[1]),
"power_w": float(parts[2]),
"throttle": parts[3],
})
telemetry.append(sample)
except Exception:
pass
stop_event.wait(interval)
def _collect_xid_events(self) -> list[str]:
try:
r = subprocess.run(
["dmesg", "--color=never"],
capture_output=True, text=True, timeout=10,
)
if r.returncode != 0:
return []
return [
line.strip()
for line in r.stdout.splitlines()
if any(token in line.upper() for token in ("XID", "NVRM: XID"))
]
except Exception:
return []
@staticmethod
def _new_xid_events(before: list[str], after: list[str]) -> list[str]:
seen = set(before)
return [line for line in after if line not in seen]
def _evaluate_telemetry(self, telemetry: list, pass_tflops: list, xid_events: list[str] | None = None) -> dict:
cfg = self.stress_cfg
max_temp = float(cfg.get("max_temp_c", 80))
max_delta = float(cfg.get("max_temp_delta_c", 5))
min_power = float(cfg.get("min_power_watts", 630))
max_jitter = float(cfg.get("max_tflops_jitter_pct", 5))
require_jitter = bool(cfg.get("require_tflops_jitter", True))
duration = float(cfg.get("duration_sec", 60))
requested_warmup = float(cfg.get("warmup_sec", 60))
warmup_sec = min(requested_warmup, max(0.0, duration * 0.2))
min_steady_samples = int(cfg.get("min_steady_samples", 10))
temps = {}
powers = {}
throttle_bad = []
xid_events = xid_events or []
steady_telemetry = [
sample for sample in telemetry
if sample.get("time", 0) - telemetry[0].get("time", 0) >= warmup_sec
] if telemetry else []
evaluation_samples = steady_telemetry if len(steady_telemetry) >= min_steady_samples else telemetry
for sample in evaluation_samples:
for g in sample.get("gpus", []):
idx = g["index"]
temps.setdefault(idx, []).append(g["temp_c"])
powers.setdefault(idx, []).append(g["power_w"])
try:
bitmask = int(str(g["throttle"]), 16)
except ValueError:
bitmask = 0
real_throttle = bitmask & ~0x1
if real_throttle:
throttle_bad.append({
"gpu": idx,
"throttle": g["throttle"],
"real_throttle": f"0x{real_throttle:x}",
})
max_temps = {idx: max(vals) for idx, vals in temps.items() if vals}
avg_powers = {idx: sum(vals) / len(vals) for idx, vals in powers.items() if vals}
temp_delta = (max(max_temps.values()) - min(max_temps.values())) if len(max_temps) >= 2 else 0
jitter = 0
steady_tflops = []
for item in pass_tflops:
if isinstance(item, dict):
if float(item.get("elapsed_sec", 0)) >= warmup_sec:
steady_tflops.append(float(item.get("tflops", 0)))
else:
steady_tflops.append(float(item))
if len(steady_tflops) < 2 and pass_tflops:
steady_tflops = [
float(item.get("tflops", 0)) if isinstance(item, dict) else float(item)
for item in pass_tflops
]
if steady_tflops:
mean = sum(steady_tflops) / len(steady_tflops)
jitter = max(abs(v - mean) / mean * 100 for v in steady_tflops) if mean else 0
failures = []
temp_failures = {idx: v for idx, v in max_temps.items() if v > max_temp}
power_failures = {idx: v for idx, v in avg_powers.items() if v < min_power}
if not evaluation_samples:
failures.append("no telemetry samples available for evaluation")
if temp_failures:
failures.append(
"max temperature above threshold: "
+ ", ".join(f"GPU {idx} {val:.1f}C" for idx, val in sorted(temp_failures.items()))
)
if temp_delta > max_delta:
failures.append(f"GPU temperature delta {temp_delta:.1f}C exceeds {max_delta:.1f}C")
if power_failures:
failures.append(
"average steady-state power below threshold: "
+ ", ".join(f"GPU {idx} {val:.1f}W" for idx, val in sorted(power_failures.items()))
)
if throttle_bad:
failures.append(
f"non-idle throttle reasons observed in {len(throttle_bad)} samples "
f"(first: GPU {throttle_bad[0]['gpu']} {throttle_bad[0]['real_throttle']})"
)
if xid_events:
failures.append(f"{len(xid_events)} new XID/NVRM XID events observed")
if require_jitter and len(steady_tflops) < 2:
failures.append(
f"insufficient steady TFLOPS samples for jitter evaluation: {len(steady_tflops)} < 2"
)
if jitter > max_jitter:
failures.append(f"TFLOPS jitter {jitter:.2f}% exceeds {max_jitter:.2f}%")
passed = (
bool(evaluation_samples)
and all(v <= max_temp for v in max_temps.values())
and temp_delta <= max_delta
and all(v >= min_power for v in avg_powers.values())
and not throttle_bad
and not xid_events
and (not require_jitter or len(steady_tflops) >= 2)
and jitter <= max_jitter
)
return {
"passed": passed,
"samples": len(telemetry),
"steady_samples": len(evaluation_samples),
"warmup_sec": round(warmup_sec, 1),
"max_temp_c": {k: round(v, 1) for k, v in max_temps.items()},
"avg_power_w": {k: round(v, 1) for k, v in avg_powers.items()},
"temp_delta_c": round(temp_delta, 1),
"throttle_events": throttle_bad[:20],
"throttle_event_count": len(throttle_bad),
"xid_events": xid_events[-20:],
"tflops_jitter_pct": round(jitter, 2),
"steady_tflops_samples": len(steady_tflops),
"failures": failures,
"thresholds": {
"max_temp_c": max_temp,
"max_temp_delta_c": max_delta,
"min_power_w": min_power,
"max_tflops_jitter_pct": max_jitter,
"require_tflops_jitter": require_jitter,
"warmup_sec": requested_warmup,
"min_steady_samples": min_steady_samples,
},
}
@staticmethod @staticmethod
def print_results(results: dict, console: Console = None): def print_results(results: dict, console: Console = None):
c = console or Console() c = console or Console()
@ -245,5 +467,21 @@ class StressTest:
color = "green" if status == "PASS" else "red" color = "green" if status == "PASS" else "red"
c.print(f" GPU {gid}: [{color}]{status}[/{color}]") c.print(f" GPU {gid}: [{color}]{status}[/{color}]")
telemetry = results.get("telemetry") or {}
if telemetry:
c.print("\n Telemetry:")
c.print(f" Samples: {telemetry.get('samples', 0)} total, {telemetry.get('steady_samples', 0)} evaluated after {telemetry.get('warmup_sec', 0)}s warmup")
c.print(f" Avg steady power: {telemetry.get('avg_power_w', {})}")
c.print(f" Max steady temp: {telemetry.get('max_temp_c', {})}")
c.print(f" Temp delta: {telemetry.get('temp_delta_c', 'N/A')} C")
c.print(f" TFLOPS jitter: {telemetry.get('tflops_jitter_pct', 'N/A')}%")
c.print(f" Throttle events: {telemetry.get('throttle_event_count', len(telemetry.get('throttle_events', [])))}")
c.print(f" XID events: {len(telemetry.get('xid_events', []))}")
failures = telemetry.get("failures", [])
if failures:
c.print(" [red]Failure reasons:[/red]")
for reason in failures:
c.print(f" [red]- {reason}[/red]")
if results.get("error"): if results.get("error"):
c.print(f" [red]Error: {results['error']}[/red]") c.print(f" [red]Error: {results['error']}[/red]")

View File

@ -1,8 +1,13 @@
"""Training simulation module - LLM training workload with PyTorch.""" """Training simulation module - LLM training workload with PyTorch."""
import json
import os
import sys
import tempfile
import time import time
import subprocess import subprocess
import shutil import shutil
import math
from datetime import datetime from datetime import datetime
from typing import Optional from typing import Optional
@ -36,6 +41,7 @@ class TrainingSim:
batch_size = self.train_cfg.get("batch_size", 8) batch_size = self.train_cfg.get("batch_size", 8)
seq_length = self.train_cfg.get("seq_length", 2048) seq_length = self.train_cfg.get("seq_length", 2048)
num_steps = self.train_cfg.get("num_steps", 50) num_steps = self.train_cfg.get("num_steps", 50)
warmup_steps = int(self.train_cfg.get("warmup_steps", 5))
dtype_str = self.train_cfg.get("dtype", "bf16") dtype_str = self.train_cfg.get("dtype", "bf16")
dtype_map = { dtype_map = {
@ -47,7 +53,13 @@ class TrainingSim:
self.console.print(f"[cyan]Training Simulation[/cyan]") self.console.print(f"[cyan]Training Simulation[/cyan]")
self.console.print(f" Model: {model_name} | Batch: {batch_size} | Seq: {seq_length} | " self.console.print(f" Model: {model_name} | Batch: {batch_size} | Seq: {seq_length} | "
f"DType: {dtype_str} | Steps: {num_steps} | GPUs: {gpu_count}") f"DType: {dtype_str} | Steps: {num_steps} | Warmup: {warmup_steps} | GPUs: {gpu_count}")
if self.train_cfg.get("mode", "ddp") == "ddp" and gpu_count > 1:
ddp_result = self._run_synthetic_ddp(gpu_count, batch_size, seq_length, num_steps, dtype_str)
if ddp_result.get("passed") or not self.train_cfg.get("allow_fallback", False):
return ddp_result
self.console.print("[yellow]DDP synthetic training failed, falling back to single-process synthetic path[/yellow]")
try: try:
from transformers import AutoModelForCausalLM, AutoTokenizer from transformers import AutoModelForCausalLM, AutoTokenizer
@ -87,9 +99,10 @@ class TrainingSim:
BarColumn(), TextColumn("{task.completed}/{task.total}"), BarColumn(), TextColumn("{task.completed}/{task.total}"),
TimeElapsedColumn(), console=self.console, TimeElapsedColumn(), console=self.console,
) as progress: ) as progress:
task = progress.add_task("Training steps...", total=num_steps) total_steps = num_steps + warmup_steps
task = progress.add_task("Training steps...", total=total_steps)
for step in range(num_steps): for step in range(total_steps):
torch.cuda.synchronize() torch.cuda.synchronize()
t0 = time.perf_counter() t0 = time.perf_counter()
@ -119,8 +132,15 @@ class TrainingSim:
progress.advance(task) progress.advance(task)
avg_step_time = sum(step_times) / len(step_times) measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
avg_step_time = sum(measured_steps) / len(measured_steps)
throughput = batch_size * seq_length / avg_step_time throughput = batch_size * seq_length / avg_step_time
jitter = self._jitter_pct(measured_steps)
peak_mem = round(max(mem_usage) if mem_usage else 0, 2)
final_loss = float(loss.item()) if hasattr(loss, "item") else float("nan")
passed = self._acceptance_pass(throughput, jitter, peak_mem, final_loss)
if self.train_cfg.get("require_distributed", True):
passed = False
return { return {
"model": model_name, "model": model_name,
@ -130,11 +150,18 @@ class TrainingSim:
"batch_size": batch_size, "batch_size": batch_size,
"seq_length": seq_length, "seq_length": seq_length,
"num_steps": num_steps, "num_steps": num_steps,
"warmup_steps": warmup_steps,
"total_steps": total_steps,
"avg_step_time_ms": round(avg_step_time * 1000, 1), "avg_step_time_ms": round(avg_step_time * 1000, 1),
"throughput_tokens_per_sec": round(throughput, 0), "throughput_tokens_per_sec": round(throughput, 0),
"throughput_samples_per_sec": round(batch_size / avg_step_time, 2), "throughput_samples_per_sec": round(batch_size / avg_step_time, 2),
"peak_memory_gb": round(max(mem_usage) if mem_usage else 0, 2), "peak_memory_gb": peak_mem,
"final_loss": round(loss.item(), 4) if hasattr(loss, 'item') else None, "final_loss": round(final_loss, 4),
"step_jitter_pct": round(jitter, 2),
"distributed_mode": "device_map",
"loss_finite": math.isfinite(final_loss),
"passed": passed,
"acceptance_gap": "8-GPU DDP was not used" if self.train_cfg.get("require_distributed", True) else "",
"timestamp": datetime.now().isoformat(), "timestamp": datetime.now().isoformat(),
} }
@ -142,6 +169,196 @@ class TrainingSim:
self.console.print(f"[yellow]Model loading failed: {e}[/yellow]") self.console.print(f"[yellow]Model loading failed: {e}[/yellow]")
return self._run_synthetic(gpu_count, batch_size, seq_length, num_steps, dtype) return self._run_synthetic(gpu_count, batch_size, seq_length, num_steps, dtype)
def _run_synthetic_ddp(self, gpu_count: int, batch_size: int, seq_length: int,
num_steps: int, dtype_str: str) -> dict:
"""Run the 1.5B synthetic Transformer with one process per GPU."""
torchrun = os.path.join(os.path.dirname(sys.executable), "torchrun")
if not os.path.isfile(torchrun):
torchrun = shutil.which("torchrun") or ""
if not torchrun:
return {
"model": "synthetic_transformer_1.5b",
"gpu_count": gpu_count,
"distributed_mode": "ddp",
"passed": False,
"error": "torchrun not found",
"timestamp": datetime.now().isoformat(),
}
script = r'''
import json
import math
import os
import time
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def main():
local_rank = int(os.environ["LOCAL_RANK"])
world_size = int(os.environ["WORLD_SIZE"])
torch.cuda.set_device(local_rank)
dist.init_process_group("nccl")
global_batch = int(os.environ["TRAIN_BATCH_SIZE"])
local_batch = max(1, global_batch // world_size)
seq_length = int(os.environ["TRAIN_SEQ_LENGTH"])
num_steps = int(os.environ["TRAIN_NUM_STEPS"])
warmup_steps = int(os.environ.get("TRAIN_WARMUP_STEPS", "5"))
total_steps = num_steps + warmup_steps
dtype_name = os.environ.get("TRAIN_DTYPE", "bf16")
dtype = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}.get(dtype_name, torch.bfloat16)
hidden_size = 4096
num_layers = 6
num_heads = 32
vocab_size = 32000
class SyntheticTransformer(torch.nn.Module):
def __init__(self):
super().__init__()
self.embed = torch.nn.Embedding(vocab_size, hidden_size)
self.layers = torch.nn.ModuleList([
torch.nn.TransformerEncoderLayer(
d_model=hidden_size,
nhead=num_heads,
dim_feedforward=hidden_size * 4,
batch_first=True,
dtype=dtype,
) for _ in range(num_layers)
])
self.head = torch.nn.Linear(hidden_size, vocab_size, dtype=dtype)
def forward(self, x):
h = self.embed(x).to(dtype)
for layer in self.layers:
h = layer(h)
return self.head(h)
model = SyntheticTransformer().cuda()
total_params = sum(p.numel() for p in model.parameters())
model = DDP(model, device_ids=[local_rank], output_device=local_rank)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
input_ids = torch.randint(0, vocab_size, (local_batch, seq_length), device="cuda")
step_times = []
last_loss = torch.tensor(float("nan"), device="cuda")
torch.cuda.reset_peak_memory_stats(local_rank)
for _ in range(total_steps):
torch.cuda.synchronize()
t0 = time.perf_counter()
with torch.amp.autocast("cuda", dtype=dtype, enabled=dtype in (torch.float16, torch.bfloat16)):
logits = model(input_ids)
loss = torch.nn.functional.cross_entropy(logits.reshape(-1, vocab_size), input_ids.reshape(-1))
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
torch.cuda.synchronize()
step_times.append(time.perf_counter() - t0)
last_loss = loss.detach()
peak_mem = torch.tensor(torch.cuda.max_memory_allocated(local_rank) / 1024**3, device="cuda")
dist.all_reduce(peak_mem, op=dist.ReduceOp.MAX)
finite = torch.tensor(1 if math.isfinite(float(last_loss.item())) else 0, device="cuda")
dist.all_reduce(finite, op=dist.ReduceOp.MIN)
if dist.get_rank() == 0:
measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
avg_step = sum(measured_steps) / len(measured_steps)
mean = avg_step
jitter = max(abs(v - mean) / mean * 100 for v in measured_steps) if mean else 0.0
throughput = global_batch * seq_length / avg_step if avg_step else 0.0
print("TRAINING_DDP_JSON=" + json.dumps({
"model": "synthetic_transformer_1.5b",
"total_params_m": round(total_params / 1e6, 1),
"num_layers": num_layers,
"hidden_size": hidden_size,
"gpu_count": world_size,
"dtype": dtype_name,
"batch_size": global_batch,
"local_batch_size": local_batch,
"seq_length": seq_length,
"num_steps": num_steps,
"warmup_steps": warmup_steps,
"total_steps": total_steps,
"avg_step_time_ms": round(avg_step * 1000, 1),
"throughput_tokens_per_sec": round(throughput, 0),
"throughput_samples_per_sec": round(global_batch / avg_step, 2) if avg_step else 0,
"peak_memory_gb": round(float(peak_mem.item()), 2),
"final_loss": round(float(last_loss.item()), 4),
"step_jitter_pct": round(jitter, 2),
"distributed_mode": "ddp",
"loss_finite": bool(int(finite.item())),
}), flush=True)
dist.destroy_process_group()
if __name__ == "__main__":
main()
'''
tmp = tempfile.NamedTemporaryFile("w", suffix="_training_ddp.py", delete=False)
tmp.write(script)
tmp.close()
env = {
**os.environ,
"TRAIN_BATCH_SIZE": str(batch_size),
"TRAIN_SEQ_LENGTH": str(seq_length),
"TRAIN_NUM_STEPS": str(num_steps),
"TRAIN_WARMUP_STEPS": str(int(self.train_cfg.get("warmup_steps", 5))),
"TRAIN_DTYPE": dtype_str,
"NCCL_DEBUG": os.environ.get("NCCL_DEBUG", "WARN"),
}
cmd = [torchrun, f"--nproc_per_node={gpu_count}", tmp.name]
self.console.print(f" Running synthetic 1.5B DDP via torchrun ({gpu_count} processes)...")
try:
timeout = int(self.train_cfg.get("timeout_sec", max(600, num_steps * 180)))
r = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, env=env)
except subprocess.TimeoutExpired:
os.unlink(tmp.name)
return {
"model": "synthetic_transformer_1.5b",
"gpu_count": gpu_count,
"distributed_mode": "ddp",
"passed": False,
"error": "training_ddp_timeout",
"timestamp": datetime.now().isoformat(),
}
finally:
if os.path.exists(tmp.name):
try:
os.unlink(tmp.name)
except OSError:
pass
marker = "TRAINING_DDP_JSON="
payload = None
for line in (r.stdout + "\n" + r.stderr).splitlines():
if marker in line:
payload = line.split(marker, 1)[1].strip()
if r.returncode != 0 or not payload:
return {
"model": "synthetic_transformer_1.5b",
"gpu_count": gpu_count,
"distributed_mode": "ddp",
"passed": False,
"error": (r.stderr or r.stdout or "training_ddp_failed")[-1000:],
"timestamp": datetime.now().isoformat(),
}
result = json.loads(payload)
loss_value = float(result.get("final_loss", "nan"))
passed = self._acceptance_pass(
float(result.get("throughput_tokens_per_sec", 0)),
float(result.get("step_jitter_pct", 999)),
float(result.get("peak_memory_gb", 999)),
loss_value,
) and bool(result.get("loss_finite", False)) and result.get("gpu_count") == gpu_count
result.update({
"passed": passed,
"timestamp": datetime.now().isoformat(),
})
return result
def _run_synthetic(self, gpu_count, batch_size, seq_length, num_steps, dtype) -> dict: def _run_synthetic(self, gpu_count, batch_size, seq_length, num_steps, dtype) -> dict:
self.console.print(" Running synthetic training benchmark...") self.console.print(" Running synthetic training benchmark...")
@ -170,11 +387,17 @@ class TrainingSim:
h = layer(h) h = layer(h)
return self.head(h) return self.head(h)
model = SyntheticTransformer().cuda() model = SyntheticTransformer()
total_params = sum(p.numel() for p in model.parameters()) total_params = sum(p.numel() for p in model.parameters())
self.console.print(f" Synthetic params: {total_params / 1e6:.1f}M") self.console.print(f" Synthetic params: {total_params / 1e6:.1f}M")
distributed_mode = "single_gpu"
if gpu_count > 1:
model = torch.nn.DataParallel(model).cuda()
distributed_mode = "data_parallel"
else:
model = model.cuda()
model.train() model.train()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4) optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
@ -183,14 +406,17 @@ class TrainingSim:
step_times = [] step_times = []
mem_usage = [] mem_usage = []
warmup_steps = int(self.train_cfg.get("warmup_steps", 5))
total_steps = num_steps + warmup_steps
with Progress( with Progress(
SpinnerColumn(), TextColumn("[progress.description]{task.description}"), SpinnerColumn(), TextColumn("[progress.description]{task.description}"),
BarColumn(), TextColumn("{task.completed}/{task.total}"), BarColumn(), TextColumn("{task.completed}/{task.total}"),
TimeElapsedColumn(), console=self.console, TimeElapsedColumn(), console=self.console,
) as progress: ) as progress:
task = progress.add_task("Synthetic training...", total=num_steps) task = progress.add_task("Synthetic training...", total=total_steps)
for step in range(num_steps): for step in range(total_steps):
torch.cuda.synchronize() torch.cuda.synchronize()
t0 = time.perf_counter() t0 = time.perf_counter()
@ -206,14 +432,22 @@ class TrainingSim:
elapsed = time.perf_counter() - t0 elapsed = time.perf_counter() - t0
step_times.append(elapsed) step_times.append(elapsed)
mem_used = torch.cuda.max_memory_allocated() / 1024**3 mem_used = max(torch.cuda.max_memory_allocated(i) for i in range(gpu_count)) / 1024**3
mem_usage.append(mem_used) mem_usage.append(mem_used)
torch.cuda.reset_peak_memory_stats() for i in range(gpu_count):
torch.cuda.reset_peak_memory_stats(i)
progress.advance(task) progress.advance(task)
avg_step_time = sum(step_times) / len(step_times) measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
avg_step_time = sum(measured_steps) / len(measured_steps)
throughput = batch_size * seq_length / avg_step_time throughput = batch_size * seq_length / avg_step_time
jitter = self._jitter_pct(measured_steps)
peak_mem = round(max(mem_usage) if mem_usage else 0, 2)
final_loss = float(loss.item())
passed = self._acceptance_pass(throughput, jitter, peak_mem, final_loss)
if self.train_cfg.get("require_distributed", True):
passed = False
return { return {
"model": "synthetic_transformer", "model": "synthetic_transformer",
@ -225,14 +459,36 @@ class TrainingSim:
"batch_size": batch_size, "batch_size": batch_size,
"seq_length": seq_length, "seq_length": seq_length,
"num_steps": num_steps, "num_steps": num_steps,
"warmup_steps": warmup_steps,
"total_steps": total_steps,
"avg_step_time_ms": round(avg_step_time * 1000, 1), "avg_step_time_ms": round(avg_step_time * 1000, 1),
"throughput_tokens_per_sec": round(throughput, 0), "throughput_tokens_per_sec": round(throughput, 0),
"throughput_samples_per_sec": round(batch_size / avg_step_time, 2), "throughput_samples_per_sec": round(batch_size / avg_step_time, 2),
"peak_memory_gb": round(max(mem_usage) if mem_usage else 0, 2), "peak_memory_gb": peak_mem,
"final_loss": round(loss.item(), 4), "final_loss": round(final_loss, 4),
"step_jitter_pct": round(jitter, 2),
"distributed_mode": distributed_mode,
"loss_finite": math.isfinite(final_loss),
"passed": passed,
"acceptance_gap": "8-GPU DDP was not used" if self.train_cfg.get("require_distributed", True) else "",
"timestamp": datetime.now().isoformat(), "timestamp": datetime.now().isoformat(),
} }
@staticmethod
def _jitter_pct(step_times: list[float]) -> float:
if not step_times:
return 0.0
mean = sum(step_times) / len(step_times)
return max(abs(v - mean) / mean * 100 for v in step_times) if mean else 0.0
def _acceptance_pass(self, throughput: float, jitter: float, peak_mem: float, loss_value: float) -> bool:
return (
throughput >= float(self.train_cfg.get("min_tokens_per_sec", 45000))
and jitter <= float(self.train_cfg.get("max_step_jitter_pct", 3))
and peak_mem <= float(self.train_cfg.get("max_peak_memory_gb", 70))
and math.isfinite(loss_value)
)
@staticmethod @staticmethod
def print_results(results: dict, console: Console = None): def print_results(results: dict, console: Console = None):
c = console or Console() c = console or Console()
@ -254,11 +510,15 @@ class TrainingSim:
("Batch Size", str(results.get("batch_size", "N/A"))), ("Batch Size", str(results.get("batch_size", "N/A"))),
("Seq Length", str(results.get("seq_length", "N/A"))), ("Seq Length", str(results.get("seq_length", "N/A"))),
("Steps", str(results.get("num_steps", "N/A"))), ("Steps", str(results.get("num_steps", "N/A"))),
("Warmup Steps", str(results.get("warmup_steps", "N/A"))),
("Avg Step Time", f"{results.get('avg_step_time_ms', 'N/A')} ms"), ("Avg Step Time", f"{results.get('avg_step_time_ms', 'N/A')} ms"),
("Throughput", f"{results.get('throughput_tokens_per_sec', 'N/A')} tokens/s"), ("Throughput", f"{results.get('throughput_tokens_per_sec', 'N/A')} tokens/s"),
("Samples/sec", f"{results.get('throughput_samples_per_sec', 'N/A')}"), ("Samples/sec", f"{results.get('throughput_samples_per_sec', 'N/A')}"),
("Peak Memory", f"{results.get('peak_memory_gb', 'N/A')} GB"), ("Peak Memory", f"{results.get('peak_memory_gb', 'N/A')} GB"),
("Final Loss", str(results.get("final_loss", "N/A"))), ("Final Loss", str(results.get("final_loss", "N/A"))),
("Step Jitter", f"{results.get('step_jitter_pct', 'N/A')}%"),
("Distributed Mode", results.get("distributed_mode", "N/A")),
("Verdict", "PASS" if results.get("passed") else "FAIL"),
] ]
for label, val in metrics: for label, val in metrics:
table.add_row(label, str(val)) table.add_row(label, str(val))

View File

@ -0,0 +1,921 @@
{
"timestamp": "2026-05-22T15:49:02.368516",
"gpu_info": {
"driver_version": "580.159.03",
"cuda_version": "13.0",
"gpu_count": 8,
"gpus": [
{
"index": 0,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-dfbc9513-255d-4fe7-2b77-7b1ec3972e75",
"pci_bus_id": "00000000:18:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 69.98,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924016120",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 1,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-bb845ef7-d7b5-f011-9395-ea74274e2282",
"pci_bus_id": "00000000:2A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 67.54,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924015483",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 2,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-3720cf13-2a34-be38-27be-0a7adc4addc4",
"pci_bus_id": "00000000:3A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 66.82,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 22,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924025595",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 3,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-87080b2d-ac43-be0d-d574-c193078850ae",
"pci_bus_id": "00000000:5D:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 67.02,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924016862",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 4,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-599bd883-cc5c-a5dd-6c33-c15f7049da48",
"pci_bus_id": "00000000:9A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 67.24,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924025670",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 5,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-a1c6bba4-61b0-e623-06c9-9c88635e26fe",
"pci_bus_id": "00000000:AB:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 69.31,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 23,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924027166",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 6,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-98745a0c-39bd-3e56-d6ca-54ba3647ab6d",
"pci_bus_id": "00000000:BA:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 67.84,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924026234",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 7,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-8c73bd8b-666b-357e-ac5d-c75ac7a759db",
"pci_bus_id": "00000000:DB:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 66.21,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924027255",
"ecc_errors_single": 0,
"ecc_errors_double": 0
}
],
"topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n NIC3: mlx5_3\n NIC4: mlx5_4\n NIC5: mlx5_5\n NIC6: mlx5_6\n NIC7: mlx5_7\n NIC8: mlx5_8\n NIC9: mlx5_9\n\n",
"timestamp": "2026-05-22T15:49:09.197459",
"detected_gpu_type": "h100",
"gpu_label": "H100 SXM5"
},
"health": {
"passed": true,
"gpu_health": [
{
"index": 0,
"status": "WARN",
"checks": {
"temperature": {
"value": 21,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 69.86,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 1,
"status": "WARN",
"checks": {
"temperature": {
"value": 21,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 67.48,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 2,
"status": "WARN",
"checks": {
"temperature": {
"value": 22,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 66.76,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 3,
"status": "WARN",
"checks": {
"temperature": {
"value": 21,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 67.06,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 4,
"status": "WARN",
"checks": {
"temperature": {
"value": 21,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 67.23,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 5,
"status": "WARN",
"checks": {
"temperature": {
"value": 23,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 69.27,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 6,
"status": "WARN",
"checks": {
"temperature": {
"value": 21,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 67.81,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
},
{
"index": 7,
"status": "WARN",
"checks": {
"temperature": {
"value": 21,
"status": "PASS",
"threshold": 75
},
"power": {
"value": 66.3,
"limit": 700.0,
"status": "PASS"
},
"ecc_errors": {
"single": 0,
"double": 0,
"status": "PASS"
},
"memory_errors": {
"status": "PASS"
},
"pcie_link": {
"gen": 5,
"width": 16,
"status": "PASS"
},
"clock_speed": {
"sm": 345,
"mem": 2619,
"status": "PASS"
},
"throttling": {
"status": "PASS",
"reasons": []
},
"persistence_mode": {
"enabled": false,
"status": "WARN"
}
}
}
],
"system_health": {
"nvidia_persistenced": {
"installed": true,
"running": false
},
"hugepages": {
"configured": false,
"count": 0
},
"swap": {
"enabled": true
},
"transparent_hugepage": "madvise",
"file_descriptors": {
"soft": 1024,
"max": 1048576
},
"infiniband_devices": [
"mlx5_4",
"mlx5_2",
"mlx5_0",
"mlx5_9",
"mlx5_7",
"mlx5_5",
"mlx5_3",
"mlx5_1",
"mlx5_8",
"mlx5_6"
],
"rdma_devices": [
"abi_version",
"uverbs4",
"uverbs2",
"uverbs0",
"uverbs9",
"uverbs7",
"uverbs5",
"uverbs3",
"uverbs1",
"uverbs8",
"uverbs6"
],
"nccl_env_vars": {}
},
"timestamp": "2026-05-22T15:49:11.294816",
"detected_gpu_type": "h100"
},
"memory_bench": {
"memory": {
"source": "nvbandwidth",
"h2d_bandwidth_gbps": 55.5,
"d2h_bandwidth_gbps": 55.3,
"d2d_bandwidth_gbps": 486.5,
"h2d_peak_gbps": 64,
"d2h_peak_gbps": 64,
"d2d_peak_gbps": 450.0,
"h2d_efficiency_pct": 86.7,
"d2h_efficiency_pct": 86.4,
"d2d_efficiency_pct": 108.1,
"peak_bandwidth_gbps": 3400,
"efficiency_pct": 108.1,
"results_by_test": {
"h2d": 55.5,
"d2h": 55.3,
"d2d_write": 397.4,
"d2d_read": 395.1,
"d2d_bidir": 486.5
},
"per_gpu": []
}
},
"compute_bench": {
"compute": {
"per_dtype_tflops": {
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
"peak_tflops": {
"fp32": 67,
"tf32": 495,
"fp16": 990,
"bf16": 990,
"fp8": 1979
},
"efficiency_pct": {
"fp32": 77.5,
"tf32": 72.1,
"fp16": 67.1,
"bf16": 70.7,
"fp8": 56.4
},
"pass_thresholds_tflops": {
"fp32": 54,
"tf32": 444,
"fp16": 734,
"bf16": 745,
"fp8": 1400
},
"per_gpu": [
{
"index": 0,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 1,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 2,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 3,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 4,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 5,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 6,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
},
{
"index": 7,
"fp32": 51.9,
"tf32": 357.0,
"fp16": 664.0,
"bf16": 700.1,
"fp8": 1116.2
}
],
"matrix_size": 8192,
"warmup": 50,
"iterations": 500
}
},
"nccl": {
"passed": false,
"source": "torchrun_fallback",
"tests": {
"NCCL version 2.21.5+cuda12.4": {
"status": "FAIL",
"error": null
},
"allreduce": {
"status": "PASS",
"error": null
},
"broadcast": {
"status": "PASS",
"error": null
},
"allgather": {
"status": "PASS",
"error": null
},
"reducescatter": {
"status": "PASS",
"error": null
},
"alltoall": {
"status": "PASS",
"error": null
}
},
"gpu_count": 8
},
"stress": {
"source": "pytorch",
"passed": true,
"duration_sec": 60,
"elapsed_sec": 60.0,
"gpu_status": {
"0": "PASS",
"1": "PASS",
"2": "PASS",
"3": "PASS",
"4": "PASS",
"5": "PASS",
"6": "PASS",
"7": "PASS"
},
"timestamp": "2026-05-22T15:51:56.803540"
},
"rdma": {
"passed": false,
"devices": [
{
"name": "mlx5_0",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:58a2:e103:0088:81e0"
}
]
},
{
"name": "mlx5_1",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:0054:e00a"
}
]
},
{
"name": "mlx5_2",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
}
]
},
{
"name": "mlx5_3",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "1: DOWN",
"phys_state": "3: Disabled",
"gid": "fe80:0000:0000:0000:c670:bdff:fefd:5bd9"
}
]
},
{
"name": "mlx5_4",
"ports": [
{
"port": "1",
"rate": "100 Gb/sec (2X HDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:005f:58ec"
}
]
},
{
"name": "mlx5_5",
"ports": [
{
"port": "1",
"rate": "100 Gb/sec (2X HDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:005f:58ed"
}
]
},
{
"name": "mlx5_6",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:0055:0e56"
}
]
},
{
"name": "mlx5_7",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:a088:c203:00f0:286c"
}
]
},
{
"name": "mlx5_8",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
}
]
},
{
"name": "mlx5_9",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "1: DOWN",
"phys_state": "3: Disabled",
"gid": "fe80:0000:0000:0000:c670:bdff:fefd:569d"
}
]
}
],
"bandwidth_tests": [
{
"test": "ib_write_bw",
"status": "WARN",
"bandwidth_gbps": 0.13,
"min_required_gbps": 50
},
{
"test": "ib_read_bw",
"status": "WARN",
"bandwidth_gbps": 0.13,
"min_required_gbps": 50
}
],
"latency_tests": [
{
"test": "ib_write_lat",
"status": "PASS",
"latency_us": 4.1,
"max_allowed_us": 10
},
{
"test": "ib_read_lat",
"status": "WARN",
"latency_us": 16.0,
"max_allowed_us": 10
}
],
"timestamp": "2026-05-22T15:52:03.507540"
},
"training": {
"model": "synthetic_transformer",
"total_params_m": 1470.5,
"num_layers": 6,
"hidden_size": 4096,
"gpu_count": 8,
"dtype": "bfloat16",
"batch_size": 8,
"seq_length": 2048,
"num_steps": 50,
"avg_step_time_ms": 312.3,
"throughput_tokens_per_sec": 52471.0,
"throughput_samples_per_sec": 25.62,
"peak_memory_gb": 27.31,
"final_loss": 0.0041,
"timestamp": "2026-05-22T15:52:32.650522"
}
}

View File

@ -0,0 +1,157 @@
# GPU Test Report
- **Date:** 2026-05-22T15:49:02.368516
- **Host:** aikubeworker0016
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
- NCCL: FAIL (no nccl-tests bus BW)
- RDMA: FAIL
- Training: UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict)
Missing required evidence:
- NVLink/NVSwitch
- DCGM
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Health Check | PASS |
| Memory Bandwidth | PASS (108.1%) |
| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
| NCCL | FAIL (no nccl-tests bus BW) |
| Stress Test | PASS |
| RDMA | FAIL |
| Training | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 67/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 23C | 69/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 66/700W | 345 MHz |
## Health Check
**Overall: PASS**
| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
|-----|------|-------|-----|------|----------|--------|
| 0 | 21C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 1 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 2 | 22C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 3 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 4 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 5 | 23C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 6 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 7 | 21C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
| D2H (PCIe) | 55.3 GB/s | 64 GB/s | 86.4% |
| D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
**Verdict: PASS** (D2D efficiency 108.1%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 51.9 | 67 | >= 54 | FAIL |
| TF32 | 357.0 | 495 | >= 444 | FAIL |
| FP16 | 664.0 | 990 | >= 734 | FAIL |
| BF16 | 700.1 | 990 | >= 745 | FAIL |
| FP8 | 1116.2 | 1979 | >= 1400 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 56.4%)
### Compute Per-GPU TFLOPS
| GPU | FP32 | TF32 | FP16 | BF16 | FP8 |
|---|---|---|---|---|---|
| 0 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 1 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 2 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 3 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 4 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 5 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 6 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 7 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
## NCCL Multi-GPU
Source: torchrun_fallback | GPUs: 8
> Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.
| Operation | Bus BW (GB/s) | Threshold | Status |
|-----------|---------------|-----------|--------|
| NCCL version 2.21.5+cuda12.4 | 0.0 | >= 0 | FAIL |
| allreduce | 0.0 | >= 0 | PASS |
| broadcast | 0.0 | >= 0 | PASS |
| allgather | 0.0 | >= 0 | PASS |
| reducescatter | 0.0 | >= 0 | PASS |
| alltoall | 0.0 | >= 0 | PASS |
**Overall: FAIL**
## Stress Test
- **Source:** pytorch
- **Duration:** 60s (requested 60s)
- **Result: PASS**
## RDMA/InfiniBand
> Legacy RDMA result re-evaluated with current PDF acceptance thresholds; old WARN statuses and old 50GB/s/10us limits are not used for verdict.
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
| ib_read_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 4.10 us | <= 2 us | FAIL |
| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
- **Failure reasons:**
- ib_write_bw bandwidth 0.13GB/s < 47GB/s
- ib_read_bw bandwidth 0.13GB/s < 47GB/s
- ib_write_lat latency 4.1us > 2us
- ib_read_lat latency 16.0us > 3.5us
**Overall: FAIL**
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer |
| Params | 1470.5M |
| Throughput | 52471 tokens/sec |
| Avg Step Time | 312.3 ms |
| Peak Memory | 27.3 GB |
| Final Loss | 0.0041 |
| Step Jitter | N/A% |
| Distributed Mode | N/A |
| Acceptance Gaps | missing passed, step_jitter_pct, distributed_mode, loss_finite |
| Verdict | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,65 @@
# GPU Test Report
- **Date:** 2026-05-22T20:26:56.947796
- **Host:** aikubeworker0012
## Overall Acceptance Verdict
**Result: FAIL**
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- RDMA
- Training
## Summary
| Test | Result |
|------|--------|
| DCGM | PASS |
## DCGM Diagnostic
**Overall: PASS**
| Subtest | Status |
|---------|--------|
| Hardware/nvbandwidth/GPU6 | PASS |
| Hardware/nvbandwidth/GPU7 | PASS |
| Hardware/nvbandwidth/summary | PASS |
| Integration/pcie/GPU0 | PASS |
| Integration/pcie/GPU1 | PASS |
| Integration/pcie/GPU2 | PASS |
| Integration/pcie/GPU3 | PASS |
| Integration/pcie/GPU4 | PASS |
| Integration/pcie/GPU5 | PASS |
| Integration/pcie/GPU6 | PASS |
| Integration/pcie/GPU7 | PASS |
| Integration/pcie/summary | PASS |
| Stress/targeted_stress/GPU0 | PASS |
| Stress/targeted_stress/GPU1 | PASS |
| Stress/targeted_stress/GPU2 | PASS |
| Stress/targeted_stress/GPU3 | PASS |
| Stress/targeted_stress/GPU4 | PASS |
| Stress/targeted_stress/GPU5 | PASS |
| Stress/targeted_stress/GPU6 | PASS |
| Stress/targeted_stress/GPU7 | PASS |
| Stress/targeted_stress/summary | PASS |
| Stress/targeted_power/GPU0 | PASS |
| Stress/targeted_power/GPU1 | PASS |
| Stress/targeted_power/GPU2 | PASS |
| Stress/targeted_power/GPU3 | PASS |
| Stress/targeted_power/GPU4 | PASS |
| Stress/targeted_power/GPU5 | PASS |
| Stress/targeted_power/GPU6 | PASS |
| Stress/targeted_power/GPU7 | PASS |
| Stress/targeted_power/summary | PASS |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,65 @@
# GPU Test Report
- **Date:** 2026-05-22T20:28:58.716266
- **Host:** aikubeworker0016
## Overall Acceptance Verdict
**Result: FAIL**
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- RDMA
- Training
## Summary
| Test | Result |
|------|--------|
| DCGM | PASS |
## DCGM Diagnostic
**Overall: PASS**
| Subtest | Status |
|---------|--------|
| Hardware/nvbandwidth/GPU6 | PASS |
| Hardware/nvbandwidth/GPU7 | PASS |
| Hardware/nvbandwidth/summary | PASS |
| Integration/pcie/GPU0 | PASS |
| Integration/pcie/GPU1 | PASS |
| Integration/pcie/GPU2 | PASS |
| Integration/pcie/GPU3 | PASS |
| Integration/pcie/GPU4 | PASS |
| Integration/pcie/GPU5 | PASS |
| Integration/pcie/GPU6 | PASS |
| Integration/pcie/GPU7 | PASS |
| Integration/pcie/summary | PASS |
| Stress/targeted_stress/GPU0 | PASS |
| Stress/targeted_stress/GPU1 | PASS |
| Stress/targeted_stress/GPU2 | PASS |
| Stress/targeted_stress/GPU3 | PASS |
| Stress/targeted_stress/GPU4 | PASS |
| Stress/targeted_stress/GPU5 | PASS |
| Stress/targeted_stress/GPU6 | PASS |
| Stress/targeted_stress/GPU7 | PASS |
| Stress/targeted_stress/summary | PASS |
| Stress/targeted_power/GPU0 | PASS |
| Stress/targeted_power/GPU1 | PASS |
| Stress/targeted_power/GPU2 | PASS |
| Stress/targeted_power/GPU3 | PASS |
| Stress/targeted_power/GPU4 | PASS |
| Stress/targeted_power/GPU5 | PASS |
| Stress/targeted_power/GPU6 | PASS |
| Stress/targeted_power/GPU7 | PASS |
| Stress/targeted_power/summary | PASS |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,70 @@
{
"benchmark": {
"memory": {
"source": "nvbandwidth",
"h2d_bandwidth_gbps": 55.5,
"d2h_bandwidth_gbps": 54.8,
"d2d_bandwidth_gbps": 0.0,
"h2d_peak_gbps": 64,
"d2h_peak_gbps": 64,
"d2d_peak_gbps": 450.0,
"h2d_efficiency_pct": 86.7,
"d2h_efficiency_pct": 85.6,
"d2d_efficiency_pct": null,
"peak_bandwidth_gbps": 3400,
"efficiency_pct": null,
"results_by_test": {
"h2d": 55.5,
"d2h": 54.8,
"d2d_write": 0.0,
"d2d_read": 0.0,
"d2d_bidir": 0.0
},
"per_gpu": []
},
"compute": {
"per_dtype_tflops": {
"fp32": 52.2,
"tf32": 360.7,
"fp16": 680.0,
"bf16": 707.6,
"fp8": 1142.4
},
"peak_tflops": {
"fp32": 67,
"tf32": 495,
"fp16": 990,
"bf16": 990,
"fp8": 1979
},
"efficiency_pct": {
"fp32": 77.9,
"tf32": 72.9,
"fp16": 68.7,
"bf16": 71.5,
"fp8": 57.7
},
"pass_thresholds_tflops": {
"fp32": 54,
"tf32": 444,
"fp16": 734,
"bf16": 745,
"fp8": 1400
},
"per_gpu": [
{
"index": 0,
"fp32": 52.2,
"tf32": 360.7,
"fp16": 680.0,
"bf16": 707.6,
"fp8": 1142.4
}
],
"matrix_size": 8192,
"warmup": 50,
"iterations": 500
}
},
"timestamp": "2026-05-22T15:35:16.675924"
}

View File

@ -0,0 +1,38 @@
# GPU Test Report
- **Date:** 2026-05-22 15:37:12
- **Host:** aikubeworker0012
## Summary
| Test | Result |
|------|--------|
| Memory Bandwidth | FAIL (0.0%) |
| Compute Throughput | FAIL (worst TF32 361 vs >= 444) |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
| D2H (PCIe) | 54.8 GB/s | 64 GB/s | 85.6% |
| D2D (NVLink) | 0.0 GB/s | 450 GB/s | 0.0% |
**Verdict: FAIL** (D2D efficiency 0.0%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 52.2 | 67 | >= 54 | WARN |
| TF32 | 360.7 | 495 | >= 444 | FAIL |
| FP16 | 680.0 | 990 | >= 734 | WARN |
| BF16 | 707.6 | 990 | >= 745 | WARN |
| FP8 | 1142.4 | 1979 | >= 1400 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.7%)
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,70 @@
{
"benchmark": {
"memory": {
"source": "nvbandwidth",
"h2d_bandwidth_gbps": 55.5,
"d2h_bandwidth_gbps": 55.0,
"d2d_bandwidth_gbps": 0.0,
"h2d_peak_gbps": 64,
"d2h_peak_gbps": 64,
"d2d_peak_gbps": 450.0,
"h2d_efficiency_pct": 86.7,
"d2h_efficiency_pct": 85.9,
"d2d_efficiency_pct": null,
"peak_bandwidth_gbps": 3400,
"efficiency_pct": null,
"results_by_test": {
"h2d": 55.5,
"d2h": 55.0,
"d2d_write": 0.0,
"d2d_read": 0.0,
"d2d_bidir": 0.0
},
"per_gpu": []
},
"compute": {
"per_dtype_tflops": {
"fp32": 52.2,
"tf32": 357.5,
"fp16": 665.3,
"bf16": 697.1,
"fp8": 1138.8
},
"peak_tflops": {
"fp32": 67,
"tf32": 495,
"fp16": 990,
"bf16": 990,
"fp8": 1979
},
"efficiency_pct": {
"fp32": 77.9,
"tf32": 72.2,
"fp16": 67.2,
"bf16": 70.4,
"fp8": 57.5
},
"pass_thresholds_tflops": {
"fp32": 54,
"tf32": 444,
"fp16": 734,
"bf16": 745,
"fp8": 1400
},
"per_gpu": [
{
"index": 0,
"fp32": 52.2,
"tf32": 357.5,
"fp16": 665.3,
"bf16": 697.1,
"fp8": 1138.8
}
],
"matrix_size": 8192,
"warmup": 50,
"iterations": 500
}
},
"timestamp": "2026-05-22T15:35:19.219299"
}

View File

@ -0,0 +1,38 @@
# GPU Test Report
- **Date:** 2026-05-22 15:37:18
- **Host:** aikubeworker0016
## Summary
| Test | Result |
|------|--------|
| Memory Bandwidth | FAIL (0.0%) |
| Compute Throughput | FAIL (worst TF32 358 vs >= 444) |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
| D2H (PCIe) | 55.0 GB/s | 64 GB/s | 85.9% |
| D2D (NVLink) | 0.0 GB/s | 450 GB/s | 0.0% |
**Verdict: FAIL** (D2D efficiency 0.0%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 52.2 | 67 | >= 54 | WARN |
| TF32 | 357.5 | 495 | >= 444 | FAIL |
| FP16 | 665.3 | 990 | >= 734 | WARN |
| BF16 | 697.1 | 990 | >= 745 | WARN |
| FP8 | 1138.8 | 1979 | >= 1400 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.5%)
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,157 @@
{
"rdma": {
"passed": false,
"devices": [
{
"name": "mlx5_0",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:58a2:e103:0093:3898"
}
]
},
{
"name": "mlx5_1",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:58a2:e103:0093:3db0"
}
]
},
{
"name": "mlx5_2",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:5c3f:b8ff:fe5e:7832"
}
]
},
{
"name": "mlx5_3",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "1: DOWN",
"phys_state": "3: Disabled",
"gid": "fe80:0000:0000:0000:5e25:73ff:fe4e:eac1"
}
]
},
{
"name": "mlx5_4",
"ports": [
{
"port": "1",
"rate": "100 Gb/sec (2X HDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:005f:63cc"
}
]
},
{
"name": "mlx5_5",
"ports": [
{
"port": "1",
"rate": "100 Gb/sec (2X HDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:005f:63cd"
}
]
},
{
"name": "mlx5_6",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:58a2:e103:0093:3bf4"
}
]
},
{
"name": "mlx5_7",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:58a2:e103:0093:3e28"
}
]
},
{
"name": "mlx5_8",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:5c3f:b8ff:fe5e:7832"
}
]
},
{
"name": "mlx5_9",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "1: DOWN",
"phys_state": "3: Disabled",
"gid": "fe80:0000:0000:0000:5e25:73ff:fe63:1717"
}
]
}
],
"bandwidth_tests": [
{
"test": "ib_write_bw",
"status": "WARN",
"bandwidth_gbps": 0.13,
"min_required_gbps": 50
},
{
"test": "ib_read_bw",
"status": "WARN",
"bandwidth_gbps": 0.13,
"min_required_gbps": 50
}
],
"latency_tests": [
{
"test": "ib_write_lat",
"status": "PASS",
"latency_us": 4.53,
"max_allowed_us": 10
},
{
"test": "ib_read_lat",
"status": "WARN",
"latency_us": 16.0,
"max_allowed_us": 10
}
],
"timestamp": "2026-05-22T15:41:20.534115"
},
"timestamp": "2026-05-22T15:41:20.544589"
}

View File

@ -0,0 +1,157 @@
{
"rdma": {
"passed": false,
"devices": [
{
"name": "mlx5_0",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:58a2:e103:0088:81e0"
}
]
},
{
"name": "mlx5_1",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:0054:e00a"
}
]
},
{
"name": "mlx5_2",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
}
]
},
{
"name": "mlx5_3",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "1: DOWN",
"phys_state": "3: Disabled",
"gid": "fe80:0000:0000:0000:c670:bdff:fefd:5bd9"
}
]
},
{
"name": "mlx5_4",
"ports": [
{
"port": "1",
"rate": "100 Gb/sec (2X HDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:005f:58ec"
}
]
},
{
"name": "mlx5_5",
"ports": [
{
"port": "1",
"rate": "100 Gb/sec (2X HDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:005f:58ed"
}
]
},
{
"name": "mlx5_6",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:9c63:c003:0055:0e56"
}
]
},
{
"name": "mlx5_7",
"ports": [
{
"port": "1",
"rate": "400 Gb/sec (4X NDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:a088:c203:00f0:286c"
}
]
},
{
"name": "mlx5_8",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "4: ACTIVE",
"phys_state": "5: LinkUp",
"gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
}
]
},
{
"name": "mlx5_9",
"ports": [
{
"port": "1",
"rate": "25 Gb/sec (1X EDR)",
"state": "1: DOWN",
"phys_state": "3: Disabled",
"gid": "fe80:0000:0000:0000:c670:bdff:fefd:569d"
}
]
}
],
"bandwidth_tests": [
{
"test": "ib_write_bw",
"status": "WARN",
"bandwidth_gbps": 0.13,
"min_required_gbps": 50
},
{
"test": "ib_read_bw",
"status": "WARN",
"bandwidth_gbps": 0.13,
"min_required_gbps": 50
}
],
"latency_tests": [
{
"test": "ib_write_lat",
"status": "PASS",
"latency_us": 4.22,
"max_allowed_us": 10
},
{
"test": "ib_read_lat",
"status": "WARN",
"latency_us": 16.0,
"max_allowed_us": 10
}
],
"timestamp": "2026-05-22T15:41:07.851101"
},
"timestamp": "2026-05-22T15:41:07.861558"
}

View File

@ -0,0 +1,62 @@
# GPU Test Report
- **Date:** 2026-05-22T19:48:26.622179
- **Host:** aikubeworker0012
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- RDMA: FAIL
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- DCGM
- Training
## Summary
| Test | Result |
|------|--------|
| RDMA | FAIL |
## RDMA/InfiniBand
### RDMA Port Checks
| Device | Port | State | Rate | Required | Status |
|--------|------|-------|------|----------|--------|
| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 49.3 GB/s | >= 47 GB/s | PASS |
| ib_read_bw | 39.2 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 4.49 us | <= 2 us | FAIL |
| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
| ibping | target=0x58 count=5 | 0% packet loss | PASS |
- **PFC/ECN/CNP/congestion counters checked:** 146
- **PFC/ECN/CNP/congestion non-zero:** no
- **Failure reasons:**
- mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- ib_read_bw bandwidth 39.21GB/s < 47GB/s
- ib_write_lat latency 4.49us > 2.0us
- ib_read_lat latency 16.0us > 3.5us
**Overall: FAIL**
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,62 @@
# GPU Test Report
- **Date:** 2026-05-22T19:48:45.899570
- **Host:** aikubeworker0016
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- RDMA: FAIL
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- DCGM
- Training
## Summary
| Test | Result |
|------|--------|
| RDMA | FAIL |
## RDMA/InfiniBand
### RDMA Port Checks
| Device | Port | State | Rate | Required | Status |
|--------|------|-------|------|----------|--------|
| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 48.1 GB/s | >= 47 GB/s | PASS |
| ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 4.28 us | <= 2 us | FAIL |
| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
| ibping | target=0x4b count=5 | 0% packet loss | PASS |
- **PFC/ECN/CNP/congestion counters checked:** 146
- **PFC/ECN/CNP/congestion non-zero:** no
- **Failure reasons:**
- mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- ib_read_bw bandwidth 40.3GB/s < 47GB/s
- ib_write_lat latency 4.28us > 2.0us
- ib_read_lat latency 16.0us > 3.5us
**Overall: FAIL**
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,50 @@
# RDMA Cross-node Evidence Report
- **Date:** 2026-05-23 Asia/Shanghai
- **Scope:** `aikubeworker0012` <-> `aikubeworker0016`, single rail `mlx5_0`, port 1
- **Client/server bootstrap IPs:** `172.72.8.12` and `172.72.8.16`
- **Bandwidth message size:** 4MB
- **Latency message size:** 8B
- **Iterations:** 1000
## Port Evidence
| Host | Device | State | Rate | Link | LID |
|---|---|---|---|---|---|
| aikubeworker0012 | mlx5_0/1 | ACTIVE | 400 Gb/sec (4X NDR) | InfiniBand | 0x58 |
| aikubeworker0016 | mlx5_0/1 | ACTIVE | 400 Gb/sec (4X NDR) | InfiniBand | 0x4b |
## Cross-node Perftest Results
| Direction | Test | Value | PDF Threshold | Status |
|---|---|---:|---:|---|
| 0016 -> 0012 | ib_write_bw | 49.35 GB/s | >= 47 GB/s | PASS |
| 0016 -> 0012 | ib_read_bw | 44.36 GB/s | >= 47 GB/s | FAIL |
| 0016 -> 0012 | ib_write_lat avg | 2.17 us | <= 2.0 us | FAIL |
| 0016 -> 0012 | ib_read_lat avg | 4.05 us | <= 3.5 us | FAIL |
| 0012 -> 0016 | ib_write_bw | 48.38 GB/s | >= 47 GB/s | PASS |
| 0012 -> 0016 | ib_read_bw | 44.37 GB/s | >= 47 GB/s | FAIL |
| 0012 -> 0016 | ib_write_lat avg | 2.13 us | <= 2.0 us | FAIL |
| 0012 -> 0016 | ib_read_lat avg | 4.08 us | <= 3.5 us | FAIL |
## Bidirectional ibping
| Direction | Target LID | Result |
|---|---|---|
| 0016 -> 0012 | 0x58 | 5 transmitted, 5 received, 0% packet loss; avg 0.005 ms |
| 0012 -> 0016 | 0x4b | 5 transmitted, 5 received, 0% packet loss; avg 0.005 ms |
## Fabric Counters
| Host | PFC/ECN/CNP/congestion Counters Checked | Non-zero Counters | Status |
|---|---:|---:|---|
| aikubeworker0012 | 146 | 0 | PASS |
| aikubeworker0016 | 146 | 0 | PASS |
## Verdict
**RDMA cross-node verdict: FAIL**
Reason: bidirectional connectivity is good, PFC/ECN/CNP/congestion counters are clean, and write bandwidth passes. However read bandwidth is below 47 GB/s in both directions, write latency is slightly above 2.0 us in both directions, and read latency is above 3.5 us in both directions.
Note: `modules/rdma_test.py` was corrected on 2026-05-23 to parse `ib_write_lat` / `ib_read_lat` `t_avg[usec]` rather than the 99.9 percentile column. Older reports that show `read_lat` around 16 us are therefore not the current parser output.

View File

@ -0,0 +1,73 @@
# Single-node RDMA/IB Report
Generated: 2026-05-22 23:41 Asia/Shanghai
Scope: project CLI `gpu_tester.py --test rdma --report --format json`, run separately on each host.
Important note: the current repository RDMA test is single-node only. In `modules/rdma_test.py`, the perftest client connects to `localhost`, so this report validates local IB device discovery and local perftest behavior. It does not validate cross-node RDMA bandwidth between `aikubeworker0012` and `aikubeworker0016`.
## Summary
| Host | Devices Found | Active 400G Ports | Active 100G Ports | Down Ports | Overall |
| --- | ---: | --- | --- | --- | --- |
| aikubeworker0012 / 172.72.8.12 | 10 | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | mlx5_4, mlx5_5 | mlx5_3, mlx5_9 | WARN |
| aikubeworker0016 / 172.72.8.16 | 10 | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | mlx5_4, mlx5_5 | mlx5_3, mlx5_9 | WARN |
## Bandwidth
The bandwidth numbers below are from the repo's local `localhost` RDMA perftest path.
| Host | ib_write_bw | Threshold | Status | ib_read_bw | Threshold | Status |
| --- | ---: | ---: | --- | ---: | ---: | --- |
| aikubeworker0012 | 0.13 GB/s | 50 GB/s | WARN | 0.13 GB/s | 50 GB/s | WARN |
| aikubeworker0016 | 0.13 GB/s | 50 GB/s | WARN | 0.13 GB/s | 50 GB/s | WARN |
## Latency
| Host | ib_write_lat | Limit | Status | ib_read_lat | Limit | Status |
| --- | ---: | ---: | --- | ---: | ---: | --- |
| aikubeworker0012 | 4.53 us | 10 us | PASS | 16.00 us | 10 us | WARN |
| aikubeworker0016 | 4.22 us | 10 us | PASS | 16.00 us | 10 us | WARN |
## Device Inventory
### aikubeworker0012
| Device | Port | State | Physical State | Rate |
| --- | --- | --- | --- | --- |
| mlx5_0 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_1 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_2 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
| mlx5_3 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
| mlx5_4 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
| mlx5_5 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
| mlx5_6 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_7 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_8 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
| mlx5_9 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
### aikubeworker0016
| Device | Port | State | Physical State | Rate |
| --- | --- | --- | --- | --- |
| mlx5_0 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_1 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_2 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
| mlx5_3 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
| mlx5_4 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
| mlx5_5 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
| mlx5_6 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_7 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
| mlx5_8 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
| mlx5_9 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
## Files
Raw JSON:
- `reports_rdma_aikubeworker0012.json`
- `reports_rdma_aikubeworker0016.json`
Markdown summary:
- `reports_rdma_single_node_summary.md`

View File

@ -0,0 +1,292 @@
{
"timestamp": "2026-05-22T15:26:26.973586",
"gpu_info": {
"driver_version": "580.159.03",
"cuda_version": "13.0",
"gpu_count": 8,
"gpus": [
{
"index": 0,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-7658c03c-7659-9886-041e-545c21d53e12",
"pci_bus_id": "00000000:18:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 69.72,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 25,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1654923030411",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 1,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-6392d40b-893b-9fc2-4284-a3f1d8c4d7f1",
"pci_bus_id": "00000000:2A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 73.17,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 25,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1654724063165",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 2,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-2ae38735-10de-fb0b-fb20-9d1b5b434558",
"pci_bus_id": "00000000:3A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 68.71,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 26,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1654823036530",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 3,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-ec62123f-0c48-6dbd-49e4-8b231b3fed0e",
"pci_bus_id": "00000000:5D:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 69.73,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 25,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1654923021638",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 4,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-b64fc270-109e-1543-fb0c-be7feecf14f1",
"pci_bus_id": "00000000:9A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 68.84,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 24,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1655023033179",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 5,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-15ab7baf-9010-7cf3-5462-eeb09f8dbe65",
"pci_bus_id": "00000000:AB:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 69.94,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 27,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1655023034225",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 6,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-225f6f3c-6fef-d1e2-5428-d90f665fb3d3",
"pci_bus_id": "00000000:BA:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 70.46,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 25,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1654923078278",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 7,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-79aeb6a8-c00c-6edb-956f-779ef56950a3",
"pci_bus_id": "00000000:DB:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 71.76,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 24,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1654024031464",
"ecc_errors_single": 0,
"ecc_errors_double": 0
}
],
"topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n NIC3: mlx5_3\n NIC4: mlx5_4\n NIC5: mlx5_5\n NIC6: mlx5_6\n NIC7: mlx5_7\n NIC8: mlx5_8\n NIC9: mlx5_9\n\n",
"timestamp": "2026-05-22T15:26:34.187409",
"detected_gpu_type": "h100",
"gpu_label": "H100 SXM5"
},
"memory_bench": {
"memory": {
"source": "pytorch",
"h2d_bandwidth_gbps": 11.8,
"d2h_bandwidth_gbps": 9.9,
"d2d_bandwidth_gbps": 829.1,
"peak_bandwidth_gbps": 3400,
"efficiency_pct": 24.4,
"test_sizes_mb": [
1,
4,
16,
64,
256,
1024,
4096
],
"bandwidth_by_size": {
"1": {
"h2d_gbps": 3.8,
"d2h_gbps": 1.4,
"d2d_gbps": 40.6
},
"4": {
"h2d_gbps": 7.6,
"d2h_gbps": 9.9,
"d2d_gbps": 141.5
},
"16": {
"h2d_gbps": 11.0,
"d2h_gbps": 1.9,
"d2d_gbps": 450.3
},
"64": {
"h2d_gbps": 11.8,
"d2h_gbps": 1.4,
"d2d_gbps": 726.5
},
"256": {
"h2d_gbps": 9.0,
"d2h_gbps": 1.4,
"d2d_gbps": 793.8
},
"1024": {
"h2d_gbps": 5.5,
"d2h_gbps": 1.4,
"d2d_gbps": 821.2
},
"4096": {
"h2d_gbps": 5.9,
"d2h_gbps": 1.4,
"d2d_gbps": 829.1
}
},
"per_gpu": []
}
},
"compute_bench": {
"compute": {
"per_dtype_tflops": {
"fp32": 52.0,
"tf32": 362.3,
"fp16": 691.0,
"bf16": 713.0,
"fp8": 1148.8
},
"peak_tflops": {
"fp32": 67,
"tf32": 495,
"fp16": 990,
"bf16": 990,
"fp8": 1979
},
"efficiency_pct": {
"fp32": 77.6,
"tf32": 73.2,
"fp16": 69.8,
"bf16": 72.0,
"fp8": 58.0
},
"pass_thresholds_tflops": {
"fp32": 54,
"tf32": 444,
"fp16": 734,
"bf16": 745,
"fp8": 1400
},
"per_gpu": [
{
"index": 0,
"fp32": 52.0,
"tf32": 362.3,
"fp16": 691.0,
"bf16": 713.0,
"fp8": 1148.8
}
],
"matrix_size": 8192,
"warmup": 50,
"iterations": 500
}
}
}

View File

@ -0,0 +1,54 @@
# GPU Test Report
- **Date:** 2026-05-22 15:27:51
- **Host:** aikubeworker0012
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Memory Bandwidth | WARN (829 GB/s via PyTorch fallback) |
| Compute Throughput | FAIL (worst TF32 362 vs >= 444) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 72/700W | 345 MHz |
## Memory Bandwidth
Source: pytorch
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 11.8 GB/s | 0 GB/s | 0.0% |
| D2H (PCIe) | 9.9 GB/s | 0 GB/s | 0.0% |
| D2D (NVLink) | 829.1 GB/s | 3400 GB/s | 24.4% |
**Verdict: WARN** (D2D 829.1 GB/s via PyTorch fallback; nvbandwidth unavailable — figure is indicative only, not a true HBM peak)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 52.0 | 67 | >= 54 | WARN |
| TF32 | 362.3 | 495 | >= 444 | FAIL |
| FP16 | 691.0 | 990 | >= 734 | WARN |
| BF16 | 713.0 | 990 | >= 745 | WARN |
| FP8 | 1148.8 | 1979 | >= 1400 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 58.0%)
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,292 @@
{
"timestamp": "2026-05-22T15:26:29.511252",
"gpu_info": {
"driver_version": "580.159.03",
"cuda_version": "13.0",
"gpu_count": 8,
"gpus": [
{
"index": 0,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-dfbc9513-255d-4fe7-2b77-7b1ec3972e75",
"pci_bus_id": "00000000:18:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 4,
"vram_free_mb": 81076,
"power_draw": 69.81,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 20,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924016120",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 1,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-bb845ef7-d7b5-f011-9395-ea74274e2282",
"pci_bus_id": "00000000:2A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 67.45,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 20,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924015483",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 2,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-3720cf13-2a34-be38-27be-0a7adc4addc4",
"pci_bus_id": "00000000:3A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 66.69,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 21,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924025595",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 3,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-87080b2d-ac43-be0d-d574-c193078850ae",
"pci_bus_id": "00000000:5D:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 66.86,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 20,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924016862",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 4,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-599bd883-cc5c-a5dd-6c33-c15f7049da48",
"pci_bus_id": "00000000:9A:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 67.07,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 20,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924025670",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 5,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-a1c6bba4-61b0-e623-06c9-9c88635e26fe",
"pci_bus_id": "00000000:AB:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 69.12,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 22,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924027166",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 6,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-98745a0c-39bd-3e56-d6ca-54ba3647ab6d",
"pci_bus_id": "00000000:BA:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 67.61,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 20,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924026234",
"ecc_errors_single": 0,
"ecc_errors_double": 0
},
{
"index": 7,
"name": "NVIDIA H100 80GB HBM3",
"uuid": "GPU-8c73bd8b-666b-357e-ac5d-c75ac7a759db",
"pci_bus_id": "00000000:DB:00.0",
"pcie_link_gen": 5,
"pcie_link_width": 16,
"vram_total_mb": 81559,
"vram_used_mb": 0,
"vram_free_mb": 81079,
"power_draw": 66.19,
"power_limit": 700.0,
"clock_sm": 345,
"clock_mem": 2619,
"temperature": 20,
"fan_speed": 0,
"persistence_mode": false,
"compute_mode": "Default",
"serial_number": "1651924027255",
"ecc_errors_single": 0,
"ecc_errors_double": 0
}
],
"topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n X = Self\n SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n PIX = Connection traversing at most a single PCIe bridge\n NV# = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n NIC0: mlx5_0\n NIC1: mlx5_1\n NIC2: mlx5_2\n NIC3: mlx5_3\n NIC4: mlx5_4\n NIC5: mlx5_5\n NIC6: mlx5_6\n NIC7: mlx5_7\n NIC8: mlx5_8\n NIC9: mlx5_9\n\n",
"timestamp": "2026-05-22T15:26:36.627805",
"detected_gpu_type": "h100",
"gpu_label": "H100 SXM5"
},
"memory_bench": {
"memory": {
"source": "pytorch",
"h2d_bandwidth_gbps": 11.8,
"d2h_bandwidth_gbps": 10.1,
"d2d_bandwidth_gbps": 829.0,
"peak_bandwidth_gbps": 3400,
"efficiency_pct": 24.4,
"test_sizes_mb": [
1,
4,
16,
64,
256,
1024,
4096
],
"bandwidth_by_size": {
"1": {
"h2d_gbps": 3.6,
"d2h_gbps": 1.4,
"d2d_gbps": 40.3
},
"4": {
"h2d_gbps": 7.7,
"d2h_gbps": 10.1,
"d2d_gbps": 159.5
},
"16": {
"h2d_gbps": 10.9,
"d2h_gbps": 1.9,
"d2d_gbps": 439.5
},
"64": {
"h2d_gbps": 11.8,
"d2h_gbps": 1.4,
"d2d_gbps": 740.5
},
"256": {
"h2d_gbps": 9.0,
"d2h_gbps": 1.4,
"d2d_gbps": 792.1
},
"1024": {
"h2d_gbps": 8.4,
"d2h_gbps": 1.4,
"d2d_gbps": 818.9
},
"4096": {
"h2d_gbps": 6.1,
"d2h_gbps": 1.4,
"d2d_gbps": 829.0
}
},
"per_gpu": []
}
},
"compute_bench": {
"compute": {
"per_dtype_tflops": {
"fp32": 51.9,
"tf32": 357.8,
"fp16": 667.2,
"bf16": 699.1,
"fp8": 1146.2
},
"peak_tflops": {
"fp32": 67,
"tf32": 495,
"fp16": 990,
"bf16": 990,
"fp8": 1979
},
"efficiency_pct": {
"fp32": 77.5,
"tf32": 72.3,
"fp16": 67.4,
"bf16": 70.6,
"fp8": 57.9
},
"pass_thresholds_tflops": {
"fp32": 54,
"tf32": 444,
"fp16": 734,
"bf16": 745,
"fp8": 1400
},
"per_gpu": [
{
"index": 0,
"fp32": 51.9,
"tf32": 357.8,
"fp16": 667.2,
"bf16": 699.1,
"fp8": 1146.2
}
],
"matrix_size": 8192,
"warmup": 50,
"iterations": 500
}
}
}

View File

@ -0,0 +1,54 @@
# GPU Test Report
- **Date:** 2026-05-22 15:27:53
- **Host:** aikubeworker0016
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Memory Bandwidth | WARN (829 GB/s via PyTorch fallback) |
| Compute Throughput | FAIL (worst TF32 358 vs >= 444) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 69/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 66/700W | 345 MHz |
## Memory Bandwidth
Source: pytorch
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 11.8 GB/s | 0 GB/s | 0.0% |
| D2H (PCIe) | 10.1 GB/s | 0 GB/s | 0.0% |
| D2D (NVLink) | 829.0 GB/s | 3400 GB/s | 24.4% |
**Verdict: WARN** (D2D 829.0 GB/s via PyTorch fallback; nvbandwidth unavailable — figure is indicative only, not a true HBM peak)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 51.9 | 67 | >= 54 | WARN |
| TF32 | 357.8 | 495 | >= 444 | FAIL |
| FP16 | 667.2 | 990 | >= 734 | WARN |
| BF16 | 699.1 | 990 | >= 745 | WARN |
| FP8 | 1146.2 | 1979 | >= 1400 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.9%)
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,165 @@
{
"stress": {
"source": "pytorch",
"passed": false,
"duration_sec": 45,
"elapsed_sec": 45.4,
"gpu_status": {
"0": "PASS",
"1": "PASS",
"2": "PASS",
"3": "PASS",
"4": "PASS",
"5": "PASS",
"6": "PASS",
"7": "PASS"
},
"telemetry": {
"passed": false,
"samples": 39,
"steady_samples": 31,
"warmup_sec": 9.0,
"max_temp_c": {
"0": 59.0,
"1": 58.0,
"2": 65.0,
"3": 54.0,
"4": 59.0,
"5": 66.0,
"6": 62.0,
"7": 55.0
},
"avg_power_w": {
"0": 697.0,
"1": 697.4,
"2": 697.9,
"3": 698.0,
"4": 697.8,
"5": 697.6,
"6": 697.9,
"7": 698.2
},
"temp_delta_c": 12.0,
"throttle_events": [
{
"gpu": 0,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 1,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 2,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 3,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 4,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 5,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 6,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 7,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 0,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 1,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 2,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 3,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 4,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 5,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 6,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 7,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 0,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 1,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 2,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 3,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
}
],
"throttle_event_count": 248,
"xid_events": [],
"tflops_jitter_pct": 4.07,
"steady_tflops_samples": 781,
"failures": [
"GPU temperature delta 12.0C exceeds 5.0C",
"non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)"
],
"thresholds": {
"max_temp_c": 80.0,
"max_temp_delta_c": 5.0,
"min_power_w": 630.0,
"max_tflops_jitter_pct": 5.0,
"warmup_sec": 10.0,
"min_steady_samples": 10
}
},
"timestamp": "2026-05-22T17:52:09.074859"
},
"timestamp": "2026-05-22T17:52:09.082873"
}

View File

@ -0,0 +1,29 @@
# GPU Test Report
- **Date:** 2026-05-22T17:52:09.082873
- **Host:** aikubeworker0012
## Summary
| Test | Result |
|------|--------|
| Stress Test | FAIL |
## Stress Test
- **Source:** pytorch
- **Duration:** 45s (requested 45s)
- **Telemetry samples:** 39
- **Max temp:** {'0': 59.0, '1': 58.0, '2': 65.0, '3': 54.0, '4': 59.0, '5': 66.0, '6': 62.0, '7': 55.0}
- **Avg power:** {'0': 697.0, '1': 697.4, '2': 697.9, '3': 698.0, '4': 697.8, '5': 697.6, '6': 697.9, '7': 698.2}
- **Temp delta:** 12.0 C
- **TFLOPS jitter:** 4.07%
- **Throttle events:** 248
- **XID events:** 0
- **Failure reasons:**
- GPU temperature delta 12.0C exceeds 5.0C
- non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)
- **Result: FAIL**
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,165 @@
{
"stress": {
"source": "pytorch",
"passed": false,
"duration_sec": 45,
"elapsed_sec": 45.4,
"gpu_status": {
"0": "PASS",
"1": "PASS",
"2": "PASS",
"3": "PASS",
"4": "PASS",
"5": "PASS",
"6": "PASS",
"7": "PASS"
},
"telemetry": {
"passed": false,
"samples": 39,
"steady_samples": 31,
"warmup_sec": 9.0,
"max_temp_c": {
"0": 50.0,
"1": 56.0,
"2": 57.0,
"3": 52.0,
"4": 51.0,
"5": 58.0,
"6": 53.0,
"7": 51.0
},
"avg_power_w": {
"0": 698.3,
"1": 698.5,
"2": 697.6,
"3": 697.9,
"4": 697.8,
"5": 698.0,
"6": 697.5,
"7": 698.0
},
"temp_delta_c": 8.0,
"throttle_events": [
{
"gpu": 0,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 1,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 2,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 3,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 4,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 5,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 6,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 7,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 0,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 1,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 2,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 3,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 4,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 5,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 6,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 7,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 0,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 1,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 2,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
},
{
"gpu": 3,
"throttle": "0x0000000000000004",
"real_throttle": "0x4"
}
],
"throttle_event_count": 248,
"xid_events": [],
"tflops_jitter_pct": 3.77,
"steady_tflops_samples": 787,
"failures": [
"GPU temperature delta 8.0C exceeds 5.0C",
"non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)"
],
"thresholds": {
"max_temp_c": 80.0,
"max_temp_delta_c": 5.0,
"min_power_w": 630.0,
"max_tflops_jitter_pct": 5.0,
"warmup_sec": 10.0,
"min_steady_samples": 10
}
},
"timestamp": "2026-05-22T17:53:02.058687"
},
"timestamp": "2026-05-22T17:53:02.066792"
}

View File

@ -0,0 +1,29 @@
# GPU Test Report
- **Date:** 2026-05-22T17:53:02.066792
- **Host:** aikubeworker0016
## Summary
| Test | Result |
|------|--------|
| Stress Test | FAIL |
## Stress Test
- **Source:** pytorch
- **Duration:** 45s (requested 45s)
- **Telemetry samples:** 39
- **Max temp:** {'0': 50.0, '1': 56.0, '2': 57.0, '3': 52.0, '4': 51.0, '5': 58.0, '6': 53.0, '7': 51.0}
- **Avg power:** {'0': 698.3, '1': 698.5, '2': 697.6, '3': 697.9, '4': 697.8, '5': 698.0, '6': 697.5, '7': 698.0}
- **Temp delta:** 8.0 C
- **TFLOPS jitter:** 3.77%
- **Throttle events:** 248
- **XID events:** 0
- **Failure reasons:**
- GPU temperature delta 8.0C exceeds 5.0C
- non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)
- **Result: FAIL**
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,322 @@
# GPU Test Report
- **Date:** 2026-05-22T20:32:51.687830
- **Host:** aikubeworker0012
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- Compute Throughput: FAIL (FP16 spread 3.04% > 3%)
- NCCL: FAIL
- Stress Test: FAIL
- RDMA: FAIL
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Health Check | PASS |
| Memory Bandwidth | PASS (108.1%) |
| Compute Throughput | FAIL (FP16 spread 3.04% > 3%) |
| NVLink/NVSwitch | PASS |
| DCGM | PASS |
| NCCL | FAIL |
| Stress Test | FAIL |
| RDMA | FAIL |
| Training | PASS (216498 tokens/sec) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 69/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 71/700W | 345 MHz |
## Health Check
**Overall: PASS**
| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
|-----|------|-------|-----|------|----------|--------|
| 0 | 25C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 1 | 25C PASS | 73W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 2 | 26C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 3 | 24C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 4 | 24C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 5 | 27C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 6 | 25C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 7 | 24C PASS | 71W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.4 GB/s | 64 GB/s | 86.6% |
| D2H (PCIe) | 54.0 GB/s | 64 GB/s | 84.4% |
| D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
**Verdict: PASS** (D2D efficiency 108.1%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 51.9 | 67 | >= 54 | FAIL |
| TF32 | 364.9 | 495 | >= 444 | FAIL |
| FP16 | 680.0 | 990 | >= 734 | FAIL |
| BF16 | 713.2 | 990 | >= 745 | FAIL |
| FP8 | 1170.4 | 1979 | >= 1400 | FAIL |
| FP64 | 46.9 | 67 | >= 63 | FAIL |
| INT8 | 100.4 | 1979 | >= 1536 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 5.1%)
### Compute Consistency
| DType | Min | Mean | Max | Spread | Limit | Status |
|-------|-----|------|-----|--------|-------|--------|
| FP32 | 51.9 | 52.0 | 52.1 | 0.38% | <= 3% | PASS |
| TF32 | 361.0 | 364.9 | 369.0 | 2.19% | <= 3% | PASS |
| FP16 | 667.3 | 680.0 | 688.0 | 3.04% | <= 3% | FAIL |
| BF16 | 703.0 | 713.3 | 735.7 | 4.58% | <= 3% | FAIL |
| FP8 | 1156.9 | 1170.5 | 1186.1 | 2.49% | <= 3% | PASS |
| FP64 | 45.9 | 46.9 | 47.5 | 3.41% | <= 3% | FAIL |
| INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
### Compute Per-GPU TFLOPS
| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
|---|---|---|---|---|---|---|---|
| 0 | 52.0 | 369.0 | 688.0 | 735.7 | 1186.1 | 47.5 | 100.4 |
| 1 | 51.9 | 365.6 | 675.3 | 711.6 | 1171.0 | 47.0 | 100.4 |
| 2 | 51.9 | 364.9 | 685.7 | 715.3 | 1175.3 | 47.1 | 100.4 |
| 3 | 51.9 | 364.0 | 679.9 | 704.0 | 1167.6 | 47.4 | 100.4 |
| 4 | 51.9 | 367.7 | 681.2 | 719.0 | 1178.0 | 46.6 | 100.4 |
| 5 | 52.0 | 364.3 | 680.8 | 712.3 | 1165.5 | 46.8 | 100.4 |
| 6 | 52.1 | 362.9 | 681.8 | 703.0 | 1156.9 | 46.9 | 100.4 |
| 7 | 51.9 | 361.0 | 667.3 | 705.3 | 1163.2 | 45.9 | 100.4 |
## NVLink/NVSwitch
**Overall: PASS**
| GPU | Active Links | Issues |
|-----|--------------|--------|
| 0 | 18/18 | OK |
| 1 | 18/18 | OK |
| 2 | 18/18 | OK |
| 3 | 18/18 | OK |
| 4 | 18/18 | OK |
| 5 | 18/18 | OK |
| 6 | 18/18 | OK |
| 7 | 18/18 | OK |
## DCGM Diagnostic
**Overall: PASS**
| Subtest | Status |
|---------|--------|
| Deployment/software/GPU0 | PASS |
| Deployment/software/GPU1 | PASS |
| Deployment/software/GPU2 | PASS |
| Deployment/software/GPU3 | PASS |
| Deployment/software/GPU4 | PASS |
| Deployment/software/GPU5 | PASS |
| Deployment/software/GPU6 | PASS |
| Deployment/software/GPU7 | PASS |
| Deployment/software/summary | PASS |
| Hardware/memory/GPU0 | PASS |
| Hardware/memory/GPU1 | PASS |
| Hardware/memory/GPU2 | PASS |
| Hardware/memory/GPU3 | PASS |
| Hardware/memory/GPU4 | PASS |
| Hardware/memory/GPU5 | PASS |
| Hardware/memory/GPU6 | PASS |
| Hardware/memory/GPU7 | PASS |
| Hardware/memory/summary | PASS |
| Hardware/diagnostic/GPU0 | PASS |
| Hardware/diagnostic/GPU1 | PASS |
| Hardware/diagnostic/GPU2 | PASS |
| Hardware/diagnostic/GPU3 | PASS |
| Hardware/diagnostic/GPU4 | PASS |
| Hardware/diagnostic/GPU5 | PASS |
| Hardware/diagnostic/GPU6 | PASS |
| Hardware/diagnostic/GPU7 | PASS |
| Hardware/diagnostic/summary | PASS |
| Hardware/nvbandwidth/GPU0 | PASS |
| Hardware/nvbandwidth/GPU1 | PASS |
| Hardware/nvbandwidth/GPU2 | PASS |
| Hardware/nvbandwidth/GPU3 | PASS |
| Hardware/nvbandwidth/GPU4 | PASS |
| Hardware/nvbandwidth/GPU5 | PASS |
| Hardware/nvbandwidth/GPU6 | PASS |
| Hardware/nvbandwidth/GPU7 | PASS |
| Hardware/nvbandwidth/summary | PASS |
| Integration/pcie/GPU0 | PASS |
| Integration/pcie/GPU1 | PASS |
| Integration/pcie/GPU2 | PASS |
| Integration/pcie/GPU3 | PASS |
| Integration/pcie/GPU4 | PASS |
| Integration/pcie/GPU5 | PASS |
| Integration/pcie/GPU6 | PASS |
| Integration/pcie/GPU7 | PASS |
| Integration/pcie/summary | PASS |
| Stress/targeted_stress/GPU0 | PASS |
| Stress/targeted_stress/GPU1 | PASS |
| Stress/targeted_stress/GPU2 | PASS |
| Stress/targeted_stress/GPU3 | PASS |
| Stress/targeted_stress/GPU4 | PASS |
| Stress/targeted_stress/GPU5 | PASS |
| Stress/targeted_stress/GPU6 | PASS |
| Stress/targeted_stress/GPU7 | PASS |
| Stress/targeted_stress/summary | PASS |
| Stress/targeted_power/GPU0 | PASS |
| Stress/targeted_power/GPU1 | PASS |
| Stress/targeted_power/GPU2 | PASS |
| Stress/targeted_power/GPU3 | PASS |
| Stress/targeted_power/GPU4 | PASS |
| Stress/targeted_power/GPU5 | PASS |
| Stress/targeted_power/GPU6 | PASS |
| Stress/targeted_power/GPU7 | PASS |
| Stress/targeted_power/summary | PASS |
## NCCL Multi-GPU
Source: nccl-tests | GPUs: 8
| Operation | Bus BW (GB/s) | Threshold | Status |
|-----------|---------------|-----------|--------|
| allreduce | 472.3 | >= 405 | FAIL |
| alltoall | 343.3 | >= 315 | FAIL |
| broadcast | 364.1 | >= 360 | FAIL |
| reducescatter | 352.8 | >= 405 | FAIL |
| allgather | 366.4 | >= 405 | FAIL |
| sendrecv | 369.0 | >= 360 | FAIL |
### NCCL allreduce by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 24.9, 25.0, 24.7 | 24.7 | 24.9 | 0.50% | >= 405 | FAIL |
| 256M | 421.6, 421.8, 421.6 | 421.6 | 421.7 | 0.02% | >= 405 | PASS |
| 2G | 472.8, 472.7, 471.5 | 471.5 | 472.3 | 0.13% | >= 405 | PASS |
### NCCL alltoall by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 8.1, 8.0, 8.0 | 8.0 | 8.0 | 0.59% | >= 315 | FAIL |
| 256M | 305.3, 314.9, 313.1 | 305.3 | 311.1 | 1.34% | >= 315 | FAIL |
| 2G | 342.1, 342.5, 345.4 | 342.1 | 343.3 | 0.43% | >= 315 | PASS |
### NCCL broadcast by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.5, 14.6, 14.2 | 14.2 | 14.4 | 1.18% | >= 360 | FAIL |
| 256M | 344.2, 345.9, 344.6 | 344.2 | 344.9 | 0.21% | >= 360 | FAIL |
| 2G | 364.2, 364.0, 364.1 | 364.0 | 364.1 | 0.02% | >= 360 | PASS |
### NCCL reducescatter by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.1, 13.8, 14.2 | 13.8 | 14.0 | 1.21% | >= 405 | FAIL |
| 256M | 328.6, 328.3, 328.2 | 328.2 | 328.4 | 0.05% | >= 405 | FAIL |
| 2G | 352.6, 352.4, 353.3 | 352.4 | 352.8 | 0.11% | >= 405 | FAIL |
### NCCL allgather by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.6, 14.3, 14.4 | 14.3 | 14.4 | 0.86% | >= 405 | FAIL |
| 256M | 350.5, 350.4, 349.9 | 349.9 | 350.3 | 0.07% | >= 405 | FAIL |
| 2G | 366.3, 366.6, 366.2 | 366.2 | 366.4 | 0.05% | >= 405 | FAIL |
### NCCL sendrecv by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 18.4, 18.4, 18.4 | 18.4 | 18.4 | 0.00% | >= 360 | FAIL |
| 256M | 350.9, 351.6, 351.4 | 350.9 | 351.3 | 0.08% | >= 360 | FAIL |
| 2G | 368.9, 369.1, 368.9 | 368.9 | 369.0 | 0.03% | >= 360 | PASS |
**Overall: FAIL**
## Stress Test
- **Source:** pytorch
- **Duration:** 1800s (requested 1800s)
- **Telemetry samples:** 1266
- **Max temp:** {0: 60.0, 1: 60.0, 2: 68.0, 3: 56.0, 4: 60.0, 5: 68.0, 6: 64.0, 7: 56.0}
- **Avg power:** {0: 697.7, 1: 697.5, 2: 697.1, 3: 697.8, 4: 697.8, 5: 697.9, 6: 697.7, 7: 698.3}
- **Temp delta:** 12.0 C
- **TFLOPS jitter:** 4.37%
- **Steady TFLOPS samples:** 37672
- **Throttle events:** 9712
- **XID events:** 0
- **Failure reasons:**
- GPU temperature delta 12.0C exceeds 5.0C
- non-idle throttle reasons observed in 9712 samples (first: GPU 0 0x4)
- **Result: FAIL**
## RDMA/InfiniBand
### RDMA Port Checks
| Device | Port | State | Rate | Required | Status |
|--------|------|-------|------|----------|--------|
| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 49.5 GB/s | >= 47 GB/s | PASS |
| ib_read_bw | 39.1 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 1.25 us | <= 2 us | PASS |
| ib_read_lat | 2.60 us | <= 3.5 us | PASS |
| ibping | local_loopback target=0x58 count=5 | 0% packet loss | PASS |
- **PFC/ECN/CNP/congestion counters checked:** 146
- **PFC/ECN/CNP/congestion non-zero:** no
- **Failure reasons:**
- mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- ib_read_bw bandwidth 39.12GB/s < 47GB/s
**Overall: FAIL**
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer_1.5b |
| Params | 1470.5M |
| Throughput | 216498 tokens/sec |
| Avg Step Time | 75.7 ms |
| Warmup Steps | 5 |
| Peak Memory | 18.1 GB |
| Final Loss | 0.0039 |
| Step Jitter | 1.89% |
| Distributed Mode | ddp |
| Verdict | PASS (216498 tokens/sec) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,322 @@
# GPU Test Report
- **Date:** 2026-05-22T20:34:52.129246
- **Host:** aikubeworker0016
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- Compute Throughput: FAIL (BF16 spread 3.44% > 3%)
- NCCL: FAIL
- Stress Test: FAIL
- RDMA: FAIL
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Health Check | PASS |
| Memory Bandwidth | PASS (108.1%) |
| Compute Throughput | FAIL (BF16 spread 3.44% > 3%) |
| NVLink/NVSwitch | PASS |
| DCGM | PASS |
| NCCL | FAIL |
| Stress Test | FAIL |
| RDMA | FAIL |
| Training | PASS (216683 tokens/sec) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 69/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 66/700W | 345 MHz |
## Health Check
**Overall: PASS**
| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
|-----|------|-------|-----|------|----------|--------|
| 0 | 20C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 1 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 2 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 3 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 4 | 20C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 5 | 22C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 6 | 20C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 7 | 20C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.4 GB/s | 64 GB/s | 86.6% |
| D2H (PCIe) | 54.4 GB/s | 64 GB/s | 85.0% |
| D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
**Verdict: PASS** (D2D efficiency 108.1%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 52.1 | 67 | >= 54 | FAIL |
| TF32 | 366.7 | 495 | >= 444 | FAIL |
| FP16 | 682.7 | 990 | >= 734 | FAIL |
| BF16 | 717.3 | 990 | >= 745 | FAIL |
| FP8 | 1173.5 | 1979 | >= 1400 | FAIL |
| FP64 | 47.4 | 67 | >= 63 | FAIL |
| INT8 | 100.4 | 1979 | >= 1536 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 5.1%)
### Compute Consistency
| DType | Min | Mean | Max | Spread | Limit | Status |
|-------|-----|------|-----|--------|-------|--------|
| FP32 | 51.9 | 52.1 | 52.2 | 0.58% | <= 3% | PASS |
| TF32 | 362.3 | 366.7 | 369.2 | 1.88% | <= 3% | PASS |
| FP16 | 674.4 | 682.7 | 693.1 | 2.74% | <= 3% | PASS |
| BF16 | 705.3 | 717.2 | 730.0 | 3.44% | <= 3% | FAIL |
| FP8 | 1155.2 | 1173.5 | 1186.2 | 2.64% | <= 3% | PASS |
| FP64 | 46.3 | 47.4 | 48.5 | 4.64% | <= 3% | FAIL |
| INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
### Compute Per-GPU TFLOPS
| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
|---|---|---|---|---|---|---|---|
| 0 | 52.2 | 362.3 | 674.4 | 714.3 | 1159.0 | 46.3 | 100.4 |
| 1 | 51.9 | 366.5 | 674.7 | 721.4 | 1185.4 | 47.7 | 100.4 |
| 2 | 52.2 | 367.4 | 693.1 | 730.0 | 1185.7 | 48.5 | 100.4 |
| 3 | 52.2 | 367.8 | 682.2 | 708.2 | 1163.4 | 47.4 | 100.4 |
| 4 | 52.0 | 366.4 | 686.9 | 714.1 | 1186.2 | 47.3 | 100.4 |
| 5 | 52.0 | 369.2 | 679.9 | 721.1 | 1155.2 | 47.3 | 100.4 |
| 6 | 51.9 | 365.1 | 677.7 | 705.3 | 1169.0 | 47.0 | 100.4 |
| 7 | 52.2 | 369.0 | 692.8 | 723.5 | 1184.3 | 47.6 | 100.4 |
## NVLink/NVSwitch
**Overall: PASS**
| GPU | Active Links | Issues |
|-----|--------------|--------|
| 0 | 18/18 | OK |
| 1 | 18/18 | OK |
| 2 | 18/18 | OK |
| 3 | 18/18 | OK |
| 4 | 18/18 | OK |
| 5 | 18/18 | OK |
| 6 | 18/18 | OK |
| 7 | 18/18 | OK |
## DCGM Diagnostic
**Overall: PASS**
| Subtest | Status |
|---------|--------|
| Deployment/software/GPU0 | PASS |
| Deployment/software/GPU1 | PASS |
| Deployment/software/GPU2 | PASS |
| Deployment/software/GPU3 | PASS |
| Deployment/software/GPU4 | PASS |
| Deployment/software/GPU5 | PASS |
| Deployment/software/GPU6 | PASS |
| Deployment/software/GPU7 | PASS |
| Deployment/software/summary | PASS |
| Hardware/memory/GPU0 | PASS |
| Hardware/memory/GPU1 | PASS |
| Hardware/memory/GPU2 | PASS |
| Hardware/memory/GPU3 | PASS |
| Hardware/memory/GPU4 | PASS |
| Hardware/memory/GPU5 | PASS |
| Hardware/memory/GPU6 | PASS |
| Hardware/memory/GPU7 | PASS |
| Hardware/memory/summary | PASS |
| Hardware/diagnostic/GPU0 | PASS |
| Hardware/diagnostic/GPU1 | PASS |
| Hardware/diagnostic/GPU2 | PASS |
| Hardware/diagnostic/GPU3 | PASS |
| Hardware/diagnostic/GPU4 | PASS |
| Hardware/diagnostic/GPU5 | PASS |
| Hardware/diagnostic/GPU6 | PASS |
| Hardware/diagnostic/GPU7 | PASS |
| Hardware/diagnostic/summary | PASS |
| Hardware/nvbandwidth/GPU0 | PASS |
| Hardware/nvbandwidth/GPU1 | PASS |
| Hardware/nvbandwidth/GPU2 | PASS |
| Hardware/nvbandwidth/GPU3 | PASS |
| Hardware/nvbandwidth/GPU4 | PASS |
| Hardware/nvbandwidth/GPU5 | PASS |
| Hardware/nvbandwidth/GPU6 | PASS |
| Hardware/nvbandwidth/GPU7 | PASS |
| Hardware/nvbandwidth/summary | PASS |
| Integration/pcie/GPU0 | PASS |
| Integration/pcie/GPU1 | PASS |
| Integration/pcie/GPU2 | PASS |
| Integration/pcie/GPU3 | PASS |
| Integration/pcie/GPU4 | PASS |
| Integration/pcie/GPU5 | PASS |
| Integration/pcie/GPU6 | PASS |
| Integration/pcie/GPU7 | PASS |
| Integration/pcie/summary | PASS |
| Stress/targeted_stress/GPU0 | PASS |
| Stress/targeted_stress/GPU1 | PASS |
| Stress/targeted_stress/GPU2 | PASS |
| Stress/targeted_stress/GPU3 | PASS |
| Stress/targeted_stress/GPU4 | PASS |
| Stress/targeted_stress/GPU5 | PASS |
| Stress/targeted_stress/GPU6 | PASS |
| Stress/targeted_stress/GPU7 | PASS |
| Stress/targeted_stress/summary | PASS |
| Stress/targeted_power/GPU0 | PASS |
| Stress/targeted_power/GPU1 | PASS |
| Stress/targeted_power/GPU2 | PASS |
| Stress/targeted_power/GPU3 | PASS |
| Stress/targeted_power/GPU4 | PASS |
| Stress/targeted_power/GPU5 | PASS |
| Stress/targeted_power/GPU6 | PASS |
| Stress/targeted_power/GPU7 | PASS |
| Stress/targeted_power/summary | PASS |
## NCCL Multi-GPU
Source: nccl-tests | GPUs: 8
| Operation | Bus BW (GB/s) | Threshold | Status |
|-----------|---------------|-----------|--------|
| allreduce | 472.4 | >= 405 | FAIL |
| alltoall | 344.3 | >= 315 | FAIL |
| broadcast | 363.6 | >= 360 | FAIL |
| reducescatter | 353.1 | >= 405 | FAIL |
| allgather | 366.4 | >= 405 | FAIL |
| sendrecv | 368.9 | >= 360 | FAIL |
### NCCL allreduce by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 24.9, 24.4, 24.9 | 24.4 | 24.7 | 0.95% | >= 405 | FAIL |
| 256M | 421.9, 421.1, 421.9 | 421.1 | 421.6 | 0.09% | >= 405 | PASS |
| 2G | 472.6, 472.0, 472.5 | 472.0 | 472.4 | 0.06% | >= 405 | PASS |
### NCCL alltoall by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 7.9, 7.8, 8.1 | 7.8 | 7.9 | 1.57% | >= 315 | FAIL |
| 256M | 298.7, 312.7, 303.2 | 298.7 | 304.9 | 1.91% | >= 315 | FAIL |
| 2G | 342.2, 345.4, 345.2 | 342.2 | 344.3 | 0.43% | >= 315 | PASS |
### NCCL broadcast by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.5, 14.3, 14.4 | 14.3 | 14.4 | 0.57% | >= 360 | FAIL |
| 256M | 344.1, 344.3, 344.8 | 344.1 | 344.4 | 0.09% | >= 360 | FAIL |
| 2G | 364.0, 363.6, 363.3 | 363.3 | 363.6 | 0.08% | >= 360 | PASS |
### NCCL reducescatter by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.0, 14.2, 14.3 | 14.0 | 14.2 | 0.88% | >= 405 | FAIL |
| 256M | 328.8, 328.7, 328.4 | 328.4 | 328.6 | 0.05% | >= 405 | FAIL |
| 2G | 351.9, 353.8, 353.6 | 351.9 | 353.1 | 0.24% | >= 405 | FAIL |
### NCCL allgather by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.4, 13.9, 14.0 | 13.9 | 14.1 | 1.53% | >= 405 | FAIL |
| 256M | 350.2, 350.4, 350.7 | 350.2 | 350.4 | 0.06% | >= 405 | FAIL |
| 2G | 366.9, 366.4, 366.0 | 366.0 | 366.4 | 0.10% | >= 405 | FAIL |
### NCCL sendrecv by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 18.4, 18.3, 18.5 | 18.3 | 18.4 | 0.44% | >= 360 | FAIL |
| 256M | 351.1, 351.4, 351.3 | 351.1 | 351.3 | 0.04% | >= 360 | FAIL |
| 2G | 368.9, 368.8, 368.9 | 368.8 | 368.9 | 0.01% | >= 360 | PASS |
**Overall: FAIL**
## Stress Test
- **Source:** pytorch
- **Duration:** 1800s (requested 1800s)
- **Telemetry samples:** 1295
- **Max temp:** {0: 51.0, 1: 59.0, 2: 61.0, 3: 53.0, 4: 53.0, 5: 62.0, 6: 56.0, 7: 52.0}
- **Avg power:** {0: 698.8, 1: 697.8, 2: 698.1, 3: 697.9, 4: 697.9, 5: 698.2, 6: 698.0, 7: 697.8}
- **Temp delta:** 11.0 C
- **TFLOPS jitter:** 3.4%
- **Steady TFLOPS samples:** 37874
- **Throttle events:** 9944
- **XID events:** 0
- **Failure reasons:**
- GPU temperature delta 11.0C exceeds 5.0C
- non-idle throttle reasons observed in 9944 samples (first: GPU 0 0x4)
- **Result: FAIL**
## RDMA/InfiniBand
### RDMA Port Checks
| Device | Port | State | Rate | Required | Status |
|--------|------|-------|------|----------|--------|
| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 48.6 GB/s | >= 47 GB/s | PASS |
| ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 1.29 us | <= 2 us | PASS |
| ib_read_lat | 2.59 us | <= 3.5 us | PASS |
| ibping | local_loopback target=0x4b count=5 | 0% packet loss | PASS |
- **PFC/ECN/CNP/congestion counters checked:** 146
- **PFC/ECN/CNP/congestion non-zero:** no
- **Failure reasons:**
- mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- ib_read_bw bandwidth 40.29GB/s < 47GB/s
**Overall: FAIL**
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer_1.5b |
| Params | 1470.5M |
| Throughput | 216683 tokens/sec |
| Avg Step Time | 75.6 ms |
| Warmup Steps | 5 |
| Peak Memory | 18.1 GB |
| Final Loss | 0.0039 |
| Step Jitter | 1.2% |
| Distributed Mode | ddp |
| Verdict | PASS (216683 tokens/sec) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,101 @@
# H100 单节点 test all 中文汇总
生成时间2026-05-23
测试范围:`aikubeworker0012``aikubeworker0016` 单节点 `python gpu_tester.py --test all --report --format md`
原始报告:
- `reports_test_all_latest_aikubeworker0012_20260522_203246.md`
- `reports_test_all_latest_aikubeworker0016_20260522_203447.md`
## 总结论
| 机器 | Suite | PDF 验收结论 | 主要失败项 |
|---|---:|---|---|
| aikubeworker0012 | 6/10 PASS | FAIL | Compute、NCCL、Stress、RDMA |
| aikubeworker0016 | 6/10 PASS | FAIL | Compute、NCCL、Stress、RDMA |
按 PDF 口径,任一必测子项 FAIL则整机 FAIL。因此两台机器当前都不通过生产验收。
## 通过项
| 项目 | aikubeworker0012 | aikubeworker0016 | 说明 |
|---|---|---|---|
| GPU Info | PASS | PASS | 8 张 H100 |
| Health | PASS | PASS | 温度、空闲功耗、ECC、PCIe、空闲 throttle 正常 |
| Memory Bandwidth | PASS | PASS | D2D 效率均约 108.1% |
| NVLink/NVSwitch | PASS | PASS | 8 卡均 18/18 links |
| DCGM diag -r 3 | PASS | PASS | software、memory、diagnostic、nvbandwidth、pcie、targeted stress/power 全 PASS |
| Training Simulation | PASS | PASS | 8 卡 DDP synthetic 1.5Bloss finite |
Training 结果:
| 机器 | Throughput | Step jitter | Peak memory | Verdict |
|---|---:|---:|---:|---|
| aikubeworker0012 | 216498 tokens/s | 1.89% | 18.08 GB | PASS |
| aikubeworker0016 | 216683 tokens/s | 1.20% | 18.08 GB | PASS |
## 失败项
### Compute
两台机器都未达到当前 H100 绝对 TFLOPS 阈值,且部分 dtype 的跨 GPU spread 超过 3%。
| 机器 | 代表性失败 |
|---|---|
| aikubeworker0012 | FP16 spread 3.04%BF16 spread 4.58%FP64 spread 3.41%FP32/TF32/FP16/BF16/FP8/FP64/INT8 绝对阈值均 FAIL |
| aikubeworker0016 | BF16 spread 3.44%FP64 spread 4.64%FP32/TF32/FP16/BF16/FP8/FP64/INT8 绝对阈值均 FAIL |
### NCCL
NCCL 已经使用真实 `nccl-tests` bus BW不是 torchrun fallback。失败主要来自小 size 以及部分 256M/2G op 未达阈值。
| 机器 | allreduce best | alltoall best | broadcast best | reducescatter best | allgather best | sendrecv best | Verdict |
|---|---:|---:|---:|---:|---:|---:|---|
| aikubeworker0012 | 472.3 | 343.3 | 364.1 | 352.8 | 366.4 | 369.0 | FAIL |
| aikubeworker0016 | 472.4 | 344.3 | 363.6 | 353.1 | 366.4 | 368.9 | FAIL |
关键原因:
- `1M` size 在所有 op 上都明显低于阈值。
- `reducescatter``allgather` 的 2G 也低于 405 GB/s 阈值。
- `broadcast/sendrecv` 的 256M 低于 360 GB/s 阈值。
### Stress
两台机器的 1800 秒 PyTorch BF16 GEMM 压力测试均跑满,但 telemetry 判定 FAIL。
| 机器 | 平均稳态功耗 | 最高温度范围 | 温差 | TFLOPS jitter | throttle events | XID | Verdict |
|---|---|---|---:|---:|---:|---:|---|
| aikubeworker0012 | 约 697-698W/GPU | 56-68C | 12C | 4.37% | 9712 | 0 | FAIL |
| aikubeworker0016 | 约 698W/GPU | 51-62C | 11C | 3.40% | 9944 | 0 | FAIL |
失败原因:
- GPU 间温差超过 5C 阈值。
- 观测到大量非 idle throttle首个原因是 `0x4`,即 `sw_power_cap`
### RDMA/InfiniBand
本轮 `test all` 是单节点 RDMA 路径,`ibping` 显示为 `local_loopback`。这份结果不能替代跨节点 RDMA 验收,但仍反映单节点 perftest read bandwidth 未达标。
| 机器 | ib_write_bw | ib_read_bw | ib_write_lat | ib_read_lat | Verdict |
|---|---:|---:|---:|---:|---|
| aikubeworker0012 | 49.5 GB/s PASS | 39.1 GB/s FAIL | 1.25 us PASS | 2.60 us PASS | FAIL |
| aikubeworker0016 | 48.6 GB/s PASS | 40.3 GB/s FAIL | 1.29 us PASS | 2.59 us PASS | FAIL |
另外,两台机器都有 `mlx5_4``mlx5_5` 处于 ACTIVE 但速率为 100 Gb/sec低于当前 400G 端口阈值,因此 RDMA port check 也有 FAIL。
## 当前阻塞
1. Compute 阈值口径较严,当前实测绝对 TFLOPS 全 dtype 未达配置阈值,尤其 INT8 路径仅约 100 TFLOPS。
2. NCCL 真实 bus BW 已可测,但多 op/size 未达 PDF 阈值。
3. Stress 负载可跑满 30 分钟,但温差和 `sw_power_cap` throttle 导致 FAIL。
4. 单节点 RDMA read bandwidth 未达 47 GB/s且部分 IB 端口速率低于 400G。
5. 跨节点 RDMA 需要继续使用单独 server/client 报告;不能把本轮 `local_loopback` 当作跨节点验收。
## 状态判断
脚本能力已经基本补齐到 PDF 验收口径:真实 nccl-tests、30 分钟 stress telemetry、NVLink、DCGM r3、RDMA perftest/ibping/counter、逐 GPU compute、8 卡 DDP training、最终任一 FAIL 即整机 FAIL 都已经跑通。
当前剩余问题主要不是脚本缺项,而是两台机器的实际验收数据有多项未达标。

View File

@ -0,0 +1,259 @@
# GPU Test Report
- **Date:** 2026-05-22T18:27:01.103760
- **Host:** aikubeworker0012
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
- DCGM: ERROR: dcgmi diag -r 3 timeout after 1200s
- NCCL: FAIL
- Stress Test: FAIL
- RDMA: FAIL
- Training: FAIL (188741 tokens/sec)
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Health Check | PASS |
| Memory Bandwidth | PASS (108.1%) |
| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
| NVLink/NVSwitch | PASS |
| DCGM | ERROR: dcgmi diag -r 3 timeout after 1200s |
| NCCL | FAIL |
| Stress Test | FAIL |
| RDMA | FAIL |
| Training | FAIL (188741 tokens/sec) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 70/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 71/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 72/700W | 345 MHz |
## Health Check
**Overall: PASS**
| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
|-----|------|-------|-----|------|----------|--------|
| 0 | 25C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 1 | 25C PASS | 73W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 2 | 26C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 3 | 24C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 4 | 24C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 5 | 27C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 6 | 25C PASS | 71W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 7 | 24C PASS | 72W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
| D2H (PCIe) | 54.3 GB/s | 64 GB/s | 84.8% |
| D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
**Verdict: PASS** (D2D efficiency 108.1%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 52.0 | 67 | >= 54 | FAIL |
| TF32 | 364.8 | 495 | >= 444 | FAIL |
| FP16 | 685.0 | 990 | >= 734 | FAIL |
| BF16 | 715.9 | 990 | >= 745 | FAIL |
| FP8 | 1166.6 | 1979 | >= 1400 | FAIL |
| FP64 | 46.9 | 0 | >= 63 | FAIL |
| INT8 | 100.4 | 0 | >= 1536 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 58.9%)
### Compute Consistency
| DType | Min | Mean | Max | Spread | Limit | Status |
|-------|-----|------|-----|--------|-------|--------|
| FP32 | 51.9 | 52.0 | 52.2 | 0.58% | <= 3% | PASS |
| TF32 | 360.9 | 364.9 | 368.2 | 2.00% | <= 3% | PASS |
| FP16 | 676.0 | 685.0 | 689.9 | 2.03% | <= 3% | PASS |
| BF16 | 697.3 | 715.9 | 730.2 | 4.60% | <= 3% | FAIL |
| FP8 | 1141.8 | 1166.6 | 1180.3 | 3.30% | <= 3% | FAIL |
| FP64 | 45.8 | 46.9 | 47.7 | 4.05% | <= 3% | FAIL |
| INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
### Compute Per-GPU TFLOPS
| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
|---|---|---|---|---|---|---|---|
| 0 | 51.9 | 368.2 | 689.5 | 730.2 | 1180.3 | 47.1 | 100.4 |
| 1 | 51.9 | 366.8 | 688.7 | 721.6 | 1170.1 | 47.7 | 100.4 |
| 2 | 51.9 | 366.3 | 689.9 | 711.3 | 1167.8 | 47.2 | 100.4 |
| 3 | 51.9 | 363.0 | 677.6 | 699.2 | 1176.3 | 46.6 | 100.4 |
| 4 | 52.2 | 365.3 | 685.0 | 725.4 | 1163.0 | 46.8 | 100.4 |
| 5 | 52.1 | 363.9 | 684.2 | 725.0 | 1172.1 | 46.9 | 100.4 |
| 6 | 51.9 | 364.4 | 688.8 | 717.3 | 1161.2 | 46.9 | 100.4 |
| 7 | 51.9 | 360.9 | 676.0 | 697.3 | 1141.8 | 45.8 | 100.4 |
## NVLink/NVSwitch
**Overall: PASS**
| GPU | Active Links | Issues |
|-----|--------------|--------|
| 0 | 18/18 | OK |
| 1 | 18/18 | OK |
| 2 | 18/18 | OK |
| 3 | 18/18 | OK |
| 4 | 18/18 | OK |
| 5 | 18/18 | OK |
| 6 | 18/18 | OK |
| 7 | 18/18 | OK |
## DCGM Diagnostic
**Overall: FAIL** (dcgmi diag -r 3 timeout after 1200s)
## NCCL Multi-GPU
Source: nccl-tests | GPUs: 8
| Operation | Bus BW (GB/s) | Threshold | Status |
|-----------|---------------|-----------|--------|
| allreduce | 472.4 | >= 405 | FAIL |
| alltoall | 344.4 | >= 315 | FAIL |
| broadcast | 363.8 | >= 360 | FAIL |
| reducescatter | 353.0 | >= 405 | FAIL |
| allgather | 366.4 | >= 405 | FAIL |
| sendrecv | 368.9 | >= 360 | FAIL |
### NCCL allreduce by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 24.0, 24.9, 24.7 | 24.0 | 24.5 | 1.57% | >= 405 | FAIL |
| 256M | 421.4, 421.7, 421.4 | 421.4 | 421.5 | 0.03% | >= 405 | PASS |
| 2G | 471.8, 473.0, 472.3 | 471.8 | 472.4 | 0.10% | >= 405 | PASS |
### NCCL alltoall by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 8.1, 8.0, 8.0 | 8.0 | 8.0 | 0.59% | >= 315 | FAIL |
| 256M | 312.3, 310.9, 319.2 | 310.9 | 314.1 | 1.15% | >= 315 | FAIL |
| 2G | 343.1, 346.2, 344.0 | 343.1 | 344.4 | 0.38% | >= 315 | PASS |
### NCCL broadcast by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.6, 13.6, 14.5 | 13.6 | 14.2 | 3.16% | >= 360 | FAIL |
| 256M | 343.8, 344.2, 344.5 | 343.8 | 344.2 | 0.08% | >= 360 | FAIL |
| 2G | 363.5, 363.3, 364.7 | 363.3 | 363.8 | 0.17% | >= 360 | PASS |
### NCCL reducescatter by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.1, 14.3, 14.3 | 14.1 | 14.2 | 0.66% | >= 405 | FAIL |
| 256M | 328.1, 328.3, 328.3 | 328.1 | 328.2 | 0.03% | >= 405 | FAIL |
| 2G | 354.0, 352.6, 352.3 | 352.3 | 353.0 | 0.21% | >= 405 | FAIL |
### NCCL allgather by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.5, 14.5, 14.3 | 14.3 | 14.4 | 0.65% | >= 405 | FAIL |
| 256M | 350.7, 350.7, 350.5 | 350.5 | 350.6 | 0.03% | >= 405 | FAIL |
| 2G | 366.6, 366.3, 366.3 | 366.3 | 366.4 | 0.04% | >= 405 | FAIL |
### NCCL sendrecv by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 18.5, 18.4, 18.1 | 18.1 | 18.3 | 0.93% | >= 360 | FAIL |
| 256M | 352.3, 350.6, 350.5 | 350.5 | 351.1 | 0.24% | >= 360 | FAIL |
| 2G | 368.8, 369.0, 368.8 | 368.8 | 368.9 | 0.03% | >= 360 | PASS |
**Overall: FAIL**
## Stress Test
- **Source:** pytorch
- **Duration:** 1800s (requested 1800s)
- **Telemetry samples:** 1541
- **Max temp:** {0: 60.0, 1: 60.0, 2: 68.0, 3: 56.0, 4: 60.0, 5: 68.0, 6: 65.0, 7: 56.0}
- **Avg power:** {0: 697.7, 1: 697.4, 2: 697.2, 3: 697.7, 4: 697.5, 5: 698.0, 6: 697.8, 7: 698.4}
- **Temp delta:** 12.0 C
- **TFLOPS jitter:** 3.16%
- **Steady TFLOPS samples:** 37676
- **Throttle events:** 11912
- **XID events:** 0
- **Failure reasons:**
- GPU temperature delta 12.0C exceeds 5.0C
- non-idle throttle reasons observed in 11912 samples (first: GPU 0 0x4)
- **Result: FAIL**
## RDMA/InfiniBand
### RDMA Port Checks
| Device | Port | State | Rate | Required | Status |
|--------|------|-------|------|----------|--------|
| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 49.2 GB/s | >= 47 GB/s | PASS |
| ib_read_bw | 39.1 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 5.68 us | <= 2 us | FAIL |
| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
| ibping | target=0x58 count=5 | 0% packet loss | PASS |
- **PFC/ECN/CNP/congestion counters checked:** 0
- **PFC/ECN/CNP/congestion non-zero:** no
- **Failure reasons:**
- mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- ib_read_bw bandwidth 39.11GB/s < 47GB/s
- ib_write_lat latency 5.68us > 2.0us
- ib_read_lat latency 16.0us > 3.5us
**Overall: FAIL**
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer_1.5b |
| Params | 1470.5M |
| Throughput | 188741 tokens/sec |
| Avg Step Time | 86.8 ms |
| Peak Memory | 18.1 GB |
| Final Loss | 0.0041 |
| Step Jitter | 626.74% |
| Distributed Mode | ddp |
| Verdict | FAIL (188741 tokens/sec) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,259 @@
# GPU Test Report
- **Date:** 2026-05-22T18:29:01.245683
- **Host:** aikubeworker0016
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
- DCGM: ERROR: dcgmi diag -r 3 timeout after 1200s
- NCCL: FAIL
- Stress Test: FAIL
- RDMA: FAIL
- Training: FAIL (193836 tokens/sec)
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Health Check | PASS |
| Memory Bandwidth | PASS (108.1%) |
| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
| NVLink/NVSwitch | PASS |
| DCGM | ERROR: dcgmi diag -r 3 timeout after 1200s |
| NCCL | FAIL |
| Stress Test | FAIL |
| RDMA | FAIL |
| Training | FAIL (193836 tokens/sec) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 67/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 67/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 69/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 68/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 66/700W | 345 MHz |
## Health Check
**Overall: PASS**
| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
|-----|------|-------|-----|------|----------|--------|
| 0 | 19C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 1 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 2 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 3 | 19C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 4 | 19C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 5 | 21C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 6 | 19C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
| 7 | 19C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
| D2H (PCIe) | 54.7 GB/s | 64 GB/s | 85.5% |
| D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
**Verdict: PASS** (D2D efficiency 108.1%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 52.0 | 67 | >= 54 | FAIL |
| TF32 | 366.2 | 495 | >= 444 | FAIL |
| FP16 | 684.8 | 990 | >= 734 | FAIL |
| BF16 | 720.7 | 990 | >= 745 | FAIL |
| FP8 | 1180.3 | 1979 | >= 1400 | FAIL |
| FP64 | 47.3 | 0 | >= 63 | FAIL |
| INT8 | 100.5 | 0 | >= 1536 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 59.6%)
### Compute Consistency
| DType | Min | Mean | Max | Spread | Limit | Status |
|-------|-----|------|-----|--------|-------|--------|
| FP32 | 51.9 | 52.0 | 52.2 | 0.58% | <= 3% | PASS |
| TF32 | 361.1 | 366.2 | 368.9 | 2.13% | <= 3% | PASS |
| FP16 | 672.6 | 684.8 | 695.0 | 3.27% | <= 3% | FAIL |
| BF16 | 703.6 | 720.7 | 734.2 | 4.25% | <= 3% | FAIL |
| FP8 | 1158.6 | 1180.3 | 1241.8 | 7.05% | <= 3% | FAIL |
| FP64 | 46.7 | 47.3 | 48.0 | 2.75% | <= 3% | PASS |
| INT8 | 100.4 | 100.5 | 101.1 | 0.70% | <= 3% | PASS |
### Compute Per-GPU TFLOPS
| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
|---|---|---|---|---|---|---|---|
| 0 | 51.9 | 361.1 | 673.3 | 703.6 | 1158.6 | 46.7 | 100.4 |
| 1 | 52.0 | 367.0 | 684.0 | 725.7 | 1184.3 | 47.3 | 100.4 |
| 2 | 52.2 | 368.7 | 695.0 | 734.2 | 1197.7 | 48.0 | 100.4 |
| 3 | 51.9 | 367.8 | 688.0 | 708.1 | 1174.8 | 47.3 | 100.4 |
| 4 | 52.0 | 365.2 | 688.4 | 718.2 | 1160.5 | 47.0 | 101.1 |
| 5 | 52.1 | 368.9 | 684.2 | 733.7 | 1160.5 | 47.3 | 100.4 |
| 6 | 51.9 | 364.0 | 672.6 | 715.6 | 1164.4 | 47.1 | 100.4 |
| 7 | 51.9 | 367.0 | 692.5 | 726.5 | 1241.8 | 47.6 | 100.4 |
## NVLink/NVSwitch
**Overall: PASS**
| GPU | Active Links | Issues |
|-----|--------------|--------|
| 0 | 18/18 | OK |
| 1 | 18/18 | OK |
| 2 | 18/18 | OK |
| 3 | 18/18 | OK |
| 4 | 18/18 | OK |
| 5 | 18/18 | OK |
| 6 | 18/18 | OK |
| 7 | 18/18 | OK |
## DCGM Diagnostic
**Overall: FAIL** (dcgmi diag -r 3 timeout after 1200s)
## NCCL Multi-GPU
Source: nccl-tests | GPUs: 8
| Operation | Bus BW (GB/s) | Threshold | Status |
|-----------|---------------|-----------|--------|
| allreduce | 472.5 | >= 405 | FAIL |
| alltoall | 344.2 | >= 315 | FAIL |
| broadcast | 363.8 | >= 360 | FAIL |
| reducescatter | 352.5 | >= 405 | FAIL |
| allgather | 366.8 | >= 405 | FAIL |
| sendrecv | 369.0 | >= 360 | FAIL |
### NCCL allreduce by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 24.7, 24.1, 24.5 | 24.1 | 24.4 | 1.02% | >= 405 | FAIL |
| 256M | 421.8, 422.1, 421.4 | 421.4 | 421.8 | 0.07% | >= 405 | PASS |
| 2G | 472.8, 472.2, 472.6 | 472.2 | 472.5 | 0.05% | >= 405 | PASS |
### NCCL alltoall by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 8.0, 8.0, 7.9 | 7.9 | 8.0 | 0.59% | >= 315 | FAIL |
| 256M | 326.8, 315.4, 315.8 | 315.4 | 319.3 | 1.65% | >= 315 | PASS |
| 2G | 344.2, 343.8, 344.6 | 343.8 | 344.2 | 0.09% | >= 315 | PASS |
### NCCL broadcast by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.4, 14.2, 14.1 | 14.1 | 14.2 | 0.88% | >= 360 | FAIL |
| 256M | 345.3, 344.9, 344.4 | 344.4 | 344.9 | 0.11% | >= 360 | FAIL |
| 2G | 363.6, 363.9, 363.8 | 363.6 | 363.8 | 0.03% | >= 360 | PASS |
### NCCL reducescatter by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.3, 14.1, 14.1 | 14.1 | 14.2 | 0.67% | >= 405 | FAIL |
| 256M | 328.2, 328.3, 328.4 | 328.2 | 328.3 | 0.02% | >= 405 | FAIL |
| 2G | 352.2, 352.7, 352.6 | 352.2 | 352.5 | 0.06% | >= 405 | FAIL |
### NCCL allgather by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 14.2, 14.5, 14.3 | 14.2 | 14.3 | 0.87% | >= 405 | FAIL |
| 256M | 350.6, 350.6, 350.5 | 350.5 | 350.6 | 0.01% | >= 405 | FAIL |
| 2G | 367.0, 366.8, 366.5 | 366.5 | 366.8 | 0.06% | >= 405 | FAIL |
### NCCL sendrecv by size
| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
|------|---------------------|-------|------|--------|-----------|--------|
| 1M | 18.4, 18.2, 18.6 | 18.2 | 18.4 | 0.89% | >= 360 | FAIL |
| 256M | 350.7, 350.8, 351.1 | 350.7 | 350.9 | 0.05% | >= 360 | FAIL |
| 2G | 369.0, 369.0, 368.9 | 368.9 | 369.0 | 0.01% | >= 360 | PASS |
**Overall: FAIL**
## Stress Test
- **Source:** pytorch
- **Duration:** 1800s (requested 1800s)
- **Telemetry samples:** 1541
- **Max temp:** {0: 51.0, 1: 59.0, 2: 62.0, 3: 53.0, 4: 53.0, 5: 62.0, 6: 57.0, 7: 53.0}
- **Avg power:** {0: 698.7, 1: 698.0, 2: 698.1, 3: 697.9, 4: 697.7, 5: 698.2, 6: 698.0, 7: 697.7}
- **Temp delta:** 11.0 C
- **TFLOPS jitter:** 3.05%
- **Steady TFLOPS samples:** 37841
- **Throttle events:** 11912
- **XID events:** 0
- **Failure reasons:**
- GPU temperature delta 11.0C exceeds 5.0C
- non-idle throttle reasons observed in 11912 samples (first: GPU 0 0x4)
- **Result: FAIL**
## RDMA/InfiniBand
### RDMA Port Checks
| Device | Port | State | Rate | Required | Status |
|--------|------|-------|------|----------|--------|
| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 48.4 GB/s | >= 47 GB/s | PASS |
| ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 2.44 us | <= 2 us | FAIL |
| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
| ibping | target=0x4b count=5 | 0% packet loss | PASS |
- **PFC/ECN/CNP/congestion counters checked:** 0
- **PFC/ECN/CNP/congestion non-zero:** no
- **Failure reasons:**
- mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
- ib_read_bw bandwidth 40.29GB/s < 47GB/s
- ib_write_lat latency 2.44us > 2.0us
- ib_read_lat latency 16.0us > 3.5us
**Overall: FAIL**
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer_1.5b |
| Params | 1470.5M |
| Throughput | 193836 tokens/sec |
| Avg Step Time | 84.5 ms |
| Peak Memory | 18.1 GB |
| Final Loss | 0.004 |
| Step Jitter | 521.24% |
| Distributed Mode | ddp |
| Verdict | FAIL (193836 tokens/sec) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,43 @@
# GPU Test Report
- **Date:** 2026-05-22T19:46:07.450315
- **Host:** aikubeworker0012
## Overall Acceptance Verdict
**Result: FAIL**
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- RDMA
- DCGM
## Summary
| Test | Result |
|------|--------|
| Training | PASS (216654 tokens/sec) |
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer_1.5b |
| Params | 1470.5M |
| Throughput | 216654 tokens/sec |
| Avg Step Time | 75.6 ms |
| Warmup Steps | 5 |
| Peak Memory | 18.1 GB |
| Final Loss | 0.0039 |
| Step Jitter | 0.87% |
| Distributed Mode | ddp |
| Verdict | PASS (216654 tokens/sec) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,43 @@
# GPU Test Report
- **Date:** 2026-05-22T19:46:48.023650
- **Host:** aikubeworker0016
## Overall Acceptance Verdict
**Result: FAIL**
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- RDMA
- DCGM
## Summary
| Test | Result |
|------|--------|
| Training | PASS (217236 tokens/sec) |
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer_1.5b |
| Params | 1470.5M |
| Throughput | 217236 tokens/sec |
| Avg Step Time | 75.4 ms |
| Warmup Steps | 5 |
| Peak Memory | 18.1 GB |
| Final Loss | 0.0039 |
| Step Jitter | 1.23% |
| Distributed Mode | ddp |
| Verdict | PASS (217236 tokens/sec) |
---
*Generated by GPU Test Suite v0.2.0*

View File

@ -0,0 +1,73 @@
# aikubeworker0016 `test all` 中文结果与 H100 验收差距
测试命令:
```bash
/root/gpu-test-venv/bin/python gpu_tester.py --test all --report --format json --output reports_all/test_all.json
```
测试机器:`aikubeworker0016 / 172.72.8.16`
原始结果:`reports_all_aikubeworker0016.json`
## 先说结论
项目输出里最后显示 `Suite complete: 8/8 tests passed`,但这个结论不能直接当成生产验收 PASS。
原因是当前 `all` 的汇总逻辑主要看模块有没有抛 `error`,没有把 `nccl.passed=false``rdma.passed=false` 当成整套失败。因此按 PDF 的生产验收口径,这台机器目前不能算完整验收通过。
## 本次 `test all` 实际结果
| 模块 | 当前结果 | 关键数据 | 按 PDF 验收看 |
| --- | --- | --- | --- |
| GPU 信息 | 已覆盖 | 8 张 H100Driver 580.159.03CUDA 13.0 | 基础信息 OK但 NVLink 链路专项不足 |
| 健康检查 | PASS | health.passed=true | 基础健康 OK但缺 retired pages、AER/Replay、fabricmanager 日志、stress 期间采样 |
| Memory | 有结果 | H2D 55.5 GB/sD2H 55.3 GB/sD2D 486.5 GB/s | 单项看起来不错,但缺 8x8 P2P 矩阵验收 |
| Compute | 有结果 | FP32 51.9TF32 357.0FP16 664.0BF16 700.1FP8 1116.2 TFLOPS | 对 PDF 绝对门槛不全通过 |
| NCCL | 实际不合格 | source=torchrun_fallback`nccl.passed=false`,无 bus BW 性能数据 | 不满足 PDF NCCL 性能验收 |
| Stress | PASS | PyTorch fallback60 秒8 GPU 状态 PASS | 不满足 PDF 的 30/60 分钟 burn-in负载只有约 64MB/卡,压力明显不够 |
| RDMA/IB | 实际不合格 | ib_write_bw/read_bw 0.13 GB/s WARNwrite_lat 4.10us PASSread_lat 16us WARN | 当前是 localhost 单节点口径,不满足 PDF RDMA 生产验收 |
| Training | 有结果 | synthetic 1.47B52471 tokens/speak 27.31GBloss 0.0041 | tokens/s 过线,但代码实际不是 8 卡分布式训练验收 |
## Compute 对 PDF 门槛的判断
PDF H100 PASS 门槛:
| DType | 本次结果 | PDF PASS 门槛 | 判断 |
| --- | ---: | ---: | --- |
| FP32 | 51.9 TFLOPS | >= 54 | WARN |
| TF32 | 357.0 TFLOPS | >= 444 | FAIL |
| FP16 | 664.0 TFLOPS | >= 734 | WARN |
| BF16 | 700.1 TFLOPS | >= 745 | WARN |
| FP8 | 1116.2 TFLOPS | >= 1400 | FAIL |
| FP64 | 未测 | >= 63 | 缺失 |
| INT8 | 未测 | >= 1536 | 缺失 |
说明PDF 里 WARN 区间是 PASS 门槛的 90%-100%。TF32 和 FP8 低于 90% 门槛,所以按 PDF 是 FAIL。
## 如果只执行当前仓库 `test all`,少了什么
1. 少 NVLink 专项验收:没有逐卡检查 18 条链路、25GB/s 速率、CRC/Replay/Recovery error = 0。
2. 少 DCGM 诊断:没有 `dcgmi diag -r 3`
3. 少长时间 burn-in当前是 60 秒,不是 30/60 分钟。
4. 少 stress 期间 1 秒级采样温度、功耗、throttle、XID、TFLOPS 抖动都没按 PDF 统计。
5. 少真正 NCCL 性能:当前退化到 torchrun fallback没有 `nccl-tests` bus BW。
6. 少 NCCL 全操作和三档消息PDF 要 AllReduce/AllGather/ReduceScatter/Broadcast/SendRecv/AllToAll且 1MB/256MB/2GB 都过线。
7. 少 NCCL 重复 3 次取最差值和标准差 <=3%。
8. 少完整 P2P 8x8 矩阵:没有非对角均值、最小值、偏差判断。
9. 少逐 GPU compute 一致性:没有真正分别测 8 卡同 dtype 极差/均值 <=3%。
10. 少 FP64 和 INT8。
11. 少 RDMA 生产口径:当前 `localhost`64KB message阈值 10usPDF 要 4MB BW、8B latency、write/read >=47GB/s、write_lat <=2us、read_lat <=3.5us。
12. 少 PFC/ECN 错误计数和 ibping 双向。
13. 少真正 8 卡分布式 Training Simulation 验收。
14. 少严格最终 verdict当前代码会把 `passed=false` 的模块也计入“通过”,这是验收逻辑漏洞。
## 建议
`test all` 可以继续作为快速初筛跑,但如果目标是对齐 `H100_production_acceptance.pdf`,需要把它升级成“生产验收模式”。优先级如下:
1. 先修汇总 verdict任何子模块 `passed=false` 必须导致整机 FAIL。
2. 先装好 `nccl-tests``gpu-burn`,否则 NCCL/Stress 都不是生产口径。
3. 增加 NVLink、DCGM、长时间 telemetry、P2P 矩阵。
4. 改 RDMA 为生产参数,且支持跨节点。
5. 改 compute/training 为逐 GPU/8 卡分布式验收。