Add H100 acceptance test coverage and reports

2026-05-23 10:41:09 +08:00 · 2026-05-23 10:41:09 +08:00 · 86f15544d7
commit 86f15544d7
parent dd77a882f1
44 changed files with 6938 additions and 190 deletions
--- a/.gitignore
+++ b/.gitignore
@ -15,3 +15,4 @@ reports/
 venv/
 .qoder/*
 .claude/settings.local.json
+.omx/
--- a/H100_test_all_vs_PDF_覆盖对比.md
+++ b/H100_test_all_vs_PDF_覆盖对比.md
@ -0,0 +1,85 @@
+# H100 PDF 验收项 vs 当前 `test all` 覆盖对比
+
+对比对象：
+
+- PDF：`/Users/d-robotics/Downloads/H100_production_acceptance.pdf`
+- 当前脚本：`python gpu_tester.py --config configs/default.yaml --test all --report --format md`
+- 范围：单节点 8 卡 H100。跨节点 NCCL/RDMA 暂不纳入本轮。
+
+## 结论
+
+当前 `test all` 已经从“功能巡检”扩成了“接近生产验收”的单节点套件：GPU 健康、NVLink/NVSwitch、HBM/PCIe/NVLink 带宽、计算、NCCL、压力、RDMA 本机端口、DCGM、训练模拟都会进入同一个 all。
+
+最新 stress smoke 已确认 PyTorch BF16 GEMM 压力能把两台机器压到 PDF 要求的功耗区间：
+
+- `aikubeworker0012`：45 秒 smoke，稳态平均功耗约 `697-698W/卡`，TFLOPS jitter `4.07%`，XID `0`，但温差 `12C`、`clocks_throttle_reasons.active=0x4`，按 PDF 严格 FAIL。
+- `aikubeworker0016`：45 秒 smoke，稳态平均功耗约 `697-699W/卡`，TFLOPS jitter `3.77%`，XID `0`，但温差 `8C`、`clocks_throttle_reasons.active=0x4`，按 PDF 严格 FAIL。
+
+也就是说，当前卡点已经不是“脚本压不满 H100”，而是机器在满功耗压力下没有满足 PDF 的 `温差 <=5C` 和 `Throttle Reasons 全程 0x0` 两个严格门槛。
+
+但如果严格按 PDF 做最终验收，现在还差这些：
+
+1. 24 小时类指标未覆盖：PDF 要求 SBE 24h 增长率、长稳态观察；当前 `all` 是单次快照 + 30 分钟压力，不等于 24 小时老化。
+2. 跨节点项目本轮故意不测：PDF 的 IB/RDMA 生产验收通常要双端 `ib_write_bw/read_bw/lat`、`ibping`；当前按你的要求先做单节点，跨节点未纳入。
+3. PFC/ECN/AER 的覆盖依赖机器暴露的系统计数器：脚本会读能找到的 sysfs 计数器和 dmesg，但如果交换机侧 PFC/ECN 不在主机暴露，仍需要网络侧补证据。
+4. NCCL 1MB 档会被严格阈值打失败：实测 1MB AllReduce bus BW 约 23 GB/s，而 256MB AllReduce 已通过 `nccl-tests` 验证，约 421 GB/s；如果 PDF 要求 1MB 也达到 405 GB/s，这项不是“没测”，而是会被判 FAIL。
+5. Stress 已能达到功耗和 jitter 要求，但短测已经暴露温差和 throttle strict FAIL；完整 1800 秒只会给出更正式的证据，不会自动改变这个判据。
+
+## 覆盖表
+
+| PDF 验收项 | 当前 `test all` 状态 | 还少什么 |
+|---|---:|---|
+| GPU 基本信息、Driver/CUDA | 已覆盖 | 无；会记录 driver、CUDA、GPU 型号 |
+| 温度阈值：稳态 ≤75C、峰值 ≤85C | 已覆盖健康快照；压力项覆盖 ≤80C | 24h 稳态曲线不在一次 all 内 |
+| idle power ≤100W/card | 部分覆盖 | 当前 health 会采功耗，但 idle 判据还不是独立验收项 |
+| stress power ≥630W/card | 已覆盖；短测两台约 697-699W/卡 | 完整 1800 秒仍待跑 |
+| throttle reasons active=0x0 | 已覆盖；短测两台出现 0x4 | 按 PDF 严格判 FAIL；不是脚本跳过项 |
+| DBE/SBE/retired pages | 部分覆盖 | retired pages 和内核错误已查；SBE 24h 增长率未覆盖 |
+| PCIe Gen5 x16 | 部分覆盖 | GPU 信息/拓扑可见；Replay/AER 依赖 dmesg/sysfs，可能还需额外主板侧证据 |
+| Fabric Manager active 且无 ERROR | 已覆盖 | 无；health 会查 systemd 和 journal |
+| NVLink：18 links/GPU、25GB/s/link、错误为 0 | 已覆盖 | 无；新增 `nvlink` 项 |
+| D2D/H2D/D2H 带宽 | 已覆盖 | 依赖 `nvbandwidth`，两台已具备 |
+| 8x8 P2P matrix off-diagonal mean/min/deviation | 已覆盖 | 无；由 nvbandwidth JSON 解析 |
+| Compute FP32/TF32/FP16/BF16/FP8/FP64/INT8 | 已覆盖 | INT8 为 PyTorch `_int_mm` 路径，若要供应商标准 INT8 kernel 需再换实现 |
+| NCCL AllReduce/AllGather/ReduceScatter/Broadcast/SendRecv/AllToAll | 已覆盖 | 无；`nccl-tests` 已在两台编好 |
+| NCCL 1MB/256MB/2GB，repeat 3，stddev ≤3% | 已覆盖 | 严格按 PDF 阈值时 1MB 档大概率 FAIL；256MB AllReduce 两台 `nccl-tests` 实测约 421GB/s |
+| Stress ≥30min，BF16/FP16 GEMM 8192，1s telemetry | 已覆盖；默认 BF16 GEMM `24576`，1s telemetry，warmup 后稳态判定 | 完整 1800 秒待执行；短测已暴露温差/throttle FAIL |
+| DCGM `dcgmi diag -r 3` | 已覆盖；DCGM 4.5.3 已安装，服务已启用 | 两台完整 `-r 3` 已 PASS；日志见 `/root/test_gpu_scripts/reports/dcgm_r3_*_20260522_17010*.log` |
+| RDMA 端口 ACTIVE、400Gbps | 部分覆盖 | 单节点可查端口；严格双端吞吐/时延本轮不跑 |
+| RDMA write/read bw ≥47GB/s、latency ≤2/3.5us | 部分覆盖 | 单机 localhost/perftest 不等价跨节点线速验收 |
+| PFC/ECN errors=0、ibping 双向 OK | 部分覆盖 | 主机能读到的计数器会查；交换机侧/跨节点 ibping 未覆盖 |
+| 1.5B synthetic Transformer BF16，8 卡，≥45k tokens/s | 已覆盖 DDP 路径 | 8 进程 DDP smoke 已通过；完整 50 step 长跑待执行 |
+| 任一子项 FAIL 则总体验收 FAIL | 已覆盖 | `all` 现在会按 strict verdict 退出非 0 |
+
+## 如果现在直接跑 `all`
+
+推荐命令：
+
+```bash
+cd /root/test_gpu_scripts
+/root/gpu-test-venv/bin/python gpu_tester.py --config configs/default.yaml --test all --report --format json --output reports/h100_all_$(hostname)_$(date +%Y%m%d_%H%M%S).json
+```
+
+如果要直接生成中文 Markdown 报告，用这个：
+
+```bash
+cd /root/test_gpu_scripts
+/root/gpu-test-venv/bin/python gpu_tester.py --config configs/default.yaml --test all --report --format md --output reports/h100_all_$(hostname)_$(date +%Y%m%d_%H%M%S).md
+```
+
+预计行为：
+
+- 会跑完整单节点项目，压力默认 1800 秒，默认使用 PyTorch BF16 GEMM 压力并采 1 秒 telemetry/XID。
+- stress 默认矩阵为 `24576`，用于把 H100 压到 ≥630W/卡；PDF 只要求 `matrix_size >=8192`，这里是为了满足功耗门槛。
+- NCCL 会跑 6 个 op × 3 个 message size × 3 次 repeat。
+- DCGM 会跑 `dcgmi diag -r 3 -n gpu:8 -j`；DCGM 工具链已安装并启动，`diag -r 1` 与两台独立 `r3` 长跑均已 PASS。
+- NCCL 1MB 档按 405GB/s 阈值也会失败；256MB AllReduce 已验证走 `nccl-tests`，两台约 421GB/s。
+- stress 按 PDF 严格口径预计会 FAIL：当前短测证据显示温差超过 5C，且 throttle active 出现 `0x4`。
+- 跨节点 RDMA/NCCL 不在这次单节点 all 里。
+
+## 当前最小补齐清单
+
+1. 如果要严格 RDMA 生产验收，下一轮用两台机器做 server/client 双端测试。
+2. 执行完整 1.5B DDP 50 step 训练验收并归档 tokens/s、jitter、显存和 loss。
+3. 执行完整 1800 秒 stress 并归档 1 秒 telemetry、XID、throttle、功耗和温度；当前预期会因温差/throttle FAIL。
+4. 如果要 24 小时验收，增加一个 24h monitor 模式，记录 SBE 增长率、XID、温度、功耗、降频曲线。
--- a/H100验收_vs_test_all_差距分析.md
+++ b/H100验收_vs_test_all_差距分析.md
@ -0,0 +1,100 @@
+# H100 生产验收标准 vs 当前 `gpu_tester.py --test all` 覆盖差距
+
+对比文件：`/Users/d-robotics/Downloads/H100_production_acceptance.pdf`
+
+对比对象：当前仓库执行 `python gpu_tester.py --test all --report --format md/json`
+
+## 结论
+
+当前仓库的 `test all` 能覆盖验收文档里的大类框架，但还不是完整的 H100 生产验收。
+
+它会跑 8 个模块：
+
+1. GPU Information
+2. Health Check
+3. Memory Benchmark
+4. Compute Benchmark
+5. NCCL Test
+6. GPU Stress Test
+7. RDMA/IB Test
+8. Training Simulation
+
+但是按照 PDF 的生产验收标准，仍缺少这些关键项：
+
+- NVLink 每卡 18 条链路的 active/速率/错误计数逐项验收
+- DCGM `dcgmi diag -r 3`
+- 30-60 分钟 burn-in 和 1 秒级温度/功耗/throttle/XID 采样
+- NCCL 官方 `nccl-tests` 的性能验收，包括 1MB/256MB/2GB 三个消息大小、重复 3 次取最差值、标准差
+- RDMA 生产口径：4MB 带宽、8B 延迟、PFC/ECN 错误、ibping 双向
+- 8 卡逐卡 compute 一致性，要求同 dtype 极差/均值 <= 3%
+- FP64、INT8 计算项
+- 训练项应为 8 卡 1.5B synthetic Transformer，并按 45k tokens/s、step 抖动、显存、loss 健康度验收
+
+## 覆盖矩阵
+
+| PDF 验收项 | `test all` 是否覆盖 | 当前覆盖程度 | 主要缺口 |
+| --- | --- | --- | --- |
+| 1. 健康检查 | 部分覆盖 | 温度、功耗、ECC、PCIe、时钟、throttle、persistence、IB 设备 | idle 功耗 <=100W 未单独判定；stress 功耗 >=630W 未判定；retired pages 未查；24h SBE 增长率未查；AER/Replay errors 未查；fabricmanager 服务和 ERROR 日志未查 |
+| 2. NVLink 拓扑与链路 | 部分覆盖 | GPU info 会保存 `nvidia-smi topo -m` | 未跑 `nvidia-smi nvlink -s/-c/-e`；未验证每卡 18 条 NVLink；未验证每条 25GB/s；未验证 CRC/Replay/Recovery error = 0 |
+| 3. Memory Bandwidth | 部分覆盖 | 会用 nvbandwidth 测 H2D、D2H、D2D write/read/bidir | 未输出完整 8x8 P2P 矩阵；未验非对角均值 >=360GB/s、最小值 >=320GB/s、相对均值偏差 <=±5%；D2D 口径和 PDF 的单卡/P2P 验收口径还没完全对齐 |
+| 4. Compute Throughput | 大部分覆盖 | 默认配置已是 matrix_size=8192、warmup=50、iterations=500、use_compile=true；H100 绝对 TFLOPS 阈值在 `gpu_specs.py` 里有 | 目前测试结果是整体/单进程口径，未真正逐 GPU 分别测出 8 卡极差/均值；未测 FP64、INT8 |
+| 5. NCCL Multi-GPU | 部分覆盖，依赖工具 | 代码支持 nccl-tests；若缺 binary 会 fallback torchrun 功能连通性 | 当前远端没装好 nccl-tests，实际会退化成功能测试且失败/无性能数据；默认只启 allreduce/alltoall/broadcast，未启 allgather/reducescatter/sendrecv；消息大小不是 1MB/256MB/2GB 三点；未重复 3 次取 worst；未统计标准差 |
+| 6. Stress/Burn-in | 部分覆盖 | 会跑 stress，默认 60 秒；无 gpu-burn 时用 PyTorch fallback | PDF 要 >=30min，推荐 60min；要 FP16/BF16 大 GEMM matrix >=8192；要每分钟 TFLOPS 抖动、温度 <=80、卡间温差 <=5、功耗 >=630W、throttle=0、XID=0；当前 PyTorch fallback 只分配约 64MB/卡，压力不够 |
+| 7. DCGM 诊断 | 未覆盖 | 无 | 没有执行 `dcgmi diag -r 3`，也没有解析 Software/Deployment/Hardware/Integration/Stress/Power 子项 |
+| 8. RDMA/IB | 部分覆盖 | 会发现 IB 设备，跑 ib_write_bw/read_bw/write_lat/read_lat | 当前脚本用 `localhost`，不是跨节点；msg_size 是 64KB，不是 4MB；latency 没指定 8B；阈值是 50GB/s 和 10us，不是 PDF 的 write/read >=47GB/s、write_lat <=2us、read_lat <=3.5us；未查 PFC/ECN、ibping 双向 |
+| 9. Training Simulation | 部分覆盖 | 会跑 GPT-2 或 synthetic transformer，输出 tokens/s、step time、显存、loss | 当前 synthetic 是约 1.47B 参数但实际单进程 `.cuda()`，不是 8 卡分布式训练；未按 45k tokens/s、step 抖动 <=±3%、peak <=70GB/卡、NaN/Inf 做硬判定 |
+| 10. 总体 Verdict | 部分覆盖 | report 有 summary | 当前 `all` 的 pass/fail 逻辑偏“模块是否报错”，不是 PDF 的任一子项 FAIL 即整机禁上生产 |
+
+## 如果现在直接执行 `test all`，能得到什么
+
+会得到一份“单节点综合体检/基准测试报告”，包含：
+
+- 8 张 H100 的基础信息、驱动/CUDA、PCIe、显存、温度、功耗
+- 健康检查结果
+- nvbandwidth 的 H2D/D2H/D2D 汇总带宽
+- FP32/TF32/FP16/BF16/FP8 计算吞吐
+- NCCL 测试结果，如果 nccl-tests 缺失会退化到 torchrun fallback
+- 60 秒 stress 结果
+- 本机 localhost RDMA/IB 结果
+- 训练模拟结果
+
+这份报告能作为“快速冒烟 + 单机初筛”，不能直接作为 PDF 标准下的“生产验收合格报告”。
+
+## 当前两台机器执行前置状态
+
+已经确认：
+
+- `nvbandwidth` 已装好并能被项目脚本调用
+- PyTorch CUDA 环境已装好
+- RDMA perftest 工具已存在
+- `nccl-tests` 和 `gpu-burn` 目前没有按 PDF 生产验收口径准备好
+
+另外，我刚才误触发的 `test all`：
+
+- `aikubeworker0016` 已经在跑单节点 `test all`，当前到 Training Simulation
+- `aikubeworker0012` 没有成功启动
+
+## 要补齐到 PDF 验收口径，需要加的最小清单
+
+1. 安装/修复 `nccl-tests`，确保真正输出 bus BW，而不是 torchrun fallback。
+2. 安装/修复 `gpu-burn`，或把 PyTorch stress 改成真正高占用 FP16/BF16 GEMM，并支持 30/60 分钟。
+3. 增加 NVLink 专项：`nvidia-smi nvlink -s/-c/-e`，按 18 条/卡、25GB/s、error=0 判定。
+4. 增加 DCGM 专项：`dcgmi diag -r 3`，解析子项 PASS/FAIL。
+5. 增加 telemetry 采样：stress 期间每 1 秒采温度、功耗、throttle、XID；计算稳态功耗、温差、抖动。
+6. 修改 RDMA：支持指定 server/client、4MB 带宽、8B 延迟、双向 ibping、PFC/ECN 计数。
+7. 修改 NCCL 配置：全 op 开启，按 1MB/256MB/2GB 三个 size，重复 3 次取最差值和标准差。
+8. 修改 Compute：逐 GPU 分别跑，计算同 dtype 极差/均值；增加 FP64、INT8。
+9. 修改 Training Simulation：明确 8 卡 1.5B synthetic 分布式训练，加入 tokens/s、step 抖动、显存、loss NaN/Inf 的 PASS/FAIL。
+10. 修改最终 verdict：按 PDF 规则，任一子项 FAIL 就整机不通过。
+
+## 建议执行策略
+
+现在直接跑：
+
+```bash
+/root/gpu-test-venv/bin/python gpu_tester.py --test all --report --format md --output reports_all/test_all.md
+```
+
+得到的是“当前仓库 all 覆盖范围报告”。
+
+要拿来做生产验收，需要先补齐上面的缺口，尤其是 `nccl-tests`、`gpu-burn`、NVLink、DCGM、长时间 burn-in、跨节点 RDMA。
--- a/README.md
+++ b/README.md
@ -159,7 +159,7 @@ python3 gpu_tester.py
 [3]  Memory Benchmark (nvbandwidth)
 [4]  Compute Benchmark
 [5]  NCCL Multi-GPU Test
- [6]  GPU Stress Test (gpu-burn)
+ [6]  GPU Stress Test (PyTorch/gpu-burn)
 [7]  RDMA/IB Test
 [8]  Training Simulation
 [9]  Full Test Suite (All Tests)
@ -279,33 +279,35 @@ python3 gpu_tester.py --config /path/to/config.yaml --test all
 | FP16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS |
 | BF16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS |
 | FP8 | N/A | 1,979 TFLOPS | 4,500 TFLOPS | 7,000 TFLOPS |
+| FP64 | 9.7 TFLOPS | 67 TFLOPS | TBD | TBD |
+| INT8 | 624 TOPS | 1,979 TOPS | TBD | TBD |

-默认配置：4096×4096 矩阵，10 次 warmup，100 次迭代。
+默认配置：8192×8192 矩阵，50 次 warmup，500 次迭代；逐 GPU 跑 FP32/TF32/FP16/BF16/FP8/FP64/INT8，并按同 dtype 的极差/均值判断一致性。

 ### 5. NCCL Multi-GPU Test（多卡通信）

-优先使用官方 nccl-tests（通过 mpirun 调用），不可用时 torchrun fallback。
+优先使用官方 nccl-tests（通过 mpirun 调用）并解析真实 bus BW；如果只能走 torchrun fallback，验收结果会标记 FAIL。

 | 操作 | 说明 |
 |---|---|
 | AllReduce | 最常用的集合通信 |
 | AllToAll | 模型并行关键操作 |
 | Broadcast | 参数同步 |
-| ReduceScatter | 可选 |
-| AllGather | 可选 |
-| SendRecv | 可选 |
+| ReduceScatter | 必测 |
+| AllGather | 必测 |
+| SendRecv | 必测 |

-默认测试数据量范围 8B ~ 256MB，5 次 warmup，20 次迭代。
+默认按 PDF 口径测试 1MB、256MB、2GB 三个 size，每个 op 重复 3 次，取 worst bus BW 和标准差；标准差超过 3% 判 FAIL。

 **NVLink 参考带宽：** A100/A800 ≥ 240 GB/s | H100/H200 ≥ 360 GB/s | B200/B300 ≥ 720 GB/s（40% NVLink 峰值）

 ### 6. GPU Stress Test（压力测试）

-使用 gpu-burn 进行长时满载测试，验证热稳定性和内存正确性。
+默认使用 PyTorch BF16/FP16 GEMM 进行长时高功耗满载测试；也可在配置中启用 gpu-burn。测试期间采集温度、功耗、throttle、XID，并计算稳态功耗、温差和 TFLOPS 抖动。

 | 参数 | 默认值 | 说明 |
 |---|---|---|
-| duration_sec | 60 | 测试时长（秒） |
+| duration_sec | 1800 | 测试时长（秒） |
 | use_tensor_cores | true | 使用 Tensor Core |
 | memory_pct | 90 | 内存占用比例 |

@ -320,18 +322,18 @@ python3 gpu_tester.py --config /path/to/config.yaml --test all
 | 写延迟 | ib_write_lat |
 | 读延迟 | ib_read_lat |

-**参考阈值：** 带宽 ≥ 50 GB/s, 延迟 ≤ 10 μs
+**参考阈值：** 端口 ACTIVE 且 ≥400Gbps；4MB 写/读带宽 ≥47GB/s；8B 写延迟 ≤2μs、读延迟 ≤3.5μs；PFC/ECN/CNP/congestion 计数为 0。

 ### 8. Training Simulation（训练模拟）

-使用真实或合成模型模拟训练负载。
+默认跑 8 卡 DDP synthetic 1.5B Transformer 训练模拟。

 | 模式 | 说明 |
 |---|---|
-| 真实模型 | 加载 HuggingFace GPT-2（需安装 transformers） |
-| 合成模型 | 6 层 Transformer（无需额外依赖） |
+| DDP 合成模型 | 约 1.5B 参数，8 卡 torchrun |
+| 单进程 fallback | 仅用于调试；生产验收按 FAIL |

-输出：tokens/sec、步时、峰值显存、最终 loss。
+输出：tokens/sec、步时、warmup 后 step 抖动、峰值显存、最终 loss，并检查 loss 是否 NaN/Inf。

 ---

@ -351,14 +353,14 @@ benchmark:
    nvbandwidth_buffer_mb: 512          # nvbandwidth 缓冲区大小
    nvbandwidth_samples: 3              # nvbandwidth 采样次数
  compute:
-    dtypes: [fp32, tf32, fp16, bf16, fp8]
-    matrix_size: 4096                   # GEMM 矩阵维度
-    warmup: 10
-    iterations: 100
+    dtypes: [fp32, tf32, fp16, bf16, fp8, fp64, int8]
+    matrix_size: 8192                   # GEMM 矩阵维度
+    warmup: 50
+    iterations: 500

 health:
-  temp_warning: 80                      # 温度警告阈值 °C
-  temp_critical: 90                     # 温度严重阈值 °C
+  temp_warning: 75                      # 温度警告阈值 °C
+  temp_critical: 85                     # 温度严重阈值 °C
  power_limit: null                     # null = 自动匹配 GPU TDP

 nccl:
@ -366,26 +368,62 @@ nccl:
  test_allreduce: true
  test_alltoall: true
  test_broadcast: true
+  test_reduce_scatter: true
+  test_allgather: true
+  test_sendrecv: true
+  message_sizes: [1M, 256M, 2G]
+  repeats: 3
+  max_stddev_pct: 3

 stress:
-  duration_sec: 60                     # 压力测试时长
+  duration_sec: 1800                   # 压力测试时长
+  use_gpu_burn: false                  # 默认走 PyTorch GEMM stress
+  dtype: bf16
+  matrix_size: 24576
+  telemetry_interval_sec: 1
+  min_power_watts: 630
+  max_tflops_jitter_pct: 5
+  require_tflops_jitter: true
  use_tensor_cores: true

 rdma:
-  min_bandwidth_gbps: 50              # RDMA 最低可接受带宽
-  max_latency_us: 10                  # RDMA 最大可接受延迟
-  msg_size: 65536                     # 测试消息大小
+  min_bandwidth_gbps: 47              # RDMA 最低可接受带宽
+  min_port_rate_gbps: 400             # IB 端口最低速率
+  max_write_latency_us: 2.0
+  max_read_latency_us: 3.5
+  msg_size: 4194304                   # 4MB 带宽测试消息
+  latency_msg_size: 8                 # 8B 延迟测试消息
+  server_addr: null                   # client 模式 perftest 对端 IP
+  ibping_target: null                 # ibping 对端 LID/GID，不是 IP
+  role: auto                          # auto / server / client
+  pfc_ecn_counters: true
+
+nvlink:
+  expected_links_per_gpu: 18
+  expected_link_speed_gbps: 25
+  require_zero_errors: true
+
+dcgm:
+  diag_level: 3
+  timeout_sec: 3600
+  expected_num_gpus: 8
+  json_output: true
+  require_subtests: true

 training:
-  model: gpt2                          # HuggingFace 模型名
+  model: synthetic_1.5b                # 8 卡 synthetic Transformer
  batch_size: 8
  seq_length: 2048
  num_steps: 50
+  warmup_steps: 5
  dtype: bf16
+  mode: ddp
+  min_tokens_per_sec: 45000
+  max_step_jitter_pct: 3

 report:
  output_dir: ./reports
-  format: json                         # json 或 html
+  format: json                         # json / html / md
 ```

 ---
@ -493,9 +531,11 @@ report:
 步骤 2: RDMA 网络测试
 ├── python3 gpu_tester.py --test rdma
 ├── 确认: IB 设备被识别
-├── 确认: 端口状态 Active
-├── 确认: 写带宽 ≥ 50 GB/s
-├── 确认: 延迟 ≤ 10 μs
+├── 确认: 端口状态 ACTIVE 且 ≥400Gbps
+├── 确认: 4MB 写/读带宽 ≥47 GB/s
+├── 确认: 8B 写延迟 ≤2 μs、读延迟 ≤3.5 μs
+├── 确认: ibping 双向连通
+├── 确认: PFC/ECN/CNP/congestion 计数为 0
 └── 异常: 检查 IB 线缆、交换机配置、子网管理器

 步骤 3: 多节点 NCCL 测试
--- a/docs/h100_test_all_metrics_guide_cn.md
+++ b/docs/h100_test_all_metrics_guide_cn.md
@ -0,0 +1,255 @@
+# H100 `test all` 指标说明
+
+本文解释 `gpu_tester.py --test all` 报告里每一项指标的意义、它在验收中代表什么，以及异常时通常应该优先排查什么。
+
+适用报告：
+
+- `reports_test_all_latest_aikubeworker0012_20260522_203246.md`
+- `reports_test_all_latest_aikubeworker0016_20260522_203447.md`
+- `reports_test_all_latest_summary_cn_20260523.md`
+
+## 总体判定
+
+| 指标 | 意义 | 怎么看 |
+|---|---|---|
+| `Overall Acceptance Verdict` | 整机验收结论 | 按 PDF 生产验收规则，任一必测子项 FAIL，则整机 FAIL |
+| `Suite complete: x/10 tests passed` | 10 个测试模块里通过了几个 | 用来快速看整体健康度，但最终以 `Overall Acceptance Verdict` 为准 |
+| `PASS` | 达到当前配置阈值 | 表示该指标在当前测试口径下通过 |
+| `FAIL` | 未达到当前配置阈值，或证据不足 | 表示该项不能作为生产验收通过证据 |
+| `WARN` | 旧报告或非强制警告口径 | 当前 PDF 生产验收里，关键性能未达标应按 FAIL 处理 |
+
+## GPU Info
+
+GPU Info 是基础盘点项，用来确认机器硬件、驱动和 CUDA 环境是否符合预期。
+
+| 指标 | 意义 | 异常影响 |
+|---|---|---|
+| GPU count | 当前系统识别到的 GPU 数量 | H100 8 卡机器如果不是 8 张，后续所有多卡测试都不可信 |
+| GPU model | GPU 型号，例如 H100 | 型号不对会导致阈值、峰值、验收口径都不对 |
+| Driver version | NVIDIA 驱动版本 | 版本过旧可能影响 CUDA、NCCL、DCGM、NVLink 工具 |
+| CUDA version | CUDA 运行时或驱动支持版本 | CUDA 不匹配会导致 PyTorch、nccl-tests 或编译工具异常 |
+| GPU UUID / PCI bus id | GPU 唯一标识和 PCIe 拓扑位置 | 用于定位具体故障卡、对应槽位和链路 |
+
+这项通常不直接代表性能好坏，它是确认“测的是不是目标机器、目标 GPU、目标软件栈”。
+
+## Health Check
+
+Health Check 是空闲或轻负载状态下的基础健康检查。
+
+| 指标 | 意义 | 怎么看 |
+|---|---|---|
+| Temperature | 当前 GPU 温度 | 空闲温度过高可能说明散热、风道、环境温度异常 |
+| Power | 当前功耗 | 空闲功耗异常高可能说明有残留进程或功耗状态异常 |
+| ECC errors | 显存纠错错误 | 单比特错误过多或双比特错误通常需要重点关注硬件稳定性 |
+| PCIe | PCIe 代际和宽度，例如 Gen5 x16 | 降速或降宽会影响 CPU-GPU、RDMA、部分数据搬运性能 |
+| Throttle | 当前是否触发限速 | 空闲状态下非 idle throttle 不正常，可能影响后续性能 |
+| XID / NVRM events | 驱动或 GPU 错误事件 | 出现新 XID 通常说明硬件、驱动、供电或内核态异常 |
+
+Health PASS 只能说明基础状态正常，不代表满载性能一定达标。
+
+## Memory Bandwidth
+
+Memory Bandwidth 衡量数据搬运能力，包括 CPU 到 GPU、GPU 到 CPU、GPU 到 GPU。
+
+| 指标 | 意义 | 代表什么 |
+|---|---|---|
+| H2D | Host to Device，CPU 内存到 GPU 显存带宽 | 受 PCIe、NUMA、CPU 内存、驱动影响 |
+| D2H | Device to Host，GPU 显存到 CPU 内存带宽 | 受 PCIe、NUMA、CPU 内存、驱动影响 |
+| D2D | Device to Device，GPU 到 GPU 带宽 | 单节点多卡通常主要受 NVLink/NVSwitch 影响 |
+| Efficiency | 实测值相对理论或配置阈值的比例 | 用于快速判断是否达到预期带宽 |
+
+H2D/D2H 主要看 PCIe 和 CPU 侧链路是否正常。D2D 更接近多卡训练、NCCL 和 P2P 通信的基础能力。
+
+## Compute Throughput
+
+Compute Throughput 衡量 GPU 在不同数值格式下的矩阵计算吞吐，单位通常是 TFLOPS。
+
+| 指标 | 意义 | 常见用途 |
+|---|---|---|
+| FP32 | 32 位浮点性能 | 传统科学计算、部分模型训练和验证 |
+| TF32 | TensorFloat-32 Tensor Core 性能 | NVIDIA Ampere/Hopper 上常见的 FP32 加速路径 |
+| FP16 | 16 位浮点 Tensor Core 性能 | 深度学习训练和推理常用 |
+| BF16 | bfloat16 Tensor Core 性能 | 大模型训练常用，数值范围比 FP16 更稳 |
+| FP8 | 8 位浮点 Tensor Core 性能 | 新一代低精度训练/推理加速 |
+| FP64 | 64 位双精度性能 | HPC、科学计算、仿真 |
+| INT8 | 8 位整数性能 | 推理、量化模型 |
+| Achieved | 实测吞吐 | 越接近峰值越好 |
+| Peak | 理论峰值或规格峰值 | 用来计算效率 |
+| Threshold | 当前验收阈值 | 低于阈值则 FAIL |
+| Efficiency | `Achieved / Peak` | 衡量实测利用率 |
+
+### Compute Consistency
+
+Consistency 是看同一种 dtype 下，不同 GPU 之间性能是否均衡。
+
+| 指标 | 意义 | 异常含义 |
+|---|---|---|
+| Min | 8 张 GPU 里最慢卡的实测值 | 用于发现拖后腿的卡 |
+| Mean | 8 张 GPU 平均值 | 用于看整体水平 |
+| Max | 8 张 GPU 里最快卡的实测值 | 和 Min 一起计算离散度 |
+| Spread | `(Max - Min) / Mean` | 反映卡间性能差异 |
+
+Spread 超过阈值通常说明某些卡受温度、功耗、PCIe、后台负载、时钟策略或硬件状态影响。即使平均性能还可以，卡间差异过大也会拖慢分布式训练。
+
+## NVLink / NVSwitch
+
+NVLink/NVSwitch 测试确认 GPU 间高速互联是否完整、速率是否正确、错误计数是否干净。
+
+| 指标 | 意义 | 怎么看 |
+|---|---|---|
+| Active Links | 每张 GPU 当前活跃 NVLink 数 | H100 8 卡 SXM 常见期望是每卡 18 条 |
+| Expected Links | 配置期望链路数 | 少一条都可能影响拓扑和 NCCL 性能 |
+| Link speed | 单条链路速率 | 速率不对说明链路降级或识别异常 |
+| Error counters | NVLink 错误计数，例如 CRC/replay/recovery | 非零可能说明链路质量或硬件问题 |
+
+NVLink PASS 表示链路状态看起来正常，但 NCCL 仍可能因算法、拓扑、消息大小、NCCL 参数或系统噪声而不达标。
+
+## DCGM Diagnostic
+
+DCGM 是 NVIDIA 官方诊断工具。`dcgmi diag -r 3` 是比较完整的生产诊断级别。
+
+| 子项 | 意义 |
+|---|---|
+| Deployment/software | 驱动、库、系统软件依赖检查 |
+| Hardware/memory | GPU 显存健康检查 |
+| Hardware/diagnostic | GPU 硬件基础诊断 |
+| Hardware/nvbandwidth | GPU/NVLink/NVSwitch 带宽诊断 |
+| Integration/pcie | PCIe 集成和链路相关检查 |
+| Stress/targeted_stress | DCGM 自带目标压力测试 |
+| Stress/targeted_power | DCGM 自带目标功耗压力测试 |
+| summary | 该分类汇总结果 |
+
+DCGM PASS 是强证据，说明官方诊断没有发现明显硬件故障。但它不替代项目里的 NCCL、RDMA、长时间 telemetry 和训练模拟验收。
+
+## NCCL Multi-GPU
+
+NCCL 测试衡量单节点多 GPU 集合通信能力。它直接关系到多卡训练效率。
+
+| 指标 | 意义 | 为什么重要 |
+|---|---|---|
+| source | 测试来源 | 必须是 `nccl-tests` 才有真实 bus BW；`torchrun_fallback` 只能说明功能连通，不是性能验收 |
+| bus BW | NCCL 报告的总线等效带宽 | 用来衡量通信是否吃满 NVLink/NVSwitch |
+| message size | 消息大小，例如 1M、256M、2G | 小消息看延迟和调度，中大消息看带宽 |
+| repeats | 重复次数 | 减少偶然波动，当前按 3 次取样 |
+| worst bus BW | 多次结果里的最差值 | 生产验收更关注最差情况 |
+| mean bus BW | 多次平均值 | 反映稳定水平 |
+| stddev | 标准差或波动 | 波动大说明通信稳定性不足 |
+
+### NCCL op 含义
+
+| Op | 意义 | 常见场景 |
+|---|---|---|
+| allreduce | 每张卡都有一份数据，做规约后每张卡都拿到结果 | 数据并行梯度同步最常见 |
+| allgather | 每张卡收集所有卡的数据分片 | 模型并行、张量并行、参数/激活收集 |
+| reducescatter | 先规约再把结果切分给各卡 | ZeRO、优化器状态切分、分布式训练常用 |
+| broadcast | 一张卡把数据广播给其他卡 | 参数同步、初始化权重分发 |
+| sendrecv | 点对点发送和接收 | pipeline、定制通信、拓扑验证 |
+| alltoall | 每张卡向每张卡交换不同数据 | MoE、专家并行、shuffle 类通信 |
+
+NCCL 小消息失败常见于延迟、调度或阈值口径较严；大消息失败更偏向链路带宽、拓扑、NCCL 参数或 NVSwitch/PCIe/NUMA 配置问题。
+
+## Stress Test
+
+Stress Test 是长时间高负载稳定性测试。它不是只看“能不能跑完”，还要看满载期间的温度、功耗、限速和错误事件。
+
+| 指标 | 意义 | 怎么看 |
+|---|---|---|
+| duration | 实际压力测试时长 | 生产验收通常需要 30/60 分钟 |
+| source | 压力来源，例如 `pytorch` 或 `gpu-burn` | 说明用什么负载压 GPU |
+| dtype | 压力计算的数据类型，例如 BF16 | 影响 Tensor Core、功耗和温度 |
+| matrix_size | GEMM 矩阵边长 | 越大越容易形成持续高占用 |
+| memory_pct | 目标显存占用比例 | 避免只测很小负载 |
+| Avg steady power | 稳态平均功耗 | 判断是否真的把卡压起来 |
+| Max steady temp | 稳态最高温度 | 判断散热上限 |
+| Temp delta | 8 卡之间最高温和最低温的差 | 差异过大说明风道、散热或卡位不均衡 |
+| TFLOPS jitter | 稳态吞吐波动 | 波动大说明性能不稳定 |
+| Throttle events | 限速事件数量 | 非 idle throttle 会影响性能稳定性 |
+| XID events | 压测期间新增 XID 错误 | 出现 XID 通常是严重风险 |
+
+### Throttle 常见含义
+
+| 代码 | 常见含义 | 解释 |
+|---|---|---|
+| `0x1` | idle throttle | 空闲状态限速，通常不算真实问题 |
+| `0x4` | `sw_power_cap` | 达到软件功耗上限，性能可能被功耗墙限制 |
+| `0x8` | hardware slowdown | 硬件触发降速 |
+| `0x10` | thermal slowdown | 温度触发降速 |
+| `0x20` | power brake | 外部供电或硬件功率保护 |
+| `0x40` | software thermal slowdown | 软件温度策略触发降速 |
+
+当前报告里的 `sw_power_cap` 表示负载确实压到了功耗墙附近，但验收口径把非 idle throttle 作为失败原因之一，因为它会影响长时间稳定输出。
+
+## RDMA / InfiniBand
+
+RDMA 测试衡量 IB 网卡和网络链路性能。单节点 loopback 和跨节点 server/client 是两种不同证据，不能混用。
+
+| 指标 | 意义 | 怎么看 |
+|---|---|---|
+| Device | IB 设备名，例如 `mlx5_0` | 对应具体 HCA/端口 |
+| Port | 端口号 | 通常是 port 1 |
+| State | 端口状态，例如 ACTIVE/DOWN | ACTIVE 才能作为可用链路 |
+| Rate | 端口速率，例如 400 Gb/sec | 低于期望说明链路降级或接错网络 |
+| GID/LID | IB 寻址信息 | `ibping` 和跨节点定位会用到 |
+| ib_write_bw | RDMA write 带宽 | 客户端向远端写数据的吞吐 |
+| ib_read_bw | RDMA read 带宽 | 客户端从远端读数据的吞吐 |
+| ib_write_lat | RDMA write 延迟 | 小消息写延迟 |
+| ib_read_lat | RDMA read 延迟 | 小消息读延迟 |
+| ibping | IB 层连通性测试 | 看 LID/GID 层是否可达 |
+| PFC/ECN/CNP counters | 拥塞和流控相关计数 | 非零或增长可能说明网络拥塞/丢包/流控问题 |
+
+### 单节点与跨节点的区别
+
+| 口径 | 意义 | 能证明什么 | 不能证明什么 |
+|---|---|---|---|
+| `local_loopback` | 在同一台机器本地启动 perftest server/client | 工具、设备、单机端口基本可用 | 不能证明两台机器之间 RDMA 网络达标 |
+| server/client 跨节点 | 一台做 server，另一台做 client | 能证明实际跨节点 RDMA 带宽/延迟 | 需要明确 server_addr、ib_device、ib_port、ibping_target |
+
+RDMA read 带宽低于 write 带宽很常见，但生产验收会给 read/write 各自设置阈值。read 不过线时，需要排查 HCA 固件、BIOS、PCIe、NUMA、RoCE/IB 配置、交换机、PFC/ECN、线缆和端口速率。
+
+## Training Simulation
+
+Training Simulation 用一个合成 1.5B Transformer 训练负载验证 8 卡分布式训练是否能稳定运行。
+
+| 指标 | 意义 | 怎么看 |
+|---|---|---|
+| Model | 模型类型 | 当前是 synthetic 1.5B，不依赖真实数据集 |
+| Parameters | 参数量 | 用来确认负载规模是否达到预期 |
+| GPU Count | 参与训练的 GPU 数 | 生产口径要求 8 卡 DDP |
+| DType | 训练数值格式，例如 BF16 | 大模型训练常用 BF16 |
+| Batch Size | 每步 batch 大小 | 影响吞吐和显存 |
+| Seq Length | 序列长度 | 影响计算量和显存 |
+| Steps | 计入统计的训练步数 | 步数太少会导致统计不稳 |
+| Warmup Steps | 预热步数 | 避免把 CUDA 初始化、编译、缓存冷启动计入性能 |
+| Avg Step Time | 平均每步耗时 | 越低越好 |
+| Throughput | tokens/sec | 训练吞吐核心指标 |
+| Samples/sec | 每秒样本数 | 辅助衡量数据处理速度 |
+| Peak Memory | 峰值显存 | 看是否接近 OOM 或显存利用不足 |
+| Final Loss | 最后 loss | 用于确认数值是有限值，没有 NaN/Inf |
+| Step Jitter | step 时间抖动 | 抖动大说明训练不稳定 |
+| Distributed Mode | 分布式模式 | 必须是 `ddp` 才满足 8 卡分布式口径 |
+
+Training PASS 说明 8 卡 DDP 训练路径、NCCL 功能连通、PyTorch CUDA 和基本数值稳定性都没问题。但它不能替代 NCCL 性能测试，因为训练负载可能没有覆盖所有通信模式和消息大小。
+
+## 常见误读
+
+1. `DCGM PASS` 不等于整机验收 PASS。DCGM 是官方诊断的一部分，不覆盖全部业务性能门槛。
+2. `Training PASS` 不等于 NCCL 性能 PASS。训练能跑，只说明功能链路通；NCCL bus BW 仍可能不达标。
+3. `NVLink PASS` 不等于 NCCL PASS。链路数量和错误计数正常，不代表所有 NCCL op/size 都达到阈值。
+4. `ibping PASS` 不等于 RDMA 带宽 PASS。`ibping` 只证明连通性，不证明吞吐和延迟达标。
+5. `local_loopback` 不能当作跨节点 RDMA 证据。跨节点验收必须有 server/client 两端证据。
+6. Stress 跑满 30 分钟不等于 PASS。温差、功耗、throttle、XID、jitter 都要一起看。
+7. 小消息 NCCL 低不一定是链路断了，可能是延迟、算法、启动开销或阈值口径导致；但生产验收仍按阈值判定。
+
+## 排查优先级建议
+
+| 失败项 | 优先看什么 |
+|---|---|
+| Compute FAIL | GPU 时钟、功耗策略、MIG/MPS、后台进程、PyTorch/CUDA 版本、benchmark 算法是否用到目标 Tensor Core 路径 |
+| NCCL FAIL | `NCCL_DEBUG=INFO`、拓扑、NVSwitch/NVLink、NCCL 算法、消息大小、PCIe/NUMA、进程绑核 |
+| Stress FAIL | 机箱风道、风扇、环境温度、功耗上限、`nvidia-smi -q -d POWER,CLOCK,TEMPERATURE` |
+| RDMA FAIL | 端口速率、HCA 固件、线缆、交换机、PFC/ECN、NUMA、BIOS、跨节点 server/client 配置 |
+| Training FAIL | torchrun、NCCL 环境变量、CUDA OOM、loss NaN/Inf、DDP 初始化、网络/共享内存 |
+
+## 一句话版
+
+这套报告不是只看 GPU 能不能亮、训练能不能跑，而是同时验证：硬件识别、基础健康、显存和互联带宽、计算吞吐、多卡通信、长时间满载稳定性、IB/RDMA 网络、官方 DCGM 诊断和 8 卡训练业务路径。任何一个关键项 FAIL，按生产验收都应判整机不通过。
--- a/docs/multinode_nccl_concepts.md
+++ b/docs/multinode_nccl_concepts.md
@ -0,0 +1,362 @@
+# 多机多卡 NCCL 测试概念说明
+
+本文先讲概念，不涉及脚本改造。目标是理解两台 8 卡 H100 服务器做多机多卡通信测试时，应该从哪些层次逐步验证，以及每一层到底在证明什么。
+
+当前示例机器：
+
+| 别名 | 主机名 | 内网 IP | GPU |
+|---|---|---|---|
+| nccl-gpu-1 | aikubeworker0012 | 172.72.8.12 | 8 x H100 |
+| nccl-gpu-2 | aikubeworker0016 | 172.72.8.16 | 8 x H100 |
+
+两台机器合起来就是 16 张 GPU。多机 NCCL 测试的核心问题是：这 16 张 GPU 是否能通过正确的 GPU、NVLink、PCIe、IB/RDMA 网络路径，高效且正确地完成集体通信。
+
+## 1. 总体思路
+
+多机多卡通信测试是一个自底向上的过程。越底层越接近硬件和链路，越上层越接近真实训练业务。
+
+```mermaid
+flowchart TD
+    L0["0. 物理与基础连通<br/>电源 / GPU / 网卡 / 线缆 / 交换机 / SSH"] --> L1["1. 系统识别层<br/>nvidia-smi / lspci / ibstat / ibdev2netdev"]
+    L1 --> L2["2. 单机 GPU 健康层<br/>温度 / 功耗 / ECC / PCIe / Throttling / NVLink Topo"]
+    L2 --> L3["3. 单机 GPU 性能层<br/>HBM 带宽 / H2D-D2H / FP32-TF32-FP16-BF16-FP8 算力"]
+    L3 --> L4["4. 单机多卡通信层<br/>单节点 8 卡 NCCL over NVLink/NVSwitch"]
+    L4 --> L5["5. 跨机网络与 RDMA 层<br/>IP 连通 / IB Active / RDMA 带宽 / RDMA 延迟"]
+    L5 --> L6["6. 跨机 NCCL 层<br/>两机 16 卡 AllReduce / AllGather / ReduceScatter / Broadcast / AllToAll"]
+    L6 --> L7["7. 训练负载层<br/>torchrun / Megatron / DeepSpeed / 业务训练压测"]
+```
+
+最重要的原则：
+
+**上层失败，不一定是上层问题。**
+
+比如两机 `all_reduce_perf` 失败，原因可能在 NCCL，也可能在 SSH、MPI、IB、GID、网卡选择、驱动版本、CUDA 版本、NCCL 版本或 GPU Direct RDMA。
+
+所以排查顺序应该是：
+
+```text
+基础连通 -> 单机健康 -> 单机性能 -> 单机 NCCL -> 跨机 RDMA -> 跨机 NCCL -> 训练业务
+```
+
+## 2. 两机 16 卡通信路径
+
+单机内部主要走 NVLink/NVSwitch；跨机器时，数据必须经过 GPU、PCIe/NVLink、网卡、交换机和对端网卡。
+
+```mermaid
+flowchart LR
+    subgraph A["aikubeworker0012 / 172.72.8.12"]
+        A0["GPU0"] --- ASW["NVSwitch / NVLink"]
+        A1["GPU1"] --- ASW
+        A2["..."] --- ASW
+        A7["GPU7"] --- ASW
+        ASW --> ANIC["IB/RDMA NIC(s)"]
+    end
+
+    subgraph NET["InfiniBand / RoCE Fabric"]
+        SW["IB Switch"]
+    end
+
+    subgraph B["aikubeworker0016 / 172.72.8.16"]
+        BNIC["IB/RDMA NIC(s)"] --> BSW["NVSwitch / NVLink"]
+        B0["GPU0"] --- BSW
+        B1["GPU1"] --- BSW
+        B2["..."] --- BSW
+        B7["GPU7"] --- BSW
+    end
+
+    ANIC <--> SW
+    SW <--> BNIC
+```
+
+这里有两个不同的通信域：
+
+| 通信域 | 典型路径 | 主要测试 |
+|---|---|---|
+| 单机内 8 卡 | GPU -> NVLink/NVSwitch -> GPU | 单机 NCCL、NVLink topo、D2D |
+| 跨机器 16 卡 | GPU -> NIC -> IB/RDMA 网络 -> NIC -> GPU | RDMA、跨机 NCCL |
+
+这两个域的性能阈值不能混用。单机 NVSwitch 很快，跨机 RDMA 一般慢一些，跨机 NCCL 的瓶颈通常在 IB/RDMA 网络。
+
+## 3. 每一层要测什么
+
+### 3.1 基础连通层
+
+这一层只证明机器能访问、身份和地址正确。
+
+要确认：
+
+| 检查项 | 目的 |
+|---|---|
+| SSH 互通 | MPI/NCCL 多机启动依赖远端拉起进程 |
+| hostname 正确 | 避免登录错机器 |
+| IP 正确 | 确认使用的是训练网络或 IB/RDMA 对应网络 |
+| 时间同步 | 长时间训练日志和超时排查更可靠 |
+
+这一层不证明 GPU 或 RDMA 性能，只证明“机器能互相找到”。
+
+### 3.2 系统识别层
+
+这一层证明系统能看见 GPU 和网卡。
+
+常见信息：
+
+| 工具 | 看什么 |
+|---|---|
+| `nvidia-smi` | GPU 数量、型号、驱动、CUDA、温度、功耗 |
+| `nvidia-smi topo -m` | GPU、NIC、CPU NUMA、NVLink/NVSwitch 拓扑 |
+| `ibstat` | IB 设备、端口状态、链路速率 |
+| `ibdev2netdev` | mlx5 设备和网络接口的映射 |
+| `/sys/class/infiniband` | 端口状态、link layer、rate、GID |
+
+这一层很关键，因为 NCCL 经常因为选错网卡而跑到 TCP 或错误的接口上。
+
+### 3.3 单机 GPU 健康层
+
+这一层证明每台机器自己是健康的。
+
+```mermaid
+flowchart LR
+    H["单机健康检查"] --> T["温度"]
+    H --> P["功耗"]
+    H --> E["ECC 错误"]
+    H --> PCIE["PCIe Gen/Width"]
+    H --> C["SM/Mem Clock"]
+    H --> TH["Throttling"]
+    H --> PM["Persistence Mode"]
+```
+
+如果某张卡温度过高、ECC double-bit、PCIe 降级或 throttling，后面的 NCCL 测试即使能跑，结果也不可信。
+
+### 3.4 单机 GPU 性能层
+
+这一层证明每台机器的 GPU 本身性能正常。
+
+| 测试 | 证明什么 |
+|---|---|
+| HBM/D2D 带宽 | GPU 显存和设备间拷贝能力 |
+| H2D/D2H 带宽 | CPU/Host 到 GPU 的 PCIe 路径 |
+| FP32/TF32 | 基础矩阵计算能力 |
+| FP16/BF16/FP8 | 训练常用 Tensor Core 能力 |
+
+这一步是单机验收。它不能证明两台机器之间通信正常，但可以排除“某台机器本身 GPU 算力或带宽异常”。
+
+### 3.5 单机多卡 NCCL 层
+
+这一层验证单台机器 8 卡之间的集体通信。
+
+```mermaid
+flowchart TD
+    S["单机 8 卡 NCCL"] --> AR["AllReduce"]
+    S --> AG["AllGather"]
+    S --> RS["ReduceScatter"]
+    S --> BC["Broadcast"]
+    S --> AT["AllToAll"]
+```
+
+单机 NCCL 主要看 NVLink/NVSwitch 通信路径是否正常。常见指标：
+
+| 指标 | 含义 |
+|---|---|
+| `algbw` | 算法视角的有效带宽 |
+| `busbw` | 总线视角的带宽，更适合比较通信链路利用率 |
+| `#wrong` | 结果错误数量，必须是 0 |
+
+单机测试通过后，只能说明单台服务器内部 8 卡通信正常。
+
+### 3.6 跨机 RDMA 层
+
+这一层验证两台机器之间的网络和 RDMA 能力，不涉及 NCCL。
+
+```mermaid
+sequenceDiagram
+    participant N1 as aikubeworker0012
+    participant FAB as IB/RDMA Fabric
+    participant N2 as aikubeworker0016
+
+    N1->>N2: ping / ssh
+    N1->>FAB: ib_write_bw client
+    FAB->>N2: ib_write_bw server
+    N1->>FAB: ib_read_bw client
+    FAB->>N2: ib_read_bw server
+    N1->>N2: ib_write_lat / ib_read_lat
+```
+
+这一层要回答：
+
+| 问题 | 说明 |
+|---|---|
+| IB 端口是否 Active | 没 Active 就不用跑 NCCL |
+| RDMA 带宽是否达标 | 证明网络数据面能跑起来 |
+| RDMA 延迟是否正常 | 高延迟会影响小消息和训练同步 |
+| 是否是 InfiniBand/RoCE | 两者环境变量和排障点不同 |
+
+如果 RDMA 层失败，跨机 NCCL 大概率也会失败或退化到 TCP。
+
+### 3.7 跨机 NCCL 层
+
+这一层才是真正的多机多卡 NCCL 测试。
+
+两台 8 卡机器通常是：
+
+```text
+2 nodes x 8 GPUs = 16 ranks
+每个 rank 绑定 1 张 GPU
+```
+
+概念上是：
+
+```mermaid
+flowchart LR
+    subgraph N1["Node 1: 172.72.8.12"]
+        R0["rank 0 / GPU0"]
+        R1["rank 1 / GPU1"]
+        R2["..."]
+        R7["rank 7 / GPU7"]
+    end
+
+    subgraph N2["Node 2: 172.72.8.16"]
+        R8["rank 8 / GPU0"]
+        R9["rank 9 / GPU1"]
+        R10["..."]
+        R15["rank 15 / GPU7"]
+    end
+
+    R0 <--> R8
+    R1 <--> R9
+    R7 <--> R15
+    N1 <--> N2
+```
+
+典型测试项：
+
+| NCCL 测试 | 训练里对应什么 |
+|---|---|
+| AllReduce | 数据并行梯度同步 |
+| ReduceScatter | ZeRO/FSDP 梯度切分 |
+| AllGather | ZeRO/FSDP 参数聚合 |
+| Broadcast | 参数广播、初始化 |
+| AllToAll | MoE、专家并行、部分并行策略 |
+| SendRecv | 点对点通信、pipeline parallel |
+
+跨机 NCCL 要看：
+
+| 指标 | 判定 |
+|---|---|
+| 是否成功启动 16 rank | MPI/SSH/路径/环境是否正常 |
+| `#wrong == 0` | 正确性必须过 |
+| `busbw` | 跨节点通信链路利用率 |
+| 是否走 IB/RDMA | 需要从 `NCCL_DEBUG=INFO` 确认 |
+| 是否退化 TCP | 如果退化，性能会明显偏低 |
+
+## 4. NCCL 为什么要分单机和跨机
+
+单机 8 卡通信和跨机 16 卡通信的瓶颈不同。
+
+```mermaid
+flowchart TD
+    A["NCCL 性能结果"] --> B{"测试范围"}
+    B --> C["单机 8 卡"]
+    B --> D["跨机 16 卡"]
+
+    C --> C1["主要瓶颈：NVLink / NVSwitch"]
+    C --> C2["阈值可参考 GPU NVLink 能力"]
+
+    D --> D1["主要瓶颈：IB/RDMA 网络"]
+    D --> D2["阈值应参考网卡数量、速率、拓扑和 rail 数"]
+```
+
+所以不能用单机 NVLink 的阈值直接判断跨机 NCCL。跨机要根据真实网络能力设阈值，例如：
+
+| 网络配置 | 理论上限理解 |
+|---|---|
+| 单张 400G 网卡 | 约 50 GB/s 单向原始带宽 |
+| 8 张 400G 网卡 | 约 400 GB/s 原始聚合带宽 |
+| 实测 NCCL busbw | 会受拓扑、GDR、rail、NUMA、交换机、NCCL 算法影响 |
+
+实际验收时，应该先知道每台机器有几张 IB/RDMA 网卡、每张速率多少、GPU 到 NIC 的拓扑关系，再定跨机 NCCL 阈值。
+
+## 5. 常见失败位置
+
+```mermaid
+flowchart TD
+    F["跨机 NCCL 失败"] --> A["启动失败"]
+    F --> B["能启动但很慢"]
+    F --> C["运行中 timeout"]
+    F --> D["结果 #wrong 非 0"]
+
+    A --> A1["SSH 不通"]
+    A --> A2["远端路径不存在"]
+    A --> A3["MPI 环境不一致"]
+    A --> A4["root 运行未允许"]
+
+    B --> B1["NCCL_SOCKET_IFNAME 选错"]
+    B --> B2["没走 IB/RDMA，退化 TCP"]
+    B --> B3["NCCL_IB_HCA 没选对"]
+    B --> B4["GPU Direct RDMA 没生效"]
+
+    C --> C1["IB 端口不稳定"]
+    C --> C2["交换机/PFC/ECN 问题"]
+    C --> C3["NCCL timeout 配置"]
+    C --> C4["驱动/CUDA/NCCL 版本不兼容"]
+
+    D --> D1["通信正确性失败"]
+    D --> D2["必须 FAIL，不能只看带宽"]
+```
+
+## 6. 推荐验收顺序
+
+下面是面向两台 8 卡机器的推荐顺序：
+
+```mermaid
+flowchart TD
+    A["Step 1: 两台机器基础信息"] --> B["Step 2: 两台机器单机 GPU 健康"]
+    B --> C["Step 3: 两台机器单机 benchmark"]
+    C --> D["Step 4: 两台机器分别跑单机 8 卡 NCCL"]
+    D --> E["Step 5: 两台机器互测 RDMA bandwidth/latency"]
+    E --> F["Step 6: 两机 16 卡 NCCL correctness"]
+    F --> G["Step 7: 两机 16 卡 NCCL performance"]
+    G --> H["Step 8: 两机训练 demo 或业务压测"]
+```
+
+每一步的意义：
+
+| 步骤 | 目的 |
+|---|---|
+| Step 1 | 确认没有登录错机器，基础网络和环境存在 |
+| Step 2 | 排除 GPU 健康问题 |
+| Step 3 | 排除 GPU 单卡/单机性能问题 |
+| Step 4 | 排除单机 NVLink/NVSwitch/NCCL 问题 |
+| Step 5 | 排除跨机 RDMA 问题 |
+| Step 6 | 先证明 NCCL 正确性 |
+| Step 7 | 再证明 NCCL 性能 |
+| Step 8 | 最后用真实训练形态验证稳定性 |
+
+## 7. 对当前脚本的映射
+
+当前脚本已有模块和上面层次的关系：
+
+| 当前模块 | 覆盖层次 | 备注 |
+|---|---|---|
+| `gpu_info` | 系统识别层 | 单机 |
+| `health` | 单机 GPU 健康层 | 单机 |
+| `benchmark` | 单机 GPU 性能层 | 单机 |
+| `nccl` | 单机多卡通信层 | 当前主要是单机 |
+| `rdma` | RDMA 检查 | 当前偏本机检查，不是两机互测 |
+| `stress` | 稳定性 | 单机 |
+| `training` | 训练负载层 | 当前偏单机 |
+| 建议新增 `multi_node_nccl` | 跨机 NCCL 层 | 专门处理 hostfile、mpirun、多节点环境、结果解析 |
+
+如果未来要扩展脚本，比较自然的方向是新增一个多机模块，而不是把所有逻辑塞进现有 `nccl` 模块。
+
+## 8. 最小概念模型
+
+记住这句话即可：
+
+```text
+单机 NCCL 验证 GPU 之间的 NVLink/NVSwitch。
+跨机 RDMA 验证机器之间的网络。
+跨机 NCCL 验证 NCCL 是否能把 GPU 和网络组合起来，为真实训练提供高效通信。
+```
+
+因此，多机多卡测试不是一个命令，而是一条验证链路。
+
--- a/gpu_tester.py
+++ b/gpu_tester.py
@ -5,6 +5,7 @@ import argparse
 import json
 import os
 import signal
+import socket
 import sys
 import time
 from datetime import datetime
@ -25,6 +26,8 @@ from modules.nccl_test import NCCLTest
 from modules.training_sim import TrainingSim
 from modules.stress_test import StressTest
 from modules.rdma_test import RDMATest
+from modules.nvlink_test import NVLinkTest
+from modules.dcgm_test import DCGMTest
 from modules.report import ReportGenerator
 from modules.gpu_specs import detect_gpu_type, get_gpu_specs, get_gpu_label, get_supported_gpus, validate_driver_compatibility

@ -32,43 +35,87 @@ DEFAULT_CONFIG = {
    "benchmark": {
        "memory": {"size_mb": 4096, "iterations": 10, "nvbandwidth_buffer_mb": 512, "nvbandwidth_samples": 3},
        "compute": {
-            "dtypes": ["fp32", "tf32", "fp16", "bf16", "fp8"],
-            "matrix_size": 4096,
-            "warmup": 10,
-            "iterations": 100,
+            "dtypes": ["fp32", "tf32", "fp16", "bf16", "fp8", "fp64", "int8"],
+            "matrix_size": 8192,
+            "warmup": 50,
+            "iterations": 500,
+            "use_compile": True,
        },
    },
-    "health": {"temp_warning": 80, "temp_critical": 90, "power_limit": None},
+    "health": {"temp_warning": 75, "temp_critical": 85, "power_limit": None},
    "nccl": {
        "min_bandwidth_gbps": None,
        "test_allreduce": True,
        "test_alltoall": True,
        "test_broadcast": True,
-        "test_reduce_scatter": False,
-        "test_allgather": False,
-        "test_sendrecv": False,
+        "test_reduce_scatter": True,
+        "test_allgather": True,
+        "test_sendrecv": True,
+        "message_sizes": ["1M", "256M", "2G"],
+        "repeats": 3,
+        "max_stddev_pct": 3,
    },
    "stress": {
-        "duration_sec": 60,
+        "duration_sec": 1800,
+        "production_duration_sec": 1800,
+        "use_gpu_burn": False,
        "use_doubles": False,
        "use_tensor_cores": True,
        "memory_pct": 90,
        "gpus": "all",
+        "dtype": "bf16",
+        "matrix_size": 24576,
+        "telemetry_interval_sec": 1,
+        "warmup_sec": 60,
+        "min_steady_samples": 10,
+        "max_temp_c": 80,
+        "max_temp_delta_c": 5,
+        "min_power_watts": 630,
+        "max_tflops_jitter_pct": 5,
+        "require_tflops_jitter": True,
    },
    "rdma": {
-        "min_bandwidth_gbps": 50,
-        "max_latency_us": 10,
+        "min_bandwidth_gbps": 47,
+        "min_port_rate_gbps": 400,
+        "max_latency_us": 3.5,
+        "max_write_latency_us": 2.0,
+        "max_read_latency_us": 3.5,
        "ib_iterations": 1000,
-        "msg_size": 65536,
+        "msg_size": 4194304,
+        "latency_msg_size": 8,
        "ib_device": None,
        "ib_port": 1,
+        "server_addr": None,
+        "ibping_target": None,
+        "ibping_count": 5,
+        "role": "auto",
+        "pfc_ecn_counters": True,
+    },
+    "nvlink": {
+        "expected_links_per_gpu": 18,
+        "expected_link_speed_gbps": 25,
+        "require_zero_errors": True,
+    },
+    "dcgm": {
+        "diag_level": 3,
+        "timeout_sec": 1200,
+        "expected_num_gpus": 8,
+        "json_output": True,
+        "require_subtests": True,
    },
    "training": {
-        "model": "gpt2",
+        "model": "synthetic_1.5b",
        "batch_size": 8,
        "seq_length": 2048,
        "num_steps": 50,
+        "warmup_steps": 5,
        "dtype": "bf16",
+        "mode": "ddp",
+        "synthetic_params_b": 1.5,
+        "min_tokens_per_sec": 45000,
+        "max_step_jitter_pct": 3,
+        "max_peak_memory_gb": 70,
+        "require_distributed": True,
    },
    "report": {"output_dir": "./reports", "format": "json"},
    "tools": {"install_dir": "/opt/gpu-test-tools"},
@ -131,7 +178,7 @@ def interactive_menu(config: dict):
    if not check_prerequisites(console):
        return

-    results_store: dict = {"timestamp": datetime.now().isoformat(), "tests": {}}
+    results_store: dict = {"timestamp": datetime.now().isoformat(), "hostname": socket.gethostname(), "tests": {}}

    menu_items = [
        ("1", "GPU Information", "gpu_info"),
@ -139,10 +186,12 @@ def interactive_menu(config: dict):
        ("3", "Memory Benchmark (nvbandwidth)", "memory_bench"),
        ("4", "Compute Benchmark", "compute_bench"),
        ("5", "NCCL Multi-GPU Test", "nccl"),
-        ("6", "GPU Stress Test (gpu-burn)", "stress"),
+        ("6", "GPU Stress Test (PyTorch/gpu-burn)", "stress"),
        ("7", "RDMA/IB Test", "rdma"),
-        ("8", "Training Simulation", "training"),
-        ("9", "Full Test Suite (All Tests)", "all"),
+        ("8", "NVLink/NVSwitch Test", "nvlink"),
+        ("9", "DCGM Diagnostic", "dcgm"),
+        ("10", "Training Simulation", "training"),
+        ("11", "Full Test Suite (All Tests)", "all"),
        ("0", "Generate Report", "report"),
    ]

@ -164,8 +213,10 @@ def interactive_menu(config: dict):
            "memory_bench": "HBM bandwidth via nvbandwidth",
            "compute_bench": "GEMM TFLOPS across FP32/TF32/FP16/BF16/FP8",
            "nccl": "AllReduce, AllToAll, Broadcast via nccl-tests",
-            "stress": "Long-running GPU stress via gpu-burn",
+            "stress": "Long-running high-power GEMM stress with telemetry",
            "rdma": "InfiniBand bandwidth & latency (ib_write_bw)",
+            "nvlink": "NVLink links, speed, and error counters",
+            "dcgm": "DCGM diag -r 3 production diagnostic",
            "training": "Simulate LLM training with PyTorch",
            "all": "Run all tests sequentially",
            "report": "Export results to JSON/HTML",
@ -257,6 +308,18 @@ def _run_test(test_name: str, config: dict, console: Console) -> dict:
            m.print_results(result)
            return result

+        elif test_name == "nvlink":
+            m = NVLinkTest(config)
+            result = m.run()
+            m.print_results(result)
+            return result
+
+        elif test_name == "dcgm":
+            m = DCGMTest(config)
+            result = m.run()
+            m.print_results(result)
+            return result
+
        elif test_name == "training":
            m = TrainingSim(config)
            result = m.run()
@ -280,15 +343,17 @@ def _run_test(test_name: str, config: dict, console: Console) -> dict:
 def _run_full_suite(config: dict, console: Console) -> dict:
    """Run all tests sequentially."""
    console.print(Panel("[bold cyan]Running Full Test Suite[/bold cyan]", box=box.DOUBLE))
-    all_results: dict = {"timestamp": datetime.now().isoformat()}
+    all_results: dict = {"timestamp": datetime.now().isoformat(), "hostname": socket.gethostname()}
    tests = [
        ("gpu_info", "GPU Information", GPUInfo),
        ("health", "Health Check", HealthCheck),
        ("memory_bench", "Memory Benchmark", lambda c: Benchmark(c)),
        ("compute_bench", "Compute Benchmark", lambda c: Benchmark(c)),
+        ("nvlink", "NVLink/NVSwitch Test", NVLinkTest),
        ("nccl", "NCCL Test", NCCLTest),
        ("stress", "GPU Stress Test", StressTest),
        ("rdma", "RDMA/IB Test", RDMATest),
+        ("dcgm", "DCGM Diagnostic", DCGMTest),
        ("training", "Training Simulation", TrainingSim),
    ]

@ -313,14 +378,49 @@ def _run_full_suite(config: dict, console: Console) -> dict:
    # Summary
    console.print("\n" + "=" * 60)
    # Only count test results, exclude metadata like timestamp
-    test_results = {k: v for k, v in all_results.items() if k != "timestamp"}
-    passed = sum(1 for v in test_results.values() if not isinstance(v, dict) or "error" not in v)
+    test_results = {k: v for k, v in all_results.items() if k not in ("timestamp", "hostname")}
+    passed = sum(1 for v in test_results.values() if _test_result_passed(v))
    total = len(test_results)
    color = "green" if passed == total else ("yellow" if passed > 0 else "red")
    console.print(f"[bold {color}]Suite complete: {passed}/{total} tests passed[/bold {color}]")
    return all_results


+def _test_result_passed(result) -> bool:
+    """Strict production verdict helper for full-suite exit status."""
+    if not isinstance(result, dict):
+        return True
+    if result.get("error"):
+        return False
+    if result.get("skipped") or result.get("status") == "SKIP":
+        return False
+    if result.get("source") == "torchrun_fallback":
+        return False
+    if "passed" in result:
+        return bool(result.get("passed"))
+    if "memory" in result:
+        mem = result["memory"]
+        if isinstance(mem, dict) and "passed" in mem:
+            return bool(mem.get("passed"))
+        if mem.get("error") or mem.get("source") == "pytorch":
+            return False
+        eff = mem.get("d2d_efficiency_pct") or mem.get("efficiency_pct") or 0
+        return eff >= 80
+    if "compute" in result:
+        comp = result["compute"]
+        if isinstance(comp, dict) and "passed" in comp:
+            return bool(comp.get("passed"))
+        thresholds = comp.get("pass_thresholds_tflops", {}) or {}
+        per_dtype = comp.get("per_dtype_tflops", {})
+        for dt, threshold in thresholds.items():
+            val = per_dtype.get(dt)
+            if not isinstance(val, (int, float)) or val < threshold:
+                return False
+        consistency = comp.get("consistency", {})
+        return not any(not c.get("passed", False) for c in consistency.values())
+    return True
+
+
 def main():
    gpu_list_str = " / ".join(g.upper() for g in get_supported_gpus())
    parser = argparse.ArgumentParser(
@ -335,15 +435,17 @@ Examples:
   python gpu_tester.py --test benchmark --type memory
   python gpu_tester.py --test benchmark --type compute --dtype fp16
   python gpu_tester.py --test nccl            # NCCL test
+   python gpu_tester.py --test nvlink          # NVLink/NVSwitch test
+   python gpu_tester.py --test dcgm            # DCGM diagnostic
   python gpu_tester.py --test training        # Training sim
   python gpu_tester.py --test all             # Full suite
   python gpu_tester.py --report --format json --output report.json
        """,
    )
-    parser.add_argument("--test", choices=["gpu-info", "health", "benchmark", "nccl", "stress", "rdma", "training", "all"],
+    parser.add_argument("--test", choices=["gpu-info", "health", "benchmark", "nccl", "stress", "rdma", "nvlink", "dcgm", "training", "all"],
                        help="Run a specific test")
    parser.add_argument("--type", choices=["memory", "compute"], help="Benchmark type (with --test benchmark)")
-    parser.add_argument("--dtype", choices=["fp32", "tf32", "fp16", "bf16", "fp8"],
+    parser.add_argument("--dtype", choices=["fp32", "tf32", "fp16", "bf16", "fp8", "fp64", "int8"],
                        help="Compute benchmark dtype (with --test benchmark --type compute)")
    parser.add_argument("--interactive", action="store_true", help="Force interactive mode")
    parser.add_argument("--report", action="store_true", help="Generate report from last results")
@ -399,6 +501,8 @@ Examples:
        "nccl": "nccl",
        "stress": "stress",
        "rdma": "rdma",
+        "nvlink": "nvlink",
+        "dcgm": "dcgm",
        "training": "training",
        "all": "all",
    }
@ -415,19 +519,30 @@ Examples:
            result = bench.run()
            Benchmark.print_results(result)
        if args.report:
-            ReportGenerator(config).generate({"benchmark": result, "timestamp": datetime.now().isoformat()},
+            ReportGenerator(config).generate({
+                "benchmark": result,
+                "timestamp": datetime.now().isoformat(),
+                "hostname": socket.gethostname(),
+            },
                                             fmt=args.format, output=args.output)
+        sys.exit(0 if _test_result_passed(result) else 1)
    elif args.test == "all":
        results = _run_full_suite(config, console)
        if args.report:
            ReportGenerator(config).generate(results, fmt=args.format, output=args.output)
-        has_errors = any("error" in v for v in results.values() if isinstance(v, dict))
-        sys.exit(1 if has_errors else 0)
+        failed = any(not _test_result_passed(v) for k, v in results.items() if k not in ("timestamp", "hostname"))
+        sys.exit(1 if failed else 0)
    else:
        result = _run_test(test_map[args.test], config, console)
        if args.report and result:
-            ReportGenerator(config).generate({args.test: result, "timestamp": datetime.now().isoformat()},
+            report_key = test_map[args.test] or args.test
+            ReportGenerator(config).generate({
+                report_key: result,
+                "timestamp": datetime.now().isoformat(),
+                "hostname": socket.gethostname(),
+            },
                                             fmt=args.format, output=args.output)
+        sys.exit(0 if _test_result_passed(result) else 1)


 if __name__ == "__main__":
--- a/modules/dcgm_test.py
+++ b/modules/dcgm_test.py
@ -0,0 +1,231 @@
+"""DCGM diagnostic acceptance wrapper."""
+
+import json
+import os
+import re
+import shutil
+import signal
+import subprocess
+from datetime import datetime
+from typing import Optional
+
+from rich.console import Console
+from rich.table import Table
+
+
+class DCGMTest:
+    def __init__(self, config: dict):
+        self.config = config
+        self.console = Console()
+        self.cfg = config.get("dcgm", {})
+
+    def run(self) -> dict:
+        dcgmi = shutil.which("dcgmi")
+        if not dcgmi:
+            return {
+                "passed": False,
+                "error": "dcgmi not found",
+                "timestamp": datetime.now().isoformat(),
+            }
+
+        level = str(self.cfg.get("diag_level", 3))
+        timeout = int(self.cfg.get("timeout_sec", 1200))
+        cmd = [dcgmi, "diag", "-r", level]
+        expected_gpus = self.cfg.get("expected_num_gpus")
+        if expected_gpus:
+            cmd.extend(["-n", f"gpu:{int(expected_gpus)}"])
+        if self.cfg.get("json_output", True):
+            cmd.append("-j")
+
+        try:
+            r = self._run_with_process_group_timeout(cmd, timeout)
+        except subprocess.TimeoutExpired as e:
+            output = ((e.output or "") + "\n" + (e.stderr or "")).strip()
+            return {
+                "passed": False,
+                "error": f"dcgmi diag -r {level} timeout after {timeout}s",
+                "command": cmd,
+                "raw_output_tail": output[-8000:],
+                "timestamp": datetime.now().isoformat(),
+            }
+
+        output = r.stdout + "\n" + r.stderr
+        subtests = self._parse_json_output(output) or self._parse_output(output)
+        strict_statuses = {"PASS"}
+        failed = [s for s in subtests if s["status"] not in strict_statuses]
+        require_subtests = bool(self.cfg.get("require_subtests", True))
+        passed = r.returncode == 0 and not failed and (bool(subtests) or not require_subtests)
+        return {
+            "passed": passed,
+            "returncode": r.returncode,
+            "level": int(level),
+            "command": cmd,
+            "expected_num_gpus": int(expected_gpus) if expected_gpus else None,
+            "subtests": subtests,
+            "raw_output_tail": output[-8000:],
+            "timestamp": datetime.now().isoformat(),
+        }
+
+    @staticmethod
+    def _run_with_process_group_timeout(cmd: list[str], timeout: int) -> subprocess.CompletedProcess:
+        proc = subprocess.Popen(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True,
+            start_new_session=True,
+        )
+        try:
+            stdout, stderr = proc.communicate(timeout=timeout)
+        except subprocess.TimeoutExpired as e:
+            try:
+                os.killpg(proc.pid, signal.SIGTERM)
+                stdout, stderr = proc.communicate(timeout=10)
+            except subprocess.TimeoutExpired:
+                os.killpg(proc.pid, signal.SIGKILL)
+                stdout, stderr = proc.communicate(timeout=10)
+            raise subprocess.TimeoutExpired(cmd, timeout, output=stdout, stderr=stderr) from e
+        return subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
+
+    @classmethod
+    def _parse_json_output(cls, output: str) -> list[dict]:
+        text = output.strip()
+        if not text:
+            return []
+        try:
+            payload = json.loads(text)
+        except json.JSONDecodeError:
+            m = re.search(r"(\{.*\})", text, re.S)
+            if not m:
+                return []
+            try:
+                payload = json.loads(m.group(1))
+            except json.JSONDecodeError:
+                return []
+
+        dcgm_payload = payload.get("DCGM Diagnostic") if isinstance(payload, dict) else None
+        if isinstance(dcgm_payload, dict):
+            parsed = cls._parse_dcgm_diagnostic_json(dcgm_payload)
+            if parsed:
+                return parsed
+
+        subtests = []
+
+        def walk(node, path: list[str]):
+            if isinstance(node, dict):
+                node_name = (
+                    node.get("name")
+                    or node.get("testName")
+                    or node.get("test_name")
+                    or node.get("category")
+                    or node.get("category_name")
+                )
+                child_path = [*path, str(node_name)] if node_name else path
+                status = node.get("status") or node.get("result") or node.get("Result")
+                if isinstance(status, str):
+                    name = (
+                        node_name
+                        or " / ".join(path[-3:])
+                    )
+                    normalized = cls._normalize_status(status)
+                    if normalized:
+                        subtests.append({
+                            "name": str(name)[:160],
+                            "status": normalized,
+                            "raw": json.dumps(node, default=str)[:1000],
+                        })
+                for key, value in node.items():
+                    walk(value, [*child_path, str(key)])
+            elif isinstance(node, list):
+                for idx, item in enumerate(node):
+                    walk(item, [*path, str(idx)])
+
+        walk(payload, [])
+        return subtests
+
+    @classmethod
+    def _parse_dcgm_diagnostic_json(cls, payload: dict) -> list[dict]:
+        subtests = []
+        for category in payload.get("test_categories", []) or []:
+            category_name = str(category.get("category") or "DCGM")
+            for test in category.get("tests", []) or []:
+                test_name = str(test.get("name") or "unnamed")
+                for result in test.get("results", []) or []:
+                    status = cls._normalize_status(str(result.get("status", "")))
+                    if not status:
+                        continue
+                    entity_group = result.get("entity_group") or "entity"
+                    entity_id = result.get("entity_id", "unknown")
+                    name = f"{category_name}/{test_name}/{entity_group}{entity_id}"
+                    subtests.append({
+                        "name": name[:160],
+                        "status": status,
+                        "raw": json.dumps(result, default=str)[:1000],
+                    })
+                summary = test.get("test_summary") or {}
+                status = cls._normalize_status(str(summary.get("status", "")))
+                if status:
+                    subtests.append({
+                        "name": f"{category_name}/{test_name}/summary"[:160],
+                        "status": status,
+                        "raw": json.dumps(summary, default=str)[:1000],
+                    })
+        return subtests
+
+    @staticmethod
+    def _normalize_status(status: str) -> str:
+        s = status.strip().upper()
+        aliases = {
+            "PASS": "PASS",
+            "PASSED": "PASS",
+            "OK": "PASS",
+            "FAIL": "FAIL",
+            "FAILED": "FAIL",
+            "ERROR": "ERROR",
+            "WARN": "WARN",
+            "WARNING": "WARN",
+            "SKIP": "SKIP",
+            "SKIPPED": "SKIP",
+            "NOT_RUN": "SKIP",
+            "NOT RUN": "SKIP",
+        }
+        return aliases.get(s, s if s in {"PASS", "FAIL", "ERROR", "WARN", "SKIP"} else "")
+
+    @staticmethod
+    def _parse_output(output: str) -> list[dict]:
+        subtests = []
+        for line in output.splitlines():
+            stripped = line.strip()
+            if not stripped:
+                continue
+            m = re.search(r"(.+?)\s*[:|]\s*(PASS|FAIL|WARN|ERROR|SKIP)\b", stripped, re.I)
+            if not m:
+                m = re.search(r"\b(PASS|FAIL|WARN|ERROR|SKIP)\b\s*[-:|]\s*(.+)", stripped, re.I)
+                if m:
+                    status = DCGMTest._normalize_status(m.group(1))
+                    name = m.group(2).strip()
+                else:
+                    continue
+            else:
+                name = m.group(1).strip(" .|-")
+                status = DCGMTest._normalize_status(m.group(2))
+            if name and len(name) < 160:
+                subtests.append({"name": name, "status": status, "raw": stripped})
+        return subtests
+
+    @staticmethod
+    def print_results(results: dict, console: Optional[Console] = None):
+        c = console or Console()
+        if results.get("error"):
+            c.print(f"[bold red]DCGM error: {results['error']}[/bold red]")
+            return
+        passed = results.get("passed", False)
+        c.print("[bold green]✓ DCGM diag PASSED[/bold green]" if passed else "[bold red]✗ DCGM diag FAILED[/bold red]")
+        subtests = results.get("subtests", [])
+        if subtests:
+            table = Table(box=None, padding=(0, 1))
+            table.add_column("Subtest")
+            table.add_column("Status", style="bold")
+            for s in subtests:
+                table.add_row(s.get("name", ""), s.get("status", ""))
+            c.print(table)
--- a/modules/health_check.py
+++ b/modules/health_check.py
@ -171,6 +171,10 @@ class HealthCheck:
            gpu_health.append({"index": i, "status": worst, "checks": checks})

        system_health = self._check_system()
+        for key in ("fabricmanager", "retired_pages", "kernel_errors"):
+            item = system_health.get(key, {})
+            if isinstance(item, dict) and item.get("status") == "FAIL":
+                overall_pass = False

        return {
            "passed": overall_pass,
@ -228,6 +232,9 @@ class HealthCheck:
            rdma_devs = os.listdir("/sys/class/infiniband_verbs")

        nccl_env = {k: v for k, v in os.environ.items() if k.startswith("NCCL_")}
+        fabric = self._check_fabricmanager()
+        retired = self._check_retired_pages()
+        kernel_errors = self._check_kernel_errors()

        return {
            "nvidia_persistenced": {"installed": persistd, "running": persistd_running},
@ -238,6 +245,41 @@ class HealthCheck:
            "infiniband_devices": ib_devs,
            "rdma_devices": rdma_devs,
            "nccl_env_vars": nccl_env,
+            "fabricmanager": fabric,
+            "retired_pages": retired,
+            "kernel_errors": kernel_errors,
+        }
+
+    def _check_fabricmanager(self) -> dict:
+        r = self._run_cmd(["systemctl", "is-active", "nvidia-fabricmanager"], timeout=5)
+        active = r == "active"
+        logs = self._run_cmd(["journalctl", "-u", "nvidia-fabricmanager", "-n", "200", "--no-pager"], timeout=10) or ""
+        has_error = "ERROR" in logs.upper() or "FAILED" in logs.upper()
+        return {
+            "active": active,
+            "has_error_logs": has_error,
+            "status": "PASS" if active and not has_error else "FAIL",
+        }
+
+    def _check_retired_pages(self) -> dict:
+        raw = self._run_cmd(["nvidia-smi", "-q", "-d", "PAGE_RETIREMENT"], timeout=30) or ""
+        nums = [int(x) for x in __import__("re").findall(r"Retired Pages.*?:\s*(\d+)", raw, flags=__import__("re").I)]
+        pending = "Pending Page Blacklist" in raw and "Yes" in raw
+        total = sum(nums)
+        return {
+            "retired_pages": total,
+            "pending_blacklist": pending,
+            "status": "PASS" if total == 0 and not pending else "FAIL",
+        }
+
+    def _check_kernel_errors(self) -> dict:
+        raw = self._run_cmd(["dmesg", "--ctime", "--level=err,crit,alert,emerg"], timeout=10) or ""
+        upper = raw.upper()
+        hits = [line for line in raw.splitlines() if any(k in line.upper() for k in ("XID", "AER", "PCIE", "NVRM"))]
+        return {
+            "count": len(hits),
+            "tail": hits[-20:],
+            "status": "PASS" if not hits else "FAIL",
        }

    @staticmethod
--- a/modules/nccl_test.py
+++ b/modules/nccl_test.py
@ -5,6 +5,8 @@ import os
 import re
 import shutil
 import subprocess
+import statistics
+import sys
 from datetime import datetime
 from typing import Optional

@ -70,6 +72,38 @@ class NCCLTest:
                return p
        return None

+    def _message_sizes(self) -> list[str]:
+        return list(self.nccl_cfg.get("message_sizes") or ["1M", "256M", "2G"])
+
+    def _repeats(self) -> int:
+        return int(self.nccl_cfg.get("repeats", 3))
+
+    def _max_stddev_pct(self) -> float:
+        return float(self.nccl_cfg.get("max_stddev_pct", 3))
+
+    def _runtime_env(self) -> dict:
+        env = {**os.environ, "NCCL_DEBUG": "WARN"}
+        lib_dirs = []
+
+        nccl_home = env.get("NCCL_HOME") or self.nccl_cfg.get("nccl_home")
+        if nccl_home:
+            lib_dirs.append(os.path.join(str(nccl_home), "lib"))
+
+        for path in sys.path:
+            lib_dirs.append(os.path.join(path, "nvidia", "nccl", "lib"))
+
+        venv_root = os.path.dirname(os.path.dirname(sys.executable))
+        lib_dirs.extend(glob.glob(os.path.join(venv_root, "lib", "python*", "site-packages", "nvidia", "nccl", "lib")))
+
+        existing = env.get("LD_LIBRARY_PATH", "")
+        valid_dirs = []
+        for d in lib_dirs:
+            if d and os.path.isdir(d) and d not in valid_dirs:
+                valid_dirs.append(d)
+        if valid_dirs:
+            env["LD_LIBRARY_PATH"] = ":".join(valid_dirs + ([existing] if existing else []))
+        return env
+
    def run(self) -> dict:
        gpu_count = 0
        if TORCH_AVAILABLE:
@ -89,7 +123,7 @@ class NCCLTest:
        if self.nccl_cfg.get("test_reduce_scatter", False):
            tests.append(("reduce_scatter_perf", "ReduceScatter"))
        if self.nccl_cfg.get("test_allgather", False):
-            tests.append(("allgather_perf", "AllGather"))
+            tests.append(("all_gather_perf", "AllGather"))
        if self.nccl_cfg.get("test_sendrecv", False):
            tests.append(("sendrecv_perf", "SendRecv"))

@ -170,39 +204,7 @@ class NCCLTest:
        if not binary:
            return {"status": "SKIP", "error": f"{binary_name} not found"}

-        cmd = [
-            binary,
-            "-b", "8M",
-            "-e", "8G",
-            "-f", "2",
-            "-g", str(gpu_count),
-            "-w", "5",
-            "-n", "20",
-        ]
-
-        try:
-            env = os.environ.copy()
-            env["NCCL_DEBUG"] = "WARN"
-            r = subprocess.run(cmd, capture_output=True, text=True, timeout=180, env=env)
-
-            combined = r.stdout + r.stderr
-            # Check for NCCL/CUDA compatibility errors
-            if "CUDA driver version is insufficient" in combined or \
-               "Test NCCL failure" in combined:
-                error_msg = "NCCL/CUDA driver version mismatch" \
-                    if "CUDA driver version" in combined \
-                    else "NCCL test failure (library incompatibility)"
-                return {"status": "FAIL", "error": error_msg}
-
-            if r.returncode != 0:
-                return {"status": "FAIL", "error": r.stderr[:300]}
-
-            return self._parse_nccl_output(r.stdout, min_bw)
-
-        except subprocess.TimeoutExpired:
-            return {"status": "FAIL", "error": "timeout"}
-        except Exception as e:
-            return {"status": "FAIL", "error": str(e)}
+        return self._run_nccl_matrix([binary, "-g", str(gpu_count)], min_bw)

    def _run_one_nccl_test_mpirun(self, binary_name: str, label: str,
                                   gpu_count: int, mpirun: str, min_bw: float) -> dict:
@ -218,37 +220,64 @@ class NCCLTest:
            "-x", "NCCL_DEBUG=WARN",
            "-x", "CUDA_VISIBLE_DEVICES=" + ",".join(str(i) for i in range(gpu_count)),
            binary,
-            "-b", "8",
-            "-e", "256M",
-            "-f", "2",
            "-g", "1",
-            "-w", "5",
-            "-n", "20",
        ]

+        return self._run_nccl_matrix(cmd, min_bw)
+
+    def _run_nccl_matrix(self, base_cmd: list[str], min_bw: float) -> dict:
+        size_results = []
+        failures = []
+        env = self._runtime_env()
+
        try:
-            env = os.environ.copy()
-            env["NCCL_DEBUG"] = "WARN"
-            r = subprocess.run(cmd, capture_output=True, text=True, timeout=180, env=env)
-
-            combined = r.stdout + r.stderr
-            if "CUDA driver version is insufficient" in combined or \
-               "Test NCCL failure" in combined:
-                error_msg = "NCCL/CUDA driver version mismatch" \
-                    if "CUDA driver version" in combined \
-                    else "NCCL test failure (library incompatibility)"
-                return {"status": "FAIL", "error": error_msg}
-
-            if r.returncode != 0:
-                return {"status": "FAIL", "error": r.stderr[:300]}
-
-            return self._parse_nccl_output(r.stdout, min_bw)
+            for size in self._message_sizes():
+                runs = []
+                for _ in range(self._repeats()):
+                    cmd = [*base_cmd, "-b", size, "-e", size, "-f", "2", "-w", "5", "-n", "20"]
+                    r = subprocess.run(cmd, capture_output=True, text=True, timeout=300, env=env)
+                    combined = r.stdout + r.stderr
+                    if "CUDA driver version is insufficient" in combined or "Test NCCL failure" in combined:
+                        failures.append({"size": size, "error": "NCCL/CUDA/library failure"})
+                        continue
+                    if r.returncode != 0:
+                        failures.append({"size": size, "error": r.stderr[:300]})
+                        continue
+                    parsed = self._parse_nccl_output(r.stdout, min_bw)
+                    runs.append(parsed.get("best_busbw_gbps", 0))
+                if runs:
+                    worst = min(runs)
+                    mean = sum(runs) / len(runs)
+                    std_pct = (statistics.pstdev(runs) / mean * 100) if len(runs) > 1 and mean else 0
+                    size_results.append({
+                        "size": size,
+                        "runs_busbw_gbps": [round(v, 1) for v in runs],
+                        "worst_busbw_gbps": round(worst, 1),
+                        "mean_busbw_gbps": round(mean, 1),
+                        "stddev_pct": round(std_pct, 2),
+                        "status": "PASS" if worst >= min_bw and std_pct <= self._max_stddev_pct() else "FAIL",
+                    })
+                else:
+                    size_results.append({"size": size, "status": "FAIL", "runs_busbw_gbps": []})

        except subprocess.TimeoutExpired:
            return {"status": "FAIL", "error": "timeout"}
        except Exception as e:
            return {"status": "FAIL", "error": str(e)}

+        best_bus = max((r.get("mean_busbw_gbps", 0) for r in size_results), default=0)
+        worst_bus = min((r.get("worst_busbw_gbps", 0) for r in size_results if r.get("runs_busbw_gbps")), default=0)
+        passed = bool(size_results) and all(r.get("status") == "PASS" for r in size_results) and not failures
+        return {
+            "status": "PASS" if passed else "FAIL",
+            "best_busbw_gbps": round(best_bus, 1),
+            "worst_busbw_gbps": round(worst_bus, 1),
+            "min_required_gbps": min_bw,
+            "max_stddev_pct": self._max_stddev_pct(),
+            "by_size": size_results,
+            "failures": failures,
+        }
+
    @staticmethod
    def _parse_nccl_output(stdout: str, min_bw: float) -> dict:
        """Parse nccl-tests tabular output and extract bandwidth results."""
@ -363,7 +392,7 @@ dist.destroy_process_group()
            r = subprocess.run(
                [torchrun_cmd, f"--nproc_per_node={gpu_count}", tmp.name],
                capture_output=True, text=True, timeout=120,
-                env={**os.environ, "NCCL_DEBUG": "WARN"},
+                env=self._runtime_env(),
            )
            os.unlink(tmp.name)
            
@ -390,10 +419,15 @@ dist.destroy_process_group()
                }
            
            return {
-                "passed": all_passed,
+                # torchrun fallback is a functional smoke only. It never proves
+                # production bus bandwidth, so it must not satisfy acceptance.
+                "passed": False,
+                "functional_passed": all_passed,
                "source": "torchrun_fallback",
                "tests": tests,
                "gpu_count": gpu_count,
+                "error": None if all_passed else "torchrun functional NCCL smoke failed",
+                "acceptance_gap": "nccl-tests bus bandwidth was not measured",
            }
        except Exception as e:
            return {"passed": False, "source": "torchrun_fallback", "error": str(e)}
@ -410,7 +444,8 @@ dist.destroy_process_group()
        
        if source == "torchrun_fallback":
            # Connectivity check mode
-            verdict = "[bold green]✓ NCCL Connectivity OK[/bold green]" if passed else "[bold red]✗ NCCL Connectivity FAILED[/bold red]"
+            functional = results.get("functional_passed", passed)
+            verdict = "[bold yellow]⚠ NCCL bus BW NOT VERIFIED[/bold yellow]" if functional else "[bold red]✗ NCCL Connectivity FAILED[/bold red]"
            c.print(f"{verdict} [dim](basic check via torchrun)[/dim]")
            
            tests = results.get("tests", {})
@ -427,7 +462,7 @@ dist.destroy_process_group()
                    else:
                        c.print(f"  [{s_color}]{op_name}[/{s_color}]")
            
-            c.print("\n[yellow]Note: functional connectivity test only (no performance data)[/yellow]")
+            c.print("\n[yellow]Note: functional connectivity test only (no bus bandwidth data; acceptance FAIL)[/yellow]")
        else:
            # nccl-tests mode
            verdict = "[bold green]✓ NCCL tests PASSED[/bold green]" if passed else "[bold yellow]⚠ NCCL tests WARNING[/bold yellow]"
@ -448,12 +483,16 @@ dist.destroy_process_group()
                if by_size:
                    t = Table(box=None, padding=(0, 1))
                    t.add_column("Size", style="bold", justify="right")
-                    t.add_column("Time (us)", justify="right")
-                    t.add_column("Alg BW (GB/s)", justify="right")
-                    t.add_column("Bus BW (GB/s)", justify="right")
+                    t.add_column("Worst Bus BW", justify="right")
+                    t.add_column("Mean Bus BW", justify="right")
+                    t.add_column("StdDev", justify="right")
+                    t.add_column("Status", justify="right")
                    for r in by_size:
-                        sz = r.get("size", 0)
-                        sz_str = f"{sz/1024:.0f}K" if sz < 1048576 else f"{sz/1048576:.0f}M"
-                        t.add_row(sz_str, f"{r.get('time_us',0):.1f}",
-                                  f"{r.get('algbw_gbps',0):.1f}", f"{r.get('busbw_gbps',0):.1f}")
+                        t.add_row(
+                            str(r.get("size", "")),
+                            f"{r.get('worst_busbw_gbps', 0):.1f}",
+                            f"{r.get('mean_busbw_gbps', 0):.1f}",
+                            f"{r.get('stddev_pct', 0):.2f}%",
+                            r.get("status", "?"),
+                        )
                    c.print(t)
--- a/modules/nvlink_test.py
+++ b/modules/nvlink_test.py
@ -0,0 +1,188 @@
+"""NVLink / NVSwitch production acceptance checks."""
+
+import re
+import shutil
+import subprocess
+from datetime import datetime
+from typing import Optional
+
+from rich.console import Console
+from rich.table import Table
+
+
+class NVLinkTest:
+    def __init__(self, config: dict):
+        self.config = config
+        self.console = Console()
+        self.cfg = config.get("nvlink", {})
+
+    def _run(self, args: list[str], timeout: int = 60) -> tuple[int, str, str]:
+        if not shutil.which("nvidia-smi"):
+            return 127, "", "nvidia-smi not found"
+        r = subprocess.run(["nvidia-smi", *args], capture_output=True, text=True, timeout=timeout)
+        return r.returncode, r.stdout, r.stderr
+
+    def run(self) -> dict:
+        expected_links = int(self.cfg.get("expected_links_per_gpu", 18))
+        expected_speed = float(self.cfg.get("expected_link_speed_gbps", 25))
+        require_zero_errors = bool(self.cfg.get("require_zero_errors", True))
+
+        rc_s, out_s, err_s = self._run(["nvlink", "-s"])
+        rc_c, out_c, err_c = self._run(["nvlink", "-c"])
+        rc_e, out_e, err_e = self._run(["nvlink", "-e"])
+
+        if rc_s != 0:
+            return {
+                "passed": False,
+                "error": (err_s or out_s or "nvidia-smi nvlink -s failed")[:1000],
+                "timestamp": datetime.now().isoformat(),
+            }
+
+        links = self._parse_status(out_s)
+        if not links:
+            return {
+                "passed": False,
+                "error": "no NVLink status entries parsed from nvidia-smi nvlink -s",
+                "raw_status": out_s[-4000:],
+                "timestamp": datetime.now().isoformat(),
+            }
+        speeds = self._parse_speeds(out_c) if rc_c == 0 else {}
+        status_speeds = self._parse_speeds(out_s)
+        for gpu, gpu_speeds in status_speeds.items():
+            speeds.setdefault(gpu, {}).update({k: v for k, v in gpu_speeds.items() if k not in speeds.get(gpu, {})})
+        errors = self._parse_errors(out_e) if rc_e == 0 else {}
+
+        gpu_results = []
+        overall = True
+        for gpu, gpu_links in sorted(links.items(), key=lambda x: int(x[0])):
+            active = sum(1 for l in gpu_links.values() if l.get("active"))
+            inactive = [lid for lid, l in gpu_links.items() if not l.get("active")]
+            speed_bad = []
+            for lid in gpu_links:
+                speed = speeds.get(gpu, {}).get(lid)
+                if speed is not None and speed < expected_speed:
+                    speed_bad.append({"link": lid, "speed_gbps": speed})
+            err_bad = []
+            if require_zero_errors:
+                for lid, counters in errors.get(gpu, {}).items():
+                    total = sum(v for v in counters.values() if isinstance(v, int))
+                    if total:
+                        err_bad.append({"link": lid, "counters": counters})
+
+            passed = active == expected_links and not inactive and not speed_bad and not err_bad
+            if not passed:
+                overall = False
+            gpu_results.append({
+                "gpu": int(gpu),
+                "active_links": active,
+                "expected_links": expected_links,
+                "inactive_links": inactive,
+                "speed_issues": speed_bad,
+                "error_issues": err_bad,
+                "passed": passed,
+            })
+
+        return {
+            "passed": overall,
+            "expected_links_per_gpu": expected_links,
+            "expected_link_speed_gbps": expected_speed,
+            "require_zero_errors": require_zero_errors,
+            "gpus": gpu_results,
+            "raw_status": out_s[-4000:],
+            "raw_speed": out_c[-4000:] if out_c else "",
+            "raw_errors": out_e[-4000:] if out_e else "",
+            "timestamp": datetime.now().isoformat(),
+        }
+
+    @staticmethod
+    def _parse_status(text: str) -> dict[str, dict[str, dict]]:
+        result: dict[str, dict[str, dict]] = {}
+        gpu = None
+        for line in text.splitlines():
+            m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
+            if m_gpu:
+                gpu = m_gpu.group(1)
+                result.setdefault(gpu, {})
+                continue
+            if gpu is None:
+                continue
+            m_link = re.search(r"Link\s+(\d+).*?(Active|Inactive|Disabled|Off|Down)", line, re.I)
+            if m_link:
+                state = m_link.group(2)
+                result[gpu][m_link.group(1)] = {
+                    "state": state,
+                    "active": state.lower() == "active",
+                    "raw": line.strip(),
+                }
+                continue
+            m_speed = re.search(r"Link\s+(\d+).*?([0-9.]+)\s*GB/s", line, re.I)
+            if m_speed:
+                result[gpu][m_speed.group(1)] = {
+                    "state": "Active",
+                    "active": True,
+                    "raw": line.strip(),
+                }
+        return result
+
+    @staticmethod
+    def _parse_speeds(text: str) -> dict[str, dict[str, float]]:
+        result: dict[str, dict[str, float]] = {}
+        gpu = None
+        for line in text.splitlines():
+            m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
+            if m_gpu:
+                gpu = m_gpu.group(1)
+                result.setdefault(gpu, {})
+                continue
+            if gpu is None:
+                continue
+            m_link = re.search(r"Link\s+(\d+).*?([0-9.]+)\s*GB/s", line, re.I)
+            if m_link:
+                result[gpu][m_link.group(1)] = float(m_link.group(2))
+        return result
+
+    @staticmethod
+    def _parse_errors(text: str) -> dict[str, dict[str, dict[str, int]]]:
+        result: dict[str, dict[str, dict[str, int]]] = {}
+        gpu = None
+        link = None
+        for line in text.splitlines():
+            m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
+            if m_gpu:
+                gpu = m_gpu.group(1)
+                result.setdefault(gpu, {})
+                continue
+            m_link = re.search(r"Link\s+(\d+)", line, re.I)
+            if m_link and gpu is not None:
+                link = m_link.group(1)
+                result[gpu].setdefault(link, {})
+            if gpu is None or link is None:
+                continue
+            for name in ("CRC", "Replay", "Recovery"):
+                m = re.search(rf"{name}[^0-9]*(\d+)", line, re.I)
+                if m:
+                    result[gpu][link][name.lower()] = int(m.group(1))
+        return result
+
+    @staticmethod
+    def print_results(results: dict, console: Optional[Console] = None):
+        c = console or Console()
+        if results.get("error"):
+            c.print(f"[bold red]NVLink error: {results['error']}[/bold red]")
+            return
+        passed = results.get("passed", False)
+        c.print("[bold green]✓ NVLink PASSED[/bold green]" if passed else "[bold red]✗ NVLink FAILED[/bold red]")
+        table = Table(box=None, padding=(0, 1))
+        table.add_column("GPU", style="bold")
+        table.add_column("Active Links", justify="right")
+        table.add_column("Issues")
+        for g in results.get("gpus", []):
+            issues = []
+            if g.get("inactive_links"):
+                issues.append("inactive=" + ",".join(g["inactive_links"]))
+            if g.get("speed_issues"):
+                issues.append(f"speed={len(g['speed_issues'])}")
+            if g.get("error_issues"):
+                issues.append(f"errors={len(g['error_issues'])}")
+            table.add_row(str(g["gpu"]), f"{g['active_links']}/{g['expected_links']}", "; ".join(issues) or "OK")
+        c.print(table)
--- a/modules/report.py
+++ b/modules/report.py
@ -93,8 +93,8 @@ class ReportGenerator:

    def _generate_html(self, results: dict, output: str) -> str:
        import socket
-        hostname = socket.gethostname()
-        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        hostname = results.get("hostname") or socket.gethostname()
+        timestamp = results.get("timestamp") or datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        sections = []

@ -178,8 +178,8 @@ class ReportGenerator:

    def _generate_markdown(self, results: dict, output: str) -> str:
        import socket
-        hostname = socket.gethostname()
-        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        hostname = results.get("hostname") or socket.gethostname()
+        timestamp = results.get("timestamp") or datetime.now().strftime("%Y-%m-%d %H:%M:%S")

        lines: list[str] = []

@ -201,6 +201,21 @@ class ReportGenerator:
        # --- Summary table ---
        summary_items = self._build_summary(results)
        if summary_items:
+            verdict, failures, missing = self._overall_acceptance_verdict(summary_items)
+            lines.append("## Overall Acceptance Verdict\n")
+            lines.append(f"**Result: {verdict}**")
+            lines.append("")
+            if failures:
+                lines.append("Failed or unverified items:")
+                for name, status in failures:
+                    lines.append(f"- {name}: {status}")
+                lines.append("")
+            if missing:
+                lines.append("Missing required evidence:")
+                for name in missing:
+                    lines.append(f"- {name}")
+                lines.append("")
+
            lines.append("## Summary\n")
            lines.append("| Test | Result |")
            lines.append("|------|--------|")
@ -319,8 +334,6 @@ class ReportGenerator:
                    if use_abs and thr:
                        if val >= thr:
                            status = "PASS"
-                        elif val >= thr * 0.9:
-                            status = "WARN"
                        else:
                            status = "FAIL"
                        lines.append(f"| {dt.upper()} | {val:.1f} | {pk:.0f} | >= {thr} | {status} |")
@ -331,30 +344,123 @@ class ReportGenerator:
                        overall_status = status
            lines.append("")
            if use_abs:
+                if any(not row.get("passed", False) for row in (comp_data.get("consistency", {}) or {}).values()):
+                    overall_status = "FAIL"
                lines.append(f"**Verdict: {overall_status}** (absolute TFLOPS thresholds; worst efficiency {worst_eff:.1f}%)\n")
            else:
                overall_status = "PASS" if worst_eff >= 80 else ("WARN" if worst_eff >= 50 else "FAIL")
                lines.append(f"**Verdict: {overall_status}** (worst efficiency {worst_eff:.1f}%)\n")

+            consistency = comp_data.get("consistency", {}) or {}
+            if consistency:
+                lines.append("### Compute Consistency\n")
+                lines.append("| DType | Min | Mean | Max | Spread | Limit | Status |")
+                lines.append("|-------|-----|------|-----|--------|-------|--------|")
+                for dt, row in consistency.items():
+                    status = "PASS" if row.get("passed") else "FAIL"
+                    lines.append(
+                        f"| {dt.upper()} | {row.get('min_tflops', 0):.1f} | "
+                        f"{row.get('mean_tflops', 0):.1f} | {row.get('max_tflops', 0):.1f} | "
+                        f"{row.get('spread_pct', 0):.2f}% | <= {row.get('max_allowed_pct', 3)}% | {status} |"
+                    )
+                lines.append("")
+
+            per_gpu = comp_data.get("per_gpu", []) or []
+            dtype_order = [dt for dt in per_dtype.keys() if not isinstance(per_dtype.get(dt), str)]
+            if per_gpu and dtype_order:
+                lines.append("### Compute Per-GPU TFLOPS\n")
+                headers = ["GPU", *[dt.upper() for dt in dtype_order]]
+                lines.append("| " + " | ".join(headers) + " |")
+                lines.append("|" + "|".join(["---"] * len(headers)) + "|")
+                for row in per_gpu:
+                    cells = [str(row.get("index", ""))]
+                    for dt in dtype_order:
+                        val = row.get(dt, "")
+                        cells.append(f"{val:.1f}" if isinstance(val, (int, float)) else str(val))
+                    lines.append("| " + " | ".join(cells) + " |")
+                lines.append("")
+
+        # --- NCCL ---
+        nvlink = results.get("nvlink")
+        if nvlink and not nvlink.get("error"):
+            lines.append("## NVLink/NVSwitch\n")
+            lines.append(f"**Overall: {'PASS' if nvlink.get('passed') else 'FAIL'}**\n")
+            lines.append("| GPU | Active Links | Issues |")
+            lines.append("|-----|--------------|--------|")
+            for g in nvlink.get("gpus", []):
+                issues = []
+                if g.get("inactive_links"):
+                    issues.append("inactive=" + ",".join(g["inactive_links"]))
+                if g.get("speed_issues"):
+                    issues.append(f"speed issues={len(g['speed_issues'])}")
+                if g.get("error_issues"):
+                    issues.append(f"errors={len(g['error_issues'])}")
+                lines.append(f"| {g.get('gpu')} | {g.get('active_links')}/{g.get('expected_links')} | {', '.join(issues) or 'OK'} |")
+            lines.append("")
+        elif nvlink and nvlink.get("error"):
+            lines.append("## NVLink/NVSwitch\n")
+            lines.append(f"**Overall: FAIL** ({nvlink.get('error')})\n")
+
+        dcgm = results.get("dcgm")
+        if dcgm and not dcgm.get("error"):
+            lines.append("## DCGM Diagnostic\n")
+            lines.append(f"**Overall: {'PASS' if dcgm.get('passed') else 'FAIL'}**\n")
+            if dcgm.get("subtests"):
+                lines.append("| Subtest | Status |")
+                lines.append("|---------|--------|")
+                for s in dcgm.get("subtests", []):
+                    lines.append(f"| {s.get('name', '')} | {s.get('status', '')} |")
+                lines.append("")
+        elif dcgm and dcgm.get("error"):
+            lines.append("## DCGM Diagnostic\n")
+            lines.append(f"**Overall: FAIL** ({dcgm.get('error')})\n")
+
        # --- NCCL ---
        nccl = results.get("nccl")
        if nccl and not nccl.get("error"):
            lines.append("## NCCL Multi-GPU\n")
            lines.append(f"Source: {nccl.get('source', 'unknown')} | "
                         f"GPUs: {nccl.get('gpu_count', '?')}\n")
+            if nccl.get("source") == "torchrun_fallback":
+                lines.append("> Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.\n")
            tests = nccl.get("tests", {})
            if tests:
-                lines.append("| Operation | Bus BW (GB/s) | Threshold | Status |")
-                lines.append("|-----------|---------------|-----------|--------|")
+                lines.append("> Summary reports the best Bus BW observed for each operation. PASS/FAIL is evaluated across every tested message size and repeat run shown in the detail table below.\n")
+                lines.append("| Operation | Best Bus BW (GB/s) | Failed Sizes | Threshold | Status |")
+                lines.append("|-----------|--------------------|--------------|-----------|--------|")
                for op, data in tests.items():
                    if isinstance(data, dict) and not data.get("error"):
                        bw = data.get("best_busbw_gbps", 0)
                        req = data.get("min_required_gbps", 0)
                        status = data.get("status", "?")
-                        lines.append(f"| {op} | {bw:.1f} | >= {req:.0f} | {status} |")
+                        failed_sizes = [
+                            str(row.get("size", "?"))
+                            for row in data.get("by_size", [])
+                            if row.get("status") != "PASS"
+                        ]
+                        failed_sizes_text = ", ".join(failed_sizes) if failed_sizes else "-"
+                        lines.append(f"| {op} | {bw:.1f} | {failed_sizes_text} | >= {req:.0f} | {status} |")
                    elif isinstance(data, dict) and data.get("error"):
-                        lines.append(f"| {op} | - | - | ERROR: {data['error']} |")
+                        lines.append(f"| {op} | - | - | - | ERROR: {data['error']} |")
                lines.append("")
+                for op, data in tests.items():
+                    by_size = data.get("by_size", []) if isinstance(data, dict) else []
+                    if not by_size:
+                        continue
+                    lines.append(f"### NCCL {op} by size\n")
+                    lines.append("| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |")
+                    lines.append("|------|---------------------|-------|------|--------|-----------|--------|")
+                    for row in by_size:
+                        runs = ", ".join(str(v) for v in row.get("runs_busbw_gbps", []))
+                        lines.append(
+                            f"| {row.get('size', '')} | {runs} | "
+                            f"{row.get('worst_busbw_gbps', 0):.1f} | "
+                            f"{row.get('mean_busbw_gbps', 0):.1f} | "
+                            f"{row.get('stddev_pct', 0):.2f}% | "
+                            f">= {data.get('min_required_gbps', 0):.0f} | "
+                            f"{row.get('status', '?')} |"
+                        )
+                    lines.append("")
            passed = nccl.get("passed", False)
            lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n")

@ -368,6 +474,21 @@ class ReportGenerator:
            source = stress.get("source", "unknown")
            lines.append(f"- **Source:** {source}")
            lines.append(f"- **Duration:** {elapsed:.0f}s (requested {duration}s)")
+            telemetry = stress.get("telemetry") or {}
+            if telemetry:
+                lines.append(f"- **Telemetry samples:** {telemetry.get('samples', 0)}")
+                lines.append(f"- **Max temp:** {telemetry.get('max_temp_c', {})}")
+                lines.append(f"- **Avg power:** {telemetry.get('avg_power_w', {})}")
+                lines.append(f"- **Temp delta:** {telemetry.get('temp_delta_c', 'N/A')} C")
+                lines.append(f"- **TFLOPS jitter:** {telemetry.get('tflops_jitter_pct', 'N/A')}%")
+                lines.append(f"- **Steady TFLOPS samples:** {telemetry.get('steady_tflops_samples', 0)}")
+                lines.append(f"- **Throttle events:** {telemetry.get('throttle_event_count', len(telemetry.get('throttle_events', [])))}")
+                lines.append(f"- **XID events:** {len(telemetry.get('xid_events', []))}")
+                failures = telemetry.get("failures") or []
+                if failures:
+                    lines.append("- **Failure reasons:**")
+                    for reason in failures:
+                        lines.append(f"  - {reason}")
            lines.append(f"- **Result: {'PASS' if passed else 'FAIL'}**")
            lines.append("")

@ -378,26 +499,70 @@ class ReportGenerator:
            lines.append(f"**Overall: SKIP** [{rdma.get('reason', 'no IB hardware detected')}]\n")
        elif rdma and not rdma.get("error"):
            lines.append("## RDMA/InfiniBand\n")
+            rdma_legacy_note = self._rdma_legacy_note(rdma)
+            if rdma_legacy_note:
+                lines.append(f"> {rdma_legacy_note}\n")
+            port_checks = rdma.get("port_checks", [])
+            if port_checks:
+                lines.append("### RDMA Port Checks\n")
+                lines.append("| Device | Port | State | Rate | Required | Status |")
+                lines.append("|--------|------|-------|------|----------|--------|")
+                for p in port_checks:
+                    lines.append(
+                        f"| {p.get('device', '')} | {p.get('port', '')} | "
+                        f"{p.get('state', '')} | {p.get('rate', '')} | "
+                        f">= {p.get('min_rate_gbps', 400):.0f}Gbps ACTIVE | {p.get('status', '?')} |"
+                    )
+                lines.append("")
            bw_tests = rdma.get("bandwidth_tests", [])
            lat_tests = rdma.get("latency_tests", [])
-            if bw_tests or lat_tests:
+            ibping_tests = rdma.get("ibping_tests", [])
+            if bw_tests or lat_tests or ibping_tests:
                lines.append("| Test | Value | Threshold | Status |")
                lines.append("|------|-------|-----------|--------|")
                for bt in bw_tests:
-                    if not bt.get("error"):
+                    if bt.get("error"):
+                        lines.append(f"| {bt.get('test', 'ib_bw')} | {bt.get('error')} | required runnable test | {bt.get('status', 'FAIL')} |")
+                    else:
+                        threshold, status = self._rdma_bandwidth_verdict(bt)
                        lines.append(f"| {bt['test']} | {bt.get('bandwidth_gbps', 0):.1f} GB/s | "
-                                     f">= {bt.get('min_required_gbps', 0)} GB/s | {bt.get('status', '?')} |")
+                                     f">= {threshold:g} GB/s | {status} |")
                for lt in lat_tests:
-                    if not lt.get("error"):
+                    if lt.get("error"):
+                        lines.append(f"| {lt.get('test', 'ib_lat')} | {lt.get('error')} | required runnable test | {lt.get('status', 'FAIL')} |")
+                    else:
+                        threshold, status = self._rdma_latency_verdict(lt)
                        lines.append(f"| {lt['test']} | {lt.get('latency_us', 0):.2f} us | "
-                                     f"<= {lt.get('max_allowed_us', 0)} us | {lt.get('status', '?')} |")
+                                     f"<= {threshold:g} us | {status} |")
+                for it in ibping_tests:
+                    direction = it.get("direction") or it.get("role", "N/A")
+                    if it.get("error"):
+                        lines.append(f"| {it.get('test', 'ibping')} | {it.get('error')} | bidirectional peer evidence | {it.get('status', 'FAIL')} |")
+                    else:
+                        lines.append(f"| {it['test']} | {direction} target={it.get('target', 'N/A')} count={it.get('count', 'N/A')} | "
+                                     f"0% packet loss | {it.get('status', '?')} |")
                lines.append("")
+            fabric = rdma.get("fabric_counters") or {}
+            if fabric:
+                counters = fabric.get("counters", {})
+                lines.append(f"- **PFC/ECN/CNP/congestion counters checked:** {len(counters)}")
+                lines.append(f"- **PFC/ECN/CNP/congestion non-zero:** {'yes' if fabric.get('failed') else 'no'}")
+                if not counters:
+                    lines.append("- **PFC/ECN/CNP/congestion evidence:** missing")
+            failures = rdma.get("failures") or []
+            if not failures:
+                failures = self._rdma_failure_reasons(rdma)
+            if failures:
+                lines.append("- **Failure reasons:**")
+                for reason in failures:
+                    lines.append(f"  - {reason}")
            passed = rdma.get("passed", False)
            lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n")

        # --- Training ---
        training = results.get("training")
        if training and not training.get("error"):
+            training_status, training_detail, training_missing = self._training_verdict(training)
            lines.append("## Training Simulation\n")
            lines.append("| Metric | Value |")
            lines.append("|--------|-------|")
@ -405,8 +570,14 @@ class ReportGenerator:
            lines.append(f"| Params | {training.get('total_params_m', 0):.1f}M |")
            lines.append(f"| Throughput | {training.get('throughput_tokens_per_sec', 0):.0f} tokens/sec |")
            lines.append(f"| Avg Step Time | {training.get('avg_step_time_ms', 0):.1f} ms |")
+            lines.append(f"| Warmup Steps | {training.get('warmup_steps', 'N/A')} |")
            lines.append(f"| Peak Memory | {training.get('peak_memory_gb', 0):.1f} GB |")
            lines.append(f"| Final Loss | {training.get('final_loss', 'N/A')} |")
+            lines.append(f"| Step Jitter | {training.get('step_jitter_pct', 'N/A')}% |")
+            lines.append(f"| Distributed Mode | {training.get('distributed_mode', 'N/A')} |")
+            if training_missing:
+                lines.append(f"| Acceptance Gaps | missing {', '.join(training_missing)} |")
+            lines.append(f"| Verdict | {training_status} ({training_detail}) |")
            lines.append("")

        # --- Footer ---
@ -441,6 +612,101 @@ class ReportGenerator:
                return bench["compute"]
        return {}

+    @staticmethod
+    def _training_verdict(training: dict) -> tuple[str, str, list[str]]:
+        """Return report status for both current and legacy training result schemas."""
+        tps = float(training.get("throughput_tokens_per_sec", 0) or 0)
+        if "passed" in training:
+            status = "PASS" if training.get("passed") else "FAIL"
+            return status, f"{tps:.0f} tokens/sec", []
+
+        required = ["passed", "step_jitter_pct", "distributed_mode", "loss_finite"]
+        missing = [k for k in required if k not in training]
+        return "UNVERIFIED", f"{tps:.0f} tokens/sec; legacy result lacks explicit acceptance verdict", missing
+
+    def _rdma_cfg_value(self, key: str, default: float) -> float:
+        try:
+            return float((self.config.get("rdma", {}) or {}).get(key, default))
+        except (TypeError, ValueError):
+            return default
+
+    def _rdma_bandwidth_verdict(self, row: dict) -> tuple[float, str]:
+        threshold = self._rdma_cfg_value("min_bandwidth_gbps", 47.0)
+        value = float(row.get("bandwidth_gbps", 0) or 0)
+        return threshold, "PASS" if value >= threshold else "FAIL"
+
+    def _rdma_latency_verdict(self, row: dict) -> tuple[float, str]:
+        name = row.get("test", "")
+        if name == "ib_write_lat":
+            threshold = self._rdma_cfg_value("max_write_latency_us", 2.0)
+        elif name == "ib_read_lat":
+            threshold = self._rdma_cfg_value("max_read_latency_us", 3.5)
+        else:
+            threshold = self._rdma_cfg_value("max_latency_us", 3.5)
+        value = float(row.get("latency_us", 0) or 0)
+        return threshold, "PASS" if 0 < value <= threshold else "FAIL"
+
+    def _rdma_legacy_note(self, rdma: dict) -> str:
+        """Flag old RDMA result schemas whose embedded thresholds were looser."""
+        for row in rdma.get("bandwidth_tests", []) or []:
+            if row.get("min_required_gbps") != self._rdma_cfg_value("min_bandwidth_gbps", 47.0):
+                return (
+                    "Legacy RDMA result re-evaluated with current PDF acceptance thresholds; "
+                    "old WARN statuses and old 50GB/s/10us limits are not used for verdict."
+                )
+        for row in rdma.get("latency_tests", []) or []:
+            threshold, _ = self._rdma_latency_verdict(row)
+            if row.get("max_allowed_us") != threshold:
+                return (
+                    "Legacy RDMA result re-evaluated with current PDF acceptance thresholds; "
+                    "old WARN statuses and old 50GB/s/10us limits are not used for verdict."
+                )
+        return ""
+
+    def _rdma_failure_reasons(self, rdma: dict) -> list[str]:
+        failures = []
+        for row in rdma.get("bandwidth_tests", []) or []:
+            threshold, status = self._rdma_bandwidth_verdict(row)
+            if status != "PASS":
+                failures.append(
+                    f"{row.get('test')} bandwidth {row.get('bandwidth_gbps', 0)}GB/s < {threshold:g}GB/s"
+                )
+        for row in rdma.get("latency_tests", []) or []:
+            threshold, status = self._rdma_latency_verdict(row)
+            if status != "PASS":
+                failures.append(
+                    f"{row.get('test')} latency {row.get('latency_us', 0)}us > {threshold:g}us"
+                )
+        for row in rdma.get("ibping_tests", []) or []:
+            if row.get("status") != "PASS":
+                failures.append(f"{row.get('test')} failed")
+        return failures
+
+    @staticmethod
+    def _overall_acceptance_verdict(summary_items: list[tuple[str, str]]) -> tuple[str, list[tuple[str, str]], list[str]]:
+        """PDF-style machine verdict: every required item must be present and PASS."""
+        required = [
+            "GPU Info",
+            "Health Check",
+            "Memory Bandwidth",
+            "Compute Throughput",
+            "NVLink/NVSwitch",
+            "NCCL",
+            "Stress Test",
+            "RDMA",
+            "DCGM",
+            "Training",
+        ]
+        status_by_name = dict(summary_items)
+        missing = [name for name in required if name not in status_by_name]
+        failures = [
+            (name, status)
+            for name, status in summary_items
+            if name in required and not str(status).startswith("PASS")
+        ]
+        verdict = "PASS" if not missing and not failures else "FAIL"
+        return verdict, failures, missing
+
    def _build_summary(self, results: dict) -> list[tuple[str, str]]:
        """Build summary verdict list from results."""
        items = []
@ -473,7 +739,7 @@ class ReportGenerator:
                d2d = mem.get("d2d_bandwidth_gbps") or 0
                items.append(("Memory Bandwidth", f"WARN ({d2d:.0f} GB/s via PyTorch fallback)"))
            else:
-                eff = mem.get("efficiency_pct") or 0
+                eff = mem.get("d2d_efficiency_pct") or mem.get("efficiency_pct") or 0
                verdict = "PASS" if eff >= 80 else ("WARN" if eff >= 60 else "FAIL")
                items.append(("Memory Bandwidth", f"{verdict} ({eff:.1f}%)"))

@ -491,25 +757,43 @@ class ReportGenerator:
                    rank = {"PASS": 0, "WARN": 1, "FAIL": 2}
                    worst_status = "PASS"
                    worst_dt = None
+                    lowest_margin = None
                    for dt, thr in pass_thresholds.items():
                        val = per_dtype.get(dt)
                        if not isinstance(val, (int, float)):
                            continue
                        if val >= thr:
                            st = "PASS"
-                        elif val >= thr * 0.9:
-                            st = "WARN"
                        else:
                            st = "FAIL"
+                        margin = val / thr if thr else 0
+                        if lowest_margin is None or margin < lowest_margin:
+                            lowest_margin = margin
+                            worst_dt = dt
                        if rank[st] > rank[worst_status]:
                            worst_status = st
-                            worst_dt = dt
                    if worst_dt:
-                        items.append((
-                            "Compute Throughput",
-                            f"{worst_status} (worst {worst_dt.upper()} "
-                            f"{per_dtype[worst_dt]:.0f} vs >= {pass_thresholds[worst_dt]})"
-                        ))
+                        consistency = comp.get("consistency", {}) or {}
+                        failed_consistency = [
+                            (dt, row)
+                            for dt, row in consistency.items()
+                            if not row.get("passed", False)
+                        ]
+                        if failed_consistency:
+                            worst_status = "FAIL"
+                            fail_dt, fail_row = failed_consistency[0]
+                            items.append((
+                                "Compute Throughput",
+                                f"FAIL ({fail_dt.upper()} spread "
+                                f"{fail_row.get('spread_pct', 0):.2f}% > "
+                                f"{fail_row.get('max_allowed_pct', 3)}%)"
+                            ))
+                        else:
+                            items.append((
+                                "Compute Throughput",
+                                f"{worst_status} (worst {worst_dt.upper()} "
+                                f"{per_dtype[worst_dt]:.0f} vs >= {pass_thresholds[worst_dt]})"
+                            ))
                    else:
                        items.append(("Compute Throughput", f"{worst_status}"))
                else:
@ -521,11 +805,32 @@ class ReportGenerator:
                    else:
                        items.append(("Compute Throughput", "N/A"))

+        # NCCL
+        if "nvlink" in results:
+            nvl = results["nvlink"]
+            if nvl.get("error"):
+                items.append(("NVLink/NVSwitch", f"ERROR: {nvl['error']}"))
+            elif nvl.get("passed"):
+                items.append(("NVLink/NVSwitch", "PASS"))
+            else:
+                items.append(("NVLink/NVSwitch", "FAIL"))
+
+        if "dcgm" in results:
+            d = results["dcgm"]
+            if d.get("error"):
+                items.append(("DCGM", f"ERROR: {d['error']}"))
+            elif d.get("passed"):
+                items.append(("DCGM", "PASS"))
+            else:
+                items.append(("DCGM", "FAIL"))
+
        # NCCL
        if "nccl" in results:
            n = results["nccl"]
            if n.get("error"):
                items.append(("NCCL", f"ERROR: {n['error']}"))
+            elif n.get("source") == "torchrun_fallback":
+                items.append(("NCCL", "FAIL (no nccl-tests bus BW)"))
            elif n.get("passed"):
                items.append(("NCCL", "PASS"))
            else:
@ -559,7 +864,7 @@ class ReportGenerator:
            if t.get("error"):
                items.append(("Training", f"ERROR: {t['error']}"))
            else:
-                tps = t.get("throughput_tokens_per_sec", 0)
-                items.append(("Training", f"PASS ({tps:.0f} tokens/sec)"))
+                status, detail, _missing = self._training_verdict(t)
+                items.append(("Training", f"{status} ({detail})"))

        return items
--- a/modules/stress_test.py
+++ b/modules/stress_test.py
@ -1,9 +1,10 @@
-"""GPU stress test module — wraps gpu-burn for long-running stability tests."""
+"""GPU stress test module — gpu-burn or PyTorch GEMM with telemetry."""

 import glob
 import os
 import shutil
 import subprocess
+import threading
 import time
 from datetime import datetime

@ -46,7 +47,7 @@ class StressTest:
        memory_pct = cfg.get("memory_pct", 90)
        target_gpus = cfg.get("gpus", "all")

-        gpu_burn = self._find_gpu_burn()
+        gpu_burn = self._find_gpu_burn() if cfg.get("use_gpu_burn", False) else ""

        if gpu_burn:
            # Try gpu-burn first
@ -60,7 +61,7 @@ class StressTest:
            
            return result

-        self.console.print("[yellow]gpu_burn not found, using PyTorch stress test[/yellow]")
+        self.console.print("[yellow]Using PyTorch stress test[/yellow]")
        return self._run_pytorch_stress(duration_sec, memory_pct)

    def _run_gpu_burn(self, gpu_burn: str, duration: int,
@ -77,12 +78,26 @@ class StressTest:
        cmd.append(str(duration))

        t0 = time.time()
+        xid_before = self._collect_xid_events()
+        interval = int(self.stress_cfg.get("telemetry_interval_sec", 1))
+        telemetry = []
+        stop_sampling = threading.Event()
+        sampler = threading.Thread(
+            target=self._sample_telemetry,
+            args=(telemetry, stop_sampling, interval),
+            daemon=True,
+        )
+        sampler.start()
        try:
            r = subprocess.run(cmd, capture_output=True, text=True, timeout=duration + 120)
            elapsed = round(time.time() - t0, 1)
+            stop_sampling.set()
+            sampler.join(timeout=interval + 1)

            output = r.stdout + r.stderr
-            passed = r.returncode == 0
+            xid_events = self._new_xid_events(xid_before, self._collect_xid_events())
+            telemetry_summary = self._evaluate_telemetry(telemetry, [], xid_events)
+            passed = r.returncode == 0 and telemetry_summary.get("passed", False)

            gpu_results = []
            for line in output.split("\n"):
@ -96,25 +111,36 @@ class StressTest:
                "duration_sec": duration,
                "elapsed_sec": elapsed,
                "gpu_results": gpu_results,
+                "telemetry": telemetry_summary,
                "raw_output_tail": output[-500:] if output else "",
                "timestamp": datetime.now().isoformat(),
            }

        except subprocess.TimeoutExpired:
+            stop_sampling.set()
            return {
                "source": "gpu-burn",
                "passed": False,
                "duration_sec": duration,
                "error": "timeout",
+                "telemetry": self._evaluate_telemetry(
+                    telemetry, [], self._new_xid_events(xid_before, self._collect_xid_events())
+                ),
                "timestamp": datetime.now().isoformat(),
            }
        except Exception as e:
+            stop_sampling.set()
            return {
                "source": "gpu-burn",
                "passed": False,
                "error": str(e),
+                "telemetry": self._evaluate_telemetry(
+                    telemetry, [], self._new_xid_events(xid_before, self._collect_xid_events())
+                ),
                "timestamp": datetime.now().isoformat(),
            }
+        finally:
+            stop_sampling.set()

    def _run_pytorch_stress(self, duration: int, memory_pct: int = 90) -> dict:
        try:
@ -127,58 +153,79 @@ class StressTest:
        gpu_count = torch.cuda.device_count()
        self.console.print(f"[cyan]PyTorch Stress Test ({duration}s, {gpu_count} GPUs, target {memory_pct}% memory)[/cyan]")

+        dtype_name = self.stress_cfg.get("dtype", "bf16")
+        matrix_size = int(self.stress_cfg.get("matrix_size", 8192))
+        interval = int(self.stress_cfg.get("telemetry_interval_sec", 1))
+        dtype_map = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}
+        dtype = dtype_map.get(dtype_name, torch.bfloat16)
+
        gpu_status = {}
+        telemetry = []
+        stop_sampling = threading.Event()
        t0 = time.time()
+        xid_before = self._collect_xid_events()

        try:
+            sampler = threading.Thread(
+                target=self._sample_telemetry,
+                args=(telemetry, stop_sampling, interval),
+                daemon=True,
+            )
+            sampler.start()
            tensors = {}
+            ballast = {}
+            pass_tflops = []
            for i in range(gpu_count):
                with torch.cuda.device(i):
-                    # Get actual free memory (accounting for other processes)
                    free_mem, total_mem = torch.cuda.mem_get_info(i)
-                    
-                    # Calculate allocation from configured memory_pct
-                    target_mem = int(total_mem * memory_pct / 100)
-                    
-                    # Cap at actual free memory with 5% safety margin
-                    alloc_bytes = min(target_mem, int(free_mem * 0.95))
-                    
-                    # matmul(A, A.T) needs 2x input memory (input + output)
-                    mem_side = int((alloc_bytes / 4 / 2) ** 0.5)
-                    # Cap compute matrix so a single matmul completes in ~2s on H100/H200
-                    # (FP32 ≈ 67 TFLOPS → 2*4096³/67e12 ≈ 2s). Without this cap, a 141GB
-                    # HBM yields side ≈ 131K → single matmul ~68s × 8 GPUs serial → loop
-                    # overshoots a 60s duration request by 10×+.
-                    MAX_COMPUTE_SIDE = 4096
-                    side = min(mem_side, MAX_COMPUTE_SIDE)
-
-                    actual_mem_mb = side * side * 4 / 1024 / 1024
+                    side = matrix_size
+                    elem = torch.tensor([], dtype=dtype).element_size()
+                    compute_bytes = side * side * elem * 3
+                    target_mem = min(int(total_mem * memory_pct / 100), int(free_mem * 0.90))
+                    ballast_bytes = max(0, target_mem - compute_bytes)
+                    if ballast_bytes:
+                        ballast_elems = ballast_bytes // 2
+                        ballast[i] = torch.empty(ballast_elems, device=f"cuda:{i}", dtype=torch.float16)
+                    actual_mem_mb = (compute_bytes + ballast_bytes) / 1024 / 1024
                    total_mem_mb = total_mem / 1024 / 1024
                    free_mem_mb = free_mem / 1024 / 1024

                    self.console.print(
                        f"  [dim]GPU {i}: total {total_mem_mb:.0f}MB, free {free_mem_mb:.0f}MB, "
                        f"alloc {actual_mem_mb:.0f}MB ({actual_mem_mb/total_mem_mb*100:.0f}%) - "
-                        f"matrix {side}x{side}[/dim]"
+                        f"{dtype_name} matrix {side}x{side}[/dim]"
+                    )
+                    tensors[i] = (
+                        torch.randn(side, side, device=f"cuda:{i}", dtype=dtype),
+                        torch.randn(side, side, device=f"cuda:{i}", dtype=dtype),
+                        torch.empty(side, side, device=f"cuda:{i}", dtype=dtype),
                    )
-                    tensors[i] = torch.randn(side, side, device=f"cuda:{i}", dtype=torch.float32)

            self.console.print(f"\n[cyan]Starting stress test for {duration} seconds...[/cyan]")
            
            elapsed_check = 0
            while time.time() - t0 < duration:
+                loop_start = time.perf_counter()
                # Dispatch matmul on all GPUs in parallel — do NOT synchronize between
                # GPUs, otherwise the 8 GPUs run serially and overshoot the duration.
                for i in range(gpu_count):
                    with torch.cuda.device(i):
-                        tensors[i] = torch.matmul(tensors[i], tensors[i].T)
+                        a, b, out = tensors[i]
+                        torch.matmul(a, b, out=out)
                # Single sync per pass — waits for all 8 streams concurrently
                for i in range(gpu_count):
                    with torch.cuda.device(i):
                        torch.cuda.synchronize()
+                loop_elapsed = time.perf_counter() - loop_start
+                current_elapsed = time.time() - t0
+                if loop_elapsed > 0:
+                    flops = gpu_count * 2 * (matrix_size ** 3)
+                    pass_tflops.append({
+                        "elapsed_sec": current_elapsed,
+                        "tflops": flops / loop_elapsed / 1e12,
+                    })

                # Show progress every 10 seconds
-                current_elapsed = time.time() - t0
                if int(current_elapsed) != int(elapsed_check) and int(current_elapsed) % 10 == 0:
                    self.console.print(f"  [dim]Running {int(current_elapsed)}s / {duration}s[/dim]")
                    elapsed_check = current_elapsed
@ -198,21 +245,196 @@ class StressTest:
                "duration_sec": duration,
                "error": error_msg,
                "gpu_status": gpu_status,
+                "telemetry": self._evaluate_telemetry(
+                    telemetry, pass_tflops if "pass_tflops" in locals() else [],
+                    self._new_xid_events(xid_before, self._collect_xid_events()),
+                ),
            }
        finally:
+            stop_sampling.set()
            tensors.clear()
+            ballast.clear()
            torch.cuda.empty_cache()

        elapsed = round(time.time() - t0, 1)
+        xid_events = self._new_xid_events(xid_before, self._collect_xid_events())
+        telemetry_summary = self._evaluate_telemetry(telemetry, pass_tflops, xid_events)
+        passed = all(v == "PASS" for v in gpu_status.values()) and telemetry_summary.get("passed", False)
        return {
            "source": "pytorch",
-            "passed": True,
+            "passed": passed,
            "duration_sec": duration,
            "elapsed_sec": elapsed,
            "gpu_status": gpu_status,
+            "telemetry": telemetry_summary,
            "timestamp": datetime.now().isoformat(),
        }

+    def _sample_telemetry(self, telemetry: list, stop_event: threading.Event, interval: int):
+        query = "index,temperature.gpu,power.draw,clocks_throttle_reasons.active"
+        while not stop_event.is_set():
+            try:
+                r = subprocess.run(
+                    ["nvidia-smi", f"--query-gpu={query}", "--format=csv,noheader,nounits"],
+                    capture_output=True, text=True, timeout=10,
+                )
+                if r.returncode == 0:
+                    sample = {"time": time.time(), "gpus": []}
+                    for line in r.stdout.splitlines():
+                        parts = [p.strip() for p in line.split(",")]
+                        if len(parts) >= 4:
+                            sample["gpus"].append({
+                                "index": int(parts[0]),
+                                "temp_c": float(parts[1]),
+                                "power_w": float(parts[2]),
+                                "throttle": parts[3],
+                            })
+                    telemetry.append(sample)
+            except Exception:
+                pass
+            stop_event.wait(interval)
+
+    def _collect_xid_events(self) -> list[str]:
+        try:
+            r = subprocess.run(
+                ["dmesg", "--color=never"],
+                capture_output=True, text=True, timeout=10,
+            )
+            if r.returncode != 0:
+                return []
+            return [
+                line.strip()
+                for line in r.stdout.splitlines()
+                if any(token in line.upper() for token in ("XID", "NVRM: XID"))
+            ]
+        except Exception:
+            return []
+
+    @staticmethod
+    def _new_xid_events(before: list[str], after: list[str]) -> list[str]:
+        seen = set(before)
+        return [line for line in after if line not in seen]
+
+    def _evaluate_telemetry(self, telemetry: list, pass_tflops: list, xid_events: list[str] | None = None) -> dict:
+        cfg = self.stress_cfg
+        max_temp = float(cfg.get("max_temp_c", 80))
+        max_delta = float(cfg.get("max_temp_delta_c", 5))
+        min_power = float(cfg.get("min_power_watts", 630))
+        max_jitter = float(cfg.get("max_tflops_jitter_pct", 5))
+        require_jitter = bool(cfg.get("require_tflops_jitter", True))
+        duration = float(cfg.get("duration_sec", 60))
+        requested_warmup = float(cfg.get("warmup_sec", 60))
+        warmup_sec = min(requested_warmup, max(0.0, duration * 0.2))
+        min_steady_samples = int(cfg.get("min_steady_samples", 10))
+        temps = {}
+        powers = {}
+        throttle_bad = []
+        xid_events = xid_events or []
+        steady_telemetry = [
+            sample for sample in telemetry
+            if sample.get("time", 0) - telemetry[0].get("time", 0) >= warmup_sec
+        ] if telemetry else []
+        evaluation_samples = steady_telemetry if len(steady_telemetry) >= min_steady_samples else telemetry
+        for sample in evaluation_samples:
+            for g in sample.get("gpus", []):
+                idx = g["index"]
+                temps.setdefault(idx, []).append(g["temp_c"])
+                powers.setdefault(idx, []).append(g["power_w"])
+                try:
+                    bitmask = int(str(g["throttle"]), 16)
+                except ValueError:
+                    bitmask = 0
+                real_throttle = bitmask & ~0x1
+                if real_throttle:
+                    throttle_bad.append({
+                        "gpu": idx,
+                        "throttle": g["throttle"],
+                        "real_throttle": f"0x{real_throttle:x}",
+                    })
+        max_temps = {idx: max(vals) for idx, vals in temps.items() if vals}
+        avg_powers = {idx: sum(vals) / len(vals) for idx, vals in powers.items() if vals}
+        temp_delta = (max(max_temps.values()) - min(max_temps.values())) if len(max_temps) >= 2 else 0
+        jitter = 0
+        steady_tflops = []
+        for item in pass_tflops:
+            if isinstance(item, dict):
+                if float(item.get("elapsed_sec", 0)) >= warmup_sec:
+                    steady_tflops.append(float(item.get("tflops", 0)))
+            else:
+                steady_tflops.append(float(item))
+        if len(steady_tflops) < 2 and pass_tflops:
+            steady_tflops = [
+                float(item.get("tflops", 0)) if isinstance(item, dict) else float(item)
+                for item in pass_tflops
+            ]
+        if steady_tflops:
+            mean = sum(steady_tflops) / len(steady_tflops)
+            jitter = max(abs(v - mean) / mean * 100 for v in steady_tflops) if mean else 0
+        failures = []
+        temp_failures = {idx: v for idx, v in max_temps.items() if v > max_temp}
+        power_failures = {idx: v for idx, v in avg_powers.items() if v < min_power}
+        if not evaluation_samples:
+            failures.append("no telemetry samples available for evaluation")
+        if temp_failures:
+            failures.append(
+                "max temperature above threshold: "
+                + ", ".join(f"GPU {idx} {val:.1f}C" for idx, val in sorted(temp_failures.items()))
+            )
+        if temp_delta > max_delta:
+            failures.append(f"GPU temperature delta {temp_delta:.1f}C exceeds {max_delta:.1f}C")
+        if power_failures:
+            failures.append(
+                "average steady-state power below threshold: "
+                + ", ".join(f"GPU {idx} {val:.1f}W" for idx, val in sorted(power_failures.items()))
+            )
+        if throttle_bad:
+            failures.append(
+                f"non-idle throttle reasons observed in {len(throttle_bad)} samples "
+                f"(first: GPU {throttle_bad[0]['gpu']} {throttle_bad[0]['real_throttle']})"
+            )
+        if xid_events:
+            failures.append(f"{len(xid_events)} new XID/NVRM XID events observed")
+        if require_jitter and len(steady_tflops) < 2:
+            failures.append(
+                f"insufficient steady TFLOPS samples for jitter evaluation: {len(steady_tflops)} < 2"
+            )
+        if jitter > max_jitter:
+            failures.append(f"TFLOPS jitter {jitter:.2f}% exceeds {max_jitter:.2f}%")
+        passed = (
+            bool(evaluation_samples)
+            and all(v <= max_temp for v in max_temps.values())
+            and temp_delta <= max_delta
+            and all(v >= min_power for v in avg_powers.values())
+            and not throttle_bad
+            and not xid_events
+            and (not require_jitter or len(steady_tflops) >= 2)
+            and jitter <= max_jitter
+        )
+        return {
+            "passed": passed,
+            "samples": len(telemetry),
+            "steady_samples": len(evaluation_samples),
+            "warmup_sec": round(warmup_sec, 1),
+            "max_temp_c": {k: round(v, 1) for k, v in max_temps.items()},
+            "avg_power_w": {k: round(v, 1) for k, v in avg_powers.items()},
+            "temp_delta_c": round(temp_delta, 1),
+            "throttle_events": throttle_bad[:20],
+            "throttle_event_count": len(throttle_bad),
+            "xid_events": xid_events[-20:],
+            "tflops_jitter_pct": round(jitter, 2),
+            "steady_tflops_samples": len(steady_tflops),
+            "failures": failures,
+            "thresholds": {
+                "max_temp_c": max_temp,
+                "max_temp_delta_c": max_delta,
+                "min_power_w": min_power,
+                "max_tflops_jitter_pct": max_jitter,
+                "require_tflops_jitter": require_jitter,
+                "warmup_sec": requested_warmup,
+                "min_steady_samples": min_steady_samples,
+            },
+        }
+
    @staticmethod
    def print_results(results: dict, console: Console = None):
        c = console or Console()
@ -245,5 +467,21 @@ class StressTest:
                color = "green" if status == "PASS" else "red"
                c.print(f"    GPU {gid}: [{color}]{status}[/{color}]")

+        telemetry = results.get("telemetry") or {}
+        if telemetry:
+            c.print("\n  Telemetry:")
+            c.print(f"    Samples: {telemetry.get('samples', 0)} total, {telemetry.get('steady_samples', 0)} evaluated after {telemetry.get('warmup_sec', 0)}s warmup")
+            c.print(f"    Avg steady power: {telemetry.get('avg_power_w', {})}")
+            c.print(f"    Max steady temp: {telemetry.get('max_temp_c', {})}")
+            c.print(f"    Temp delta: {telemetry.get('temp_delta_c', 'N/A')} C")
+            c.print(f"    TFLOPS jitter: {telemetry.get('tflops_jitter_pct', 'N/A')}%")
+            c.print(f"    Throttle events: {telemetry.get('throttle_event_count', len(telemetry.get('throttle_events', [])))}")
+            c.print(f"    XID events: {len(telemetry.get('xid_events', []))}")
+            failures = telemetry.get("failures", [])
+            if failures:
+                c.print("  [red]Failure reasons:[/red]")
+                for reason in failures:
+                    c.print(f"    [red]- {reason}[/red]")
+
        if results.get("error"):
            c.print(f"  [red]Error: {results['error']}[/red]")
--- a/modules/training_sim.py
+++ b/modules/training_sim.py
@ -1,8 +1,13 @@
 """Training simulation module - LLM training workload with PyTorch."""

+import json
+import os
+import sys
+import tempfile
 import time
 import subprocess
 import shutil
+import math
 from datetime import datetime
 from typing import Optional

@ -36,6 +41,7 @@ class TrainingSim:
        batch_size = self.train_cfg.get("batch_size", 8)
        seq_length = self.train_cfg.get("seq_length", 2048)
        num_steps = self.train_cfg.get("num_steps", 50)
+        warmup_steps = int(self.train_cfg.get("warmup_steps", 5))
        dtype_str = self.train_cfg.get("dtype", "bf16")

        dtype_map = {
@ -47,7 +53,13 @@ class TrainingSim:

        self.console.print(f"[cyan]Training Simulation[/cyan]")
        self.console.print(f"  Model: {model_name} | Batch: {batch_size} | Seq: {seq_length} | "
-                           f"DType: {dtype_str} | Steps: {num_steps} | GPUs: {gpu_count}")
+                           f"DType: {dtype_str} | Steps: {num_steps} | Warmup: {warmup_steps} | GPUs: {gpu_count}")
+
+        if self.train_cfg.get("mode", "ddp") == "ddp" and gpu_count > 1:
+            ddp_result = self._run_synthetic_ddp(gpu_count, batch_size, seq_length, num_steps, dtype_str)
+            if ddp_result.get("passed") or not self.train_cfg.get("allow_fallback", False):
+                return ddp_result
+            self.console.print("[yellow]DDP synthetic training failed, falling back to single-process synthetic path[/yellow]")

        try:
            from transformers import AutoModelForCausalLM, AutoTokenizer
@ -87,9 +99,10 @@ class TrainingSim:
                BarColumn(), TextColumn("{task.completed}/{task.total}"),
                TimeElapsedColumn(), console=self.console,
            ) as progress:
-                task = progress.add_task("Training steps...", total=num_steps)
+                total_steps = num_steps + warmup_steps
+                task = progress.add_task("Training steps...", total=total_steps)

-                for step in range(num_steps):
+                for step in range(total_steps):
                    torch.cuda.synchronize()
                    t0 = time.perf_counter()

@ -119,8 +132,15 @@ class TrainingSim:

                    progress.advance(task)

-            avg_step_time = sum(step_times) / len(step_times)
+            measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
+            avg_step_time = sum(measured_steps) / len(measured_steps)
            throughput = batch_size * seq_length / avg_step_time
+            jitter = self._jitter_pct(measured_steps)
+            peak_mem = round(max(mem_usage) if mem_usage else 0, 2)
+            final_loss = float(loss.item()) if hasattr(loss, "item") else float("nan")
+            passed = self._acceptance_pass(throughput, jitter, peak_mem, final_loss)
+            if self.train_cfg.get("require_distributed", True):
+                passed = False

            return {
                "model": model_name,
@ -130,11 +150,18 @@ class TrainingSim:
                "batch_size": batch_size,
                "seq_length": seq_length,
                "num_steps": num_steps,
+                "warmup_steps": warmup_steps,
+                "total_steps": total_steps,
                "avg_step_time_ms": round(avg_step_time * 1000, 1),
                "throughput_tokens_per_sec": round(throughput, 0),
                "throughput_samples_per_sec": round(batch_size / avg_step_time, 2),
-                "peak_memory_gb": round(max(mem_usage) if mem_usage else 0, 2),
-                "final_loss": round(loss.item(), 4) if hasattr(loss, 'item') else None,
+                "peak_memory_gb": peak_mem,
+                "final_loss": round(final_loss, 4),
+                "step_jitter_pct": round(jitter, 2),
+                "distributed_mode": "device_map",
+                "loss_finite": math.isfinite(final_loss),
+                "passed": passed,
+                "acceptance_gap": "8-GPU DDP was not used" if self.train_cfg.get("require_distributed", True) else "",
                "timestamp": datetime.now().isoformat(),
            }

@ -142,6 +169,196 @@ class TrainingSim:
            self.console.print(f"[yellow]Model loading failed: {e}[/yellow]")
            return self._run_synthetic(gpu_count, batch_size, seq_length, num_steps, dtype)

+    def _run_synthetic_ddp(self, gpu_count: int, batch_size: int, seq_length: int,
+                           num_steps: int, dtype_str: str) -> dict:
+        """Run the 1.5B synthetic Transformer with one process per GPU."""
+        torchrun = os.path.join(os.path.dirname(sys.executable), "torchrun")
+        if not os.path.isfile(torchrun):
+            torchrun = shutil.which("torchrun") or ""
+        if not torchrun:
+            return {
+                "model": "synthetic_transformer_1.5b",
+                "gpu_count": gpu_count,
+                "distributed_mode": "ddp",
+                "passed": False,
+                "error": "torchrun not found",
+                "timestamp": datetime.now().isoformat(),
+            }
+
+        script = r'''
+import json
+import math
+import os
+import time
+import torch
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+def main():
+    local_rank = int(os.environ["LOCAL_RANK"])
+    world_size = int(os.environ["WORLD_SIZE"])
+    torch.cuda.set_device(local_rank)
+    dist.init_process_group("nccl")
+
+    global_batch = int(os.environ["TRAIN_BATCH_SIZE"])
+    local_batch = max(1, global_batch // world_size)
+    seq_length = int(os.environ["TRAIN_SEQ_LENGTH"])
+    num_steps = int(os.environ["TRAIN_NUM_STEPS"])
+    warmup_steps = int(os.environ.get("TRAIN_WARMUP_STEPS", "5"))
+    total_steps = num_steps + warmup_steps
+    dtype_name = os.environ.get("TRAIN_DTYPE", "bf16")
+    dtype = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}.get(dtype_name, torch.bfloat16)
+
+    hidden_size = 4096
+    num_layers = 6
+    num_heads = 32
+    vocab_size = 32000
+
+    class SyntheticTransformer(torch.nn.Module):
+        def __init__(self):
+            super().__init__()
+            self.embed = torch.nn.Embedding(vocab_size, hidden_size)
+            self.layers = torch.nn.ModuleList([
+                torch.nn.TransformerEncoderLayer(
+                    d_model=hidden_size,
+                    nhead=num_heads,
+                    dim_feedforward=hidden_size * 4,
+                    batch_first=True,
+                    dtype=dtype,
+                ) for _ in range(num_layers)
+            ])
+            self.head = torch.nn.Linear(hidden_size, vocab_size, dtype=dtype)
+
+        def forward(self, x):
+            h = self.embed(x).to(dtype)
+            for layer in self.layers:
+                h = layer(h)
+            return self.head(h)
+
+    model = SyntheticTransformer().cuda()
+    total_params = sum(p.numel() for p in model.parameters())
+    model = DDP(model, device_ids=[local_rank], output_device=local_rank)
+    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
+    input_ids = torch.randint(0, vocab_size, (local_batch, seq_length), device="cuda")
+    step_times = []
+    last_loss = torch.tensor(float("nan"), device="cuda")
+    torch.cuda.reset_peak_memory_stats(local_rank)
+
+    for _ in range(total_steps):
+        torch.cuda.synchronize()
+        t0 = time.perf_counter()
+        with torch.amp.autocast("cuda", dtype=dtype, enabled=dtype in (torch.float16, torch.bfloat16)):
+            logits = model(input_ids)
+            loss = torch.nn.functional.cross_entropy(logits.reshape(-1, vocab_size), input_ids.reshape(-1))
+        loss.backward()
+        optimizer.step()
+        optimizer.zero_grad(set_to_none=True)
+        torch.cuda.synchronize()
+        step_times.append(time.perf_counter() - t0)
+        last_loss = loss.detach()
+
+    peak_mem = torch.tensor(torch.cuda.max_memory_allocated(local_rank) / 1024**3, device="cuda")
+    dist.all_reduce(peak_mem, op=dist.ReduceOp.MAX)
+    finite = torch.tensor(1 if math.isfinite(float(last_loss.item())) else 0, device="cuda")
+    dist.all_reduce(finite, op=dist.ReduceOp.MIN)
+
+    if dist.get_rank() == 0:
+        measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
+        avg_step = sum(measured_steps) / len(measured_steps)
+        mean = avg_step
+        jitter = max(abs(v - mean) / mean * 100 for v in measured_steps) if mean else 0.0
+        throughput = global_batch * seq_length / avg_step if avg_step else 0.0
+        print("TRAINING_DDP_JSON=" + json.dumps({
+            "model": "synthetic_transformer_1.5b",
+            "total_params_m": round(total_params / 1e6, 1),
+            "num_layers": num_layers,
+            "hidden_size": hidden_size,
+            "gpu_count": world_size,
+            "dtype": dtype_name,
+            "batch_size": global_batch,
+            "local_batch_size": local_batch,
+            "seq_length": seq_length,
+            "num_steps": num_steps,
+            "warmup_steps": warmup_steps,
+            "total_steps": total_steps,
+            "avg_step_time_ms": round(avg_step * 1000, 1),
+            "throughput_tokens_per_sec": round(throughput, 0),
+            "throughput_samples_per_sec": round(global_batch / avg_step, 2) if avg_step else 0,
+            "peak_memory_gb": round(float(peak_mem.item()), 2),
+            "final_loss": round(float(last_loss.item()), 4),
+            "step_jitter_pct": round(jitter, 2),
+            "distributed_mode": "ddp",
+            "loss_finite": bool(int(finite.item())),
+        }), flush=True)
+    dist.destroy_process_group()
+
+if __name__ == "__main__":
+    main()
+'''
+        tmp = tempfile.NamedTemporaryFile("w", suffix="_training_ddp.py", delete=False)
+        tmp.write(script)
+        tmp.close()
+
+        env = {
+            **os.environ,
+            "TRAIN_BATCH_SIZE": str(batch_size),
+            "TRAIN_SEQ_LENGTH": str(seq_length),
+            "TRAIN_NUM_STEPS": str(num_steps),
+            "TRAIN_WARMUP_STEPS": str(int(self.train_cfg.get("warmup_steps", 5))),
+            "TRAIN_DTYPE": dtype_str,
+            "NCCL_DEBUG": os.environ.get("NCCL_DEBUG", "WARN"),
+        }
+        cmd = [torchrun, f"--nproc_per_node={gpu_count}", tmp.name]
+        self.console.print(f"  Running synthetic 1.5B DDP via torchrun ({gpu_count} processes)...")
+        try:
+            timeout = int(self.train_cfg.get("timeout_sec", max(600, num_steps * 180)))
+            r = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, env=env)
+        except subprocess.TimeoutExpired:
+            os.unlink(tmp.name)
+            return {
+                "model": "synthetic_transformer_1.5b",
+                "gpu_count": gpu_count,
+                "distributed_mode": "ddp",
+                "passed": False,
+                "error": "training_ddp_timeout",
+                "timestamp": datetime.now().isoformat(),
+            }
+        finally:
+            if os.path.exists(tmp.name):
+                try:
+                    os.unlink(tmp.name)
+                except OSError:
+                    pass
+
+        marker = "TRAINING_DDP_JSON="
+        payload = None
+        for line in (r.stdout + "\n" + r.stderr).splitlines():
+            if marker in line:
+                payload = line.split(marker, 1)[1].strip()
+        if r.returncode != 0 or not payload:
+            return {
+                "model": "synthetic_transformer_1.5b",
+                "gpu_count": gpu_count,
+                "distributed_mode": "ddp",
+                "passed": False,
+                "error": (r.stderr or r.stdout or "training_ddp_failed")[-1000:],
+                "timestamp": datetime.now().isoformat(),
+            }
+
+        result = json.loads(payload)
+        loss_value = float(result.get("final_loss", "nan"))
+        passed = self._acceptance_pass(
+            float(result.get("throughput_tokens_per_sec", 0)),
+            float(result.get("step_jitter_pct", 999)),
+            float(result.get("peak_memory_gb", 999)),
+            loss_value,
+        ) and bool(result.get("loss_finite", False)) and result.get("gpu_count") == gpu_count
+        result.update({
+            "passed": passed,
+            "timestamp": datetime.now().isoformat(),
+        })
+        return result
+
    def _run_synthetic(self, gpu_count, batch_size, seq_length, num_steps, dtype) -> dict:
        self.console.print("  Running synthetic training benchmark...")

@ -170,11 +387,17 @@ class TrainingSim:
                    h = layer(h)
                return self.head(h)

-        model = SyntheticTransformer().cuda()
+        model = SyntheticTransformer()
        total_params = sum(p.numel() for p in model.parameters())

        self.console.print(f"  Synthetic params: {total_params / 1e6:.1f}M")

+        distributed_mode = "single_gpu"
+        if gpu_count > 1:
+            model = torch.nn.DataParallel(model).cuda()
+            distributed_mode = "data_parallel"
+        else:
+            model = model.cuda()
        model.train()
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

@ -183,14 +406,17 @@ class TrainingSim:
        step_times = []
        mem_usage = []

+        warmup_steps = int(self.train_cfg.get("warmup_steps", 5))
+        total_steps = num_steps + warmup_steps
+
        with Progress(
            SpinnerColumn(), TextColumn("[progress.description]{task.description}"),
            BarColumn(), TextColumn("{task.completed}/{task.total}"),
            TimeElapsedColumn(), console=self.console,
        ) as progress:
-            task = progress.add_task("Synthetic training...", total=num_steps)
+            task = progress.add_task("Synthetic training...", total=total_steps)

-            for step in range(num_steps):
+            for step in range(total_steps):
                torch.cuda.synchronize()
                t0 = time.perf_counter()

@ -206,14 +432,22 @@ class TrainingSim:
                elapsed = time.perf_counter() - t0
                step_times.append(elapsed)

-                mem_used = torch.cuda.max_memory_allocated() / 1024**3
+                mem_used = max(torch.cuda.max_memory_allocated(i) for i in range(gpu_count)) / 1024**3
                mem_usage.append(mem_used)
-                torch.cuda.reset_peak_memory_stats()
+                for i in range(gpu_count):
+                    torch.cuda.reset_peak_memory_stats(i)

                progress.advance(task)

-        avg_step_time = sum(step_times) / len(step_times)
+        measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
+        avg_step_time = sum(measured_steps) / len(measured_steps)
        throughput = batch_size * seq_length / avg_step_time
+        jitter = self._jitter_pct(measured_steps)
+        peak_mem = round(max(mem_usage) if mem_usage else 0, 2)
+        final_loss = float(loss.item())
+        passed = self._acceptance_pass(throughput, jitter, peak_mem, final_loss)
+        if self.train_cfg.get("require_distributed", True):
+            passed = False

        return {
            "model": "synthetic_transformer",
@ -225,14 +459,36 @@ class TrainingSim:
            "batch_size": batch_size,
            "seq_length": seq_length,
            "num_steps": num_steps,
+            "warmup_steps": warmup_steps,
+            "total_steps": total_steps,
            "avg_step_time_ms": round(avg_step_time * 1000, 1),
            "throughput_tokens_per_sec": round(throughput, 0),
            "throughput_samples_per_sec": round(batch_size / avg_step_time, 2),
-            "peak_memory_gb": round(max(mem_usage) if mem_usage else 0, 2),
-            "final_loss": round(loss.item(), 4),
+            "peak_memory_gb": peak_mem,
+            "final_loss": round(final_loss, 4),
+            "step_jitter_pct": round(jitter, 2),
+            "distributed_mode": distributed_mode,
+            "loss_finite": math.isfinite(final_loss),
+            "passed": passed,
+            "acceptance_gap": "8-GPU DDP was not used" if self.train_cfg.get("require_distributed", True) else "",
            "timestamp": datetime.now().isoformat(),
        }

+    @staticmethod
+    def _jitter_pct(step_times: list[float]) -> float:
+        if not step_times:
+            return 0.0
+        mean = sum(step_times) / len(step_times)
+        return max(abs(v - mean) / mean * 100 for v in step_times) if mean else 0.0
+
+    def _acceptance_pass(self, throughput: float, jitter: float, peak_mem: float, loss_value: float) -> bool:
+        return (
+            throughput >= float(self.train_cfg.get("min_tokens_per_sec", 45000))
+            and jitter <= float(self.train_cfg.get("max_step_jitter_pct", 3))
+            and peak_mem <= float(self.train_cfg.get("max_peak_memory_gb", 70))
+            and math.isfinite(loss_value)
+        )
+
    @staticmethod
    def print_results(results: dict, console: Console = None):
        c = console or Console()
@ -254,11 +510,15 @@ class TrainingSim:
            ("Batch Size", str(results.get("batch_size", "N/A"))),
            ("Seq Length", str(results.get("seq_length", "N/A"))),
            ("Steps", str(results.get("num_steps", "N/A"))),
+            ("Warmup Steps", str(results.get("warmup_steps", "N/A"))),
            ("Avg Step Time", f"{results.get('avg_step_time_ms', 'N/A')} ms"),
            ("Throughput", f"{results.get('throughput_tokens_per_sec', 'N/A')} tokens/s"),
            ("Samples/sec", f"{results.get('throughput_samples_per_sec', 'N/A')}"),
            ("Peak Memory", f"{results.get('peak_memory_gb', 'N/A')} GB"),
            ("Final Loss", str(results.get("final_loss", "N/A"))),
+            ("Step Jitter", f"{results.get('step_jitter_pct', 'N/A')}%"),
+            ("Distributed Mode", results.get("distributed_mode", "N/A")),
+            ("Verdict", "PASS" if results.get("passed") else "FAIL"),
        ]
        for label, val in metrics:
            table.add_row(label, str(val))
--- a/reports_all_aikubeworker0016.json
+++ b/reports_all_aikubeworker0016.json
@ -0,0 +1,921 @@
+{
+  "timestamp": "2026-05-22T15:49:02.368516",
+  "gpu_info": {
+    "driver_version": "580.159.03",
+    "cuda_version": "13.0",
+    "gpu_count": 8,
+    "gpus": [
+      {
+        "index": 0,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-dfbc9513-255d-4fe7-2b77-7b1ec3972e75",
+        "pci_bus_id": "00000000:18:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 69.98,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924016120",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 1,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-bb845ef7-d7b5-f011-9395-ea74274e2282",
+        "pci_bus_id": "00000000:2A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 67.54,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924015483",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 2,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-3720cf13-2a34-be38-27be-0a7adc4addc4",
+        "pci_bus_id": "00000000:3A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 66.82,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 22,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924025595",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 3,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-87080b2d-ac43-be0d-d574-c193078850ae",
+        "pci_bus_id": "00000000:5D:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 67.02,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924016862",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 4,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-599bd883-cc5c-a5dd-6c33-c15f7049da48",
+        "pci_bus_id": "00000000:9A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 67.24,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924025670",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 5,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-a1c6bba4-61b0-e623-06c9-9c88635e26fe",
+        "pci_bus_id": "00000000:AB:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 69.31,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 23,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924027166",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 6,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-98745a0c-39bd-3e56-d6ca-54ba3647ab6d",
+        "pci_bus_id": "00000000:BA:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 67.84,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924026234",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 7,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-8c73bd8b-666b-357e-ac5d-c75ac7a759db",
+        "pci_bus_id": "00000000:DB:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 66.21,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924027255",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      }
+    ],
+    "topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n  X    = Self\n  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n  PIX  = Connection traversing at most a single PCIe bridge\n  NV#  = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n  NIC0: mlx5_0\n  NIC1: mlx5_1\n  NIC2: mlx5_2\n  NIC3: mlx5_3\n  NIC4: mlx5_4\n  NIC5: mlx5_5\n  NIC6: mlx5_6\n  NIC7: mlx5_7\n  NIC8: mlx5_8\n  NIC9: mlx5_9\n\n",
+    "timestamp": "2026-05-22T15:49:09.197459",
+    "detected_gpu_type": "h100",
+    "gpu_label": "H100 SXM5"
+  },
+  "health": {
+    "passed": true,
+    "gpu_health": [
+      {
+        "index": 0,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 21,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 69.86,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 1,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 21,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 67.48,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 2,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 22,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 66.76,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 3,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 21,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 67.06,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 4,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 21,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 67.23,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 5,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 23,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 69.27,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 6,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 21,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 67.81,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      },
+      {
+        "index": 7,
+        "status": "WARN",
+        "checks": {
+          "temperature": {
+            "value": 21,
+            "status": "PASS",
+            "threshold": 75
+          },
+          "power": {
+            "value": 66.3,
+            "limit": 700.0,
+            "status": "PASS"
+          },
+          "ecc_errors": {
+            "single": 0,
+            "double": 0,
+            "status": "PASS"
+          },
+          "memory_errors": {
+            "status": "PASS"
+          },
+          "pcie_link": {
+            "gen": 5,
+            "width": 16,
+            "status": "PASS"
+          },
+          "clock_speed": {
+            "sm": 345,
+            "mem": 2619,
+            "status": "PASS"
+          },
+          "throttling": {
+            "status": "PASS",
+            "reasons": []
+          },
+          "persistence_mode": {
+            "enabled": false,
+            "status": "WARN"
+          }
+        }
+      }
+    ],
+    "system_health": {
+      "nvidia_persistenced": {
+        "installed": true,
+        "running": false
+      },
+      "hugepages": {
+        "configured": false,
+        "count": 0
+      },
+      "swap": {
+        "enabled": true
+      },
+      "transparent_hugepage": "madvise",
+      "file_descriptors": {
+        "soft": 1024,
+        "max": 1048576
+      },
+      "infiniband_devices": [
+        "mlx5_4",
+        "mlx5_2",
+        "mlx5_0",
+        "mlx5_9",
+        "mlx5_7",
+        "mlx5_5",
+        "mlx5_3",
+        "mlx5_1",
+        "mlx5_8",
+        "mlx5_6"
+      ],
+      "rdma_devices": [
+        "abi_version",
+        "uverbs4",
+        "uverbs2",
+        "uverbs0",
+        "uverbs9",
+        "uverbs7",
+        "uverbs5",
+        "uverbs3",
+        "uverbs1",
+        "uverbs8",
+        "uverbs6"
+      ],
+      "nccl_env_vars": {}
+    },
+    "timestamp": "2026-05-22T15:49:11.294816",
+    "detected_gpu_type": "h100"
+  },
+  "memory_bench": {
+    "memory": {
+      "source": "nvbandwidth",
+      "h2d_bandwidth_gbps": 55.5,
+      "d2h_bandwidth_gbps": 55.3,
+      "d2d_bandwidth_gbps": 486.5,
+      "h2d_peak_gbps": 64,
+      "d2h_peak_gbps": 64,
+      "d2d_peak_gbps": 450.0,
+      "h2d_efficiency_pct": 86.7,
+      "d2h_efficiency_pct": 86.4,
+      "d2d_efficiency_pct": 108.1,
+      "peak_bandwidth_gbps": 3400,
+      "efficiency_pct": 108.1,
+      "results_by_test": {
+        "h2d": 55.5,
+        "d2h": 55.3,
+        "d2d_write": 397.4,
+        "d2d_read": 395.1,
+        "d2d_bidir": 486.5
+      },
+      "per_gpu": []
+    }
+  },
+  "compute_bench": {
+    "compute": {
+      "per_dtype_tflops": {
+        "fp32": 51.9,
+        "tf32": 357.0,
+        "fp16": 664.0,
+        "bf16": 700.1,
+        "fp8": 1116.2
+      },
+      "peak_tflops": {
+        "fp32": 67,
+        "tf32": 495,
+        "fp16": 990,
+        "bf16": 990,
+        "fp8": 1979
+      },
+      "efficiency_pct": {
+        "fp32": 77.5,
+        "tf32": 72.1,
+        "fp16": 67.1,
+        "bf16": 70.7,
+        "fp8": 56.4
+      },
+      "pass_thresholds_tflops": {
+        "fp32": 54,
+        "tf32": 444,
+        "fp16": 734,
+        "bf16": 745,
+        "fp8": 1400
+      },
+      "per_gpu": [
+        {
+          "index": 0,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 1,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 2,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 3,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 4,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 5,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 6,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        },
+        {
+          "index": 7,
+          "fp32": 51.9,
+          "tf32": 357.0,
+          "fp16": 664.0,
+          "bf16": 700.1,
+          "fp8": 1116.2
+        }
+      ],
+      "matrix_size": 8192,
+      "warmup": 50,
+      "iterations": 500
+    }
+  },
+  "nccl": {
+    "passed": false,
+    "source": "torchrun_fallback",
+    "tests": {
+      "NCCL version 2.21.5+cuda12.4": {
+        "status": "FAIL",
+        "error": null
+      },
+      "allreduce": {
+        "status": "PASS",
+        "error": null
+      },
+      "broadcast": {
+        "status": "PASS",
+        "error": null
+      },
+      "allgather": {
+        "status": "PASS",
+        "error": null
+      },
+      "reducescatter": {
+        "status": "PASS",
+        "error": null
+      },
+      "alltoall": {
+        "status": "PASS",
+        "error": null
+      }
+    },
+    "gpu_count": 8
+  },
+  "stress": {
+    "source": "pytorch",
+    "passed": true,
+    "duration_sec": 60,
+    "elapsed_sec": 60.0,
+    "gpu_status": {
+      "0": "PASS",
+      "1": "PASS",
+      "2": "PASS",
+      "3": "PASS",
+      "4": "PASS",
+      "5": "PASS",
+      "6": "PASS",
+      "7": "PASS"
+    },
+    "timestamp": "2026-05-22T15:51:56.803540"
+  },
+  "rdma": {
+    "passed": false,
+    "devices": [
+      {
+        "name": "mlx5_0",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:58a2:e103:0088:81e0"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_1",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:0054:e00a"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_2",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_3",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "1: DOWN",
+            "phys_state": "3: Disabled",
+            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:5bd9"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_4",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "100 Gb/sec (2X HDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ec"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_5",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "100 Gb/sec (2X HDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ed"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_6",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:0055:0e56"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_7",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:a088:c203:00f0:286c"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_8",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_9",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "1: DOWN",
+            "phys_state": "3: Disabled",
+            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:569d"
+          }
+        ]
+      }
+    ],
+    "bandwidth_tests": [
+      {
+        "test": "ib_write_bw",
+        "status": "WARN",
+        "bandwidth_gbps": 0.13,
+        "min_required_gbps": 50
+      },
+      {
+        "test": "ib_read_bw",
+        "status": "WARN",
+        "bandwidth_gbps": 0.13,
+        "min_required_gbps": 50
+      }
+    ],
+    "latency_tests": [
+      {
+        "test": "ib_write_lat",
+        "status": "PASS",
+        "latency_us": 4.1,
+        "max_allowed_us": 10
+      },
+      {
+        "test": "ib_read_lat",
+        "status": "WARN",
+        "latency_us": 16.0,
+        "max_allowed_us": 10
+      }
+    ],
+    "timestamp": "2026-05-22T15:52:03.507540"
+  },
+  "training": {
+    "model": "synthetic_transformer",
+    "total_params_m": 1470.5,
+    "num_layers": 6,
+    "hidden_size": 4096,
+    "gpu_count": 8,
+    "dtype": "bfloat16",
+    "batch_size": 8,
+    "seq_length": 2048,
+    "num_steps": 50,
+    "avg_step_time_ms": 312.3,
+    "throughput_tokens_per_sec": 52471.0,
+    "throughput_samples_per_sec": 25.62,
+    "peak_memory_gb": 27.31,
+    "final_loss": 0.0041,
+    "timestamp": "2026-05-22T15:52:32.650522"
+  }
+}
--- a/reports_all_aikubeworker0016.md
+++ b/reports_all_aikubeworker0016.md
@ -0,0 +1,157 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T15:49:02.368516
+- **Host:** aikubeworker0016
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
+- NCCL: FAIL (no nccl-tests bus BW)
+- RDMA: FAIL
+- Training: UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict)
+
+Missing required evidence:
+- NVLink/NVSwitch
+- DCGM
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Health Check | PASS |
+| Memory Bandwidth | PASS (108.1%) |
+| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
+| NCCL | FAIL (no nccl-tests bus BW) |
+| Stress Test | PASS |
+| RDMA | FAIL |
+| Training | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 70/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 67/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 23C | 69/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 66/700W | 345 MHz |
+
+## Health Check
+
+**Overall: PASS**
+
+| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
+|-----|------|-------|-----|------|----------|--------|
+| 0 | 21C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 1 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 2 | 22C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 3 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 4 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 5 | 23C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 6 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+| 7 | 21C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
+| D2H (PCIe) | 55.3 GB/s | 64 GB/s | 86.4% |
+| D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
+
+**Verdict: PASS** (D2D efficiency 108.1%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 51.9 | 67 | >= 54 | FAIL |
+| TF32 | 357.0 | 495 | >= 444 | FAIL |
+| FP16 | 664.0 | 990 | >= 734 | FAIL |
+| BF16 | 700.1 | 990 | >= 745 | FAIL |
+| FP8 | 1116.2 | 1979 | >= 1400 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 56.4%)
+
+### Compute Per-GPU TFLOPS
+
+| GPU | FP32 | TF32 | FP16 | BF16 | FP8 |
+|---|---|---|---|---|---|
+| 0 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 1 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 2 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 3 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 4 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 5 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 6 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+| 7 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
+
+## NCCL Multi-GPU
+
+Source: torchrun_fallback | GPUs: 8
+
+> Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.
+
+| Operation | Bus BW (GB/s) | Threshold | Status |
+|-----------|---------------|-----------|--------|
+| NCCL version 2.21.5+cuda12.4 | 0.0 | >= 0 | FAIL |
+| allreduce | 0.0 | >= 0 | PASS |
+| broadcast | 0.0 | >= 0 | PASS |
+| allgather | 0.0 | >= 0 | PASS |
+| reducescatter | 0.0 | >= 0 | PASS |
+| alltoall | 0.0 | >= 0 | PASS |
+
+**Overall: FAIL**
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 60s (requested 60s)
+- **Result: PASS**
+
+## RDMA/InfiniBand
+
+> Legacy RDMA result re-evaluated with current PDF acceptance thresholds; old WARN statuses and old 50GB/s/10us limits are not used for verdict.
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
+| ib_read_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 4.10 us | <= 2 us | FAIL |
+| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
+
+- **Failure reasons:**
+  - ib_write_bw bandwidth 0.13GB/s < 47GB/s
+  - ib_read_bw bandwidth 0.13GB/s < 47GB/s
+  - ib_write_lat latency 4.1us > 2us
+  - ib_read_lat latency 16.0us > 3.5us
+**Overall: FAIL**
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer |
+| Params | 1470.5M |
+| Throughput | 52471 tokens/sec |
+| Avg Step Time | 312.3 ms |
+| Peak Memory | 27.3 GB |
+| Final Loss | 0.0041 |
+| Step Jitter | N/A% |
+| Distributed Mode | N/A |
+| Acceptance Gaps | missing passed, step_jitter_pct, distributed_mode, loss_finite |
+| Verdict | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_dcgm_r3_aikubeworker0012_20260522_200338.md
+++ b/reports_dcgm_r3_aikubeworker0012_20260522_200338.md
@ -0,0 +1,65 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T20:26:56.947796
+- **Host:** aikubeworker0012
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- RDMA
+- Training
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| DCGM | PASS |
+
+## DCGM Diagnostic
+
+**Overall: PASS**
+
+| Subtest | Status |
+|---------|--------|
+| Hardware/nvbandwidth/GPU6 | PASS |
+| Hardware/nvbandwidth/GPU7 | PASS |
+| Hardware/nvbandwidth/summary | PASS |
+| Integration/pcie/GPU0 | PASS |
+| Integration/pcie/GPU1 | PASS |
+| Integration/pcie/GPU2 | PASS |
+| Integration/pcie/GPU3 | PASS |
+| Integration/pcie/GPU4 | PASS |
+| Integration/pcie/GPU5 | PASS |
+| Integration/pcie/GPU6 | PASS |
+| Integration/pcie/GPU7 | PASS |
+| Integration/pcie/summary | PASS |
+| Stress/targeted_stress/GPU0 | PASS |
+| Stress/targeted_stress/GPU1 | PASS |
+| Stress/targeted_stress/GPU2 | PASS |
+| Stress/targeted_stress/GPU3 | PASS |
+| Stress/targeted_stress/GPU4 | PASS |
+| Stress/targeted_stress/GPU5 | PASS |
+| Stress/targeted_stress/GPU6 | PASS |
+| Stress/targeted_stress/GPU7 | PASS |
+| Stress/targeted_stress/summary | PASS |
+| Stress/targeted_power/GPU0 | PASS |
+| Stress/targeted_power/GPU1 | PASS |
+| Stress/targeted_power/GPU2 | PASS |
+| Stress/targeted_power/GPU3 | PASS |
+| Stress/targeted_power/GPU4 | PASS |
+| Stress/targeted_power/GPU5 | PASS |
+| Stress/targeted_power/GPU6 | PASS |
+| Stress/targeted_power/GPU7 | PASS |
+| Stress/targeted_power/summary | PASS |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_dcgm_r3_aikubeworker0016_20260522_200538.md
+++ b/reports_dcgm_r3_aikubeworker0016_20260522_200538.md
@ -0,0 +1,65 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T20:28:58.716266
+- **Host:** aikubeworker0016
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- RDMA
+- Training
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| DCGM | PASS |
+
+## DCGM Diagnostic
+
+**Overall: PASS**
+
+| Subtest | Status |
+|---------|--------|
+| Hardware/nvbandwidth/GPU6 | PASS |
+| Hardware/nvbandwidth/GPU7 | PASS |
+| Hardware/nvbandwidth/summary | PASS |
+| Integration/pcie/GPU0 | PASS |
+| Integration/pcie/GPU1 | PASS |
+| Integration/pcie/GPU2 | PASS |
+| Integration/pcie/GPU3 | PASS |
+| Integration/pcie/GPU4 | PASS |
+| Integration/pcie/GPU5 | PASS |
+| Integration/pcie/GPU6 | PASS |
+| Integration/pcie/GPU7 | PASS |
+| Integration/pcie/summary | PASS |
+| Stress/targeted_stress/GPU0 | PASS |
+| Stress/targeted_stress/GPU1 | PASS |
+| Stress/targeted_stress/GPU2 | PASS |
+| Stress/targeted_stress/GPU3 | PASS |
+| Stress/targeted_stress/GPU4 | PASS |
+| Stress/targeted_stress/GPU5 | PASS |
+| Stress/targeted_stress/GPU6 | PASS |
+| Stress/targeted_stress/GPU7 | PASS |
+| Stress/targeted_stress/summary | PASS |
+| Stress/targeted_power/GPU0 | PASS |
+| Stress/targeted_power/GPU1 | PASS |
+| Stress/targeted_power/GPU2 | PASS |
+| Stress/targeted_power/GPU3 | PASS |
+| Stress/targeted_power/GPU4 | PASS |
+| Stress/targeted_power/GPU5 | PASS |
+| Stress/targeted_power/GPU6 | PASS |
+| Stress/targeted_power/GPU7 | PASS |
+| Stress/targeted_power/summary | PASS |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_nvbandwidth_aikubeworker0012.json
+++ b/reports_nvbandwidth_aikubeworker0012.json
@ -0,0 +1,70 @@
+{
+  "benchmark": {
+    "memory": {
+      "source": "nvbandwidth",
+      "h2d_bandwidth_gbps": 55.5,
+      "d2h_bandwidth_gbps": 54.8,
+      "d2d_bandwidth_gbps": 0.0,
+      "h2d_peak_gbps": 64,
+      "d2h_peak_gbps": 64,
+      "d2d_peak_gbps": 450.0,
+      "h2d_efficiency_pct": 86.7,
+      "d2h_efficiency_pct": 85.6,
+      "d2d_efficiency_pct": null,
+      "peak_bandwidth_gbps": 3400,
+      "efficiency_pct": null,
+      "results_by_test": {
+        "h2d": 55.5,
+        "d2h": 54.8,
+        "d2d_write": 0.0,
+        "d2d_read": 0.0,
+        "d2d_bidir": 0.0
+      },
+      "per_gpu": []
+    },
+    "compute": {
+      "per_dtype_tflops": {
+        "fp32": 52.2,
+        "tf32": 360.7,
+        "fp16": 680.0,
+        "bf16": 707.6,
+        "fp8": 1142.4
+      },
+      "peak_tflops": {
+        "fp32": 67,
+        "tf32": 495,
+        "fp16": 990,
+        "bf16": 990,
+        "fp8": 1979
+      },
+      "efficiency_pct": {
+        "fp32": 77.9,
+        "tf32": 72.9,
+        "fp16": 68.7,
+        "bf16": 71.5,
+        "fp8": 57.7
+      },
+      "pass_thresholds_tflops": {
+        "fp32": 54,
+        "tf32": 444,
+        "fp16": 734,
+        "bf16": 745,
+        "fp8": 1400
+      },
+      "per_gpu": [
+        {
+          "index": 0,
+          "fp32": 52.2,
+          "tf32": 360.7,
+          "fp16": 680.0,
+          "bf16": 707.6,
+          "fp8": 1142.4
+        }
+      ],
+      "matrix_size": 8192,
+      "warmup": 50,
+      "iterations": 500
+    }
+  },
+  "timestamp": "2026-05-22T15:35:16.675924"
+}
--- a/reports_nvbandwidth_aikubeworker0012.md
+++ b/reports_nvbandwidth_aikubeworker0012.md
@ -0,0 +1,38 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22 15:37:12
+- **Host:** aikubeworker0012
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Memory Bandwidth | FAIL (0.0%) |
+| Compute Throughput | FAIL (worst TF32 361 vs >= 444) |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
+| D2H (PCIe) | 54.8 GB/s | 64 GB/s | 85.6% |
+| D2D (NVLink) | 0.0 GB/s | 450 GB/s | 0.0% |
+
+**Verdict: FAIL** (D2D efficiency 0.0%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 52.2 | 67 | >= 54 | WARN |
+| TF32 | 360.7 | 495 | >= 444 | FAIL |
+| FP16 | 680.0 | 990 | >= 734 | WARN |
+| BF16 | 707.6 | 990 | >= 745 | WARN |
+| FP8 | 1142.4 | 1979 | >= 1400 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.7%)
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_nvbandwidth_aikubeworker0016.json
+++ b/reports_nvbandwidth_aikubeworker0016.json
@ -0,0 +1,70 @@
+{
+  "benchmark": {
+    "memory": {
+      "source": "nvbandwidth",
+      "h2d_bandwidth_gbps": 55.5,
+      "d2h_bandwidth_gbps": 55.0,
+      "d2d_bandwidth_gbps": 0.0,
+      "h2d_peak_gbps": 64,
+      "d2h_peak_gbps": 64,
+      "d2d_peak_gbps": 450.0,
+      "h2d_efficiency_pct": 86.7,
+      "d2h_efficiency_pct": 85.9,
+      "d2d_efficiency_pct": null,
+      "peak_bandwidth_gbps": 3400,
+      "efficiency_pct": null,
+      "results_by_test": {
+        "h2d": 55.5,
+        "d2h": 55.0,
+        "d2d_write": 0.0,
+        "d2d_read": 0.0,
+        "d2d_bidir": 0.0
+      },
+      "per_gpu": []
+    },
+    "compute": {
+      "per_dtype_tflops": {
+        "fp32": 52.2,
+        "tf32": 357.5,
+        "fp16": 665.3,
+        "bf16": 697.1,
+        "fp8": 1138.8
+      },
+      "peak_tflops": {
+        "fp32": 67,
+        "tf32": 495,
+        "fp16": 990,
+        "bf16": 990,
+        "fp8": 1979
+      },
+      "efficiency_pct": {
+        "fp32": 77.9,
+        "tf32": 72.2,
+        "fp16": 67.2,
+        "bf16": 70.4,
+        "fp8": 57.5
+      },
+      "pass_thresholds_tflops": {
+        "fp32": 54,
+        "tf32": 444,
+        "fp16": 734,
+        "bf16": 745,
+        "fp8": 1400
+      },
+      "per_gpu": [
+        {
+          "index": 0,
+          "fp32": 52.2,
+          "tf32": 357.5,
+          "fp16": 665.3,
+          "bf16": 697.1,
+          "fp8": 1138.8
+        }
+      ],
+      "matrix_size": 8192,
+      "warmup": 50,
+      "iterations": 500
+    }
+  },
+  "timestamp": "2026-05-22T15:35:19.219299"
+}
--- a/reports_nvbandwidth_aikubeworker0016.md
+++ b/reports_nvbandwidth_aikubeworker0016.md
@ -0,0 +1,38 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22 15:37:18
+- **Host:** aikubeworker0016
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Memory Bandwidth | FAIL (0.0%) |
+| Compute Throughput | FAIL (worst TF32 358 vs >= 444) |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
+| D2H (PCIe) | 55.0 GB/s | 64 GB/s | 85.9% |
+| D2D (NVLink) | 0.0 GB/s | 450 GB/s | 0.0% |
+
+**Verdict: FAIL** (D2D efficiency 0.0%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 52.2 | 67 | >= 54 | WARN |
+| TF32 | 357.5 | 495 | >= 444 | FAIL |
+| FP16 | 665.3 | 990 | >= 734 | WARN |
+| BF16 | 697.1 | 990 | >= 745 | WARN |
+| FP8 | 1138.8 | 1979 | >= 1400 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.5%)
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_rdma_aikubeworker0012.json
+++ b/reports_rdma_aikubeworker0012.json
@ -0,0 +1,157 @@
+{
+  "rdma": {
+    "passed": false,
+    "devices": [
+      {
+        "name": "mlx5_0",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3898"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_1",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3db0"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_2",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:5c3f:b8ff:fe5e:7832"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_3",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "1: DOWN",
+            "phys_state": "3: Disabled",
+            "gid": "fe80:0000:0000:0000:5e25:73ff:fe4e:eac1"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_4",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "100 Gb/sec (2X HDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:005f:63cc"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_5",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "100 Gb/sec (2X HDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:005f:63cd"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_6",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3bf4"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_7",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3e28"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_8",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:5c3f:b8ff:fe5e:7832"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_9",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "1: DOWN",
+            "phys_state": "3: Disabled",
+            "gid": "fe80:0000:0000:0000:5e25:73ff:fe63:1717"
+          }
+        ]
+      }
+    ],
+    "bandwidth_tests": [
+      {
+        "test": "ib_write_bw",
+        "status": "WARN",
+        "bandwidth_gbps": 0.13,
+        "min_required_gbps": 50
+      },
+      {
+        "test": "ib_read_bw",
+        "status": "WARN",
+        "bandwidth_gbps": 0.13,
+        "min_required_gbps": 50
+      }
+    ],
+    "latency_tests": [
+      {
+        "test": "ib_write_lat",
+        "status": "PASS",
+        "latency_us": 4.53,
+        "max_allowed_us": 10
+      },
+      {
+        "test": "ib_read_lat",
+        "status": "WARN",
+        "latency_us": 16.0,
+        "max_allowed_us": 10
+      }
+    ],
+    "timestamp": "2026-05-22T15:41:20.534115"
+  },
+  "timestamp": "2026-05-22T15:41:20.544589"
+}
--- a/reports_rdma_aikubeworker0016.json
+++ b/reports_rdma_aikubeworker0016.json
@ -0,0 +1,157 @@
+{
+  "rdma": {
+    "passed": false,
+    "devices": [
+      {
+        "name": "mlx5_0",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:58a2:e103:0088:81e0"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_1",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:0054:e00a"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_2",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_3",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "1: DOWN",
+            "phys_state": "3: Disabled",
+            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:5bd9"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_4",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "100 Gb/sec (2X HDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ec"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_5",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "100 Gb/sec (2X HDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ed"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_6",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:9c63:c003:0055:0e56"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_7",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "400 Gb/sec (4X NDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:a088:c203:00f0:286c"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_8",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "4: ACTIVE",
+            "phys_state": "5: LinkUp",
+            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
+          }
+        ]
+      },
+      {
+        "name": "mlx5_9",
+        "ports": [
+          {
+            "port": "1",
+            "rate": "25 Gb/sec (1X EDR)",
+            "state": "1: DOWN",
+            "phys_state": "3: Disabled",
+            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:569d"
+          }
+        ]
+      }
+    ],
+    "bandwidth_tests": [
+      {
+        "test": "ib_write_bw",
+        "status": "WARN",
+        "bandwidth_gbps": 0.13,
+        "min_required_gbps": 50
+      },
+      {
+        "test": "ib_read_bw",
+        "status": "WARN",
+        "bandwidth_gbps": 0.13,
+        "min_required_gbps": 50
+      }
+    ],
+    "latency_tests": [
+      {
+        "test": "ib_write_lat",
+        "status": "PASS",
+        "latency_us": 4.22,
+        "max_allowed_us": 10
+      },
+      {
+        "test": "ib_read_lat",
+        "status": "WARN",
+        "latency_us": 16.0,
+        "max_allowed_us": 10
+      }
+    ],
+    "timestamp": "2026-05-22T15:41:07.851101"
+  },
+  "timestamp": "2026-05-22T15:41:07.861558"
+}
--- a/reports_rdma_counter_aikubeworker0012_20260522_194808.md
+++ b/reports_rdma_counter_aikubeworker0012_20260522_194808.md
@ -0,0 +1,62 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T19:48:26.622179
+- **Host:** aikubeworker0012
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- RDMA: FAIL
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- DCGM
+- Training
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| RDMA | FAIL |
+
+## RDMA/InfiniBand
+
+### RDMA Port Checks
+
+| Device | Port | State | Rate | Required | Status |
+|--------|------|-------|------|----------|--------|
+| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 49.3 GB/s | >= 47 GB/s | PASS |
+| ib_read_bw | 39.2 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 4.49 us | <= 2 us | FAIL |
+| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
+| ibping | target=0x58 count=5 | 0% packet loss | PASS |
+
+- **PFC/ECN/CNP/congestion counters checked:** 146
+- **PFC/ECN/CNP/congestion non-zero:** no
+- **Failure reasons:**
+  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - ib_read_bw bandwidth 39.21GB/s < 47GB/s
+  - ib_write_lat latency 4.49us > 2.0us
+  - ib_read_lat latency 16.0us > 3.5us
+**Overall: FAIL**
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_rdma_counter_aikubeworker0016_20260522_194828.md
+++ b/reports_rdma_counter_aikubeworker0016_20260522_194828.md
@ -0,0 +1,62 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T19:48:45.899570
+- **Host:** aikubeworker0016
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- RDMA: FAIL
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- DCGM
+- Training
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| RDMA | FAIL |
+
+## RDMA/InfiniBand
+
+### RDMA Port Checks
+
+| Device | Port | State | Rate | Required | Status |
+|--------|------|-------|------|----------|--------|
+| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 48.1 GB/s | >= 47 GB/s | PASS |
+| ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 4.28 us | <= 2 us | FAIL |
+| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
+| ibping | target=0x4b count=5 | 0% packet loss | PASS |
+
+- **PFC/ECN/CNP/congestion counters checked:** 146
+- **PFC/ECN/CNP/congestion non-zero:** no
+- **Failure reasons:**
+  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - ib_read_bw bandwidth 40.3GB/s < 47GB/s
+  - ib_write_lat latency 4.28us > 2.0us
+  - ib_read_lat latency 16.0us > 3.5us
+**Overall: FAIL**
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_rdma_cross_node_mlx5_0_20260523.md
+++ b/reports_rdma_cross_node_mlx5_0_20260523.md
@ -0,0 +1,50 @@
+# RDMA Cross-node Evidence Report
+
+- **Date:** 2026-05-23 Asia/Shanghai
+- **Scope:** `aikubeworker0012` <-> `aikubeworker0016`, single rail `mlx5_0`, port 1
+- **Client/server bootstrap IPs:** `172.72.8.12` and `172.72.8.16`
+- **Bandwidth message size:** 4MB
+- **Latency message size:** 8B
+- **Iterations:** 1000
+
+## Port Evidence
+
+| Host | Device | State | Rate | Link | LID |
+|---|---|---|---|---|---|
+| aikubeworker0012 | mlx5_0/1 | ACTIVE | 400 Gb/sec (4X NDR) | InfiniBand | 0x58 |
+| aikubeworker0016 | mlx5_0/1 | ACTIVE | 400 Gb/sec (4X NDR) | InfiniBand | 0x4b |
+
+## Cross-node Perftest Results
+
+| Direction | Test | Value | PDF Threshold | Status |
+|---|---|---:|---:|---|
+| 0016 -> 0012 | ib_write_bw | 49.35 GB/s | >= 47 GB/s | PASS |
+| 0016 -> 0012 | ib_read_bw | 44.36 GB/s | >= 47 GB/s | FAIL |
+| 0016 -> 0012 | ib_write_lat avg | 2.17 us | <= 2.0 us | FAIL |
+| 0016 -> 0012 | ib_read_lat avg | 4.05 us | <= 3.5 us | FAIL |
+| 0012 -> 0016 | ib_write_bw | 48.38 GB/s | >= 47 GB/s | PASS |
+| 0012 -> 0016 | ib_read_bw | 44.37 GB/s | >= 47 GB/s | FAIL |
+| 0012 -> 0016 | ib_write_lat avg | 2.13 us | <= 2.0 us | FAIL |
+| 0012 -> 0016 | ib_read_lat avg | 4.08 us | <= 3.5 us | FAIL |
+
+## Bidirectional ibping
+
+| Direction | Target LID | Result |
+|---|---|---|
+| 0016 -> 0012 | 0x58 | 5 transmitted, 5 received, 0% packet loss; avg 0.005 ms |
+| 0012 -> 0016 | 0x4b | 5 transmitted, 5 received, 0% packet loss; avg 0.005 ms |
+
+## Fabric Counters
+
+| Host | PFC/ECN/CNP/congestion Counters Checked | Non-zero Counters | Status |
+|---|---:|---:|---|
+| aikubeworker0012 | 146 | 0 | PASS |
+| aikubeworker0016 | 146 | 0 | PASS |
+
+## Verdict
+
+**RDMA cross-node verdict: FAIL**
+
+Reason: bidirectional connectivity is good, PFC/ECN/CNP/congestion counters are clean, and write bandwidth passes. However read bandwidth is below 47 GB/s in both directions, write latency is slightly above 2.0 us in both directions, and read latency is above 3.5 us in both directions.
+
+Note: `modules/rdma_test.py` was corrected on 2026-05-23 to parse `ib_write_lat` / `ib_read_lat` `t_avg[usec]` rather than the 99.9 percentile column. Older reports that show `read_lat` around 16 us are therefore not the current parser output.
--- a/reports_rdma_single_node_summary.md
+++ b/reports_rdma_single_node_summary.md
@ -0,0 +1,73 @@
+# Single-node RDMA/IB Report
+
+Generated: 2026-05-22 23:41 Asia/Shanghai
+
+Scope: project CLI `gpu_tester.py --test rdma --report --format json`, run separately on each host.
+
+Important note: the current repository RDMA test is single-node only. In `modules/rdma_test.py`, the perftest client connects to `localhost`, so this report validates local IB device discovery and local perftest behavior. It does not validate cross-node RDMA bandwidth between `aikubeworker0012` and `aikubeworker0016`.
+
+## Summary
+
+| Host | Devices Found | Active 400G Ports | Active 100G Ports | Down Ports | Overall |
+| --- | ---: | --- | --- | --- | --- |
+| aikubeworker0012 / 172.72.8.12 | 10 | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | mlx5_4, mlx5_5 | mlx5_3, mlx5_9 | WARN |
+| aikubeworker0016 / 172.72.8.16 | 10 | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | mlx5_4, mlx5_5 | mlx5_3, mlx5_9 | WARN |
+
+## Bandwidth
+
+The bandwidth numbers below are from the repo's local `localhost` RDMA perftest path.
+
+| Host | ib_write_bw | Threshold | Status | ib_read_bw | Threshold | Status |
+| --- | ---: | ---: | --- | ---: | ---: | --- |
+| aikubeworker0012 | 0.13 GB/s | 50 GB/s | WARN | 0.13 GB/s | 50 GB/s | WARN |
+| aikubeworker0016 | 0.13 GB/s | 50 GB/s | WARN | 0.13 GB/s | 50 GB/s | WARN |
+
+## Latency
+
+| Host | ib_write_lat | Limit | Status | ib_read_lat | Limit | Status |
+| --- | ---: | ---: | --- | ---: | ---: | --- |
+| aikubeworker0012 | 4.53 us | 10 us | PASS | 16.00 us | 10 us | WARN |
+| aikubeworker0016 | 4.22 us | 10 us | PASS | 16.00 us | 10 us | WARN |
+
+## Device Inventory
+
+### aikubeworker0012
+
+| Device | Port | State | Physical State | Rate |
+| --- | --- | --- | --- | --- |
+| mlx5_0 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_1 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_2 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
+| mlx5_3 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
+| mlx5_4 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
+| mlx5_5 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
+| mlx5_6 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_7 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_8 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
+| mlx5_9 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
+
+### aikubeworker0016
+
+| Device | Port | State | Physical State | Rate |
+| --- | --- | --- | --- | --- |
+| mlx5_0 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_1 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_2 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
+| mlx5_3 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
+| mlx5_4 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
+| mlx5_5 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
+| mlx5_6 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_7 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
+| mlx5_8 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
+| mlx5_9 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
+
+## Files
+
+Raw JSON:
+
+- `reports_rdma_aikubeworker0012.json`
+- `reports_rdma_aikubeworker0016.json`
+
+Markdown summary:
+
+- `reports_rdma_single_node_summary.md`
--- a/reports_single_gpu_aikubeworker0012.json
+++ b/reports_single_gpu_aikubeworker0012.json
@ -0,0 +1,292 @@
+{
+  "timestamp": "2026-05-22T15:26:26.973586",
+  "gpu_info": {
+    "driver_version": "580.159.03",
+    "cuda_version": "13.0",
+    "gpu_count": 8,
+    "gpus": [
+      {
+        "index": 0,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-7658c03c-7659-9886-041e-545c21d53e12",
+        "pci_bus_id": "00000000:18:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 69.72,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 25,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1654923030411",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 1,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-6392d40b-893b-9fc2-4284-a3f1d8c4d7f1",
+        "pci_bus_id": "00000000:2A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 73.17,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 25,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1654724063165",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 2,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-2ae38735-10de-fb0b-fb20-9d1b5b434558",
+        "pci_bus_id": "00000000:3A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 68.71,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 26,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1654823036530",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 3,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-ec62123f-0c48-6dbd-49e4-8b231b3fed0e",
+        "pci_bus_id": "00000000:5D:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 69.73,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 25,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1654923021638",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 4,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-b64fc270-109e-1543-fb0c-be7feecf14f1",
+        "pci_bus_id": "00000000:9A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 68.84,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 24,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1655023033179",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 5,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-15ab7baf-9010-7cf3-5462-eeb09f8dbe65",
+        "pci_bus_id": "00000000:AB:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 69.94,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 27,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1655023034225",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 6,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-225f6f3c-6fef-d1e2-5428-d90f665fb3d3",
+        "pci_bus_id": "00000000:BA:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 70.46,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 25,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1654923078278",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 7,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-79aeb6a8-c00c-6edb-956f-779ef56950a3",
+        "pci_bus_id": "00000000:DB:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 71.76,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 24,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1654024031464",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      }
+    ],
+    "topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n  X    = Self\n  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n  PIX  = Connection traversing at most a single PCIe bridge\n  NV#  = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n  NIC0: mlx5_0\n  NIC1: mlx5_1\n  NIC2: mlx5_2\n  NIC3: mlx5_3\n  NIC4: mlx5_4\n  NIC5: mlx5_5\n  NIC6: mlx5_6\n  NIC7: mlx5_7\n  NIC8: mlx5_8\n  NIC9: mlx5_9\n\n",
+    "timestamp": "2026-05-22T15:26:34.187409",
+    "detected_gpu_type": "h100",
+    "gpu_label": "H100 SXM5"
+  },
+  "memory_bench": {
+    "memory": {
+      "source": "pytorch",
+      "h2d_bandwidth_gbps": 11.8,
+      "d2h_bandwidth_gbps": 9.9,
+      "d2d_bandwidth_gbps": 829.1,
+      "peak_bandwidth_gbps": 3400,
+      "efficiency_pct": 24.4,
+      "test_sizes_mb": [
+        1,
+        4,
+        16,
+        64,
+        256,
+        1024,
+        4096
+      ],
+      "bandwidth_by_size": {
+        "1": {
+          "h2d_gbps": 3.8,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 40.6
+        },
+        "4": {
+          "h2d_gbps": 7.6,
+          "d2h_gbps": 9.9,
+          "d2d_gbps": 141.5
+        },
+        "16": {
+          "h2d_gbps": 11.0,
+          "d2h_gbps": 1.9,
+          "d2d_gbps": 450.3
+        },
+        "64": {
+          "h2d_gbps": 11.8,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 726.5
+        },
+        "256": {
+          "h2d_gbps": 9.0,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 793.8
+        },
+        "1024": {
+          "h2d_gbps": 5.5,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 821.2
+        },
+        "4096": {
+          "h2d_gbps": 5.9,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 829.1
+        }
+      },
+      "per_gpu": []
+    }
+  },
+  "compute_bench": {
+    "compute": {
+      "per_dtype_tflops": {
+        "fp32": 52.0,
+        "tf32": 362.3,
+        "fp16": 691.0,
+        "bf16": 713.0,
+        "fp8": 1148.8
+      },
+      "peak_tflops": {
+        "fp32": 67,
+        "tf32": 495,
+        "fp16": 990,
+        "bf16": 990,
+        "fp8": 1979
+      },
+      "efficiency_pct": {
+        "fp32": 77.6,
+        "tf32": 73.2,
+        "fp16": 69.8,
+        "bf16": 72.0,
+        "fp8": 58.0
+      },
+      "pass_thresholds_tflops": {
+        "fp32": 54,
+        "tf32": 444,
+        "fp16": 734,
+        "bf16": 745,
+        "fp8": 1400
+      },
+      "per_gpu": [
+        {
+          "index": 0,
+          "fp32": 52.0,
+          "tf32": 362.3,
+          "fp16": 691.0,
+          "bf16": 713.0,
+          "fp8": 1148.8
+        }
+      ],
+      "matrix_size": 8192,
+      "warmup": 50,
+      "iterations": 500
+    }
+  }
+}
--- a/reports_single_gpu_aikubeworker0012.md
+++ b/reports_single_gpu_aikubeworker0012.md
@ -0,0 +1,54 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22 15:27:51
+- **Host:** aikubeworker0012
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Memory Bandwidth | WARN (829 GB/s via PyTorch fallback) |
+| Compute Throughput | FAIL (worst TF32 362 vs >= 444) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 72/700W | 345 MHz |
+
+## Memory Bandwidth
+
+Source: pytorch
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 11.8 GB/s | 0 GB/s | 0.0% |
+| D2H (PCIe) | 9.9 GB/s | 0 GB/s | 0.0% |
+| D2D (NVLink) | 829.1 GB/s | 3400 GB/s | 24.4% |
+
+**Verdict: WARN** (D2D 829.1 GB/s via PyTorch fallback; nvbandwidth unavailable — figure is indicative only, not a true HBM peak)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 52.0 | 67 | >= 54 | WARN |
+| TF32 | 362.3 | 495 | >= 444 | FAIL |
+| FP16 | 691.0 | 990 | >= 734 | WARN |
+| BF16 | 713.0 | 990 | >= 745 | WARN |
+| FP8 | 1148.8 | 1979 | >= 1400 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 58.0%)
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_single_gpu_aikubeworker0016.json
+++ b/reports_single_gpu_aikubeworker0016.json
@ -0,0 +1,292 @@
+{
+  "timestamp": "2026-05-22T15:26:29.511252",
+  "gpu_info": {
+    "driver_version": "580.159.03",
+    "cuda_version": "13.0",
+    "gpu_count": 8,
+    "gpus": [
+      {
+        "index": 0,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-dfbc9513-255d-4fe7-2b77-7b1ec3972e75",
+        "pci_bus_id": "00000000:18:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 4,
+        "vram_free_mb": 81076,
+        "power_draw": 69.81,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 20,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924016120",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 1,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-bb845ef7-d7b5-f011-9395-ea74274e2282",
+        "pci_bus_id": "00000000:2A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 67.45,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 20,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924015483",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 2,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-3720cf13-2a34-be38-27be-0a7adc4addc4",
+        "pci_bus_id": "00000000:3A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 66.69,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 21,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924025595",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 3,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-87080b2d-ac43-be0d-d574-c193078850ae",
+        "pci_bus_id": "00000000:5D:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 66.86,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 20,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924016862",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 4,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-599bd883-cc5c-a5dd-6c33-c15f7049da48",
+        "pci_bus_id": "00000000:9A:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 67.07,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 20,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924025670",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 5,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-a1c6bba4-61b0-e623-06c9-9c88635e26fe",
+        "pci_bus_id": "00000000:AB:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 69.12,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 22,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924027166",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 6,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-98745a0c-39bd-3e56-d6ca-54ba3647ab6d",
+        "pci_bus_id": "00000000:BA:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 67.61,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 20,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924026234",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      },
+      {
+        "index": 7,
+        "name": "NVIDIA H100 80GB HBM3",
+        "uuid": "GPU-8c73bd8b-666b-357e-ac5d-c75ac7a759db",
+        "pci_bus_id": "00000000:DB:00.0",
+        "pcie_link_gen": 5,
+        "pcie_link_width": 16,
+        "vram_total_mb": 81559,
+        "vram_used_mb": 0,
+        "vram_free_mb": 81079,
+        "power_draw": 66.19,
+        "power_limit": 700.0,
+        "clock_sm": 345,
+        "clock_mem": 2619,
+        "temperature": 20,
+        "fan_speed": 0,
+        "persistence_mode": false,
+        "compute_mode": "Default",
+        "serial_number": "1651924027255",
+        "ecc_errors_single": 0,
+        "ecc_errors_double": 0
+      }
+    ],
+    "topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n  X    = Self\n  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n  PIX  = Connection traversing at most a single PCIe bridge\n  NV#  = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n  NIC0: mlx5_0\n  NIC1: mlx5_1\n  NIC2: mlx5_2\n  NIC3: mlx5_3\n  NIC4: mlx5_4\n  NIC5: mlx5_5\n  NIC6: mlx5_6\n  NIC7: mlx5_7\n  NIC8: mlx5_8\n  NIC9: mlx5_9\n\n",
+    "timestamp": "2026-05-22T15:26:36.627805",
+    "detected_gpu_type": "h100",
+    "gpu_label": "H100 SXM5"
+  },
+  "memory_bench": {
+    "memory": {
+      "source": "pytorch",
+      "h2d_bandwidth_gbps": 11.8,
+      "d2h_bandwidth_gbps": 10.1,
+      "d2d_bandwidth_gbps": 829.0,
+      "peak_bandwidth_gbps": 3400,
+      "efficiency_pct": 24.4,
+      "test_sizes_mb": [
+        1,
+        4,
+        16,
+        64,
+        256,
+        1024,
+        4096
+      ],
+      "bandwidth_by_size": {
+        "1": {
+          "h2d_gbps": 3.6,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 40.3
+        },
+        "4": {
+          "h2d_gbps": 7.7,
+          "d2h_gbps": 10.1,
+          "d2d_gbps": 159.5
+        },
+        "16": {
+          "h2d_gbps": 10.9,
+          "d2h_gbps": 1.9,
+          "d2d_gbps": 439.5
+        },
+        "64": {
+          "h2d_gbps": 11.8,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 740.5
+        },
+        "256": {
+          "h2d_gbps": 9.0,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 792.1
+        },
+        "1024": {
+          "h2d_gbps": 8.4,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 818.9
+        },
+        "4096": {
+          "h2d_gbps": 6.1,
+          "d2h_gbps": 1.4,
+          "d2d_gbps": 829.0
+        }
+      },
+      "per_gpu": []
+    }
+  },
+  "compute_bench": {
+    "compute": {
+      "per_dtype_tflops": {
+        "fp32": 51.9,
+        "tf32": 357.8,
+        "fp16": 667.2,
+        "bf16": 699.1,
+        "fp8": 1146.2
+      },
+      "peak_tflops": {
+        "fp32": 67,
+        "tf32": 495,
+        "fp16": 990,
+        "bf16": 990,
+        "fp8": 1979
+      },
+      "efficiency_pct": {
+        "fp32": 77.5,
+        "tf32": 72.3,
+        "fp16": 67.4,
+        "bf16": 70.6,
+        "fp8": 57.9
+      },
+      "pass_thresholds_tflops": {
+        "fp32": 54,
+        "tf32": 444,
+        "fp16": 734,
+        "bf16": 745,
+        "fp8": 1400
+      },
+      "per_gpu": [
+        {
+          "index": 0,
+          "fp32": 51.9,
+          "tf32": 357.8,
+          "fp16": 667.2,
+          "bf16": 699.1,
+          "fp8": 1146.2
+        }
+      ],
+      "matrix_size": 8192,
+      "warmup": 50,
+      "iterations": 500
+    }
+  }
+}
--- a/reports_single_gpu_aikubeworker0016.md
+++ b/reports_single_gpu_aikubeworker0016.md
@ -0,0 +1,54 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22 15:27:53
+- **Host:** aikubeworker0016
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Memory Bandwidth | WARN (829 GB/s via PyTorch fallback) |
+| Compute Throughput | FAIL (worst TF32 358 vs >= 444) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 70/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 69/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 66/700W | 345 MHz |
+
+## Memory Bandwidth
+
+Source: pytorch
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 11.8 GB/s | 0 GB/s | 0.0% |
+| D2H (PCIe) | 10.1 GB/s | 0 GB/s | 0.0% |
+| D2D (NVLink) | 829.0 GB/s | 3400 GB/s | 24.4% |
+
+**Verdict: WARN** (D2D 829.0 GB/s via PyTorch fallback; nvbandwidth unavailable — figure is indicative only, not a true HBM peak)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 51.9 | 67 | >= 54 | WARN |
+| TF32 | 357.8 | 495 | >= 444 | FAIL |
+| FP16 | 667.2 | 990 | >= 734 | WARN |
+| BF16 | 699.1 | 990 | >= 745 | WARN |
+| FP8 | 1146.2 | 1979 | >= 1400 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.9%)
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_stress_smoke_reasons_aikubeworker0012.json
+++ b/reports_stress_smoke_reasons_aikubeworker0012.json
@ -0,0 +1,165 @@
+{
+  "stress": {
+    "source": "pytorch",
+    "passed": false,
+    "duration_sec": 45,
+    "elapsed_sec": 45.4,
+    "gpu_status": {
+      "0": "PASS",
+      "1": "PASS",
+      "2": "PASS",
+      "3": "PASS",
+      "4": "PASS",
+      "5": "PASS",
+      "6": "PASS",
+      "7": "PASS"
+    },
+    "telemetry": {
+      "passed": false,
+      "samples": 39,
+      "steady_samples": 31,
+      "warmup_sec": 9.0,
+      "max_temp_c": {
+        "0": 59.0,
+        "1": 58.0,
+        "2": 65.0,
+        "3": 54.0,
+        "4": 59.0,
+        "5": 66.0,
+        "6": 62.0,
+        "7": 55.0
+      },
+      "avg_power_w": {
+        "0": 697.0,
+        "1": 697.4,
+        "2": 697.9,
+        "3": 698.0,
+        "4": 697.8,
+        "5": 697.6,
+        "6": 697.9,
+        "7": 698.2
+      },
+      "temp_delta_c": 12.0,
+      "throttle_events": [
+        {
+          "gpu": 0,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 1,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 2,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 3,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 4,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 5,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 6,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 7,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 0,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 1,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 2,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 3,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 4,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 5,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 6,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 7,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 0,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 1,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 2,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 3,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        }
+      ],
+      "throttle_event_count": 248,
+      "xid_events": [],
+      "tflops_jitter_pct": 4.07,
+      "steady_tflops_samples": 781,
+      "failures": [
+        "GPU temperature delta 12.0C exceeds 5.0C",
+        "non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)"
+      ],
+      "thresholds": {
+        "max_temp_c": 80.0,
+        "max_temp_delta_c": 5.0,
+        "min_power_w": 630.0,
+        "max_tflops_jitter_pct": 5.0,
+        "warmup_sec": 10.0,
+        "min_steady_samples": 10
+      }
+    },
+    "timestamp": "2026-05-22T17:52:09.074859"
+  },
+  "timestamp": "2026-05-22T17:52:09.082873"
+}
--- a/reports_stress_smoke_reasons_aikubeworker0012.md
+++ b/reports_stress_smoke_reasons_aikubeworker0012.md
@ -0,0 +1,29 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T17:52:09.082873
+- **Host:** aikubeworker0012
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Stress Test | FAIL |
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 45s (requested 45s)
+- **Telemetry samples:** 39
+- **Max temp:** {'0': 59.0, '1': 58.0, '2': 65.0, '3': 54.0, '4': 59.0, '5': 66.0, '6': 62.0, '7': 55.0}
+- **Avg power:** {'0': 697.0, '1': 697.4, '2': 697.9, '3': 698.0, '4': 697.8, '5': 697.6, '6': 697.9, '7': 698.2}
+- **Temp delta:** 12.0 C
+- **TFLOPS jitter:** 4.07%
+- **Throttle events:** 248
+- **XID events:** 0
+- **Failure reasons:**
+  - GPU temperature delta 12.0C exceeds 5.0C
+  - non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)
+- **Result: FAIL**
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_stress_smoke_reasons_aikubeworker0016.json
+++ b/reports_stress_smoke_reasons_aikubeworker0016.json
@ -0,0 +1,165 @@
+{
+  "stress": {
+    "source": "pytorch",
+    "passed": false,
+    "duration_sec": 45,
+    "elapsed_sec": 45.4,
+    "gpu_status": {
+      "0": "PASS",
+      "1": "PASS",
+      "2": "PASS",
+      "3": "PASS",
+      "4": "PASS",
+      "5": "PASS",
+      "6": "PASS",
+      "7": "PASS"
+    },
+    "telemetry": {
+      "passed": false,
+      "samples": 39,
+      "steady_samples": 31,
+      "warmup_sec": 9.0,
+      "max_temp_c": {
+        "0": 50.0,
+        "1": 56.0,
+        "2": 57.0,
+        "3": 52.0,
+        "4": 51.0,
+        "5": 58.0,
+        "6": 53.0,
+        "7": 51.0
+      },
+      "avg_power_w": {
+        "0": 698.3,
+        "1": 698.5,
+        "2": 697.6,
+        "3": 697.9,
+        "4": 697.8,
+        "5": 698.0,
+        "6": 697.5,
+        "7": 698.0
+      },
+      "temp_delta_c": 8.0,
+      "throttle_events": [
+        {
+          "gpu": 0,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 1,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 2,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 3,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 4,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 5,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 6,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 7,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 0,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 1,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 2,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 3,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 4,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 5,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 6,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 7,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 0,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 1,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 2,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        },
+        {
+          "gpu": 3,
+          "throttle": "0x0000000000000004",
+          "real_throttle": "0x4"
+        }
+      ],
+      "throttle_event_count": 248,
+      "xid_events": [],
+      "tflops_jitter_pct": 3.77,
+      "steady_tflops_samples": 787,
+      "failures": [
+        "GPU temperature delta 8.0C exceeds 5.0C",
+        "non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)"
+      ],
+      "thresholds": {
+        "max_temp_c": 80.0,
+        "max_temp_delta_c": 5.0,
+        "min_power_w": 630.0,
+        "max_tflops_jitter_pct": 5.0,
+        "warmup_sec": 10.0,
+        "min_steady_samples": 10
+      }
+    },
+    "timestamp": "2026-05-22T17:53:02.058687"
+  },
+  "timestamp": "2026-05-22T17:53:02.066792"
+}
--- a/reports_stress_smoke_reasons_aikubeworker0016.md
+++ b/reports_stress_smoke_reasons_aikubeworker0016.md
@ -0,0 +1,29 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T17:53:02.066792
+- **Host:** aikubeworker0016
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Stress Test | FAIL |
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 45s (requested 45s)
+- **Telemetry samples:** 39
+- **Max temp:** {'0': 50.0, '1': 56.0, '2': 57.0, '3': 52.0, '4': 51.0, '5': 58.0, '6': 53.0, '7': 51.0}
+- **Avg power:** {'0': 698.3, '1': 698.5, '2': 697.6, '3': 697.9, '4': 697.8, '5': 698.0, '6': 697.5, '7': 698.0}
+- **Temp delta:** 8.0 C
+- **TFLOPS jitter:** 3.77%
+- **Throttle events:** 248
+- **XID events:** 0
+- **Failure reasons:**
+  - GPU temperature delta 8.0C exceeds 5.0C
+  - non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)
+- **Result: FAIL**
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_latest_aikubeworker0012_20260522_203246.md
+++ b/reports_test_all_latest_aikubeworker0012_20260522_203246.md
@ -0,0 +1,322 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T20:32:51.687830
+- **Host:** aikubeworker0012
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- Compute Throughput: FAIL (FP16 spread 3.04% > 3%)
+- NCCL: FAIL
+- Stress Test: FAIL
+- RDMA: FAIL
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Health Check | PASS |
+| Memory Bandwidth | PASS (108.1%) |
+| Compute Throughput | FAIL (FP16 spread 3.04% > 3%) |
+| NVLink/NVSwitch | PASS |
+| DCGM | PASS |
+| NCCL | FAIL |
+| Stress Test | FAIL |
+| RDMA | FAIL |
+| Training | PASS (216498 tokens/sec) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 69/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 71/700W | 345 MHz |
+
+## Health Check
+
+**Overall: PASS**
+
+| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
+|-----|------|-------|-----|------|----------|--------|
+| 0 | 25C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 1 | 25C PASS | 73W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 2 | 26C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 3 | 24C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 4 | 24C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 5 | 27C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 6 | 25C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 7 | 24C PASS | 71W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.4 GB/s | 64 GB/s | 86.6% |
+| D2H (PCIe) | 54.0 GB/s | 64 GB/s | 84.4% |
+| D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
+
+**Verdict: PASS** (D2D efficiency 108.1%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 51.9 | 67 | >= 54 | FAIL |
+| TF32 | 364.9 | 495 | >= 444 | FAIL |
+| FP16 | 680.0 | 990 | >= 734 | FAIL |
+| BF16 | 713.2 | 990 | >= 745 | FAIL |
+| FP8 | 1170.4 | 1979 | >= 1400 | FAIL |
+| FP64 | 46.9 | 67 | >= 63 | FAIL |
+| INT8 | 100.4 | 1979 | >= 1536 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 5.1%)
+
+### Compute Consistency
+
+| DType | Min | Mean | Max | Spread | Limit | Status |
+|-------|-----|------|-----|--------|-------|--------|
+| FP32 | 51.9 | 52.0 | 52.1 | 0.38% | <= 3% | PASS |
+| TF32 | 361.0 | 364.9 | 369.0 | 2.19% | <= 3% | PASS |
+| FP16 | 667.3 | 680.0 | 688.0 | 3.04% | <= 3% | FAIL |
+| BF16 | 703.0 | 713.3 | 735.7 | 4.58% | <= 3% | FAIL |
+| FP8 | 1156.9 | 1170.5 | 1186.1 | 2.49% | <= 3% | PASS |
+| FP64 | 45.9 | 46.9 | 47.5 | 3.41% | <= 3% | FAIL |
+| INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
+
+### Compute Per-GPU TFLOPS
+
+| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
+|---|---|---|---|---|---|---|---|
+| 0 | 52.0 | 369.0 | 688.0 | 735.7 | 1186.1 | 47.5 | 100.4 |
+| 1 | 51.9 | 365.6 | 675.3 | 711.6 | 1171.0 | 47.0 | 100.4 |
+| 2 | 51.9 | 364.9 | 685.7 | 715.3 | 1175.3 | 47.1 | 100.4 |
+| 3 | 51.9 | 364.0 | 679.9 | 704.0 | 1167.6 | 47.4 | 100.4 |
+| 4 | 51.9 | 367.7 | 681.2 | 719.0 | 1178.0 | 46.6 | 100.4 |
+| 5 | 52.0 | 364.3 | 680.8 | 712.3 | 1165.5 | 46.8 | 100.4 |
+| 6 | 52.1 | 362.9 | 681.8 | 703.0 | 1156.9 | 46.9 | 100.4 |
+| 7 | 51.9 | 361.0 | 667.3 | 705.3 | 1163.2 | 45.9 | 100.4 |
+
+## NVLink/NVSwitch
+
+**Overall: PASS**
+
+| GPU | Active Links | Issues |
+|-----|--------------|--------|
+| 0 | 18/18 | OK |
+| 1 | 18/18 | OK |
+| 2 | 18/18 | OK |
+| 3 | 18/18 | OK |
+| 4 | 18/18 | OK |
+| 5 | 18/18 | OK |
+| 6 | 18/18 | OK |
+| 7 | 18/18 | OK |
+
+## DCGM Diagnostic
+
+**Overall: PASS**
+
+| Subtest | Status |
+|---------|--------|
+| Deployment/software/GPU0 | PASS |
+| Deployment/software/GPU1 | PASS |
+| Deployment/software/GPU2 | PASS |
+| Deployment/software/GPU3 | PASS |
+| Deployment/software/GPU4 | PASS |
+| Deployment/software/GPU5 | PASS |
+| Deployment/software/GPU6 | PASS |
+| Deployment/software/GPU7 | PASS |
+| Deployment/software/summary | PASS |
+| Hardware/memory/GPU0 | PASS |
+| Hardware/memory/GPU1 | PASS |
+| Hardware/memory/GPU2 | PASS |
+| Hardware/memory/GPU3 | PASS |
+| Hardware/memory/GPU4 | PASS |
+| Hardware/memory/GPU5 | PASS |
+| Hardware/memory/GPU6 | PASS |
+| Hardware/memory/GPU7 | PASS |
+| Hardware/memory/summary | PASS |
+| Hardware/diagnostic/GPU0 | PASS |
+| Hardware/diagnostic/GPU1 | PASS |
+| Hardware/diagnostic/GPU2 | PASS |
+| Hardware/diagnostic/GPU3 | PASS |
+| Hardware/diagnostic/GPU4 | PASS |
+| Hardware/diagnostic/GPU5 | PASS |
+| Hardware/diagnostic/GPU6 | PASS |
+| Hardware/diagnostic/GPU7 | PASS |
+| Hardware/diagnostic/summary | PASS |
+| Hardware/nvbandwidth/GPU0 | PASS |
+| Hardware/nvbandwidth/GPU1 | PASS |
+| Hardware/nvbandwidth/GPU2 | PASS |
+| Hardware/nvbandwidth/GPU3 | PASS |
+| Hardware/nvbandwidth/GPU4 | PASS |
+| Hardware/nvbandwidth/GPU5 | PASS |
+| Hardware/nvbandwidth/GPU6 | PASS |
+| Hardware/nvbandwidth/GPU7 | PASS |
+| Hardware/nvbandwidth/summary | PASS |
+| Integration/pcie/GPU0 | PASS |
+| Integration/pcie/GPU1 | PASS |
+| Integration/pcie/GPU2 | PASS |
+| Integration/pcie/GPU3 | PASS |
+| Integration/pcie/GPU4 | PASS |
+| Integration/pcie/GPU5 | PASS |
+| Integration/pcie/GPU6 | PASS |
+| Integration/pcie/GPU7 | PASS |
+| Integration/pcie/summary | PASS |
+| Stress/targeted_stress/GPU0 | PASS |
+| Stress/targeted_stress/GPU1 | PASS |
+| Stress/targeted_stress/GPU2 | PASS |
+| Stress/targeted_stress/GPU3 | PASS |
+| Stress/targeted_stress/GPU4 | PASS |
+| Stress/targeted_stress/GPU5 | PASS |
+| Stress/targeted_stress/GPU6 | PASS |
+| Stress/targeted_stress/GPU7 | PASS |
+| Stress/targeted_stress/summary | PASS |
+| Stress/targeted_power/GPU0 | PASS |
+| Stress/targeted_power/GPU1 | PASS |
+| Stress/targeted_power/GPU2 | PASS |
+| Stress/targeted_power/GPU3 | PASS |
+| Stress/targeted_power/GPU4 | PASS |
+| Stress/targeted_power/GPU5 | PASS |
+| Stress/targeted_power/GPU6 | PASS |
+| Stress/targeted_power/GPU7 | PASS |
+| Stress/targeted_power/summary | PASS |
+
+## NCCL Multi-GPU
+
+Source: nccl-tests | GPUs: 8
+
+| Operation | Bus BW (GB/s) | Threshold | Status |
+|-----------|---------------|-----------|--------|
+| allreduce | 472.3 | >= 405 | FAIL |
+| alltoall | 343.3 | >= 315 | FAIL |
+| broadcast | 364.1 | >= 360 | FAIL |
+| reducescatter | 352.8 | >= 405 | FAIL |
+| allgather | 366.4 | >= 405 | FAIL |
+| sendrecv | 369.0 | >= 360 | FAIL |
+
+### NCCL allreduce by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 24.9, 25.0, 24.7 | 24.7 | 24.9 | 0.50% | >= 405 | FAIL |
+| 256M | 421.6, 421.8, 421.6 | 421.6 | 421.7 | 0.02% | >= 405 | PASS |
+| 2G | 472.8, 472.7, 471.5 | 471.5 | 472.3 | 0.13% | >= 405 | PASS |
+
+### NCCL alltoall by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 8.1, 8.0, 8.0 | 8.0 | 8.0 | 0.59% | >= 315 | FAIL |
+| 256M | 305.3, 314.9, 313.1 | 305.3 | 311.1 | 1.34% | >= 315 | FAIL |
+| 2G | 342.1, 342.5, 345.4 | 342.1 | 343.3 | 0.43% | >= 315 | PASS |
+
+### NCCL broadcast by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.5, 14.6, 14.2 | 14.2 | 14.4 | 1.18% | >= 360 | FAIL |
+| 256M | 344.2, 345.9, 344.6 | 344.2 | 344.9 | 0.21% | >= 360 | FAIL |
+| 2G | 364.2, 364.0, 364.1 | 364.0 | 364.1 | 0.02% | >= 360 | PASS |
+
+### NCCL reducescatter by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.1, 13.8, 14.2 | 13.8 | 14.0 | 1.21% | >= 405 | FAIL |
+| 256M | 328.6, 328.3, 328.2 | 328.2 | 328.4 | 0.05% | >= 405 | FAIL |
+| 2G | 352.6, 352.4, 353.3 | 352.4 | 352.8 | 0.11% | >= 405 | FAIL |
+
+### NCCL allgather by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.6, 14.3, 14.4 | 14.3 | 14.4 | 0.86% | >= 405 | FAIL |
+| 256M | 350.5, 350.4, 349.9 | 349.9 | 350.3 | 0.07% | >= 405 | FAIL |
+| 2G | 366.3, 366.6, 366.2 | 366.2 | 366.4 | 0.05% | >= 405 | FAIL |
+
+### NCCL sendrecv by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 18.4, 18.4, 18.4 | 18.4 | 18.4 | 0.00% | >= 360 | FAIL |
+| 256M | 350.9, 351.6, 351.4 | 350.9 | 351.3 | 0.08% | >= 360 | FAIL |
+| 2G | 368.9, 369.1, 368.9 | 368.9 | 369.0 | 0.03% | >= 360 | PASS |
+
+**Overall: FAIL**
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 1800s (requested 1800s)
+- **Telemetry samples:** 1266
+- **Max temp:** {0: 60.0, 1: 60.0, 2: 68.0, 3: 56.0, 4: 60.0, 5: 68.0, 6: 64.0, 7: 56.0}
+- **Avg power:** {0: 697.7, 1: 697.5, 2: 697.1, 3: 697.8, 4: 697.8, 5: 697.9, 6: 697.7, 7: 698.3}
+- **Temp delta:** 12.0 C
+- **TFLOPS jitter:** 4.37%
+- **Steady TFLOPS samples:** 37672
+- **Throttle events:** 9712
+- **XID events:** 0
+- **Failure reasons:**
+  - GPU temperature delta 12.0C exceeds 5.0C
+  - non-idle throttle reasons observed in 9712 samples (first: GPU 0 0x4)
+- **Result: FAIL**
+
+## RDMA/InfiniBand
+
+### RDMA Port Checks
+
+| Device | Port | State | Rate | Required | Status |
+|--------|------|-------|------|----------|--------|
+| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 49.5 GB/s | >= 47 GB/s | PASS |
+| ib_read_bw | 39.1 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 1.25 us | <= 2 us | PASS |
+| ib_read_lat | 2.60 us | <= 3.5 us | PASS |
+| ibping | local_loopback target=0x58 count=5 | 0% packet loss | PASS |
+
+- **PFC/ECN/CNP/congestion counters checked:** 146
+- **PFC/ECN/CNP/congestion non-zero:** no
+- **Failure reasons:**
+  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - ib_read_bw bandwidth 39.12GB/s < 47GB/s
+**Overall: FAIL**
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer_1.5b |
+| Params | 1470.5M |
+| Throughput | 216498 tokens/sec |
+| Avg Step Time | 75.7 ms |
+| Warmup Steps | 5 |
+| Peak Memory | 18.1 GB |
+| Final Loss | 0.0039 |
+| Step Jitter | 1.89% |
+| Distributed Mode | ddp |
+| Verdict | PASS (216498 tokens/sec) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_latest_aikubeworker0016_20260522_203447.md
+++ b/reports_test_all_latest_aikubeworker0016_20260522_203447.md
@ -0,0 +1,322 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T20:34:52.129246
+- **Host:** aikubeworker0016
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- Compute Throughput: FAIL (BF16 spread 3.44% > 3%)
+- NCCL: FAIL
+- Stress Test: FAIL
+- RDMA: FAIL
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Health Check | PASS |
+| Memory Bandwidth | PASS (108.1%) |
+| Compute Throughput | FAIL (BF16 spread 3.44% > 3%) |
+| NVLink/NVSwitch | PASS |
+| DCGM | PASS |
+| NCCL | FAIL |
+| Stress Test | FAIL |
+| RDMA | FAIL |
+| Training | PASS (216683 tokens/sec) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 70/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 69/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 66/700W | 345 MHz |
+
+## Health Check
+
+**Overall: PASS**
+
+| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
+|-----|------|-------|-----|------|----------|--------|
+| 0 | 20C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 1 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 2 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 3 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 4 | 20C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 5 | 22C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 6 | 20C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 7 | 20C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.4 GB/s | 64 GB/s | 86.6% |
+| D2H (PCIe) | 54.4 GB/s | 64 GB/s | 85.0% |
+| D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
+
+**Verdict: PASS** (D2D efficiency 108.1%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 52.1 | 67 | >= 54 | FAIL |
+| TF32 | 366.7 | 495 | >= 444 | FAIL |
+| FP16 | 682.7 | 990 | >= 734 | FAIL |
+| BF16 | 717.3 | 990 | >= 745 | FAIL |
+| FP8 | 1173.5 | 1979 | >= 1400 | FAIL |
+| FP64 | 47.4 | 67 | >= 63 | FAIL |
+| INT8 | 100.4 | 1979 | >= 1536 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 5.1%)
+
+### Compute Consistency
+
+| DType | Min | Mean | Max | Spread | Limit | Status |
+|-------|-----|------|-----|--------|-------|--------|
+| FP32 | 51.9 | 52.1 | 52.2 | 0.58% | <= 3% | PASS |
+| TF32 | 362.3 | 366.7 | 369.2 | 1.88% | <= 3% | PASS |
+| FP16 | 674.4 | 682.7 | 693.1 | 2.74% | <= 3% | PASS |
+| BF16 | 705.3 | 717.2 | 730.0 | 3.44% | <= 3% | FAIL |
+| FP8 | 1155.2 | 1173.5 | 1186.2 | 2.64% | <= 3% | PASS |
+| FP64 | 46.3 | 47.4 | 48.5 | 4.64% | <= 3% | FAIL |
+| INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
+
+### Compute Per-GPU TFLOPS
+
+| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
+|---|---|---|---|---|---|---|---|
+| 0 | 52.2 | 362.3 | 674.4 | 714.3 | 1159.0 | 46.3 | 100.4 |
+| 1 | 51.9 | 366.5 | 674.7 | 721.4 | 1185.4 | 47.7 | 100.4 |
+| 2 | 52.2 | 367.4 | 693.1 | 730.0 | 1185.7 | 48.5 | 100.4 |
+| 3 | 52.2 | 367.8 | 682.2 | 708.2 | 1163.4 | 47.4 | 100.4 |
+| 4 | 52.0 | 366.4 | 686.9 | 714.1 | 1186.2 | 47.3 | 100.4 |
+| 5 | 52.0 | 369.2 | 679.9 | 721.1 | 1155.2 | 47.3 | 100.4 |
+| 6 | 51.9 | 365.1 | 677.7 | 705.3 | 1169.0 | 47.0 | 100.4 |
+| 7 | 52.2 | 369.0 | 692.8 | 723.5 | 1184.3 | 47.6 | 100.4 |
+
+## NVLink/NVSwitch
+
+**Overall: PASS**
+
+| GPU | Active Links | Issues |
+|-----|--------------|--------|
+| 0 | 18/18 | OK |
+| 1 | 18/18 | OK |
+| 2 | 18/18 | OK |
+| 3 | 18/18 | OK |
+| 4 | 18/18 | OK |
+| 5 | 18/18 | OK |
+| 6 | 18/18 | OK |
+| 7 | 18/18 | OK |
+
+## DCGM Diagnostic
+
+**Overall: PASS**
+
+| Subtest | Status |
+|---------|--------|
+| Deployment/software/GPU0 | PASS |
+| Deployment/software/GPU1 | PASS |
+| Deployment/software/GPU2 | PASS |
+| Deployment/software/GPU3 | PASS |
+| Deployment/software/GPU4 | PASS |
+| Deployment/software/GPU5 | PASS |
+| Deployment/software/GPU6 | PASS |
+| Deployment/software/GPU7 | PASS |
+| Deployment/software/summary | PASS |
+| Hardware/memory/GPU0 | PASS |
+| Hardware/memory/GPU1 | PASS |
+| Hardware/memory/GPU2 | PASS |
+| Hardware/memory/GPU3 | PASS |
+| Hardware/memory/GPU4 | PASS |
+| Hardware/memory/GPU5 | PASS |
+| Hardware/memory/GPU6 | PASS |
+| Hardware/memory/GPU7 | PASS |
+| Hardware/memory/summary | PASS |
+| Hardware/diagnostic/GPU0 | PASS |
+| Hardware/diagnostic/GPU1 | PASS |
+| Hardware/diagnostic/GPU2 | PASS |
+| Hardware/diagnostic/GPU3 | PASS |
+| Hardware/diagnostic/GPU4 | PASS |
+| Hardware/diagnostic/GPU5 | PASS |
+| Hardware/diagnostic/GPU6 | PASS |
+| Hardware/diagnostic/GPU7 | PASS |
+| Hardware/diagnostic/summary | PASS |
+| Hardware/nvbandwidth/GPU0 | PASS |
+| Hardware/nvbandwidth/GPU1 | PASS |
+| Hardware/nvbandwidth/GPU2 | PASS |
+| Hardware/nvbandwidth/GPU3 | PASS |
+| Hardware/nvbandwidth/GPU4 | PASS |
+| Hardware/nvbandwidth/GPU5 | PASS |
+| Hardware/nvbandwidth/GPU6 | PASS |
+| Hardware/nvbandwidth/GPU7 | PASS |
+| Hardware/nvbandwidth/summary | PASS |
+| Integration/pcie/GPU0 | PASS |
+| Integration/pcie/GPU1 | PASS |
+| Integration/pcie/GPU2 | PASS |
+| Integration/pcie/GPU3 | PASS |
+| Integration/pcie/GPU4 | PASS |
+| Integration/pcie/GPU5 | PASS |
+| Integration/pcie/GPU6 | PASS |
+| Integration/pcie/GPU7 | PASS |
+| Integration/pcie/summary | PASS |
+| Stress/targeted_stress/GPU0 | PASS |
+| Stress/targeted_stress/GPU1 | PASS |
+| Stress/targeted_stress/GPU2 | PASS |
+| Stress/targeted_stress/GPU3 | PASS |
+| Stress/targeted_stress/GPU4 | PASS |
+| Stress/targeted_stress/GPU5 | PASS |
+| Stress/targeted_stress/GPU6 | PASS |
+| Stress/targeted_stress/GPU7 | PASS |
+| Stress/targeted_stress/summary | PASS |
+| Stress/targeted_power/GPU0 | PASS |
+| Stress/targeted_power/GPU1 | PASS |
+| Stress/targeted_power/GPU2 | PASS |
+| Stress/targeted_power/GPU3 | PASS |
+| Stress/targeted_power/GPU4 | PASS |
+| Stress/targeted_power/GPU5 | PASS |
+| Stress/targeted_power/GPU6 | PASS |
+| Stress/targeted_power/GPU7 | PASS |
+| Stress/targeted_power/summary | PASS |
+
+## NCCL Multi-GPU
+
+Source: nccl-tests | GPUs: 8
+
+| Operation | Bus BW (GB/s) | Threshold | Status |
+|-----------|---------------|-----------|--------|
+| allreduce | 472.4 | >= 405 | FAIL |
+| alltoall | 344.3 | >= 315 | FAIL |
+| broadcast | 363.6 | >= 360 | FAIL |
+| reducescatter | 353.1 | >= 405 | FAIL |
+| allgather | 366.4 | >= 405 | FAIL |
+| sendrecv | 368.9 | >= 360 | FAIL |
+
+### NCCL allreduce by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 24.9, 24.4, 24.9 | 24.4 | 24.7 | 0.95% | >= 405 | FAIL |
+| 256M | 421.9, 421.1, 421.9 | 421.1 | 421.6 | 0.09% | >= 405 | PASS |
+| 2G | 472.6, 472.0, 472.5 | 472.0 | 472.4 | 0.06% | >= 405 | PASS |
+
+### NCCL alltoall by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 7.9, 7.8, 8.1 | 7.8 | 7.9 | 1.57% | >= 315 | FAIL |
+| 256M | 298.7, 312.7, 303.2 | 298.7 | 304.9 | 1.91% | >= 315 | FAIL |
+| 2G | 342.2, 345.4, 345.2 | 342.2 | 344.3 | 0.43% | >= 315 | PASS |
+
+### NCCL broadcast by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.5, 14.3, 14.4 | 14.3 | 14.4 | 0.57% | >= 360 | FAIL |
+| 256M | 344.1, 344.3, 344.8 | 344.1 | 344.4 | 0.09% | >= 360 | FAIL |
+| 2G | 364.0, 363.6, 363.3 | 363.3 | 363.6 | 0.08% | >= 360 | PASS |
+
+### NCCL reducescatter by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.0, 14.2, 14.3 | 14.0 | 14.2 | 0.88% | >= 405 | FAIL |
+| 256M | 328.8, 328.7, 328.4 | 328.4 | 328.6 | 0.05% | >= 405 | FAIL |
+| 2G | 351.9, 353.8, 353.6 | 351.9 | 353.1 | 0.24% | >= 405 | FAIL |
+
+### NCCL allgather by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.4, 13.9, 14.0 | 13.9 | 14.1 | 1.53% | >= 405 | FAIL |
+| 256M | 350.2, 350.4, 350.7 | 350.2 | 350.4 | 0.06% | >= 405 | FAIL |
+| 2G | 366.9, 366.4, 366.0 | 366.0 | 366.4 | 0.10% | >= 405 | FAIL |
+
+### NCCL sendrecv by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 18.4, 18.3, 18.5 | 18.3 | 18.4 | 0.44% | >= 360 | FAIL |
+| 256M | 351.1, 351.4, 351.3 | 351.1 | 351.3 | 0.04% | >= 360 | FAIL |
+| 2G | 368.9, 368.8, 368.9 | 368.8 | 368.9 | 0.01% | >= 360 | PASS |
+
+**Overall: FAIL**
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 1800s (requested 1800s)
+- **Telemetry samples:** 1295
+- **Max temp:** {0: 51.0, 1: 59.0, 2: 61.0, 3: 53.0, 4: 53.0, 5: 62.0, 6: 56.0, 7: 52.0}
+- **Avg power:** {0: 698.8, 1: 697.8, 2: 698.1, 3: 697.9, 4: 697.9, 5: 698.2, 6: 698.0, 7: 697.8}
+- **Temp delta:** 11.0 C
+- **TFLOPS jitter:** 3.4%
+- **Steady TFLOPS samples:** 37874
+- **Throttle events:** 9944
+- **XID events:** 0
+- **Failure reasons:**
+  - GPU temperature delta 11.0C exceeds 5.0C
+  - non-idle throttle reasons observed in 9944 samples (first: GPU 0 0x4)
+- **Result: FAIL**
+
+## RDMA/InfiniBand
+
+### RDMA Port Checks
+
+| Device | Port | State | Rate | Required | Status |
+|--------|------|-------|------|----------|--------|
+| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 48.6 GB/s | >= 47 GB/s | PASS |
+| ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 1.29 us | <= 2 us | PASS |
+| ib_read_lat | 2.59 us | <= 3.5 us | PASS |
+| ibping | local_loopback target=0x4b count=5 | 0% packet loss | PASS |
+
+- **PFC/ECN/CNP/congestion counters checked:** 146
+- **PFC/ECN/CNP/congestion non-zero:** no
+- **Failure reasons:**
+  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - ib_read_bw bandwidth 40.29GB/s < 47GB/s
+**Overall: FAIL**
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer_1.5b |
+| Params | 1470.5M |
+| Throughput | 216683 tokens/sec |
+| Avg Step Time | 75.6 ms |
+| Warmup Steps | 5 |
+| Peak Memory | 18.1 GB |
+| Final Loss | 0.0039 |
+| Step Jitter | 1.2% |
+| Distributed Mode | ddp |
+| Verdict | PASS (216683 tokens/sec) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_latest_summary_cn_20260523.md
+++ b/reports_test_all_latest_summary_cn_20260523.md
@ -0,0 +1,101 @@
+# H100 单节点 test all 中文汇总
+
+生成时间：2026-05-23  
+测试范围：`aikubeworker0012`、`aikubeworker0016` 单节点 `python gpu_tester.py --test all --report --format md`
+
+原始报告：
+
+- `reports_test_all_latest_aikubeworker0012_20260522_203246.md`
+- `reports_test_all_latest_aikubeworker0016_20260522_203447.md`
+
+## 总结论
+
+| 机器 | Suite | PDF 验收结论 | 主要失败项 |
+|---|---:|---|---|
+| aikubeworker0012 | 6/10 PASS | FAIL | Compute、NCCL、Stress、RDMA |
+| aikubeworker0016 | 6/10 PASS | FAIL | Compute、NCCL、Stress、RDMA |
+
+按 PDF 口径，任一必测子项 FAIL，则整机 FAIL。因此两台机器当前都不通过生产验收。
+
+## 通过项
+
+| 项目 | aikubeworker0012 | aikubeworker0016 | 说明 |
+|---|---|---|---|
+| GPU Info | PASS | PASS | 8 张 H100 |
+| Health | PASS | PASS | 温度、空闲功耗、ECC、PCIe、空闲 throttle 正常 |
+| Memory Bandwidth | PASS | PASS | D2D 效率均约 108.1% |
+| NVLink/NVSwitch | PASS | PASS | 8 卡均 18/18 links |
+| DCGM diag -r 3 | PASS | PASS | software、memory、diagnostic、nvbandwidth、pcie、targeted stress/power 全 PASS |
+| Training Simulation | PASS | PASS | 8 卡 DDP synthetic 1.5B，loss finite |
+
+Training 结果：
+
+| 机器 | Throughput | Step jitter | Peak memory | Verdict |
+|---|---:|---:|---:|---|
+| aikubeworker0012 | 216498 tokens/s | 1.89% | 18.08 GB | PASS |
+| aikubeworker0016 | 216683 tokens/s | 1.20% | 18.08 GB | PASS |
+
+## 失败项
+
+### Compute
+
+两台机器都未达到当前 H100 绝对 TFLOPS 阈值，且部分 dtype 的跨 GPU spread 超过 3%。
+
+| 机器 | 代表性失败 |
+|---|---|
+| aikubeworker0012 | FP16 spread 3.04%，BF16 spread 4.58%，FP64 spread 3.41%；FP32/TF32/FP16/BF16/FP8/FP64/INT8 绝对阈值均 FAIL |
+| aikubeworker0016 | BF16 spread 3.44%，FP64 spread 4.64%；FP32/TF32/FP16/BF16/FP8/FP64/INT8 绝对阈值均 FAIL |
+
+### NCCL
+
+NCCL 已经使用真实 `nccl-tests` bus BW，不是 torchrun fallback。失败主要来自小 size 以及部分 256M/2G op 未达阈值。
+
+| 机器 | allreduce best | alltoall best | broadcast best | reducescatter best | allgather best | sendrecv best | Verdict |
+|---|---:|---:|---:|---:|---:|---:|---|
+| aikubeworker0012 | 472.3 | 343.3 | 364.1 | 352.8 | 366.4 | 369.0 | FAIL |
+| aikubeworker0016 | 472.4 | 344.3 | 363.6 | 353.1 | 366.4 | 368.9 | FAIL |
+
+关键原因：
+
+- `1M` size 在所有 op 上都明显低于阈值。
+- `reducescatter`、`allgather` 的 2G 也低于 405 GB/s 阈值。
+- `broadcast/sendrecv` 的 256M 低于 360 GB/s 阈值。
+
+### Stress
+
+两台机器的 1800 秒 PyTorch BF16 GEMM 压力测试均跑满，但 telemetry 判定 FAIL。
+
+| 机器 | 平均稳态功耗 | 最高温度范围 | 温差 | TFLOPS jitter | throttle events | XID | Verdict |
+|---|---|---|---:|---:|---:|---:|---|
+| aikubeworker0012 | 约 697-698W/GPU | 56-68C | 12C | 4.37% | 9712 | 0 | FAIL |
+| aikubeworker0016 | 约 698W/GPU | 51-62C | 11C | 3.40% | 9944 | 0 | FAIL |
+
+失败原因：
+
+- GPU 间温差超过 5C 阈值。
+- 观测到大量非 idle throttle，首个原因是 `0x4`，即 `sw_power_cap`。
+
+### RDMA/InfiniBand
+
+本轮 `test all` 是单节点 RDMA 路径，`ibping` 显示为 `local_loopback`。这份结果不能替代跨节点 RDMA 验收，但仍反映单节点 perftest read bandwidth 未达标。
+
+| 机器 | ib_write_bw | ib_read_bw | ib_write_lat | ib_read_lat | Verdict |
+|---|---:|---:|---:|---:|---|
+| aikubeworker0012 | 49.5 GB/s PASS | 39.1 GB/s FAIL | 1.25 us PASS | 2.60 us PASS | FAIL |
+| aikubeworker0016 | 48.6 GB/s PASS | 40.3 GB/s FAIL | 1.29 us PASS | 2.59 us PASS | FAIL |
+
+另外，两台机器都有 `mlx5_4`、`mlx5_5` 处于 ACTIVE 但速率为 100 Gb/sec，低于当前 400G 端口阈值，因此 RDMA port check 也有 FAIL。
+
+## 当前阻塞
+
+1. Compute 阈值口径较严，当前实测绝对 TFLOPS 全 dtype 未达配置阈值，尤其 INT8 路径仅约 100 TFLOPS。
+2. NCCL 真实 bus BW 已可测，但多 op/size 未达 PDF 阈值。
+3. Stress 负载可跑满 30 分钟，但温差和 `sw_power_cap` throttle 导致 FAIL。
+4. 单节点 RDMA read bandwidth 未达 47 GB/s，且部分 IB 端口速率低于 400G。
+5. 跨节点 RDMA 需要继续使用单独 server/client 报告；不能把本轮 `local_loopback` 当作跨节点验收。
+
+## 状态判断
+
+脚本能力已经基本补齐到 PDF 验收口径：真实 nccl-tests、30 分钟 stress telemetry、NVLink、DCGM r3、RDMA perftest/ibping/counter、逐 GPU compute、8 卡 DDP training、最终任一 FAIL 即整机 FAIL 都已经跑通。
+
+当前剩余问题主要不是脚本缺项，而是两台机器的实际验收数据有多项未达标。
--- a/reports_test_all_pdf_aikubeworker0012_20260522_182656.md
+++ b/reports_test_all_pdf_aikubeworker0012_20260522_182656.md
@ -0,0 +1,259 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T18:27:01.103760
+- **Host:** aikubeworker0012
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
+- DCGM: ERROR: dcgmi diag -r 3 timeout after 1200s
+- NCCL: FAIL
+- Stress Test: FAIL
+- RDMA: FAIL
+- Training: FAIL (188741 tokens/sec)
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Health Check | PASS |
+| Memory Bandwidth | PASS (108.1%) |
+| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
+| NVLink/NVSwitch | PASS |
+| DCGM | ERROR: dcgmi diag -r 3 timeout after 1200s |
+| NCCL | FAIL |
+| Stress Test | FAIL |
+| RDMA | FAIL |
+| Training | FAIL (188741 tokens/sec) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 70/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 71/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 72/700W | 345 MHz |
+
+## Health Check
+
+**Overall: PASS**
+
+| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
+|-----|------|-------|-----|------|----------|--------|
+| 0 | 25C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 1 | 25C PASS | 73W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 2 | 26C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 3 | 24C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 4 | 24C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 5 | 27C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 6 | 25C PASS | 71W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 7 | 24C PASS | 72W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
+| D2H (PCIe) | 54.3 GB/s | 64 GB/s | 84.8% |
+| D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
+
+**Verdict: PASS** (D2D efficiency 108.1%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 52.0 | 67 | >= 54 | FAIL |
+| TF32 | 364.8 | 495 | >= 444 | FAIL |
+| FP16 | 685.0 | 990 | >= 734 | FAIL |
+| BF16 | 715.9 | 990 | >= 745 | FAIL |
+| FP8 | 1166.6 | 1979 | >= 1400 | FAIL |
+| FP64 | 46.9 | 0 | >= 63 | FAIL |
+| INT8 | 100.4 | 0 | >= 1536 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 58.9%)
+
+### Compute Consistency
+
+| DType | Min | Mean | Max | Spread | Limit | Status |
+|-------|-----|------|-----|--------|-------|--------|
+| FP32 | 51.9 | 52.0 | 52.2 | 0.58% | <= 3% | PASS |
+| TF32 | 360.9 | 364.9 | 368.2 | 2.00% | <= 3% | PASS |
+| FP16 | 676.0 | 685.0 | 689.9 | 2.03% | <= 3% | PASS |
+| BF16 | 697.3 | 715.9 | 730.2 | 4.60% | <= 3% | FAIL |
+| FP8 | 1141.8 | 1166.6 | 1180.3 | 3.30% | <= 3% | FAIL |
+| FP64 | 45.8 | 46.9 | 47.7 | 4.05% | <= 3% | FAIL |
+| INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
+
+### Compute Per-GPU TFLOPS
+
+| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
+|---|---|---|---|---|---|---|---|
+| 0 | 51.9 | 368.2 | 689.5 | 730.2 | 1180.3 | 47.1 | 100.4 |
+| 1 | 51.9 | 366.8 | 688.7 | 721.6 | 1170.1 | 47.7 | 100.4 |
+| 2 | 51.9 | 366.3 | 689.9 | 711.3 | 1167.8 | 47.2 | 100.4 |
+| 3 | 51.9 | 363.0 | 677.6 | 699.2 | 1176.3 | 46.6 | 100.4 |
+| 4 | 52.2 | 365.3 | 685.0 | 725.4 | 1163.0 | 46.8 | 100.4 |
+| 5 | 52.1 | 363.9 | 684.2 | 725.0 | 1172.1 | 46.9 | 100.4 |
+| 6 | 51.9 | 364.4 | 688.8 | 717.3 | 1161.2 | 46.9 | 100.4 |
+| 7 | 51.9 | 360.9 | 676.0 | 697.3 | 1141.8 | 45.8 | 100.4 |
+
+## NVLink/NVSwitch
+
+**Overall: PASS**
+
+| GPU | Active Links | Issues |
+|-----|--------------|--------|
+| 0 | 18/18 | OK |
+| 1 | 18/18 | OK |
+| 2 | 18/18 | OK |
+| 3 | 18/18 | OK |
+| 4 | 18/18 | OK |
+| 5 | 18/18 | OK |
+| 6 | 18/18 | OK |
+| 7 | 18/18 | OK |
+
+## DCGM Diagnostic
+
+**Overall: FAIL** (dcgmi diag -r 3 timeout after 1200s)
+
+## NCCL Multi-GPU
+
+Source: nccl-tests | GPUs: 8
+
+| Operation | Bus BW (GB/s) | Threshold | Status |
+|-----------|---------------|-----------|--------|
+| allreduce | 472.4 | >= 405 | FAIL |
+| alltoall | 344.4 | >= 315 | FAIL |
+| broadcast | 363.8 | >= 360 | FAIL |
+| reducescatter | 353.0 | >= 405 | FAIL |
+| allgather | 366.4 | >= 405 | FAIL |
+| sendrecv | 368.9 | >= 360 | FAIL |
+
+### NCCL allreduce by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 24.0, 24.9, 24.7 | 24.0 | 24.5 | 1.57% | >= 405 | FAIL |
+| 256M | 421.4, 421.7, 421.4 | 421.4 | 421.5 | 0.03% | >= 405 | PASS |
+| 2G | 471.8, 473.0, 472.3 | 471.8 | 472.4 | 0.10% | >= 405 | PASS |
+
+### NCCL alltoall by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 8.1, 8.0, 8.0 | 8.0 | 8.0 | 0.59% | >= 315 | FAIL |
+| 256M | 312.3, 310.9, 319.2 | 310.9 | 314.1 | 1.15% | >= 315 | FAIL |
+| 2G | 343.1, 346.2, 344.0 | 343.1 | 344.4 | 0.38% | >= 315 | PASS |
+
+### NCCL broadcast by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.6, 13.6, 14.5 | 13.6 | 14.2 | 3.16% | >= 360 | FAIL |
+| 256M | 343.8, 344.2, 344.5 | 343.8 | 344.2 | 0.08% | >= 360 | FAIL |
+| 2G | 363.5, 363.3, 364.7 | 363.3 | 363.8 | 0.17% | >= 360 | PASS |
+
+### NCCL reducescatter by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.1, 14.3, 14.3 | 14.1 | 14.2 | 0.66% | >= 405 | FAIL |
+| 256M | 328.1, 328.3, 328.3 | 328.1 | 328.2 | 0.03% | >= 405 | FAIL |
+| 2G | 354.0, 352.6, 352.3 | 352.3 | 353.0 | 0.21% | >= 405 | FAIL |
+
+### NCCL allgather by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.5, 14.5, 14.3 | 14.3 | 14.4 | 0.65% | >= 405 | FAIL |
+| 256M | 350.7, 350.7, 350.5 | 350.5 | 350.6 | 0.03% | >= 405 | FAIL |
+| 2G | 366.6, 366.3, 366.3 | 366.3 | 366.4 | 0.04% | >= 405 | FAIL |
+
+### NCCL sendrecv by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 18.5, 18.4, 18.1 | 18.1 | 18.3 | 0.93% | >= 360 | FAIL |
+| 256M | 352.3, 350.6, 350.5 | 350.5 | 351.1 | 0.24% | >= 360 | FAIL |
+| 2G | 368.8, 369.0, 368.8 | 368.8 | 368.9 | 0.03% | >= 360 | PASS |
+
+**Overall: FAIL**
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 1800s (requested 1800s)
+- **Telemetry samples:** 1541
+- **Max temp:** {0: 60.0, 1: 60.0, 2: 68.0, 3: 56.0, 4: 60.0, 5: 68.0, 6: 65.0, 7: 56.0}
+- **Avg power:** {0: 697.7, 1: 697.4, 2: 697.2, 3: 697.7, 4: 697.5, 5: 698.0, 6: 697.8, 7: 698.4}
+- **Temp delta:** 12.0 C
+- **TFLOPS jitter:** 3.16%
+- **Steady TFLOPS samples:** 37676
+- **Throttle events:** 11912
+- **XID events:** 0
+- **Failure reasons:**
+  - GPU temperature delta 12.0C exceeds 5.0C
+  - non-idle throttle reasons observed in 11912 samples (first: GPU 0 0x4)
+- **Result: FAIL**
+
+## RDMA/InfiniBand
+
+### RDMA Port Checks
+
+| Device | Port | State | Rate | Required | Status |
+|--------|------|-------|------|----------|--------|
+| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 49.2 GB/s | >= 47 GB/s | PASS |
+| ib_read_bw | 39.1 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 5.68 us | <= 2 us | FAIL |
+| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
+| ibping | target=0x58 count=5 | 0% packet loss | PASS |
+
+- **PFC/ECN/CNP/congestion counters checked:** 0
+- **PFC/ECN/CNP/congestion non-zero:** no
+- **Failure reasons:**
+  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - ib_read_bw bandwidth 39.11GB/s < 47GB/s
+  - ib_write_lat latency 5.68us > 2.0us
+  - ib_read_lat latency 16.0us > 3.5us
+**Overall: FAIL**
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer_1.5b |
+| Params | 1470.5M |
+| Throughput | 188741 tokens/sec |
+| Avg Step Time | 86.8 ms |
+| Peak Memory | 18.1 GB |
+| Final Loss | 0.0041 |
+| Step Jitter | 626.74% |
+| Distributed Mode | ddp |
+| Verdict | FAIL (188741 tokens/sec) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_pdf_aikubeworker0016_20260522_182856.md
+++ b/reports_test_all_pdf_aikubeworker0016_20260522_182856.md
@ -0,0 +1,259 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T18:29:01.245683
+- **Host:** aikubeworker0016
+- **GPU:** NVIDIA H100 80GB HBM3 x8
+- **Driver:** 580.159.03 | **CUDA:** 13.0
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Failed or unverified items:
+- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
+- DCGM: ERROR: dcgmi diag -r 3 timeout after 1200s
+- NCCL: FAIL
+- Stress Test: FAIL
+- RDMA: FAIL
+- Training: FAIL (193836 tokens/sec)
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| GPU Info | PASS (8 GPUs detected) |
+| Health Check | PASS |
+| Memory Bandwidth | PASS (108.1%) |
+| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
+| NVLink/NVSwitch | PASS |
+| DCGM | ERROR: dcgmi diag -r 3 timeout after 1200s |
+| NCCL | FAIL |
+| Stress Test | FAIL |
+| RDMA | FAIL |
+| Training | FAIL (193836 tokens/sec) |
+
+## GPU Information
+
+| GPU | Model | VRAM | Temp | Power | SM Clock |
+|-----|-------|------|------|-------|----------|
+| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 70/700W | 345 MHz |
+| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
+| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
+| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 67/700W | 345 MHz |
+| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 67/700W | 345 MHz |
+| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 69/700W | 345 MHz |
+| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 68/700W | 345 MHz |
+| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 66/700W | 345 MHz |
+
+## Health Check
+
+**Overall: PASS**
+
+| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
+|-----|------|-------|-----|------|----------|--------|
+| 0 | 19C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 1 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 2 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 3 | 19C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 4 | 19C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 5 | 21C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 6 | 19C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+| 7 | 19C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
+
+## Memory Bandwidth
+
+Source: nvbandwidth
+
+| Metric | Value | Peak | Efficiency |
+|--------|-------|------|------------|
+| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
+| D2H (PCIe) | 54.7 GB/s | 64 GB/s | 85.5% |
+| D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
+
+**Verdict: PASS** (D2D efficiency 108.1%)
+
+## Compute Throughput
+
+| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
+|-------|-------------------|------|------------|--------|
+| FP32 | 52.0 | 67 | >= 54 | FAIL |
+| TF32 | 366.2 | 495 | >= 444 | FAIL |
+| FP16 | 684.8 | 990 | >= 734 | FAIL |
+| BF16 | 720.7 | 990 | >= 745 | FAIL |
+| FP8 | 1180.3 | 1979 | >= 1400 | FAIL |
+| FP64 | 47.3 | 0 | >= 63 | FAIL |
+| INT8 | 100.5 | 0 | >= 1536 | FAIL |
+
+**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 59.6%)
+
+### Compute Consistency
+
+| DType | Min | Mean | Max | Spread | Limit | Status |
+|-------|-----|------|-----|--------|-------|--------|
+| FP32 | 51.9 | 52.0 | 52.2 | 0.58% | <= 3% | PASS |
+| TF32 | 361.1 | 366.2 | 368.9 | 2.13% | <= 3% | PASS |
+| FP16 | 672.6 | 684.8 | 695.0 | 3.27% | <= 3% | FAIL |
+| BF16 | 703.6 | 720.7 | 734.2 | 4.25% | <= 3% | FAIL |
+| FP8 | 1158.6 | 1180.3 | 1241.8 | 7.05% | <= 3% | FAIL |
+| FP64 | 46.7 | 47.3 | 48.0 | 2.75% | <= 3% | PASS |
+| INT8 | 100.4 | 100.5 | 101.1 | 0.70% | <= 3% | PASS |
+
+### Compute Per-GPU TFLOPS
+
+| GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
+|---|---|---|---|---|---|---|---|
+| 0 | 51.9 | 361.1 | 673.3 | 703.6 | 1158.6 | 46.7 | 100.4 |
+| 1 | 52.0 | 367.0 | 684.0 | 725.7 | 1184.3 | 47.3 | 100.4 |
+| 2 | 52.2 | 368.7 | 695.0 | 734.2 | 1197.7 | 48.0 | 100.4 |
+| 3 | 51.9 | 367.8 | 688.0 | 708.1 | 1174.8 | 47.3 | 100.4 |
+| 4 | 52.0 | 365.2 | 688.4 | 718.2 | 1160.5 | 47.0 | 101.1 |
+| 5 | 52.1 | 368.9 | 684.2 | 733.7 | 1160.5 | 47.3 | 100.4 |
+| 6 | 51.9 | 364.0 | 672.6 | 715.6 | 1164.4 | 47.1 | 100.4 |
+| 7 | 51.9 | 367.0 | 692.5 | 726.5 | 1241.8 | 47.6 | 100.4 |
+
+## NVLink/NVSwitch
+
+**Overall: PASS**
+
+| GPU | Active Links | Issues |
+|-----|--------------|--------|
+| 0 | 18/18 | OK |
+| 1 | 18/18 | OK |
+| 2 | 18/18 | OK |
+| 3 | 18/18 | OK |
+| 4 | 18/18 | OK |
+| 5 | 18/18 | OK |
+| 6 | 18/18 | OK |
+| 7 | 18/18 | OK |
+
+## DCGM Diagnostic
+
+**Overall: FAIL** (dcgmi diag -r 3 timeout after 1200s)
+
+## NCCL Multi-GPU
+
+Source: nccl-tests | GPUs: 8
+
+| Operation | Bus BW (GB/s) | Threshold | Status |
+|-----------|---------------|-----------|--------|
+| allreduce | 472.5 | >= 405 | FAIL |
+| alltoall | 344.2 | >= 315 | FAIL |
+| broadcast | 363.8 | >= 360 | FAIL |
+| reducescatter | 352.5 | >= 405 | FAIL |
+| allgather | 366.8 | >= 405 | FAIL |
+| sendrecv | 369.0 | >= 360 | FAIL |
+
+### NCCL allreduce by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 24.7, 24.1, 24.5 | 24.1 | 24.4 | 1.02% | >= 405 | FAIL |
+| 256M | 421.8, 422.1, 421.4 | 421.4 | 421.8 | 0.07% | >= 405 | PASS |
+| 2G | 472.8, 472.2, 472.6 | 472.2 | 472.5 | 0.05% | >= 405 | PASS |
+
+### NCCL alltoall by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 8.0, 8.0, 7.9 | 7.9 | 8.0 | 0.59% | >= 315 | FAIL |
+| 256M | 326.8, 315.4, 315.8 | 315.4 | 319.3 | 1.65% | >= 315 | PASS |
+| 2G | 344.2, 343.8, 344.6 | 343.8 | 344.2 | 0.09% | >= 315 | PASS |
+
+### NCCL broadcast by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.4, 14.2, 14.1 | 14.1 | 14.2 | 0.88% | >= 360 | FAIL |
+| 256M | 345.3, 344.9, 344.4 | 344.4 | 344.9 | 0.11% | >= 360 | FAIL |
+| 2G | 363.6, 363.9, 363.8 | 363.6 | 363.8 | 0.03% | >= 360 | PASS |
+
+### NCCL reducescatter by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.3, 14.1, 14.1 | 14.1 | 14.2 | 0.67% | >= 405 | FAIL |
+| 256M | 328.2, 328.3, 328.4 | 328.2 | 328.3 | 0.02% | >= 405 | FAIL |
+| 2G | 352.2, 352.7, 352.6 | 352.2 | 352.5 | 0.06% | >= 405 | FAIL |
+
+### NCCL allgather by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 14.2, 14.5, 14.3 | 14.2 | 14.3 | 0.87% | >= 405 | FAIL |
+| 256M | 350.6, 350.6, 350.5 | 350.5 | 350.6 | 0.01% | >= 405 | FAIL |
+| 2G | 367.0, 366.8, 366.5 | 366.5 | 366.8 | 0.06% | >= 405 | FAIL |
+
+### NCCL sendrecv by size
+
+| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
+|------|---------------------|-------|------|--------|-----------|--------|
+| 1M | 18.4, 18.2, 18.6 | 18.2 | 18.4 | 0.89% | >= 360 | FAIL |
+| 256M | 350.7, 350.8, 351.1 | 350.7 | 350.9 | 0.05% | >= 360 | FAIL |
+| 2G | 369.0, 369.0, 368.9 | 368.9 | 369.0 | 0.01% | >= 360 | PASS |
+
+**Overall: FAIL**
+
+## Stress Test
+
+- **Source:** pytorch
+- **Duration:** 1800s (requested 1800s)
+- **Telemetry samples:** 1541
+- **Max temp:** {0: 51.0, 1: 59.0, 2: 62.0, 3: 53.0, 4: 53.0, 5: 62.0, 6: 57.0, 7: 53.0}
+- **Avg power:** {0: 698.7, 1: 698.0, 2: 698.1, 3: 697.9, 4: 697.7, 5: 698.2, 6: 698.0, 7: 697.7}
+- **Temp delta:** 11.0 C
+- **TFLOPS jitter:** 3.05%
+- **Steady TFLOPS samples:** 37841
+- **Throttle events:** 11912
+- **XID events:** 0
+- **Failure reasons:**
+  - GPU temperature delta 11.0C exceeds 5.0C
+  - non-idle throttle reasons observed in 11912 samples (first: GPU 0 0x4)
+- **Result: FAIL**
+
+## RDMA/InfiniBand
+
+### RDMA Port Checks
+
+| Device | Port | State | Rate | Required | Status |
+|--------|------|-------|------|----------|--------|
+| mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
+| mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+| mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
+
+| Test | Value | Threshold | Status |
+|------|-------|-----------|--------|
+| ib_write_bw | 48.4 GB/s | >= 47 GB/s | PASS |
+| ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
+| ib_write_lat | 2.44 us | <= 2 us | FAIL |
+| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
+| ibping | target=0x4b count=5 | 0% packet loss | PASS |
+
+- **PFC/ECN/CNP/congestion counters checked:** 0
+- **PFC/ECN/CNP/congestion non-zero:** no
+- **Failure reasons:**
+  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
+  - ib_read_bw bandwidth 40.29GB/s < 47GB/s
+  - ib_write_lat latency 2.44us > 2.0us
+  - ib_read_lat latency 16.0us > 3.5us
+**Overall: FAIL**
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer_1.5b |
+| Params | 1470.5M |
+| Throughput | 193836 tokens/sec |
+| Avg Step Time | 84.5 ms |
+| Peak Memory | 18.1 GB |
+| Final Loss | 0.004 |
+| Step Jitter | 521.24% |
+| Distributed Mode | ddp |
+| Verdict | FAIL (193836 tokens/sec) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_training_warmup_aikubeworker0012_20260522_194528.md
+++ b/reports_training_warmup_aikubeworker0012_20260522_194528.md
@ -0,0 +1,43 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T19:46:07.450315
+- **Host:** aikubeworker0012
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- RDMA
+- DCGM
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Training | PASS (216654 tokens/sec) |
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer_1.5b |
+| Params | 1470.5M |
+| Throughput | 216654 tokens/sec |
+| Avg Step Time | 75.6 ms |
+| Warmup Steps | 5 |
+| Peak Memory | 18.1 GB |
+| Final Loss | 0.0039 |
+| Step Jitter | 0.87% |
+| Distributed Mode | ddp |
+| Verdict | PASS (216654 tokens/sec) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_training_warmup_aikubeworker0016_20260522_194609.md
+++ b/reports_training_warmup_aikubeworker0016_20260522_194609.md
@ -0,0 +1,43 @@
+# GPU Test Report
+
+- **Date:** 2026-05-22T19:46:48.023650
+- **Host:** aikubeworker0016
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- RDMA
+- DCGM
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Training | PASS (217236 tokens/sec) |
+
+## Training Simulation
+
+| Metric | Value |
+|--------|-------|
+| Model | synthetic_transformer_1.5b |
+| Params | 1470.5M |
+| Throughput | 217236 tokens/sec |
+| Avg Step Time | 75.4 ms |
+| Warmup Steps | 5 |
+| Peak Memory | 18.1 GB |
+| Final Loss | 0.0039 |
+| Step Jitter | 1.23% |
+| Distributed Mode | ddp |
+| Verdict | PASS (217236 tokens/sec) |
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/test_all_aikubeworker0016_中文结果与验收差距.md
+++ b/test_all_aikubeworker0016_中文结果与验收差距.md
@ -0,0 +1,73 @@
+# aikubeworker0016 `test all` 中文结果与 H100 验收差距
+
+测试命令：
+
+```bash
+/root/gpu-test-venv/bin/python gpu_tester.py --test all --report --format json --output reports_all/test_all.json
+```
+
+测试机器：`aikubeworker0016 / 172.72.8.16`
+
+原始结果：`reports_all_aikubeworker0016.json`
+
+## 先说结论
+
+项目输出里最后显示 `Suite complete: 8/8 tests passed`，但这个结论不能直接当成生产验收 PASS。
+
+原因是当前 `all` 的汇总逻辑主要看模块有没有抛 `error`，没有把 `nccl.passed=false` 和 `rdma.passed=false` 当成整套失败。因此按 PDF 的生产验收口径，这台机器目前不能算完整验收通过。
+
+## 本次 `test all` 实际结果
+
+| 模块 | 当前结果 | 关键数据 | 按 PDF 验收看 |
+| --- | --- | --- | --- |
+| GPU 信息 | 已覆盖 | 8 张 H100，Driver 580.159.03，CUDA 13.0 | 基础信息 OK，但 NVLink 链路专项不足 |
+| 健康检查 | PASS | health.passed=true | 基础健康 OK，但缺 retired pages、AER/Replay、fabricmanager 日志、stress 期间采样 |
+| Memory | 有结果 | H2D 55.5 GB/s，D2H 55.3 GB/s，D2D 486.5 GB/s | 单项看起来不错，但缺 8x8 P2P 矩阵验收 |
+| Compute | 有结果 | FP32 51.9，TF32 357.0，FP16 664.0，BF16 700.1，FP8 1116.2 TFLOPS | 对 PDF 绝对门槛不全通过 |
+| NCCL | 实际不合格 | source=torchrun_fallback，`nccl.passed=false`，无 bus BW 性能数据 | 不满足 PDF NCCL 性能验收 |
+| Stress | PASS | PyTorch fallback，60 秒，8 GPU 状态 PASS | 不满足 PDF 的 30/60 分钟 burn-in；负载只有约 64MB/卡，压力明显不够 |
+| RDMA/IB | 实际不合格 | ib_write_bw/read_bw 0.13 GB/s WARN；write_lat 4.10us PASS；read_lat 16us WARN | 当前是 localhost 单节点口径，不满足 PDF RDMA 生产验收 |
+| Training | 有结果 | synthetic 1.47B，52471 tokens/s，peak 27.31GB，loss 0.0041 | tokens/s 过线，但代码实际不是 8 卡分布式训练验收 |
+
+## Compute 对 PDF 门槛的判断
+
+PDF H100 PASS 门槛：
+
+| DType | 本次结果 | PDF PASS 门槛 | 判断 |
+| --- | ---: | ---: | --- |
+| FP32 | 51.9 TFLOPS | >= 54 | WARN |
+| TF32 | 357.0 TFLOPS | >= 444 | FAIL |
+| FP16 | 664.0 TFLOPS | >= 734 | WARN |
+| BF16 | 700.1 TFLOPS | >= 745 | WARN |
+| FP8 | 1116.2 TFLOPS | >= 1400 | FAIL |
+| FP64 | 未测 | >= 63 | 缺失 |
+| INT8 | 未测 | >= 1536 | 缺失 |
+
+说明：PDF 里 WARN 区间是 PASS 门槛的 90%-100%。TF32 和 FP8 低于 90% 门槛，所以按 PDF 是 FAIL。
+
+## 如果只执行当前仓库 `test all`，少了什么
+
+1. 少 NVLink 专项验收：没有逐卡检查 18 条链路、25GB/s 速率、CRC/Replay/Recovery error = 0。
+2. 少 DCGM 诊断：没有 `dcgmi diag -r 3`。
+3. 少长时间 burn-in：当前是 60 秒，不是 30/60 分钟。
+4. 少 stress 期间 1 秒级采样：温度、功耗、throttle、XID、TFLOPS 抖动都没按 PDF 统计。
+5. 少真正 NCCL 性能：当前退化到 torchrun fallback，没有 `nccl-tests` bus BW。
+6. 少 NCCL 全操作和三档消息：PDF 要 AllReduce/AllGather/ReduceScatter/Broadcast/SendRecv/AllToAll，且 1MB/256MB/2GB 都过线。
+7. 少 NCCL 重复 3 次取最差值和标准差 <=3%。
+8. 少完整 P2P 8x8 矩阵：没有非对角均值、最小值、偏差判断。
+9. 少逐 GPU compute 一致性：没有真正分别测 8 卡同 dtype 极差/均值 <=3%。
+10. 少 FP64 和 INT8。
+11. 少 RDMA 生产口径：当前 `localhost`，64KB message，阈值 10us；PDF 要 4MB BW、8B latency、write/read >=47GB/s、write_lat <=2us、read_lat <=3.5us。
+12. 少 PFC/ECN 错误计数和 ibping 双向。
+13. 少真正 8 卡分布式 Training Simulation 验收。
+14. 少严格最终 verdict：当前代码会把 `passed=false` 的模块也计入“通过”，这是验收逻辑漏洞。
+
+## 建议
+
+`test all` 可以继续作为快速初筛跑，但如果目标是对齐 `H100_production_acceptance.pdf`，需要把它升级成“生产验收模式”。优先级如下：
+
+1. 先修汇总 verdict：任何子模块 `passed=false` 必须导致整机 FAIL。
+2. 先装好 `nccl-tests` 和 `gpu-burn`，否则 NCCL/Stress 都不是生产口径。
+3. 增加 NVLink、DCGM、长时间 telemetry、P2P 矩阵。
+4. 改 RDMA 为生产参数，且支持跨节点。
+5. 改 compute/training 为逐 GPU/8 卡分布式验收。