Add H100 acceptance test coverage and reports

2026-05-23 10:41:09 +08:00 · 2026-05-23 10:41:09 +08:00 · 86f15544d7
commit 86f15544d7
parent dd77a882f1
44 changed files with 6938 additions and 190 deletions
--- a/.gitignore
+++ b/.gitignore
@ -15,3 +15,4 @@ reports/
 venv/
 .qoder/*
 .claude/settings.local.json
 .omx/
--- a/H100_test_all_vs_PDF_覆盖对比.md
+++ b/H100_test_all_vs_PDF_覆盖对比.md
@ -0,0 +1,85 @@
 # H100 PDF 验收项 vs 当前 `test all` 覆盖对比
 对比对象：
 - PDF：`/Users/d-robotics/Downloads/H100_production_acceptance.pdf`
 - 当前脚本：`python gpu_tester.py --config configs/default.yaml --test all --report --format md`
 - 范围：单节点 8 卡 H100。跨节点 NCCL/RDMA 暂不纳入本轮。
 ## 结论
 当前 `test all` 已经从“功能巡检”扩成了“接近生产验收”的单节点套件：GPU 健康、NVLink/NVSwitch、HBM/PCIe/NVLink 带宽、计算、NCCL、压力、RDMA 本机端口、DCGM、训练模拟都会进入同一个 all。
 最新 stress smoke 已确认 PyTorch BF16 GEMM 压力能把两台机器压到 PDF 要求的功耗区间：
 - `aikubeworker0012`：45 秒 smoke，稳态平均功耗约 `697-698W/卡`，TFLOPS jitter `4.07%`，XID `0`，但温差 `12C`、`clocks_throttle_reasons.active=0x4`，按 PDF 严格 FAIL。
 - `aikubeworker0016`：45 秒 smoke，稳态平均功耗约 `697-699W/卡`，TFLOPS jitter `3.77%`，XID `0`，但温差 `8C`、`clocks_throttle_reasons.active=0x4`，按 PDF 严格 FAIL。
 也就是说，当前卡点已经不是“脚本压不满 H100”，而是机器在满功耗压力下没有满足 PDF 的 `温差 <=5C` 和 `Throttle Reasons 全程 0x0` 两个严格门槛。
 但如果严格按 PDF 做最终验收，现在还差这些：
 1. 24 小时类指标未覆盖：PDF 要求 SBE 24h 增长率、长稳态观察；当前 `all` 是单次快照 + 30 分钟压力，不等于 24 小时老化。
 2. 跨节点项目本轮故意不测：PDF 的 IB/RDMA 生产验收通常要双端 `ib_write_bw/read_bw/lat`、`ibping`；当前按你的要求先做单节点，跨节点未纳入。
 3. PFC/ECN/AER 的覆盖依赖机器暴露的系统计数器：脚本会读能找到的 sysfs 计数器和 dmesg，但如果交换机侧 PFC/ECN 不在主机暴露，仍需要网络侧补证据。
 4. NCCL 1MB 档会被严格阈值打失败：实测 1MB AllReduce bus BW 约 23 GB/s，而 256MB AllReduce 已通过 `nccl-tests` 验证，约 421 GB/s；如果 PDF 要求 1MB 也达到 405 GB/s，这项不是“没测”，而是会被判 FAIL。
 5. Stress 已能达到功耗和 jitter 要求，但短测已经暴露温差和 throttle strict FAIL；完整 1800 秒只会给出更正式的证据，不会自动改变这个判据。
 ## 覆盖表
 | PDF 验收项 | 当前 `test all` 状态 | 还少什么 |
 |---|---:|---|
 | GPU 基本信息、Driver/CUDA | 已覆盖 | 无；会记录 driver、CUDA、GPU 型号 |
 | 温度阈值：稳态 ≤75C、峰值 ≤85C | 已覆盖健康快照；压力项覆盖 ≤80C | 24h 稳态曲线不在一次 all 内 |
 | idle power ≤100W/card | 部分覆盖 | 当前 health 会采功耗，但 idle 判据还不是独立验收项 |
 | stress power ≥630W/card | 已覆盖；短测两台约 697-699W/卡 | 完整 1800 秒仍待跑 |
 | throttle reasons active=0x0 | 已覆盖；短测两台出现 0x4 | 按 PDF 严格判 FAIL；不是脚本跳过项 |
 | DBE/SBE/retired pages | 部分覆盖 | retired pages 和内核错误已查；SBE 24h 增长率未覆盖 |
 | PCIe Gen5 x16 | 部分覆盖 | GPU 信息/拓扑可见；Replay/AER 依赖 dmesg/sysfs，可能还需额外主板侧证据 |
 | Fabric Manager active 且无 ERROR | 已覆盖 | 无；health 会查 systemd 和 journal |
 | NVLink：18 links/GPU、25GB/s/link、错误为 0 | 已覆盖 | 无；新增 `nvlink` 项 |
 | D2D/H2D/D2H 带宽 | 已覆盖 | 依赖 `nvbandwidth`，两台已具备 |
 | 8x8 P2P matrix off-diagonal mean/min/deviation | 已覆盖 | 无；由 nvbandwidth JSON 解析 |
 | Compute FP32/TF32/FP16/BF16/FP8/FP64/INT8 | 已覆盖 | INT8 为 PyTorch `_int_mm` 路径，若要供应商标准 INT8 kernel 需再换实现 |
 | NCCL AllReduce/AllGather/ReduceScatter/Broadcast/SendRecv/AllToAll | 已覆盖 | 无；`nccl-tests` 已在两台编好 |
 | NCCL 1MB/256MB/2GB，repeat 3，stddev ≤3% | 已覆盖 | 严格按 PDF 阈值时 1MB 档大概率 FAIL；256MB AllReduce 两台 `nccl-tests` 实测约 421GB/s |
 | Stress ≥30min，BF16/FP16 GEMM 8192，1s telemetry | 已覆盖；默认 BF16 GEMM `24576`，1s telemetry，warmup 后稳态判定 | 完整 1800 秒待执行；短测已暴露温差/throttle FAIL |
 | DCGM `dcgmi diag -r 3` | 已覆盖；DCGM 4.5.3 已安装，服务已启用 | 两台完整 `-r 3` 已 PASS；日志见 `/root/test_gpu_scripts/reports/dcgm_r3_*_20260522_17010*.log` |
 | RDMA 端口 ACTIVE、400Gbps | 部分覆盖 | 单节点可查端口；严格双端吞吐/时延本轮不跑 |
 | RDMA write/read bw ≥47GB/s、latency ≤2/3.5us | 部分覆盖 | 单机 localhost/perftest 不等价跨节点线速验收 |
 | PFC/ECN errors=0、ibping 双向 OK | 部分覆盖 | 主机能读到的计数器会查；交换机侧/跨节点 ibping 未覆盖 |
 | 1.5B synthetic Transformer BF16，8 卡，≥45k tokens/s | 已覆盖 DDP 路径 | 8 进程 DDP smoke 已通过；完整 50 step 长跑待执行 |
 | 任一子项 FAIL 则总体验收 FAIL | 已覆盖 | `all` 现在会按 strict verdict 退出非 0 |
 ## 如果现在直接跑 `all`
 推荐命令：
 ```bash
 cd /root/test_gpu_scripts
 /root/gpu-test-venv/bin/python gpu_tester.py --config configs/default.yaml --test all --report --format json --output reports/h100_all_$(hostname)_$(date +%Y%m%d_%H%M%S).json
 ```
 如果要直接生成中文 Markdown 报告，用这个：
 ```bash
 cd /root/test_gpu_scripts
 /root/gpu-test-venv/bin/python gpu_tester.py --config configs/default.yaml --test all --report --format md --output reports/h100_all_$(hostname)_$(date +%Y%m%d_%H%M%S).md
 ```
 预计行为：
 - 会跑完整单节点项目，压力默认 1800 秒，默认使用 PyTorch BF16 GEMM 压力并采 1 秒 telemetry/XID。
 - stress 默认矩阵为 `24576`，用于把 H100 压到 ≥630W/卡；PDF 只要求 `matrix_size >=8192`，这里是为了满足功耗门槛。
 - NCCL 会跑 6 个 op × 3 个 message size × 3 次 repeat。
 - DCGM 会跑 `dcgmi diag -r 3 -n gpu:8 -j`；DCGM 工具链已安装并启动，`diag -r 1` 与两台独立 `r3` 长跑均已 PASS。
 - NCCL 1MB 档按 405GB/s 阈值也会失败；256MB AllReduce 已验证走 `nccl-tests`，两台约 421GB/s。
 - stress 按 PDF 严格口径预计会 FAIL：当前短测证据显示温差超过 5C，且 throttle active 出现 `0x4`。
 - 跨节点 RDMA/NCCL 不在这次单节点 all 里。
 ## 当前最小补齐清单
 1. 如果要严格 RDMA 生产验收，下一轮用两台机器做 server/client 双端测试。
 2. 执行完整 1.5B DDP 50 step 训练验收并归档 tokens/s、jitter、显存和 loss。
 3. 执行完整 1800 秒 stress 并归档 1 秒 telemetry、XID、throttle、功耗和温度；当前预期会因温差/throttle FAIL。
 4. 如果要 24 小时验收，增加一个 24h monitor 模式，记录 SBE 增长率、XID、温度、功耗、降频曲线。
--- a/H100验收_vs_test_all_差距分析.md
+++ b/H100验收_vs_test_all_差距分析.md
@ -0,0 +1,100 @@
 # H100 生产验收标准 vs 当前 `gpu_tester.py --test all` 覆盖差距
 对比文件：`/Users/d-robotics/Downloads/H100_production_acceptance.pdf`
 对比对象：当前仓库执行 `python gpu_tester.py --test all --report --format md/json`
 ## 结论
 当前仓库的 `test all` 能覆盖验收文档里的大类框架，但还不是完整的 H100 生产验收。
 它会跑 8 个模块：
 1. GPU Information
 2. Health Check
 3. Memory Benchmark
 4. Compute Benchmark
 5. NCCL Test
 6. GPU Stress Test
 7. RDMA/IB Test
 8. Training Simulation
 但是按照 PDF 的生产验收标准，仍缺少这些关键项：
 - NVLink 每卡 18 条链路的 active/速率/错误计数逐项验收
 - DCGM `dcgmi diag -r 3`
 - 30-60 分钟 burn-in 和 1 秒级温度/功耗/throttle/XID 采样
 - NCCL 官方 `nccl-tests` 的性能验收，包括 1MB/256MB/2GB 三个消息大小、重复 3 次取最差值、标准差
 - RDMA 生产口径：4MB 带宽、8B 延迟、PFC/ECN 错误、ibping 双向
 - 8 卡逐卡 compute 一致性，要求同 dtype 极差/均值 <= 3%
 - FP64、INT8 计算项
 - 训练项应为 8 卡 1.5B synthetic Transformer，并按 45k tokens/s、step 抖动、显存、loss 健康度验收
 ## 覆盖矩阵
 | PDF 验收项 | `test all` 是否覆盖 | 当前覆盖程度 | 主要缺口 |
 | --- | --- | --- | --- |
 | 1. 健康检查 | 部分覆盖 | 温度、功耗、ECC、PCIe、时钟、throttle、persistence、IB 设备 | idle 功耗 <=100W 未单独判定；stress 功耗 >=630W 未判定；retired pages 未查；24h SBE 增长率未查；AER/Replay errors 未查；fabricmanager 服务和 ERROR 日志未查 |
 | 2. NVLink 拓扑与链路 | 部分覆盖 | GPU info 会保存 `nvidia-smi topo -m` | 未跑 `nvidia-smi nvlink -s/-c/-e`；未验证每卡 18 条 NVLink；未验证每条 25GB/s；未验证 CRC/Replay/Recovery error = 0 |
 | 3. Memory Bandwidth | 部分覆盖 | 会用 nvbandwidth 测 H2D、D2H、D2D write/read/bidir | 未输出完整 8x8 P2P 矩阵；未验非对角均值 >=360GB/s、最小值 >=320GB/s、相对均值偏差 <=±5%；D2D 口径和 PDF 的单卡/P2P 验收口径还没完全对齐 |
 | 4. Compute Throughput | 大部分覆盖 | 默认配置已是 matrix_size=8192、warmup=50、iterations=500、use_compile=true；H100 绝对 TFLOPS 阈值在 `gpu_specs.py` 里有 | 目前测试结果是整体/单进程口径，未真正逐 GPU 分别测出 8 卡极差/均值；未测 FP64、INT8 |
 | 5. NCCL Multi-GPU | 部分覆盖，依赖工具 | 代码支持 nccl-tests；若缺 binary 会 fallback torchrun 功能连通性 | 当前远端没装好 nccl-tests，实际会退化成功能测试且失败/无性能数据；默认只启 allreduce/alltoall/broadcast，未启 allgather/reducescatter/sendrecv；消息大小不是 1MB/256MB/2GB 三点；未重复 3 次取 worst；未统计标准差 |
 | 6. Stress/Burn-in | 部分覆盖 | 会跑 stress，默认 60 秒；无 gpu-burn 时用 PyTorch fallback | PDF 要 >=30min，推荐 60min；要 FP16/BF16 大 GEMM matrix >=8192；要每分钟 TFLOPS 抖动、温度 <=80、卡间温差 <=5、功耗 >=630W、throttle=0、XID=0；当前 PyTorch fallback 只分配约 64MB/卡，压力不够 |
 | 7. DCGM 诊断 | 未覆盖 | 无 | 没有执行 `dcgmi diag -r 3`，也没有解析 Software/Deployment/Hardware/Integration/Stress/Power 子项 |
 | 8. RDMA/IB | 部分覆盖 | 会发现 IB 设备，跑 ib_write_bw/read_bw/write_lat/read_lat | 当前脚本用 `localhost`，不是跨节点；msg_size 是 64KB，不是 4MB；latency 没指定 8B；阈值是 50GB/s 和 10us，不是 PDF 的 write/read >=47GB/s、write_lat <=2us、read_lat <=3.5us；未查 PFC/ECN、ibping 双向 |
 | 9. Training Simulation | 部分覆盖 | 会跑 GPT-2 或 synthetic transformer，输出 tokens/s、step time、显存、loss | 当前 synthetic 是约 1.47B 参数但实际单进程 `.cuda()`，不是 8 卡分布式训练；未按 45k tokens/s、step 抖动 <=±3%、peak <=70GB/卡、NaN/Inf 做硬判定 |
 | 10. 总体 Verdict | 部分覆盖 | report 有 summary | 当前 `all` 的 pass/fail 逻辑偏“模块是否报错”，不是 PDF 的任一子项 FAIL 即整机禁上生产 |
 ## 如果现在直接执行 `test all`，能得到什么
 会得到一份“单节点综合体检/基准测试报告”，包含：
 - 8 张 H100 的基础信息、驱动/CUDA、PCIe、显存、温度、功耗
 - 健康检查结果
 - nvbandwidth 的 H2D/D2H/D2D 汇总带宽
 - FP32/TF32/FP16/BF16/FP8 计算吞吐
 - NCCL 测试结果，如果 nccl-tests 缺失会退化到 torchrun fallback
 - 60 秒 stress 结果
 - 本机 localhost RDMA/IB 结果
 - 训练模拟结果
 这份报告能作为“快速冒烟 + 单机初筛”，不能直接作为 PDF 标准下的“生产验收合格报告”。
 ## 当前两台机器执行前置状态
 已经确认：
 - `nvbandwidth` 已装好并能被项目脚本调用
 - PyTorch CUDA 环境已装好
 - RDMA perftest 工具已存在
 - `nccl-tests` 和 `gpu-burn` 目前没有按 PDF 生产验收口径准备好
 另外，我刚才误触发的 `test all`：
 - `aikubeworker0016` 已经在跑单节点 `test all`，当前到 Training Simulation
 - `aikubeworker0012` 没有成功启动
 ## 要补齐到 PDF 验收口径，需要加的最小清单
 1. 安装/修复 `nccl-tests`，确保真正输出 bus BW，而不是 torchrun fallback。
 2. 安装/修复 `gpu-burn`，或把 PyTorch stress 改成真正高占用 FP16/BF16 GEMM，并支持 30/60 分钟。
 3. 增加 NVLink 专项：`nvidia-smi nvlink -s/-c/-e`，按 18 条/卡、25GB/s、error=0 判定。
 4. 增加 DCGM 专项：`dcgmi diag -r 3`，解析子项 PASS/FAIL。
 5. 增加 telemetry 采样：stress 期间每 1 秒采温度、功耗、throttle、XID；计算稳态功耗、温差、抖动。
 6. 修改 RDMA：支持指定 server/client、4MB 带宽、8B 延迟、双向 ibping、PFC/ECN 计数。
 7. 修改 NCCL 配置：全 op 开启，按 1MB/256MB/2GB 三个 size，重复 3 次取最差值和标准差。
 8. 修改 Compute：逐 GPU 分别跑，计算同 dtype 极差/均值；增加 FP64、INT8。
 9. 修改 Training Simulation：明确 8 卡 1.5B synthetic 分布式训练，加入 tokens/s、step 抖动、显存、loss NaN/Inf 的 PASS/FAIL。
 10. 修改最终 verdict：按 PDF 规则，任一子项 FAIL 就整机不通过。
 ## 建议执行策略
 现在直接跑：
 ```bash
 /root/gpu-test-venv/bin/python gpu_tester.py --test all --report --format md --output reports_all/test_all.md
 ```
 得到的是“当前仓库 all 覆盖范围报告”。
 要拿来做生产验收，需要先补齐上面的缺口，尤其是 `nccl-tests`、`gpu-burn`、NVLink、DCGM、长时间 burn-in、跨节点 RDMA。
--- a/README.md
+++ b/README.md
@ -159,7 +159,7 @@ python3 gpu_tester.py
 [3]  Memory Benchmark (nvbandwidth)
 [4]  Compute Benchmark
 [5]  NCCL Multi-GPU Test
- [6]  GPU Stress Test (gpu-burn)
+ [6]  GPU Stress Test (PyTorch/gpu-burn)
 [7]  RDMA/IB Test
 [8]  Training Simulation
 [9]  Full Test Suite (All Tests)
@ -279,33 +279,35 @@ python3 gpu_tester.py --config /path/to/config.yaml --test all
 | FP16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS |
 | BF16 | 312 TFLOPS | 990 TFLOPS | 2,250 TFLOPS | 3,500 TFLOPS |
 | FP8 | N/A | 1,979 TFLOPS | 4,500 TFLOPS | 7,000 TFLOPS |
 | FP64 | 9.7 TFLOPS | 67 TFLOPS | TBD | TBD |
 | INT8 | 624 TOPS | 1,979 TOPS | TBD | TBD |
-默认配置：4096×4096 矩阵，10 次 warmup，100 次迭代。
+默认配置：8192×8192 矩阵，50 次 warmup，500 次迭代；逐 GPU 跑 FP32/TF32/FP16/BF16/FP8/FP64/INT8，并按同 dtype 的极差/均值判断一致性。
 ### 5. NCCL Multi-GPU Test（多卡通信）
-优先使用官方 nccl-tests（通过 mpirun 调用），不可用时 torchrun fallback。
+优先使用官方 nccl-tests（通过 mpirun 调用）并解析真实 bus BW；如果只能走 torchrun fallback，验收结果会标记 FAIL。
 | 操作 | 说明 |
 |---|---|
 | AllReduce | 最常用的集合通信 |
 | AllToAll | 模型并行关键操作 |
 | Broadcast | 参数同步 |
-| ReduceScatter | 可选 |
+| ReduceScatter | 必测 |
-| AllGather | 可选 |
+| AllGather | 必测 |
-| SendRecv | 可选 |
+| SendRecv | 必测 |
-默认测试数据量范围 8B ~ 256MB，5 次 warmup，20 次迭代。
+默认按 PDF 口径测试 1MB、256MB、2GB 三个 size，每个 op 重复 3 次，取 worst bus BW 和标准差；标准差超过 3% 判 FAIL。
 **NVLink 参考带宽：** A100/A800 ≥ 240 GB/s | H100/H200 ≥ 360 GB/s | B200/B300 ≥ 720 GB/s（40% NVLink 峰值）
 ### 6. GPU Stress Test（压力测试）
-使用 gpu-burn 进行长时满载测试，验证热稳定性和内存正确性。
+默认使用 PyTorch BF16/FP16 GEMM 进行长时高功耗满载测试；也可在配置中启用 gpu-burn。测试期间采集温度、功耗、throttle、XID，并计算稳态功耗、温差和 TFLOPS 抖动。
 | 参数 | 默认值 | 说明 |
 |---|---|---|
-| duration_sec | 60 | 测试时长（秒） |
+| duration_sec | 1800 | 测试时长（秒） |
 | use_tensor_cores | true | 使用 Tensor Core |
 | memory_pct | 90 | 内存占用比例 |
@ -320,18 +322,18 @@ python3 gpu_tester.py --config /path/to/config.yaml --test all
 | 写延迟 | ib_write_lat |
 | 读延迟 | ib_read_lat |
-**参考阈值：** 带宽 ≥ 50 GB/s, 延迟 ≤ 10 μs
+**参考阈值：** 端口 ACTIVE 且 ≥400Gbps；4MB 写/读带宽 ≥47GB/s；8B 写延迟 ≤2μs、读延迟 ≤3.5μs；PFC/ECN/CNP/congestion 计数为 0。
 ### 8. Training Simulation（训练模拟）
-使用真实或合成模型模拟训练负载。
+默认跑 8 卡 DDP synthetic 1.5B Transformer 训练模拟。
 | 模式 | 说明 |
 |---|---|
-| 真实模型 | 加载 HuggingFace GPT-2（需安装 transformers） |
+| DDP 合成模型 | 约 1.5B 参数，8 卡 torchrun |
-| 合成模型 | 6 层 Transformer（无需额外依赖） |
+| 单进程 fallback | 仅用于调试；生产验收按 FAIL |
-输出：tokens/sec、步时、峰值显存、最终 loss。
+输出：tokens/sec、步时、warmup 后 step 抖动、峰值显存、最终 loss，并检查 loss 是否 NaN/Inf。
 ---
@ -351,14 +353,14 @@ benchmark:
    nvbandwidth_buffer_mb: 512          # nvbandwidth 缓冲区大小
    nvbandwidth_samples: 3              # nvbandwidth 采样次数
  compute:
-    dtypes: [fp32, tf32, fp16, bf16, fp8]
+    dtypes: [fp32, tf32, fp16, bf16, fp8, fp64, int8]
-    matrix_size: 4096                   # GEMM 矩阵维度
+    matrix_size: 8192                   # GEMM 矩阵维度
-    warmup: 10
+    warmup: 50
-    iterations: 100
+    iterations: 500
 health:
-  temp_warning: 80                      # 温度警告阈值 °C
+  temp_warning: 75                      # 温度警告阈值 °C
-  temp_critical: 90                     # 温度严重阈值 °C
+  temp_critical: 85                     # 温度严重阈值 °C
  power_limit: null                     # null = 自动匹配 GPU TDP
 nccl:
@ -366,26 +368,62 @@ nccl:
  test_allreduce: true
  test_alltoall: true
  test_broadcast: true
  test_reduce_scatter: true
  test_allgather: true
  test_sendrecv: true
  message_sizes: [1M, 256M, 2G]
  repeats: 3
  max_stddev_pct: 3
 stress:
-  duration_sec: 60                     # 压力测试时长
+  duration_sec: 1800                   # 压力测试时长
  use_gpu_burn: false                  # 默认走 PyTorch GEMM stress
  dtype: bf16
  matrix_size: 24576
  telemetry_interval_sec: 1
  min_power_watts: 630
  max_tflops_jitter_pct: 5
  require_tflops_jitter: true
  use_tensor_cores: true
 rdma:
-  min_bandwidth_gbps: 50              # RDMA 最低可接受带宽
+  min_bandwidth_gbps: 47              # RDMA 最低可接受带宽
-  max_latency_us: 10                  # RDMA 最大可接受延迟
+  min_port_rate_gbps: 400             # IB 端口最低速率
-  msg_size: 65536                     # 测试消息大小
+  max_write_latency_us: 2.0
  max_read_latency_us: 3.5
  msg_size: 4194304                   # 4MB 带宽测试消息
  latency_msg_size: 8                 # 8B 延迟测试消息
  server_addr: null                   # client 模式 perftest 对端 IP
  ibping_target: null                 # ibping 对端 LID/GID，不是 IP
  role: auto                          # auto / server / client
  pfc_ecn_counters: true
 nvlink:
  expected_links_per_gpu: 18
  expected_link_speed_gbps: 25
  require_zero_errors: true
 dcgm:
  diag_level: 3
  timeout_sec: 3600
  expected_num_gpus: 8
  json_output: true
  require_subtests: true
 training:
-  model: gpt2                          # HuggingFace 模型名
+  model: synthetic_1.5b                # 8 卡 synthetic Transformer
  batch_size: 8
  seq_length: 2048
  num_steps: 50
  warmup_steps: 5
  dtype: bf16
  mode: ddp
  min_tokens_per_sec: 45000
  max_step_jitter_pct: 3
 report:
  output_dir: ./reports
-  format: json                         # json 或 html
+  format: json                         # json / html / md
 ```
 ---
@ -493,9 +531,11 @@ report:
 步骤 2: RDMA 网络测试
 ├── python3 gpu_tester.py --test rdma
 ├── 确认: IB 设备被识别
-├── 确认: 端口状态 Active
+├── 确认: 端口状态 ACTIVE 且 ≥400Gbps
-├── 确认: 写带宽 ≥ 50 GB/s
+├── 确认: 4MB 写/读带宽 ≥47 GB/s
-├── 确认: 延迟 ≤ 10 μs
+├── 确认: 8B 写延迟 ≤2 μs、读延迟 ≤3.5 μs
 ├── 确认: ibping 双向连通
 ├── 确认: PFC/ECN/CNP/congestion 计数为 0
 └── 异常: 检查 IB 线缆、交换机配置、子网管理器
 步骤 3: 多节点 NCCL 测试
--- a/docs/h100_test_all_metrics_guide_cn.md
+++ b/docs/h100_test_all_metrics_guide_cn.md
@ -0,0 +1,255 @@
 # H100 `test all` 指标说明
 本文解释 `gpu_tester.py --test all` 报告里每一项指标的意义、它在验收中代表什么，以及异常时通常应该优先排查什么。
 适用报告：
 - `reports_test_all_latest_aikubeworker0012_20260522_203246.md`
 - `reports_test_all_latest_aikubeworker0016_20260522_203447.md`
 - `reports_test_all_latest_summary_cn_20260523.md`
 ## 总体判定
 | 指标 | 意义 | 怎么看 |
 |---|---|---|
 | `Overall Acceptance Verdict` | 整机验收结论 | 按 PDF 生产验收规则，任一必测子项 FAIL，则整机 FAIL |
 | `Suite complete: x/10 tests passed` | 10 个测试模块里通过了几个 | 用来快速看整体健康度，但最终以 `Overall Acceptance Verdict` 为准 |
 | `PASS` | 达到当前配置阈值 | 表示该指标在当前测试口径下通过 |
 | `FAIL` | 未达到当前配置阈值，或证据不足 | 表示该项不能作为生产验收通过证据 |
 | `WARN` | 旧报告或非强制警告口径 | 当前 PDF 生产验收里，关键性能未达标应按 FAIL 处理 |
 ## GPU Info
 GPU Info 是基础盘点项，用来确认机器硬件、驱动和 CUDA 环境是否符合预期。
 | 指标 | 意义 | 异常影响 |
 |---|---|---|
 | GPU count | 当前系统识别到的 GPU 数量 | H100 8 卡机器如果不是 8 张，后续所有多卡测试都不可信 |
 | GPU model | GPU 型号，例如 H100 | 型号不对会导致阈值、峰值、验收口径都不对 |
 | Driver version | NVIDIA 驱动版本 | 版本过旧可能影响 CUDA、NCCL、DCGM、NVLink 工具 |
 | CUDA version | CUDA 运行时或驱动支持版本 | CUDA 不匹配会导致 PyTorch、nccl-tests 或编译工具异常 |
 | GPU UUID / PCI bus id | GPU 唯一标识和 PCIe 拓扑位置 | 用于定位具体故障卡、对应槽位和链路 |
 这项通常不直接代表性能好坏，它是确认“测的是不是目标机器、目标 GPU、目标软件栈”。
 ## Health Check
 Health Check 是空闲或轻负载状态下的基础健康检查。
 | 指标 | 意义 | 怎么看 |
 |---|---|---|
 | Temperature | 当前 GPU 温度 | 空闲温度过高可能说明散热、风道、环境温度异常 |
 | Power | 当前功耗 | 空闲功耗异常高可能说明有残留进程或功耗状态异常 |
 | ECC errors | 显存纠错错误 | 单比特错误过多或双比特错误通常需要重点关注硬件稳定性 |
 | PCIe | PCIe 代际和宽度，例如 Gen5 x16 | 降速或降宽会影响 CPU-GPU、RDMA、部分数据搬运性能 |
 | Throttle | 当前是否触发限速 | 空闲状态下非 idle throttle 不正常，可能影响后续性能 |
 | XID / NVRM events | 驱动或 GPU 错误事件 | 出现新 XID 通常说明硬件、驱动、供电或内核态异常 |
 Health PASS 只能说明基础状态正常，不代表满载性能一定达标。
 ## Memory Bandwidth
 Memory Bandwidth 衡量数据搬运能力，包括 CPU 到 GPU、GPU 到 CPU、GPU 到 GPU。
 | 指标 | 意义 | 代表什么 |
 |---|---|---|
 | H2D | Host to Device，CPU 内存到 GPU 显存带宽 | 受 PCIe、NUMA、CPU 内存、驱动影响 |
 | D2H | Device to Host，GPU 显存到 CPU 内存带宽 | 受 PCIe、NUMA、CPU 内存、驱动影响 |
 | D2D | Device to Device，GPU 到 GPU 带宽 | 单节点多卡通常主要受 NVLink/NVSwitch 影响 |
 | Efficiency | 实测值相对理论或配置阈值的比例 | 用于快速判断是否达到预期带宽 |
 H2D/D2H 主要看 PCIe 和 CPU 侧链路是否正常。D2D 更接近多卡训练、NCCL 和 P2P 通信的基础能力。
 ## Compute Throughput
 Compute Throughput 衡量 GPU 在不同数值格式下的矩阵计算吞吐，单位通常是 TFLOPS。
 | 指标 | 意义 | 常见用途 |
 |---|---|---|
 | FP32 | 32 位浮点性能 | 传统科学计算、部分模型训练和验证 |
 | TF32 | TensorFloat-32 Tensor Core 性能 | NVIDIA Ampere/Hopper 上常见的 FP32 加速路径 |
 | FP16 | 16 位浮点 Tensor Core 性能 | 深度学习训练和推理常用 |
 | BF16 | bfloat16 Tensor Core 性能 | 大模型训练常用，数值范围比 FP16 更稳 |
 | FP8 | 8 位浮点 Tensor Core 性能 | 新一代低精度训练/推理加速 |
 | FP64 | 64 位双精度性能 | HPC、科学计算、仿真 |
 | INT8 | 8 位整数性能 | 推理、量化模型 |
 | Achieved | 实测吞吐 | 越接近峰值越好 |
 | Peak | 理论峰值或规格峰值 | 用来计算效率 |
 | Threshold | 当前验收阈值 | 低于阈值则 FAIL |
 | Efficiency | `Achieved / Peak` | 衡量实测利用率 |
 ### Compute Consistency
 Consistency 是看同一种 dtype 下，不同 GPU 之间性能是否均衡。
 | 指标 | 意义 | 异常含义 |
 |---|---|---|
 | Min | 8 张 GPU 里最慢卡的实测值 | 用于发现拖后腿的卡 |
 | Mean | 8 张 GPU 平均值 | 用于看整体水平 |
 | Max | 8 张 GPU 里最快卡的实测值 | 和 Min 一起计算离散度 |
 | Spread | `(Max - Min) / Mean` | 反映卡间性能差异 |
 Spread 超过阈值通常说明某些卡受温度、功耗、PCIe、后台负载、时钟策略或硬件状态影响。即使平均性能还可以，卡间差异过大也会拖慢分布式训练。
 ## NVLink / NVSwitch
 NVLink/NVSwitch 测试确认 GPU 间高速互联是否完整、速率是否正确、错误计数是否干净。
 | 指标 | 意义 | 怎么看 |
 |---|---|---|
 | Active Links | 每张 GPU 当前活跃 NVLink 数 | H100 8 卡 SXM 常见期望是每卡 18 条 |
 | Expected Links | 配置期望链路数 | 少一条都可能影响拓扑和 NCCL 性能 |
 | Link speed | 单条链路速率 | 速率不对说明链路降级或识别异常 |
 | Error counters | NVLink 错误计数，例如 CRC/replay/recovery | 非零可能说明链路质量或硬件问题 |
 NVLink PASS 表示链路状态看起来正常，但 NCCL 仍可能因算法、拓扑、消息大小、NCCL 参数或系统噪声而不达标。
 ## DCGM Diagnostic
 DCGM 是 NVIDIA 官方诊断工具。`dcgmi diag -r 3` 是比较完整的生产诊断级别。
 | 子项 | 意义 |
 |---|---|
 | Deployment/software | 驱动、库、系统软件依赖检查 |
 | Hardware/memory | GPU 显存健康检查 |
 | Hardware/diagnostic | GPU 硬件基础诊断 |
 | Hardware/nvbandwidth | GPU/NVLink/NVSwitch 带宽诊断 |
 | Integration/pcie | PCIe 集成和链路相关检查 |
 | Stress/targeted_stress | DCGM 自带目标压力测试 |
 | Stress/targeted_power | DCGM 自带目标功耗压力测试 |
 | summary | 该分类汇总结果 |
 DCGM PASS 是强证据，说明官方诊断没有发现明显硬件故障。但它不替代项目里的 NCCL、RDMA、长时间 telemetry 和训练模拟验收。
 ## NCCL Multi-GPU
 NCCL 测试衡量单节点多 GPU 集合通信能力。它直接关系到多卡训练效率。
 | 指标 | 意义 | 为什么重要 |
 |---|---|---|
 | source | 测试来源 | 必须是 `nccl-tests` 才有真实 bus BW；`torchrun_fallback` 只能说明功能连通，不是性能验收 |
 | bus BW | NCCL 报告的总线等效带宽 | 用来衡量通信是否吃满 NVLink/NVSwitch |
 | message size | 消息大小，例如 1M、256M、2G | 小消息看延迟和调度，中大消息看带宽 |
 | repeats | 重复次数 | 减少偶然波动，当前按 3 次取样 |
 | worst bus BW | 多次结果里的最差值 | 生产验收更关注最差情况 |
 | mean bus BW | 多次平均值 | 反映稳定水平 |
 | stddev | 标准差或波动 | 波动大说明通信稳定性不足 |
 ### NCCL op 含义
 | Op | 意义 | 常见场景 |
 |---|---|---|
 | allreduce | 每张卡都有一份数据，做规约后每张卡都拿到结果 | 数据并行梯度同步最常见 |
 | allgather | 每张卡收集所有卡的数据分片 | 模型并行、张量并行、参数/激活收集 |
 | reducescatter | 先规约再把结果切分给各卡 | ZeRO、优化器状态切分、分布式训练常用 |
 | broadcast | 一张卡把数据广播给其他卡 | 参数同步、初始化权重分发 |
 | sendrecv | 点对点发送和接收 | pipeline、定制通信、拓扑验证 |
 | alltoall | 每张卡向每张卡交换不同数据 | MoE、专家并行、shuffle 类通信 |
 NCCL 小消息失败常见于延迟、调度或阈值口径较严；大消息失败更偏向链路带宽、拓扑、NCCL 参数或 NVSwitch/PCIe/NUMA 配置问题。
 ## Stress Test
 Stress Test 是长时间高负载稳定性测试。它不是只看“能不能跑完”，还要看满载期间的温度、功耗、限速和错误事件。
 | 指标 | 意义 | 怎么看 |
 |---|---|---|
 | duration | 实际压力测试时长 | 生产验收通常需要 30/60 分钟 |
 | source | 压力来源，例如 `pytorch` 或 `gpu-burn` | 说明用什么负载压 GPU |
 | dtype | 压力计算的数据类型，例如 BF16 | 影响 Tensor Core、功耗和温度 |
 | matrix_size | GEMM 矩阵边长 | 越大越容易形成持续高占用 |
 | memory_pct | 目标显存占用比例 | 避免只测很小负载 |
 | Avg steady power | 稳态平均功耗 | 判断是否真的把卡压起来 |
 | Max steady temp | 稳态最高温度 | 判断散热上限 |
 | Temp delta | 8 卡之间最高温和最低温的差 | 差异过大说明风道、散热或卡位不均衡 |
 | TFLOPS jitter | 稳态吞吐波动 | 波动大说明性能不稳定 |
 | Throttle events | 限速事件数量 | 非 idle throttle 会影响性能稳定性 |
 | XID events | 压测期间新增 XID 错误 | 出现 XID 通常是严重风险 |
 ### Throttle 常见含义
 | 代码 | 常见含义 | 解释 |
 |---|---|---|
 | `0x1` | idle throttle | 空闲状态限速，通常不算真实问题 |
 | `0x4` | `sw_power_cap` | 达到软件功耗上限，性能可能被功耗墙限制 |
 | `0x8` | hardware slowdown | 硬件触发降速 |
 | `0x10` | thermal slowdown | 温度触发降速 |
 | `0x20` | power brake | 外部供电或硬件功率保护 |
 | `0x40` | software thermal slowdown | 软件温度策略触发降速 |
 当前报告里的 `sw_power_cap` 表示负载确实压到了功耗墙附近，但验收口径把非 idle throttle 作为失败原因之一，因为它会影响长时间稳定输出。
 ## RDMA / InfiniBand
 RDMA 测试衡量 IB 网卡和网络链路性能。单节点 loopback 和跨节点 server/client 是两种不同证据，不能混用。
 | 指标 | 意义 | 怎么看 |
 |---|---|---|
 | Device | IB 设备名，例如 `mlx5_0` | 对应具体 HCA/端口 |
 | Port | 端口号 | 通常是 port 1 |
 | State | 端口状态，例如 ACTIVE/DOWN | ACTIVE 才能作为可用链路 |
 | Rate | 端口速率，例如 400 Gb/sec | 低于期望说明链路降级或接错网络 |
 | GID/LID | IB 寻址信息 | `ibping` 和跨节点定位会用到 |
 | ib_write_bw | RDMA write 带宽 | 客户端向远端写数据的吞吐 |
 | ib_read_bw | RDMA read 带宽 | 客户端从远端读数据的吞吐 |
 | ib_write_lat | RDMA write 延迟 | 小消息写延迟 |
 | ib_read_lat | RDMA read 延迟 | 小消息读延迟 |
 | ibping | IB 层连通性测试 | 看 LID/GID 层是否可达 |
 | PFC/ECN/CNP counters | 拥塞和流控相关计数 | 非零或增长可能说明网络拥塞/丢包/流控问题 |
 ### 单节点与跨节点的区别
 | 口径 | 意义 | 能证明什么 | 不能证明什么 |
 |---|---|---|---|
 | `local_loopback` | 在同一台机器本地启动 perftest server/client | 工具、设备、单机端口基本可用 | 不能证明两台机器之间 RDMA 网络达标 |
 | server/client 跨节点 | 一台做 server，另一台做 client | 能证明实际跨节点 RDMA 带宽/延迟 | 需要明确 server_addr、ib_device、ib_port、ibping_target |
 RDMA read 带宽低于 write 带宽很常见，但生产验收会给 read/write 各自设置阈值。read 不过线时，需要排查 HCA 固件、BIOS、PCIe、NUMA、RoCE/IB 配置、交换机、PFC/ECN、线缆和端口速率。
 ## Training Simulation
 Training Simulation 用一个合成 1.5B Transformer 训练负载验证 8 卡分布式训练是否能稳定运行。
 | 指标 | 意义 | 怎么看 |
 |---|---|---|
 | Model | 模型类型 | 当前是 synthetic 1.5B，不依赖真实数据集 |
 | Parameters | 参数量 | 用来确认负载规模是否达到预期 |
 | GPU Count | 参与训练的 GPU 数 | 生产口径要求 8 卡 DDP |
 | DType | 训练数值格式，例如 BF16 | 大模型训练常用 BF16 |
 | Batch Size | 每步 batch 大小 | 影响吞吐和显存 |
 | Seq Length | 序列长度 | 影响计算量和显存 |
 | Steps | 计入统计的训练步数 | 步数太少会导致统计不稳 |
 | Warmup Steps | 预热步数 | 避免把 CUDA 初始化、编译、缓存冷启动计入性能 |
 | Avg Step Time | 平均每步耗时 | 越低越好 |
 | Throughput | tokens/sec | 训练吞吐核心指标 |
 | Samples/sec | 每秒样本数 | 辅助衡量数据处理速度 |
 | Peak Memory | 峰值显存 | 看是否接近 OOM 或显存利用不足 |
 | Final Loss | 最后 loss | 用于确认数值是有限值，没有 NaN/Inf |
 | Step Jitter | step 时间抖动 | 抖动大说明训练不稳定 |
 | Distributed Mode | 分布式模式 | 必须是 `ddp` 才满足 8 卡分布式口径 |
 Training PASS 说明 8 卡 DDP 训练路径、NCCL 功能连通、PyTorch CUDA 和基本数值稳定性都没问题。但它不能替代 NCCL 性能测试，因为训练负载可能没有覆盖所有通信模式和消息大小。
 ## 常见误读
 1. `DCGM PASS` 不等于整机验收 PASS。DCGM 是官方诊断的一部分，不覆盖全部业务性能门槛。
 2. `Training PASS` 不等于 NCCL 性能 PASS。训练能跑，只说明功能链路通；NCCL bus BW 仍可能不达标。
 3. `NVLink PASS` 不等于 NCCL PASS。链路数量和错误计数正常，不代表所有 NCCL op/size 都达到阈值。
 4. `ibping PASS` 不等于 RDMA 带宽 PASS。`ibping` 只证明连通性，不证明吞吐和延迟达标。
 5. `local_loopback` 不能当作跨节点 RDMA 证据。跨节点验收必须有 server/client 两端证据。
 6. Stress 跑满 30 分钟不等于 PASS。温差、功耗、throttle、XID、jitter 都要一起看。
 7. 小消息 NCCL 低不一定是链路断了，可能是延迟、算法、启动开销或阈值口径导致；但生产验收仍按阈值判定。
 ## 排查优先级建议
 | 失败项 | 优先看什么 |
 |---|---|
 | Compute FAIL | GPU 时钟、功耗策略、MIG/MPS、后台进程、PyTorch/CUDA 版本、benchmark 算法是否用到目标 Tensor Core 路径 |
 | NCCL FAIL | `NCCL_DEBUG=INFO`、拓扑、NVSwitch/NVLink、NCCL 算法、消息大小、PCIe/NUMA、进程绑核 |
 | Stress FAIL | 机箱风道、风扇、环境温度、功耗上限、`nvidia-smi -q -d POWER,CLOCK,TEMPERATURE` |
 | RDMA FAIL | 端口速率、HCA 固件、线缆、交换机、PFC/ECN、NUMA、BIOS、跨节点 server/client 配置 |
 | Training FAIL | torchrun、NCCL 环境变量、CUDA OOM、loss NaN/Inf、DDP 初始化、网络/共享内存 |
 ## 一句话版
 这套报告不是只看 GPU 能不能亮、训练能不能跑，而是同时验证：硬件识别、基础健康、显存和互联带宽、计算吞吐、多卡通信、长时间满载稳定性、IB/RDMA 网络、官方 DCGM 诊断和 8 卡训练业务路径。任何一个关键项 FAIL，按生产验收都应判整机不通过。
--- a/docs/multinode_nccl_concepts.md
+++ b/docs/multinode_nccl_concepts.md
@ -0,0 +1,362 @@
 # 多机多卡 NCCL 测试概念说明
 本文先讲概念，不涉及脚本改造。目标是理解两台 8 卡 H100 服务器做多机多卡通信测试时，应该从哪些层次逐步验证，以及每一层到底在证明什么。
 当前示例机器：
 | 别名 | 主机名 | 内网 IP | GPU |
 |---|---|---|---|
 | nccl-gpu-1 | aikubeworker0012 | 172.72.8.12 | 8 x H100 |
 | nccl-gpu-2 | aikubeworker0016 | 172.72.8.16 | 8 x H100 |
 两台机器合起来就是 16 张 GPU。多机 NCCL 测试的核心问题是：这 16 张 GPU 是否能通过正确的 GPU、NVLink、PCIe、IB/RDMA 网络路径，高效且正确地完成集体通信。
 ## 1. 总体思路
 多机多卡通信测试是一个自底向上的过程。越底层越接近硬件和链路，越上层越接近真实训练业务。
 ```mermaid
 flowchart TD
    L0["0. 物理与基础连通<br/>电源 / GPU / 网卡 / 线缆 / 交换机 / SSH"] --> L1["1. 系统识别层<br/>nvidia-smi / lspci / ibstat / ibdev2netdev"]
    L1 --> L2["2. 单机 GPU 健康层<br/>温度 / 功耗 / ECC / PCIe / Throttling / NVLink Topo"]
    L2 --> L3["3. 单机 GPU 性能层<br/>HBM 带宽 / H2D-D2H / FP32-TF32-FP16-BF16-FP8 算力"]
    L3 --> L4["4. 单机多卡通信层<br/>单节点 8 卡 NCCL over NVLink/NVSwitch"]
    L4 --> L5["5. 跨机网络与 RDMA 层<br/>IP 连通 / IB Active / RDMA 带宽 / RDMA 延迟"]
    L5 --> L6["6. 跨机 NCCL 层<br/>两机 16 卡 AllReduce / AllGather / ReduceScatter / Broadcast / AllToAll"]
    L6 --> L7["7. 训练负载层<br/>torchrun / Megatron / DeepSpeed / 业务训练压测"]
 ```
 最重要的原则：
 **上层失败，不一定是上层问题。**
 比如两机 `all_reduce_perf` 失败，原因可能在 NCCL，也可能在 SSH、MPI、IB、GID、网卡选择、驱动版本、CUDA 版本、NCCL 版本或 GPU Direct RDMA。
 所以排查顺序应该是：
 ```text
 基础连通 -> 单机健康 -> 单机性能 -> 单机 NCCL -> 跨机 RDMA -> 跨机 NCCL -> 训练业务
 ```
 ## 2. 两机 16 卡通信路径
 单机内部主要走 NVLink/NVSwitch；跨机器时，数据必须经过 GPU、PCIe/NVLink、网卡、交换机和对端网卡。
 ```mermaid
 flowchart LR
    subgraph A["aikubeworker0012 / 172.72.8.12"]
        A0["GPU0"] --- ASW["NVSwitch / NVLink"]
        A1["GPU1"] --- ASW
        A2["..."] --- ASW
        A7["GPU7"] --- ASW
        ASW --> ANIC["IB/RDMA NIC(s)"]
    end
    subgraph NET["InfiniBand / RoCE Fabric"]
        SW["IB Switch"]
    end
    subgraph B["aikubeworker0016 / 172.72.8.16"]
        BNIC["IB/RDMA NIC(s)"] --> BSW["NVSwitch / NVLink"]
        B0["GPU0"] --- BSW
        B1["GPU1"] --- BSW
        B2["..."] --- BSW
        B7["GPU7"] --- BSW
    end
    ANIC <--> SW
    SW <--> BNIC
 ```
 这里有两个不同的通信域：
 | 通信域 | 典型路径 | 主要测试 |
 |---|---|---|
 | 单机内 8 卡 | GPU -> NVLink/NVSwitch -> GPU | 单机 NCCL、NVLink topo、D2D |
 | 跨机器 16 卡 | GPU -> NIC -> IB/RDMA 网络 -> NIC -> GPU | RDMA、跨机 NCCL |
 这两个域的性能阈值不能混用。单机 NVSwitch 很快，跨机 RDMA 一般慢一些，跨机 NCCL 的瓶颈通常在 IB/RDMA 网络。
 ## 3. 每一层要测什么
 ### 3.1 基础连通层
 这一层只证明机器能访问、身份和地址正确。
 要确认：
 | 检查项 | 目的 |
 |---|---|
 | SSH 互通 | MPI/NCCL 多机启动依赖远端拉起进程 |
 | hostname 正确 | 避免登录错机器 |
 | IP 正确 | 确认使用的是训练网络或 IB/RDMA 对应网络 |
 | 时间同步 | 长时间训练日志和超时排查更可靠 |
 这一层不证明 GPU 或 RDMA 性能，只证明“机器能互相找到”。
 ### 3.2 系统识别层
 这一层证明系统能看见 GPU 和网卡。
 常见信息：
 | 工具 | 看什么 |
 |---|---|
 | `nvidia-smi` | GPU 数量、型号、驱动、CUDA、温度、功耗 |
 | `nvidia-smi topo -m` | GPU、NIC、CPU NUMA、NVLink/NVSwitch 拓扑 |
 | `ibstat` | IB 设备、端口状态、链路速率 |
 | `ibdev2netdev` | mlx5 设备和网络接口的映射 |
 | `/sys/class/infiniband` | 端口状态、link layer、rate、GID |
 这一层很关键，因为 NCCL 经常因为选错网卡而跑到 TCP 或错误的接口上。
 ### 3.3 单机 GPU 健康层
 这一层证明每台机器自己是健康的。
 ```mermaid
 flowchart LR
    H["单机健康检查"] --> T["温度"]
    H --> P["功耗"]
    H --> E["ECC 错误"]
    H --> PCIE["PCIe Gen/Width"]
    H --> C["SM/Mem Clock"]
    H --> TH["Throttling"]
    H --> PM["Persistence Mode"]
 ```
 如果某张卡温度过高、ECC double-bit、PCIe 降级或 throttling，后面的 NCCL 测试即使能跑，结果也不可信。
 ### 3.4 单机 GPU 性能层
 这一层证明每台机器的 GPU 本身性能正常。
 | 测试 | 证明什么 |
 |---|---|
 | HBM/D2D 带宽 | GPU 显存和设备间拷贝能力 |
 | H2D/D2H 带宽 | CPU/Host 到 GPU 的 PCIe 路径 |
 | FP32/TF32 | 基础矩阵计算能力 |
 | FP16/BF16/FP8 | 训练常用 Tensor Core 能力 |
 这一步是单机验收。它不能证明两台机器之间通信正常，但可以排除“某台机器本身 GPU 算力或带宽异常”。
 ### 3.5 单机多卡 NCCL 层
 这一层验证单台机器 8 卡之间的集体通信。
 ```mermaid
 flowchart TD
    S["单机 8 卡 NCCL"] --> AR["AllReduce"]
    S --> AG["AllGather"]
    S --> RS["ReduceScatter"]
    S --> BC["Broadcast"]
    S --> AT["AllToAll"]
 ```
 单机 NCCL 主要看 NVLink/NVSwitch 通信路径是否正常。常见指标：
 | 指标 | 含义 |
 |---|---|
 | `algbw` | 算法视角的有效带宽 |
 | `busbw` | 总线视角的带宽，更适合比较通信链路利用率 |
 | `#wrong` | 结果错误数量，必须是 0 |
 单机测试通过后，只能说明单台服务器内部 8 卡通信正常。
 ### 3.6 跨机 RDMA 层
 这一层验证两台机器之间的网络和 RDMA 能力，不涉及 NCCL。
 ```mermaid
 sequenceDiagram
    participant N1 as aikubeworker0012
    participant FAB as IB/RDMA Fabric
    participant N2 as aikubeworker0016
    N1->>N2: ping / ssh
    N1->>FAB: ib_write_bw client
    FAB->>N2: ib_write_bw server
    N1->>FAB: ib_read_bw client
    FAB->>N2: ib_read_bw server
    N1->>N2: ib_write_lat / ib_read_lat
 ```
 这一层要回答：
 | 问题 | 说明 |
 |---|---|
 | IB 端口是否 Active | 没 Active 就不用跑 NCCL |
 | RDMA 带宽是否达标 | 证明网络数据面能跑起来 |
 | RDMA 延迟是否正常 | 高延迟会影响小消息和训练同步 |
 | 是否是 InfiniBand/RoCE | 两者环境变量和排障点不同 |
 如果 RDMA 层失败，跨机 NCCL 大概率也会失败或退化到 TCP。
 ### 3.7 跨机 NCCL 层
 这一层才是真正的多机多卡 NCCL 测试。
 两台 8 卡机器通常是：
 ```text
 2 nodes x 8 GPUs = 16 ranks
 每个 rank 绑定 1 张 GPU
 ```
 概念上是：
 ```mermaid
 flowchart LR
    subgraph N1["Node 1: 172.72.8.12"]
        R0["rank 0 / GPU0"]
        R1["rank 1 / GPU1"]
        R2["..."]
        R7["rank 7 / GPU7"]
    end
    subgraph N2["Node 2: 172.72.8.16"]
        R8["rank 8 / GPU0"]
        R9["rank 9 / GPU1"]
        R10["..."]
        R15["rank 15 / GPU7"]
    end
    R0 <--> R8
    R1 <--> R9
    R7 <--> R15
    N1 <--> N2
 ```
 典型测试项：
 | NCCL 测试 | 训练里对应什么 |
 |---|---|
 | AllReduce | 数据并行梯度同步 |
 | ReduceScatter | ZeRO/FSDP 梯度切分 |
 | AllGather | ZeRO/FSDP 参数聚合 |
 | Broadcast | 参数广播、初始化 |
 | AllToAll | MoE、专家并行、部分并行策略 |
 | SendRecv | 点对点通信、pipeline parallel |
 跨机 NCCL 要看：
 | 指标 | 判定 |
 |---|---|
 | 是否成功启动 16 rank | MPI/SSH/路径/环境是否正常 |
 | `#wrong == 0` | 正确性必须过 |
 | `busbw` | 跨节点通信链路利用率 |
 | 是否走 IB/RDMA | 需要从 `NCCL_DEBUG=INFO` 确认 |
 | 是否退化 TCP | 如果退化，性能会明显偏低 |
 ## 4. NCCL 为什么要分单机和跨机
 单机 8 卡通信和跨机 16 卡通信的瓶颈不同。
 ```mermaid
 flowchart TD
    A["NCCL 性能结果"] --> B{"测试范围"}
    B --> C["单机 8 卡"]
    B --> D["跨机 16 卡"]
    C --> C1["主要瓶颈：NVLink / NVSwitch"]
    C --> C2["阈值可参考 GPU NVLink 能力"]
    D --> D1["主要瓶颈：IB/RDMA 网络"]
    D --> D2["阈值应参考网卡数量、速率、拓扑和 rail 数"]
 ```
 所以不能用单机 NVLink 的阈值直接判断跨机 NCCL。跨机要根据真实网络能力设阈值，例如：
 | 网络配置 | 理论上限理解 |
 |---|---|
 | 单张 400G 网卡 | 约 50 GB/s 单向原始带宽 |
 | 8 张 400G 网卡 | 约 400 GB/s 原始聚合带宽 |
 | 实测 NCCL busbw | 会受拓扑、GDR、rail、NUMA、交换机、NCCL 算法影响 |
 实际验收时，应该先知道每台机器有几张 IB/RDMA 网卡、每张速率多少、GPU 到 NIC 的拓扑关系，再定跨机 NCCL 阈值。
 ## 5. 常见失败位置
 ```mermaid
 flowchart TD
    F["跨机 NCCL 失败"] --> A["启动失败"]
    F --> B["能启动但很慢"]
    F --> C["运行中 timeout"]
    F --> D["结果 #wrong 非 0"]
    A --> A1["SSH 不通"]
    A --> A2["远端路径不存在"]
    A --> A3["MPI 环境不一致"]
    A --> A4["root 运行未允许"]
    B --> B1["NCCL_SOCKET_IFNAME 选错"]
    B --> B2["没走 IB/RDMA，退化 TCP"]
    B --> B3["NCCL_IB_HCA 没选对"]
    B --> B4["GPU Direct RDMA 没生效"]
    C --> C1["IB 端口不稳定"]
    C --> C2["交换机/PFC/ECN 问题"]
    C --> C3["NCCL timeout 配置"]
    C --> C4["驱动/CUDA/NCCL 版本不兼容"]
    D --> D1["通信正确性失败"]
    D --> D2["必须 FAIL，不能只看带宽"]
 ```
 ## 6. 推荐验收顺序
 下面是面向两台 8 卡机器的推荐顺序：
 ```mermaid
 flowchart TD
    A["Step 1: 两台机器基础信息"] --> B["Step 2: 两台机器单机 GPU 健康"]
    B --> C["Step 3: 两台机器单机 benchmark"]
    C --> D["Step 4: 两台机器分别跑单机 8 卡 NCCL"]
    D --> E["Step 5: 两台机器互测 RDMA bandwidth/latency"]
    E --> F["Step 6: 两机 16 卡 NCCL correctness"]
    F --> G["Step 7: 两机 16 卡 NCCL performance"]
    G --> H["Step 8: 两机训练 demo 或业务压测"]
 ```
 每一步的意义：
 | 步骤 | 目的 |
 |---|---|
 | Step 1 | 确认没有登录错机器，基础网络和环境存在 |
 | Step 2 | 排除 GPU 健康问题 |
 | Step 3 | 排除 GPU 单卡/单机性能问题 |
 | Step 4 | 排除单机 NVLink/NVSwitch/NCCL 问题 |
 | Step 5 | 排除跨机 RDMA 问题 |
 | Step 6 | 先证明 NCCL 正确性 |
 | Step 7 | 再证明 NCCL 性能 |
 | Step 8 | 最后用真实训练形态验证稳定性 |
 ## 7. 对当前脚本的映射
 当前脚本已有模块和上面层次的关系：
 | 当前模块 | 覆盖层次 | 备注 |
 |---|---|---|
 | `gpu_info` | 系统识别层 | 单机 |
 | `health` | 单机 GPU 健康层 | 单机 |
 | `benchmark` | 单机 GPU 性能层 | 单机 |
 | `nccl` | 单机多卡通信层 | 当前主要是单机 |
 | `rdma` | RDMA 检查 | 当前偏本机检查，不是两机互测 |
 | `stress` | 稳定性 | 单机 |
 | `training` | 训练负载层 | 当前偏单机 |
 | 建议新增 `multi_node_nccl` | 跨机 NCCL 层 | 专门处理 hostfile、mpirun、多节点环境、结果解析 |
 如果未来要扩展脚本，比较自然的方向是新增一个多机模块，而不是把所有逻辑塞进现有 `nccl` 模块。
 ## 8. 最小概念模型
 记住这句话即可：
 ```text
 单机 NCCL 验证 GPU 之间的 NVLink/NVSwitch。
 跨机 RDMA 验证机器之间的网络。
 跨机 NCCL 验证 NCCL 是否能把 GPU 和网络组合起来，为真实训练提供高效通信。
 ```
 因此，多机多卡测试不是一个命令，而是一条验证链路。
--- a/gpu_tester.py
+++ b/gpu_tester.py
@ -5,6 +5,7 @@ import argparse
 import json
 import os
 import signal
 import socket
 import sys
 import time
 from datetime import datetime
@ -25,6 +26,8 @@ from modules.nccl_test import NCCLTest
 from modules.training_sim import TrainingSim
 from modules.stress_test import StressTest
 from modules.rdma_test import RDMATest
 from modules.nvlink_test import NVLinkTest
 from modules.dcgm_test import DCGMTest
 from modules.report import ReportGenerator
 from modules.gpu_specs import detect_gpu_type, get_gpu_specs, get_gpu_label, get_supported_gpus, validate_driver_compatibility
@ -32,43 +35,87 @@ DEFAULT_CONFIG = {
    "benchmark": {
        "memory": {"size_mb": 4096, "iterations": 10, "nvbandwidth_buffer_mb": 512, "nvbandwidth_samples": 3},
        "compute": {
-            "dtypes": ["fp32", "tf32", "fp16", "bf16", "fp8"],
+            "dtypes": ["fp32", "tf32", "fp16", "bf16", "fp8", "fp64", "int8"],
-            "matrix_size": 4096,
+            "matrix_size": 8192,
-            "warmup": 10,
+            "warmup": 50,
-            "iterations": 100,
+            "iterations": 500,
            "use_compile": True,
        },
    },
-    "health": {"temp_warning": 80, "temp_critical": 90, "power_limit": None},
+    "health": {"temp_warning": 75, "temp_critical": 85, "power_limit": None},
    "nccl": {
        "min_bandwidth_gbps": None,
        "test_allreduce": True,
        "test_alltoall": True,
        "test_broadcast": True,
-        "test_reduce_scatter": False,
+        "test_reduce_scatter": True,
-        "test_allgather": False,
+        "test_allgather": True,
-        "test_sendrecv": False,
+        "test_sendrecv": True,
        "message_sizes": ["1M", "256M", "2G"],
        "repeats": 3,
        "max_stddev_pct": 3,
    },
    "stress": {
-        "duration_sec": 60,
+        "duration_sec": 1800,
        "production_duration_sec": 1800,
        "use_gpu_burn": False,
        "use_doubles": False,
        "use_tensor_cores": True,
        "memory_pct": 90,
        "gpus": "all",
        "dtype": "bf16",
        "matrix_size": 24576,
        "telemetry_interval_sec": 1,
        "warmup_sec": 60,
        "min_steady_samples": 10,
        "max_temp_c": 80,
        "max_temp_delta_c": 5,
        "min_power_watts": 630,
        "max_tflops_jitter_pct": 5,
        "require_tflops_jitter": True,
    },
    "rdma": {
-        "min_bandwidth_gbps": 50,
+        "min_bandwidth_gbps": 47,
-        "max_latency_us": 10,
+        "min_port_rate_gbps": 400,
        "max_latency_us": 3.5,
        "max_write_latency_us": 2.0,
        "max_read_latency_us": 3.5,
        "ib_iterations": 1000,
-        "msg_size": 65536,
+        "msg_size": 4194304,
        "latency_msg_size": 8,
        "ib_device": None,
        "ib_port": 1,
        "server_addr": None,
        "ibping_target": None,
        "ibping_count": 5,
        "role": "auto",
        "pfc_ecn_counters": True,
    },
    "nvlink": {
        "expected_links_per_gpu": 18,
        "expected_link_speed_gbps": 25,
        "require_zero_errors": True,
    },
    "dcgm": {
        "diag_level": 3,
        "timeout_sec": 1200,
        "expected_num_gpus": 8,
        "json_output": True,
        "require_subtests": True,
    },
    "training": {
-        "model": "gpt2",
+        "model": "synthetic_1.5b",
        "batch_size": 8,
        "seq_length": 2048,
        "num_steps": 50,
        "warmup_steps": 5,
        "dtype": "bf16",
        "mode": "ddp",
        "synthetic_params_b": 1.5,
        "min_tokens_per_sec": 45000,
        "max_step_jitter_pct": 3,
        "max_peak_memory_gb": 70,
        "require_distributed": True,
    },
    "report": {"output_dir": "./reports", "format": "json"},
    "tools": {"install_dir": "/opt/gpu-test-tools"},
@ -131,7 +178,7 @@ def interactive_menu(config: dict):
    if not check_prerequisites(console):
        return
-    results_store: dict = {"timestamp": datetime.now().isoformat(), "tests": {}}
+    results_store: dict = {"timestamp": datetime.now().isoformat(), "hostname": socket.gethostname(), "tests": {}}
    menu_items = [
        ("1", "GPU Information", "gpu_info"),
@ -139,10 +186,12 @@ def interactive_menu(config: dict):
        ("3", "Memory Benchmark (nvbandwidth)", "memory_bench"),
        ("4", "Compute Benchmark", "compute_bench"),
        ("5", "NCCL Multi-GPU Test", "nccl"),
-        ("6", "GPU Stress Test (gpu-burn)", "stress"),
+        ("6", "GPU Stress Test (PyTorch/gpu-burn)", "stress"),
        ("7", "RDMA/IB Test", "rdma"),
-        ("8", "Training Simulation", "training"),
+        ("8", "NVLink/NVSwitch Test", "nvlink"),
-        ("9", "Full Test Suite (All Tests)", "all"),
+        ("9", "DCGM Diagnostic", "dcgm"),
        ("10", "Training Simulation", "training"),
        ("11", "Full Test Suite (All Tests)", "all"),
        ("0", "Generate Report", "report"),
    ]
@ -164,8 +213,10 @@ def interactive_menu(config: dict):
            "memory_bench": "HBM bandwidth via nvbandwidth",
            "compute_bench": "GEMM TFLOPS across FP32/TF32/FP16/BF16/FP8",
            "nccl": "AllReduce, AllToAll, Broadcast via nccl-tests",
-            "stress": "Long-running GPU stress via gpu-burn",
+            "stress": "Long-running high-power GEMM stress with telemetry",
            "rdma": "InfiniBand bandwidth & latency (ib_write_bw)",
            "nvlink": "NVLink links, speed, and error counters",
            "dcgm": "DCGM diag -r 3 production diagnostic",
            "training": "Simulate LLM training with PyTorch",
            "all": "Run all tests sequentially",
            "report": "Export results to JSON/HTML",
@ -257,6 +308,18 @@ def _run_test(test_name: str, config: dict, console: Console) -> dict:
            m.print_results(result)
            return result
        elif test_name == "nvlink":
            m = NVLinkTest(config)
            result = m.run()
            m.print_results(result)
            return result
        elif test_name == "dcgm":
            m = DCGMTest(config)
            result = m.run()
            m.print_results(result)
            return result
        elif test_name == "training":
            m = TrainingSim(config)
            result = m.run()
@ -280,15 +343,17 @@ def _run_test(test_name: str, config: dict, console: Console) -> dict:
 def _run_full_suite(config: dict, console: Console) -> dict:
    """Run all tests sequentially."""
    console.print(Panel("[bold cyan]Running Full Test Suite[/bold cyan]", box=box.DOUBLE))
-    all_results: dict = {"timestamp": datetime.now().isoformat()}
+    all_results: dict = {"timestamp": datetime.now().isoformat(), "hostname": socket.gethostname()}
    tests = [
        ("gpu_info", "GPU Information", GPUInfo),
        ("health", "Health Check", HealthCheck),
        ("memory_bench", "Memory Benchmark", lambda c: Benchmark(c)),
        ("compute_bench", "Compute Benchmark", lambda c: Benchmark(c)),
        ("nvlink", "NVLink/NVSwitch Test", NVLinkTest),
        ("nccl", "NCCL Test", NCCLTest),
        ("stress", "GPU Stress Test", StressTest),
        ("rdma", "RDMA/IB Test", RDMATest),
        ("dcgm", "DCGM Diagnostic", DCGMTest),
        ("training", "Training Simulation", TrainingSim),
    ]
@ -313,14 +378,49 @@ def _run_full_suite(config: dict, console: Console) -> dict:
    # Summary
    console.print("\n" + "=" * 60)
    # Only count test results, exclude metadata like timestamp
-    test_results = {k: v for k, v in all_results.items() if k != "timestamp"}
+    test_results = {k: v for k, v in all_results.items() if k not in ("timestamp", "hostname")}
-    passed = sum(1 for v in test_results.values() if not isinstance(v, dict) or "error" not in v)
+    passed = sum(1 for v in test_results.values() if _test_result_passed(v))
    total = len(test_results)
    color = "green" if passed == total else ("yellow" if passed > 0 else "red")
    console.print(f"[bold {color}]Suite complete: {passed}/{total} tests passed[/bold {color}]")
    return all_results
 def _test_result_passed(result) -> bool:
    """Strict production verdict helper for full-suite exit status."""
    if not isinstance(result, dict):
        return True
    if result.get("error"):
        return False
    if result.get("skipped") or result.get("status") == "SKIP":
        return False
    if result.get("source") == "torchrun_fallback":
        return False
    if "passed" in result:
        return bool(result.get("passed"))
    if "memory" in result:
        mem = result["memory"]
        if isinstance(mem, dict) and "passed" in mem:
            return bool(mem.get("passed"))
        if mem.get("error") or mem.get("source") == "pytorch":
            return False
        eff = mem.get("d2d_efficiency_pct") or mem.get("efficiency_pct") or 0
        return eff >= 80
    if "compute" in result:
        comp = result["compute"]
        if isinstance(comp, dict) and "passed" in comp:
            return bool(comp.get("passed"))
        thresholds = comp.get("pass_thresholds_tflops", {}) or {}
        per_dtype = comp.get("per_dtype_tflops", {})
        for dt, threshold in thresholds.items():
            val = per_dtype.get(dt)
            if not isinstance(val, (int, float)) or val < threshold:
                return False
        consistency = comp.get("consistency", {})
        return not any(not c.get("passed", False) for c in consistency.values())
    return True
 def main():
    gpu_list_str = " / ".join(g.upper() for g in get_supported_gpus())
    parser = argparse.ArgumentParser(
@ -335,15 +435,17 @@ Examples:
   python gpu_tester.py --test benchmark --type memory
   python gpu_tester.py --test benchmark --type compute --dtype fp16
   python gpu_tester.py --test nccl            # NCCL test
   python gpu_tester.py --test nvlink          # NVLink/NVSwitch test
   python gpu_tester.py --test dcgm            # DCGM diagnostic
   python gpu_tester.py --test training        # Training sim
   python gpu_tester.py --test all             # Full suite
   python gpu_tester.py --report --format json --output report.json
        """,
    )
-    parser.add_argument("--test", choices=["gpu-info", "health", "benchmark", "nccl", "stress", "rdma", "training", "all"],
+    parser.add_argument("--test", choices=["gpu-info", "health", "benchmark", "nccl", "stress", "rdma", "nvlink", "dcgm", "training", "all"],
                        help="Run a specific test")
    parser.add_argument("--type", choices=["memory", "compute"], help="Benchmark type (with --test benchmark)")
-    parser.add_argument("--dtype", choices=["fp32", "tf32", "fp16", "bf16", "fp8"],
+    parser.add_argument("--dtype", choices=["fp32", "tf32", "fp16", "bf16", "fp8", "fp64", "int8"],
                        help="Compute benchmark dtype (with --test benchmark --type compute)")
    parser.add_argument("--interactive", action="store_true", help="Force interactive mode")
    parser.add_argument("--report", action="store_true", help="Generate report from last results")
@ -399,6 +501,8 @@ Examples:
        "nccl": "nccl",
        "stress": "stress",
        "rdma": "rdma",
        "nvlink": "nvlink",
        "dcgm": "dcgm",
        "training": "training",
        "all": "all",
    }
@ -415,19 +519,30 @@ Examples:
            result = bench.run()
            Benchmark.print_results(result)
        if args.report:
-            ReportGenerator(config).generate({"benchmark": result, "timestamp": datetime.now().isoformat()},
+            ReportGenerator(config).generate({
                "benchmark": result,
                "timestamp": datetime.now().isoformat(),
                "hostname": socket.gethostname(),
            },
                                             fmt=args.format, output=args.output)
        sys.exit(0 if _test_result_passed(result) else 1)
    elif args.test == "all":
        results = _run_full_suite(config, console)
        if args.report:
            ReportGenerator(config).generate(results, fmt=args.format, output=args.output)
-        has_errors = any("error" in v for v in results.values() if isinstance(v, dict))
+        failed = any(not _test_result_passed(v) for k, v in results.items() if k not in ("timestamp", "hostname"))
-        sys.exit(1 if has_errors else 0)
+        sys.exit(1 if failed else 0)
    else:
        result = _run_test(test_map[args.test], config, console)
        if args.report and result:
-            ReportGenerator(config).generate({args.test: result, "timestamp": datetime.now().isoformat()},
+            report_key = test_map[args.test] or args.test
            ReportGenerator(config).generate({
                report_key: result,
                "timestamp": datetime.now().isoformat(),
                "hostname": socket.gethostname(),
            },
                                             fmt=args.format, output=args.output)
        sys.exit(0 if _test_result_passed(result) else 1)
 if __name__ == "__main__":
--- a/modules/dcgm_test.py
+++ b/modules/dcgm_test.py
@ -0,0 +1,231 @@
 """DCGM diagnostic acceptance wrapper."""
 import json
 import os
 import re
 import shutil
 import signal
 import subprocess
 from datetime import datetime
 from typing import Optional
 from rich.console import Console
 from rich.table import Table
 class DCGMTest:
    def __init__(self, config: dict):
        self.config = config
        self.console = Console()
        self.cfg = config.get("dcgm", {})
    def run(self) -> dict:
        dcgmi = shutil.which("dcgmi")
        if not dcgmi:
            return {
                "passed": False,
                "error": "dcgmi not found",
                "timestamp": datetime.now().isoformat(),
            }
        level = str(self.cfg.get("diag_level", 3))
        timeout = int(self.cfg.get("timeout_sec", 1200))
        cmd = [dcgmi, "diag", "-r", level]
        expected_gpus = self.cfg.get("expected_num_gpus")
        if expected_gpus:
            cmd.extend(["-n", f"gpu:{int(expected_gpus)}"])
        if self.cfg.get("json_output", True):
            cmd.append("-j")
        try:
            r = self._run_with_process_group_timeout(cmd, timeout)
        except subprocess.TimeoutExpired as e:
            output = ((e.output or "") + "\n" + (e.stderr or "")).strip()
            return {
                "passed": False,
                "error": f"dcgmi diag -r {level} timeout after {timeout}s",
                "command": cmd,
                "raw_output_tail": output[-8000:],
                "timestamp": datetime.now().isoformat(),
            }
        output = r.stdout + "\n" + r.stderr
        subtests = self._parse_json_output(output) or self._parse_output(output)
        strict_statuses = {"PASS"}
        failed = [s for s in subtests if s["status"] not in strict_statuses]
        require_subtests = bool(self.cfg.get("require_subtests", True))
        passed = r.returncode == 0 and not failed and (bool(subtests) or not require_subtests)
        return {
            "passed": passed,
            "returncode": r.returncode,
            "level": int(level),
            "command": cmd,
            "expected_num_gpus": int(expected_gpus) if expected_gpus else None,
            "subtests": subtests,
            "raw_output_tail": output[-8000:],
            "timestamp": datetime.now().isoformat(),
        }
    @staticmethod
    def _run_with_process_group_timeout(cmd: list[str], timeout: int) -> subprocess.CompletedProcess:
        proc = subprocess.Popen(
            cmd,
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            text=True,
            start_new_session=True,
        )
        try:
            stdout, stderr = proc.communicate(timeout=timeout)
        except subprocess.TimeoutExpired as e:
            try:
                os.killpg(proc.pid, signal.SIGTERM)
                stdout, stderr = proc.communicate(timeout=10)
            except subprocess.TimeoutExpired:
                os.killpg(proc.pid, signal.SIGKILL)
                stdout, stderr = proc.communicate(timeout=10)
            raise subprocess.TimeoutExpired(cmd, timeout, output=stdout, stderr=stderr) from e
        return subprocess.CompletedProcess(cmd, proc.returncode, stdout, stderr)
    @classmethod
    def _parse_json_output(cls, output: str) -> list[dict]:
        text = output.strip()
        if not text:
            return []
        try:
            payload = json.loads(text)
        except json.JSONDecodeError:
            m = re.search(r"(\{.*\})", text, re.S)
            if not m:
                return []
            try:
                payload = json.loads(m.group(1))
            except json.JSONDecodeError:
                return []
        dcgm_payload = payload.get("DCGM Diagnostic") if isinstance(payload, dict) else None
        if isinstance(dcgm_payload, dict):
            parsed = cls._parse_dcgm_diagnostic_json(dcgm_payload)
            if parsed:
                return parsed
        subtests = []
        def walk(node, path: list[str]):
            if isinstance(node, dict):
                node_name = (
                    node.get("name")
                    or node.get("testName")
                    or node.get("test_name")
                    or node.get("category")
                    or node.get("category_name")
                )
                child_path = [*path, str(node_name)] if node_name else path
                status = node.get("status") or node.get("result") or node.get("Result")
                if isinstance(status, str):
                    name = (
                        node_name
                        or " / ".join(path[-3:])
                    )
                    normalized = cls._normalize_status(status)
                    if normalized:
                        subtests.append({
                            "name": str(name)[:160],
                            "status": normalized,
                            "raw": json.dumps(node, default=str)[:1000],
                        })
                for key, value in node.items():
                    walk(value, [*child_path, str(key)])
            elif isinstance(node, list):
                for idx, item in enumerate(node):
                    walk(item, [*path, str(idx)])
        walk(payload, [])
        return subtests
    @classmethod
    def _parse_dcgm_diagnostic_json(cls, payload: dict) -> list[dict]:
        subtests = []
        for category in payload.get("test_categories", []) or []:
            category_name = str(category.get("category") or "DCGM")
            for test in category.get("tests", []) or []:
                test_name = str(test.get("name") or "unnamed")
                for result in test.get("results", []) or []:
                    status = cls._normalize_status(str(result.get("status", "")))
                    if not status:
                        continue
                    entity_group = result.get("entity_group") or "entity"
                    entity_id = result.get("entity_id", "unknown")
                    name = f"{category_name}/{test_name}/{entity_group}{entity_id}"
                    subtests.append({
                        "name": name[:160],
                        "status": status,
                        "raw": json.dumps(result, default=str)[:1000],
                    })
                summary = test.get("test_summary") or {}
                status = cls._normalize_status(str(summary.get("status", "")))
                if status:
                    subtests.append({
                        "name": f"{category_name}/{test_name}/summary"[:160],
                        "status": status,
                        "raw": json.dumps(summary, default=str)[:1000],
                    })
        return subtests
    @staticmethod
    def _normalize_status(status: str) -> str:
        s = status.strip().upper()
        aliases = {
            "PASS": "PASS",
            "PASSED": "PASS",
            "OK": "PASS",
            "FAIL": "FAIL",
            "FAILED": "FAIL",
            "ERROR": "ERROR",
            "WARN": "WARN",
            "WARNING": "WARN",
            "SKIP": "SKIP",
            "SKIPPED": "SKIP",
            "NOT_RUN": "SKIP",
            "NOT RUN": "SKIP",
        }
        return aliases.get(s, s if s in {"PASS", "FAIL", "ERROR", "WARN", "SKIP"} else "")
    @staticmethod
    def _parse_output(output: str) -> list[dict]:
        subtests = []
        for line in output.splitlines():
            stripped = line.strip()
            if not stripped:
                continue
            m = re.search(r"(.+?)\s*[:|]\s*(PASS|FAIL|WARN|ERROR|SKIP)\b", stripped, re.I)
            if not m:
                m = re.search(r"\b(PASS|FAIL|WARN|ERROR|SKIP)\b\s*[-:|]\s*(.+)", stripped, re.I)
                if m:
                    status = DCGMTest._normalize_status(m.group(1))
                    name = m.group(2).strip()
                else:
                    continue
            else:
                name = m.group(1).strip(" .|-")
                status = DCGMTest._normalize_status(m.group(2))
            if name and len(name) < 160:
                subtests.append({"name": name, "status": status, "raw": stripped})
        return subtests
    @staticmethod
    def print_results(results: dict, console: Optional[Console] = None):
        c = console or Console()
        if results.get("error"):
            c.print(f"[bold red]DCGM error: {results['error']}[/bold red]")
            return
        passed = results.get("passed", False)
        c.print("[bold green]✓ DCGM diag PASSED[/bold green]" if passed else "[bold red]✗ DCGM diag FAILED[/bold red]")
        subtests = results.get("subtests", [])
        if subtests:
            table = Table(box=None, padding=(0, 1))
            table.add_column("Subtest")
            table.add_column("Status", style="bold")
            for s in subtests:
                table.add_row(s.get("name", ""), s.get("status", ""))
            c.print(table)
--- a/modules/health_check.py
+++ b/modules/health_check.py
@ -171,6 +171,10 @@ class HealthCheck:
            gpu_health.append({"index": i, "status": worst, "checks": checks})
        system_health = self._check_system()
        for key in ("fabricmanager", "retired_pages", "kernel_errors"):
            item = system_health.get(key, {})
            if isinstance(item, dict) and item.get("status") == "FAIL":
                overall_pass = False
        return {
            "passed": overall_pass,
@ -228,6 +232,9 @@ class HealthCheck:
            rdma_devs = os.listdir("/sys/class/infiniband_verbs")
        nccl_env = {k: v for k, v in os.environ.items() if k.startswith("NCCL_")}
        fabric = self._check_fabricmanager()
        retired = self._check_retired_pages()
        kernel_errors = self._check_kernel_errors()
        return {
            "nvidia_persistenced": {"installed": persistd, "running": persistd_running},
@ -238,6 +245,41 @@ class HealthCheck:
            "infiniband_devices": ib_devs,
            "rdma_devices": rdma_devs,
            "nccl_env_vars": nccl_env,
            "fabricmanager": fabric,
            "retired_pages": retired,
            "kernel_errors": kernel_errors,
        }
    def _check_fabricmanager(self) -> dict:
        r = self._run_cmd(["systemctl", "is-active", "nvidia-fabricmanager"], timeout=5)
        active = r == "active"
        logs = self._run_cmd(["journalctl", "-u", "nvidia-fabricmanager", "-n", "200", "--no-pager"], timeout=10) or ""
        has_error = "ERROR" in logs.upper() or "FAILED" in logs.upper()
        return {
            "active": active,
            "has_error_logs": has_error,
            "status": "PASS" if active and not has_error else "FAIL",
        }
    def _check_retired_pages(self) -> dict:
        raw = self._run_cmd(["nvidia-smi", "-q", "-d", "PAGE_RETIREMENT"], timeout=30) or ""
        nums = [int(x) for x in __import__("re").findall(r"Retired Pages.*?:\s*(\d+)", raw, flags=__import__("re").I)]
        pending = "Pending Page Blacklist" in raw and "Yes" in raw
        total = sum(nums)
        return {
            "retired_pages": total,
            "pending_blacklist": pending,
            "status": "PASS" if total == 0 and not pending else "FAIL",
        }
    def _check_kernel_errors(self) -> dict:
        raw = self._run_cmd(["dmesg", "--ctime", "--level=err,crit,alert,emerg"], timeout=10) or ""
        upper = raw.upper()
        hits = [line for line in raw.splitlines() if any(k in line.upper() for k in ("XID", "AER", "PCIE", "NVRM"))]
        return {
            "count": len(hits),
            "tail": hits[-20:],
            "status": "PASS" if not hits else "FAIL",
        }
    @staticmethod
--- a/modules/nccl_test.py
+++ b/modules/nccl_test.py
@ -5,6 +5,8 @@ import os
 import re
 import shutil
 import subprocess
 import statistics
 import sys
 from datetime import datetime
 from typing import Optional
@ -70,6 +72,38 @@ class NCCLTest:
                return p
        return None
    def _message_sizes(self) -> list[str]:
        return list(self.nccl_cfg.get("message_sizes") or ["1M", "256M", "2G"])
    def _repeats(self) -> int:
        return int(self.nccl_cfg.get("repeats", 3))
    def _max_stddev_pct(self) -> float:
        return float(self.nccl_cfg.get("max_stddev_pct", 3))
    def _runtime_env(self) -> dict:
        env = {**os.environ, "NCCL_DEBUG": "WARN"}
        lib_dirs = []
        nccl_home = env.get("NCCL_HOME") or self.nccl_cfg.get("nccl_home")
        if nccl_home:
            lib_dirs.append(os.path.join(str(nccl_home), "lib"))
        for path in sys.path:
            lib_dirs.append(os.path.join(path, "nvidia", "nccl", "lib"))
        venv_root = os.path.dirname(os.path.dirname(sys.executable))
        lib_dirs.extend(glob.glob(os.path.join(venv_root, "lib", "python*", "site-packages", "nvidia", "nccl", "lib")))
        existing = env.get("LD_LIBRARY_PATH", "")
        valid_dirs = []
        for d in lib_dirs:
            if d and os.path.isdir(d) and d not in valid_dirs:
                valid_dirs.append(d)
        if valid_dirs:
            env["LD_LIBRARY_PATH"] = ":".join(valid_dirs + ([existing] if existing else []))
        return env
    def run(self) -> dict:
        gpu_count = 0
        if TORCH_AVAILABLE:
@ -89,7 +123,7 @@ class NCCLTest:
        if self.nccl_cfg.get("test_reduce_scatter", False):
            tests.append(("reduce_scatter_perf", "ReduceScatter"))
        if self.nccl_cfg.get("test_allgather", False):
-            tests.append(("allgather_perf", "AllGather"))
+            tests.append(("all_gather_perf", "AllGather"))
        if self.nccl_cfg.get("test_sendrecv", False):
            tests.append(("sendrecv_perf", "SendRecv"))
@ -170,39 +204,7 @@ class NCCLTest:
        if not binary:
            return {"status": "SKIP", "error": f"{binary_name} not found"}
-        cmd = [
+        return self._run_nccl_matrix([binary, "-g", str(gpu_count)], min_bw)
            binary,
            "-b", "8M",
            "-e", "8G",
            "-f", "2",
            "-g", str(gpu_count),
            "-w", "5",
            "-n", "20",
        ]
        try:
            env = os.environ.copy()
            env["NCCL_DEBUG"] = "WARN"
            r = subprocess.run(cmd, capture_output=True, text=True, timeout=180, env=env)
            combined = r.stdout + r.stderr
            # Check for NCCL/CUDA compatibility errors
            if "CUDA driver version is insufficient" in combined or \
               "Test NCCL failure" in combined:
                error_msg = "NCCL/CUDA driver version mismatch" \
                    if "CUDA driver version" in combined \
                    else "NCCL test failure (library incompatibility)"
                return {"status": "FAIL", "error": error_msg}
            if r.returncode != 0:
                return {"status": "FAIL", "error": r.stderr[:300]}
            return self._parse_nccl_output(r.stdout, min_bw)
        except subprocess.TimeoutExpired:
            return {"status": "FAIL", "error": "timeout"}
        except Exception as e:
            return {"status": "FAIL", "error": str(e)}
    def _run_one_nccl_test_mpirun(self, binary_name: str, label: str,
                                   gpu_count: int, mpirun: str, min_bw: float) -> dict:
@ -218,37 +220,64 @@ class NCCLTest:
            "-x", "NCCL_DEBUG=WARN",
            "-x", "CUDA_VISIBLE_DEVICES=" + ",".join(str(i) for i in range(gpu_count)),
            binary,
            "-b", "8",
            "-e", "256M",
            "-f", "2",
            "-g", "1",
            "-w", "5",
            "-n", "20",
        ]
        return self._run_nccl_matrix(cmd, min_bw)
    def _run_nccl_matrix(self, base_cmd: list[str], min_bw: float) -> dict:
        size_results = []
        failures = []
        env = self._runtime_env()
        try:
-            env = os.environ.copy()
+            for size in self._message_sizes():
-            env["NCCL_DEBUG"] = "WARN"
+                runs = []
-            r = subprocess.run(cmd, capture_output=True, text=True, timeout=180, env=env)
+                for _ in range(self._repeats()):
-
+                    cmd = [*base_cmd, "-b", size, "-e", size, "-f", "2", "-w", "5", "-n", "20"]
                    r = subprocess.run(cmd, capture_output=True, text=True, timeout=300, env=env)
                    combined = r.stdout + r.stderr
-            if "CUDA driver version is insufficient" in combined or \
+                    if "CUDA driver version is insufficient" in combined or "Test NCCL failure" in combined:
-               "Test NCCL failure" in combined:
+                        failures.append({"size": size, "error": "NCCL/CUDA/library failure"})
-                error_msg = "NCCL/CUDA driver version mismatch" \
+                        continue
                    if "CUDA driver version" in combined \
                    else "NCCL test failure (library incompatibility)"
                return {"status": "FAIL", "error": error_msg}
                    if r.returncode != 0:
-                return {"status": "FAIL", "error": r.stderr[:300]}
+                        failures.append({"size": size, "error": r.stderr[:300]})
-
+                        continue
-            return self._parse_nccl_output(r.stdout, min_bw)
+                    parsed = self._parse_nccl_output(r.stdout, min_bw)
                    runs.append(parsed.get("best_busbw_gbps", 0))
                if runs:
                    worst = min(runs)
                    mean = sum(runs) / len(runs)
                    std_pct = (statistics.pstdev(runs) / mean * 100) if len(runs) > 1 and mean else 0
                    size_results.append({
                        "size": size,
                        "runs_busbw_gbps": [round(v, 1) for v in runs],
                        "worst_busbw_gbps": round(worst, 1),
                        "mean_busbw_gbps": round(mean, 1),
                        "stddev_pct": round(std_pct, 2),
                        "status": "PASS" if worst >= min_bw and std_pct <= self._max_stddev_pct() else "FAIL",
                    })
                else:
                    size_results.append({"size": size, "status": "FAIL", "runs_busbw_gbps": []})
        except subprocess.TimeoutExpired:
            return {"status": "FAIL", "error": "timeout"}
        except Exception as e:
            return {"status": "FAIL", "error": str(e)}
        best_bus = max((r.get("mean_busbw_gbps", 0) for r in size_results), default=0)
        worst_bus = min((r.get("worst_busbw_gbps", 0) for r in size_results if r.get("runs_busbw_gbps")), default=0)
        passed = bool(size_results) and all(r.get("status") == "PASS" for r in size_results) and not failures
        return {
            "status": "PASS" if passed else "FAIL",
            "best_busbw_gbps": round(best_bus, 1),
            "worst_busbw_gbps": round(worst_bus, 1),
            "min_required_gbps": min_bw,
            "max_stddev_pct": self._max_stddev_pct(),
            "by_size": size_results,
            "failures": failures,
        }
    @staticmethod
    def _parse_nccl_output(stdout: str, min_bw: float) -> dict:
        """Parse nccl-tests tabular output and extract bandwidth results."""
@ -363,7 +392,7 @@ dist.destroy_process_group()
            r = subprocess.run(
                [torchrun_cmd, f"--nproc_per_node={gpu_count}", tmp.name],
                capture_output=True, text=True, timeout=120,
-                env={**os.environ, "NCCL_DEBUG": "WARN"},
+                env=self._runtime_env(),
            )
            os.unlink(tmp.name)
@ -390,10 +419,15 @@ dist.destroy_process_group()
                }
            return {
-                "passed": all_passed,
+                # torchrun fallback is a functional smoke only. It never proves
                # production bus bandwidth, so it must not satisfy acceptance.
                "passed": False,
                "functional_passed": all_passed,
                "source": "torchrun_fallback",
                "tests": tests,
                "gpu_count": gpu_count,
                "error": None if all_passed else "torchrun functional NCCL smoke failed",
                "acceptance_gap": "nccl-tests bus bandwidth was not measured",
            }
        except Exception as e:
            return {"passed": False, "source": "torchrun_fallback", "error": str(e)}
@ -410,7 +444,8 @@ dist.destroy_process_group()
        if source == "torchrun_fallback":
            # Connectivity check mode
-            verdict = "[bold green]✓ NCCL Connectivity OK[/bold green]" if passed else "[bold red]✗ NCCL Connectivity FAILED[/bold red]"
+            functional = results.get("functional_passed", passed)
            verdict = "[bold yellow]⚠ NCCL bus BW NOT VERIFIED[/bold yellow]" if functional else "[bold red]✗ NCCL Connectivity FAILED[/bold red]"
            c.print(f"{verdict} [dim](basic check via torchrun)[/dim]")
            tests = results.get("tests", {})
@ -427,7 +462,7 @@ dist.destroy_process_group()
                    else:
                        c.print(f"  [{s_color}]{op_name}[/{s_color}]")
-            c.print("\n[yellow]Note: functional connectivity test only (no performance data)[/yellow]")
+            c.print("\n[yellow]Note: functional connectivity test only (no bus bandwidth data; acceptance FAIL)[/yellow]")
        else:
            # nccl-tests mode
            verdict = "[bold green]✓ NCCL tests PASSED[/bold green]" if passed else "[bold yellow]⚠ NCCL tests WARNING[/bold yellow]"
@ -448,12 +483,16 @@ dist.destroy_process_group()
                if by_size:
                    t = Table(box=None, padding=(0, 1))
                    t.add_column("Size", style="bold", justify="right")
-                    t.add_column("Time (us)", justify="right")
+                    t.add_column("Worst Bus BW", justify="right")
-                    t.add_column("Alg BW (GB/s)", justify="right")
+                    t.add_column("Mean Bus BW", justify="right")
-                    t.add_column("Bus BW (GB/s)", justify="right")
+                    t.add_column("StdDev", justify="right")
                    t.add_column("Status", justify="right")
                    for r in by_size:
-                        sz = r.get("size", 0)
+                        t.add_row(
-                        sz_str = f"{sz/1024:.0f}K" if sz < 1048576 else f"{sz/1048576:.0f}M"
+                            str(r.get("size", "")),
-                        t.add_row(sz_str, f"{r.get('time_us',0):.1f}",
+                            f"{r.get('worst_busbw_gbps', 0):.1f}",
-                                  f"{r.get('algbw_gbps',0):.1f}", f"{r.get('busbw_gbps',0):.1f}")
+                            f"{r.get('mean_busbw_gbps', 0):.1f}",
                            f"{r.get('stddev_pct', 0):.2f}%",
                            r.get("status", "?"),
                        )
                    c.print(t)
--- a/modules/nvlink_test.py
+++ b/modules/nvlink_test.py
@ -0,0 +1,188 @@
 """NVLink / NVSwitch production acceptance checks."""
 import re
 import shutil
 import subprocess
 from datetime import datetime
 from typing import Optional
 from rich.console import Console
 from rich.table import Table
 class NVLinkTest:
    def __init__(self, config: dict):
        self.config = config
        self.console = Console()
        self.cfg = config.get("nvlink", {})
    def _run(self, args: list[str], timeout: int = 60) -> tuple[int, str, str]:
        if not shutil.which("nvidia-smi"):
            return 127, "", "nvidia-smi not found"
        r = subprocess.run(["nvidia-smi", *args], capture_output=True, text=True, timeout=timeout)
        return r.returncode, r.stdout, r.stderr
    def run(self) -> dict:
        expected_links = int(self.cfg.get("expected_links_per_gpu", 18))
        expected_speed = float(self.cfg.get("expected_link_speed_gbps", 25))
        require_zero_errors = bool(self.cfg.get("require_zero_errors", True))
        rc_s, out_s, err_s = self._run(["nvlink", "-s"])
        rc_c, out_c, err_c = self._run(["nvlink", "-c"])
        rc_e, out_e, err_e = self._run(["nvlink", "-e"])
        if rc_s != 0:
            return {
                "passed": False,
                "error": (err_s or out_s or "nvidia-smi nvlink -s failed")[:1000],
                "timestamp": datetime.now().isoformat(),
            }
        links = self._parse_status(out_s)
        if not links:
            return {
                "passed": False,
                "error": "no NVLink status entries parsed from nvidia-smi nvlink -s",
                "raw_status": out_s[-4000:],
                "timestamp": datetime.now().isoformat(),
            }
        speeds = self._parse_speeds(out_c) if rc_c == 0 else {}
        status_speeds = self._parse_speeds(out_s)
        for gpu, gpu_speeds in status_speeds.items():
            speeds.setdefault(gpu, {}).update({k: v for k, v in gpu_speeds.items() if k not in speeds.get(gpu, {})})
        errors = self._parse_errors(out_e) if rc_e == 0 else {}
        gpu_results = []
        overall = True
        for gpu, gpu_links in sorted(links.items(), key=lambda x: int(x[0])):
            active = sum(1 for l in gpu_links.values() if l.get("active"))
            inactive = [lid for lid, l in gpu_links.items() if not l.get("active")]
            speed_bad = []
            for lid in gpu_links:
                speed = speeds.get(gpu, {}).get(lid)
                if speed is not None and speed < expected_speed:
                    speed_bad.append({"link": lid, "speed_gbps": speed})
            err_bad = []
            if require_zero_errors:
                for lid, counters in errors.get(gpu, {}).items():
                    total = sum(v for v in counters.values() if isinstance(v, int))
                    if total:
                        err_bad.append({"link": lid, "counters": counters})
            passed = active == expected_links and not inactive and not speed_bad and not err_bad
            if not passed:
                overall = False
            gpu_results.append({
                "gpu": int(gpu),
                "active_links": active,
                "expected_links": expected_links,
                "inactive_links": inactive,
                "speed_issues": speed_bad,
                "error_issues": err_bad,
                "passed": passed,
            })
        return {
            "passed": overall,
            "expected_links_per_gpu": expected_links,
            "expected_link_speed_gbps": expected_speed,
            "require_zero_errors": require_zero_errors,
            "gpus": gpu_results,
            "raw_status": out_s[-4000:],
            "raw_speed": out_c[-4000:] if out_c else "",
            "raw_errors": out_e[-4000:] if out_e else "",
            "timestamp": datetime.now().isoformat(),
        }
    @staticmethod
    def _parse_status(text: str) -> dict[str, dict[str, dict]]:
        result: dict[str, dict[str, dict]] = {}
        gpu = None
        for line in text.splitlines():
            m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
            if m_gpu:
                gpu = m_gpu.group(1)
                result.setdefault(gpu, {})
                continue
            if gpu is None:
                continue
            m_link = re.search(r"Link\s+(\d+).*?(Active|Inactive|Disabled|Off|Down)", line, re.I)
            if m_link:
                state = m_link.group(2)
                result[gpu][m_link.group(1)] = {
                    "state": state,
                    "active": state.lower() == "active",
                    "raw": line.strip(),
                }
                continue
            m_speed = re.search(r"Link\s+(\d+).*?([0-9.]+)\s*GB/s", line, re.I)
            if m_speed:
                result[gpu][m_speed.group(1)] = {
                    "state": "Active",
                    "active": True,
                    "raw": line.strip(),
                }
        return result
    @staticmethod
    def _parse_speeds(text: str) -> dict[str, dict[str, float]]:
        result: dict[str, dict[str, float]] = {}
        gpu = None
        for line in text.splitlines():
            m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
            if m_gpu:
                gpu = m_gpu.group(1)
                result.setdefault(gpu, {})
                continue
            if gpu is None:
                continue
            m_link = re.search(r"Link\s+(\d+).*?([0-9.]+)\s*GB/s", line, re.I)
            if m_link:
                result[gpu][m_link.group(1)] = float(m_link.group(2))
        return result
    @staticmethod
    def _parse_errors(text: str) -> dict[str, dict[str, dict[str, int]]]:
        result: dict[str, dict[str, dict[str, int]]] = {}
        gpu = None
        link = None
        for line in text.splitlines():
            m_gpu = re.search(r"GPU\s+(\d+)", line, re.I)
            if m_gpu:
                gpu = m_gpu.group(1)
                result.setdefault(gpu, {})
                continue
            m_link = re.search(r"Link\s+(\d+)", line, re.I)
            if m_link and gpu is not None:
                link = m_link.group(1)
                result[gpu].setdefault(link, {})
            if gpu is None or link is None:
                continue
            for name in ("CRC", "Replay", "Recovery"):
                m = re.search(rf"{name}[^0-9]*(\d+)", line, re.I)
                if m:
                    result[gpu][link][name.lower()] = int(m.group(1))
        return result
    @staticmethod
    def print_results(results: dict, console: Optional[Console] = None):
        c = console or Console()
        if results.get("error"):
            c.print(f"[bold red]NVLink error: {results['error']}[/bold red]")
            return
        passed = results.get("passed", False)
        c.print("[bold green]✓ NVLink PASSED[/bold green]" if passed else "[bold red]✗ NVLink FAILED[/bold red]")
        table = Table(box=None, padding=(0, 1))
        table.add_column("GPU", style="bold")
        table.add_column("Active Links", justify="right")
        table.add_column("Issues")
        for g in results.get("gpus", []):
            issues = []
            if g.get("inactive_links"):
                issues.append("inactive=" + ",".join(g["inactive_links"]))
            if g.get("speed_issues"):
                issues.append(f"speed={len(g['speed_issues'])}")
            if g.get("error_issues"):
                issues.append(f"errors={len(g['error_issues'])}")
            table.add_row(str(g["gpu"]), f"{g['active_links']}/{g['expected_links']}", "; ".join(issues) or "OK")
        c.print(table)
--- a/modules/report.py
+++ b/modules/report.py
@ -93,8 +93,8 @@ class ReportGenerator:
    def _generate_html(self, results: dict, output: str) -> str:
        import socket
-        hostname = socket.gethostname()
+        hostname = results.get("hostname") or socket.gethostname()
-        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        timestamp = results.get("timestamp") or datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        sections = []
@ -178,8 +178,8 @@ class ReportGenerator:
    def _generate_markdown(self, results: dict, output: str) -> str:
        import socket
-        hostname = socket.gethostname()
+        hostname = results.get("hostname") or socket.gethostname()
-        timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+        timestamp = results.get("timestamp") or datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        lines: list[str] = []
@ -201,6 +201,21 @@ class ReportGenerator:
        # --- Summary table ---
        summary_items = self._build_summary(results)
        if summary_items:
            verdict, failures, missing = self._overall_acceptance_verdict(summary_items)
            lines.append("## Overall Acceptance Verdict\n")
            lines.append(f"**Result: {verdict}**")
            lines.append("")
            if failures:
                lines.append("Failed or unverified items:")
                for name, status in failures:
                    lines.append(f"- {name}: {status}")
                lines.append("")
            if missing:
                lines.append("Missing required evidence:")
                for name in missing:
                    lines.append(f"- {name}")
                lines.append("")
            lines.append("## Summary\n")
            lines.append("| Test | Result |")
            lines.append("|------|--------|")
@ -319,8 +334,6 @@ class ReportGenerator:
                    if use_abs and thr:
                        if val >= thr:
                            status = "PASS"
                        elif val >= thr * 0.9:
                            status = "WARN"
                        else:
                            status = "FAIL"
                        lines.append(f"| {dt.upper()} | {val:.1f} | {pk:.0f} | >= {thr} | {status} |")
@ -331,29 +344,122 @@ class ReportGenerator:
                        overall_status = status
            lines.append("")
            if use_abs:
                if any(not row.get("passed", False) for row in (comp_data.get("consistency", {}) or {}).values()):
                    overall_status = "FAIL"
                lines.append(f"**Verdict: {overall_status}** (absolute TFLOPS thresholds; worst efficiency {worst_eff:.1f}%)\n")
            else:
                overall_status = "PASS" if worst_eff >= 80 else ("WARN" if worst_eff >= 50 else "FAIL")
                lines.append(f"**Verdict: {overall_status}** (worst efficiency {worst_eff:.1f}%)\n")
            consistency = comp_data.get("consistency", {}) or {}
            if consistency:
                lines.append("### Compute Consistency\n")
                lines.append("| DType | Min | Mean | Max | Spread | Limit | Status |")
                lines.append("|-------|-----|------|-----|--------|-------|--------|")
                for dt, row in consistency.items():
                    status = "PASS" if row.get("passed") else "FAIL"
                    lines.append(
                        f"| {dt.upper()} | {row.get('min_tflops', 0):.1f} | "
                        f"{row.get('mean_tflops', 0):.1f} | {row.get('max_tflops', 0):.1f} | "
                        f"{row.get('spread_pct', 0):.2f}% | <= {row.get('max_allowed_pct', 3)}% | {status} |"
                    )
                lines.append("")
            per_gpu = comp_data.get("per_gpu", []) or []
            dtype_order = [dt for dt in per_dtype.keys() if not isinstance(per_dtype.get(dt), str)]
            if per_gpu and dtype_order:
                lines.append("### Compute Per-GPU TFLOPS\n")
                headers = ["GPU", *[dt.upper() for dt in dtype_order]]
                lines.append("| " + " | ".join(headers) + " |")
                lines.append("|" + "|".join(["---"] * len(headers)) + "|")
                for row in per_gpu:
                    cells = [str(row.get("index", ""))]
                    for dt in dtype_order:
                        val = row.get(dt, "")
                        cells.append(f"{val:.1f}" if isinstance(val, (int, float)) else str(val))
                    lines.append("| " + " | ".join(cells) + " |")
                lines.append("")
        # --- NCCL ---
        nvlink = results.get("nvlink")
        if nvlink and not nvlink.get("error"):
            lines.append("## NVLink/NVSwitch\n")
            lines.append(f"**Overall: {'PASS' if nvlink.get('passed') else 'FAIL'}**\n")
            lines.append("| GPU | Active Links | Issues |")
            lines.append("|-----|--------------|--------|")
            for g in nvlink.get("gpus", []):
                issues = []
                if g.get("inactive_links"):
                    issues.append("inactive=" + ",".join(g["inactive_links"]))
                if g.get("speed_issues"):
                    issues.append(f"speed issues={len(g['speed_issues'])}")
                if g.get("error_issues"):
                    issues.append(f"errors={len(g['error_issues'])}")
                lines.append(f"| {g.get('gpu')} | {g.get('active_links')}/{g.get('expected_links')} | {', '.join(issues) or 'OK'} |")
            lines.append("")
        elif nvlink and nvlink.get("error"):
            lines.append("## NVLink/NVSwitch\n")
            lines.append(f"**Overall: FAIL** ({nvlink.get('error')})\n")
        dcgm = results.get("dcgm")
        if dcgm and not dcgm.get("error"):
            lines.append("## DCGM Diagnostic\n")
            lines.append(f"**Overall: {'PASS' if dcgm.get('passed') else 'FAIL'}**\n")
            if dcgm.get("subtests"):
                lines.append("| Subtest | Status |")
                lines.append("|---------|--------|")
                for s in dcgm.get("subtests", []):
                    lines.append(f"| {s.get('name', '')} | {s.get('status', '')} |")
                lines.append("")
        elif dcgm and dcgm.get("error"):
            lines.append("## DCGM Diagnostic\n")
            lines.append(f"**Overall: FAIL** ({dcgm.get('error')})\n")
        # --- NCCL ---
        nccl = results.get("nccl")
        if nccl and not nccl.get("error"):
            lines.append("## NCCL Multi-GPU\n")
            lines.append(f"Source: {nccl.get('source', 'unknown')} | "
                         f"GPUs: {nccl.get('gpu_count', '?')}\n")
            if nccl.get("source") == "torchrun_fallback":
                lines.append("> Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.\n")
            tests = nccl.get("tests", {})
            if tests:
-                lines.append("| Operation | Bus BW (GB/s) | Threshold | Status |")
+                lines.append("> Summary reports the best Bus BW observed for each operation. PASS/FAIL is evaluated across every tested message size and repeat run shown in the detail table below.\n")
-                lines.append("|-----------|---------------|-----------|--------|")
+                lines.append("| Operation | Best Bus BW (GB/s) | Failed Sizes | Threshold | Status |")
                lines.append("|-----------|--------------------|--------------|-----------|--------|")
                for op, data in tests.items():
                    if isinstance(data, dict) and not data.get("error"):
                        bw = data.get("best_busbw_gbps", 0)
                        req = data.get("min_required_gbps", 0)
                        status = data.get("status", "?")
-                        lines.append(f"| {op} | {bw:.1f} | >= {req:.0f} | {status} |")
+                        failed_sizes = [
                            str(row.get("size", "?"))
                            for row in data.get("by_size", [])
                            if row.get("status") != "PASS"
                        ]
                        failed_sizes_text = ", ".join(failed_sizes) if failed_sizes else "-"
                        lines.append(f"| {op} | {bw:.1f} | {failed_sizes_text} | >= {req:.0f} | {status} |")
                    elif isinstance(data, dict) and data.get("error"):
-                        lines.append(f"| {op} | - | - | ERROR: {data['error']} |")
+                        lines.append(f"| {op} | - | - | - | ERROR: {data['error']} |")
                lines.append("")
                for op, data in tests.items():
                    by_size = data.get("by_size", []) if isinstance(data, dict) else []
                    if not by_size:
                        continue
                    lines.append(f"### NCCL {op} by size\n")
                    lines.append("| Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |")
                    lines.append("|------|---------------------|-------|------|--------|-----------|--------|")
                    for row in by_size:
                        runs = ", ".join(str(v) for v in row.get("runs_busbw_gbps", []))
                        lines.append(
                            f"| {row.get('size', '')} | {runs} | "
                            f"{row.get('worst_busbw_gbps', 0):.1f} | "
                            f"{row.get('mean_busbw_gbps', 0):.1f} | "
                            f"{row.get('stddev_pct', 0):.2f}% | "
                            f">= {data.get('min_required_gbps', 0):.0f} | "
                            f"{row.get('status', '?')} |"
                        )
                    lines.append("")
            passed = nccl.get("passed", False)
            lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n")
@ -368,6 +474,21 @@ class ReportGenerator:
            source = stress.get("source", "unknown")
            lines.append(f"- **Source:** {source}")
            lines.append(f"- **Duration:** {elapsed:.0f}s (requested {duration}s)")
            telemetry = stress.get("telemetry") or {}
            if telemetry:
                lines.append(f"- **Telemetry samples:** {telemetry.get('samples', 0)}")
                lines.append(f"- **Max temp:** {telemetry.get('max_temp_c', {})}")
                lines.append(f"- **Avg power:** {telemetry.get('avg_power_w', {})}")
                lines.append(f"- **Temp delta:** {telemetry.get('temp_delta_c', 'N/A')} C")
                lines.append(f"- **TFLOPS jitter:** {telemetry.get('tflops_jitter_pct', 'N/A')}%")
                lines.append(f"- **Steady TFLOPS samples:** {telemetry.get('steady_tflops_samples', 0)}")
                lines.append(f"- **Throttle events:** {telemetry.get('throttle_event_count', len(telemetry.get('throttle_events', [])))}")
                lines.append(f"- **XID events:** {len(telemetry.get('xid_events', []))}")
                failures = telemetry.get("failures") or []
                if failures:
                    lines.append("- **Failure reasons:**")
                    for reason in failures:
                        lines.append(f"  - {reason}")
            lines.append(f"- **Result: {'PASS' if passed else 'FAIL'}**")
            lines.append("")
@ -378,26 +499,70 @@ class ReportGenerator:
            lines.append(f"**Overall: SKIP** [{rdma.get('reason', 'no IB hardware detected')}]\n")
        elif rdma and not rdma.get("error"):
            lines.append("## RDMA/InfiniBand\n")
            rdma_legacy_note = self._rdma_legacy_note(rdma)
            if rdma_legacy_note:
                lines.append(f"> {rdma_legacy_note}\n")
            port_checks = rdma.get("port_checks", [])
            if port_checks:
                lines.append("### RDMA Port Checks\n")
                lines.append("| Device | Port | State | Rate | Required | Status |")
                lines.append("|--------|------|-------|------|----------|--------|")
                for p in port_checks:
                    lines.append(
                        f"| {p.get('device', '')} | {p.get('port', '')} | "
                        f"{p.get('state', '')} | {p.get('rate', '')} | "
                        f">= {p.get('min_rate_gbps', 400):.0f}Gbps ACTIVE | {p.get('status', '?')} |"
                    )
                lines.append("")
            bw_tests = rdma.get("bandwidth_tests", [])
            lat_tests = rdma.get("latency_tests", [])
-            if bw_tests or lat_tests:
+            ibping_tests = rdma.get("ibping_tests", [])
            if bw_tests or lat_tests or ibping_tests:
                lines.append("| Test | Value | Threshold | Status |")
                lines.append("|------|-------|-----------|--------|")
                for bt in bw_tests:
-                    if not bt.get("error"):
+                    if bt.get("error"):
                        lines.append(f"| {bt.get('test', 'ib_bw')} | {bt.get('error')} | required runnable test | {bt.get('status', 'FAIL')} |")
                    else:
                        threshold, status = self._rdma_bandwidth_verdict(bt)
                        lines.append(f"| {bt['test']} | {bt.get('bandwidth_gbps', 0):.1f} GB/s | "
-                                     f">= {bt.get('min_required_gbps', 0)} GB/s | {bt.get('status', '?')} |")
+                                     f">= {threshold:g} GB/s | {status} |")
                for lt in lat_tests:
-                    if not lt.get("error"):
+                    if lt.get("error"):
                        lines.append(f"| {lt.get('test', 'ib_lat')} | {lt.get('error')} | required runnable test | {lt.get('status', 'FAIL')} |")
                    else:
                        threshold, status = self._rdma_latency_verdict(lt)
                        lines.append(f"| {lt['test']} | {lt.get('latency_us', 0):.2f} us | "
-                                     f"<= {lt.get('max_allowed_us', 0)} us | {lt.get('status', '?')} |")
+                                     f"<= {threshold:g} us | {status} |")
                for it in ibping_tests:
                    direction = it.get("direction") or it.get("role", "N/A")
                    if it.get("error"):
                        lines.append(f"| {it.get('test', 'ibping')} | {it.get('error')} | bidirectional peer evidence | {it.get('status', 'FAIL')} |")
                    else:
                        lines.append(f"| {it['test']} | {direction} target={it.get('target', 'N/A')} count={it.get('count', 'N/A')} | "
                                     f"0% packet loss | {it.get('status', '?')} |")
                lines.append("")
            fabric = rdma.get("fabric_counters") or {}
            if fabric:
                counters = fabric.get("counters", {})
                lines.append(f"- **PFC/ECN/CNP/congestion counters checked:** {len(counters)}")
                lines.append(f"- **PFC/ECN/CNP/congestion non-zero:** {'yes' if fabric.get('failed') else 'no'}")
                if not counters:
                    lines.append("- **PFC/ECN/CNP/congestion evidence:** missing")
            failures = rdma.get("failures") or []
            if not failures:
                failures = self._rdma_failure_reasons(rdma)
            if failures:
                lines.append("- **Failure reasons:**")
                for reason in failures:
                    lines.append(f"  - {reason}")
            passed = rdma.get("passed", False)
            lines.append(f"**Overall: {'PASS' if passed else 'FAIL'}**\n")
        # --- Training ---
        training = results.get("training")
        if training and not training.get("error"):
            training_status, training_detail, training_missing = self._training_verdict(training)
            lines.append("## Training Simulation\n")
            lines.append("| Metric | Value |")
            lines.append("|--------|-------|")
@ -405,8 +570,14 @@ class ReportGenerator:
            lines.append(f"| Params | {training.get('total_params_m', 0):.1f}M |")
            lines.append(f"| Throughput | {training.get('throughput_tokens_per_sec', 0):.0f} tokens/sec |")
            lines.append(f"| Avg Step Time | {training.get('avg_step_time_ms', 0):.1f} ms |")
            lines.append(f"| Warmup Steps | {training.get('warmup_steps', 'N/A')} |")
            lines.append(f"| Peak Memory | {training.get('peak_memory_gb', 0):.1f} GB |")
            lines.append(f"| Final Loss | {training.get('final_loss', 'N/A')} |")
            lines.append(f"| Step Jitter | {training.get('step_jitter_pct', 'N/A')}% |")
            lines.append(f"| Distributed Mode | {training.get('distributed_mode', 'N/A')} |")
            if training_missing:
                lines.append(f"| Acceptance Gaps | missing {', '.join(training_missing)} |")
            lines.append(f"| Verdict | {training_status} ({training_detail}) |")
            lines.append("")
        # --- Footer ---
@ -441,6 +612,101 @@ class ReportGenerator:
                return bench["compute"]
        return {}
    @staticmethod
    def _training_verdict(training: dict) -> tuple[str, str, list[str]]:
        """Return report status for both current and legacy training result schemas."""
        tps = float(training.get("throughput_tokens_per_sec", 0) or 0)
        if "passed" in training:
            status = "PASS" if training.get("passed") else "FAIL"
            return status, f"{tps:.0f} tokens/sec", []
        required = ["passed", "step_jitter_pct", "distributed_mode", "loss_finite"]
        missing = [k for k in required if k not in training]
        return "UNVERIFIED", f"{tps:.0f} tokens/sec; legacy result lacks explicit acceptance verdict", missing
    def _rdma_cfg_value(self, key: str, default: float) -> float:
        try:
            return float((self.config.get("rdma", {}) or {}).get(key, default))
        except (TypeError, ValueError):
            return default
    def _rdma_bandwidth_verdict(self, row: dict) -> tuple[float, str]:
        threshold = self._rdma_cfg_value("min_bandwidth_gbps", 47.0)
        value = float(row.get("bandwidth_gbps", 0) or 0)
        return threshold, "PASS" if value >= threshold else "FAIL"
    def _rdma_latency_verdict(self, row: dict) -> tuple[float, str]:
        name = row.get("test", "")
        if name == "ib_write_lat":
            threshold = self._rdma_cfg_value("max_write_latency_us", 2.0)
        elif name == "ib_read_lat":
            threshold = self._rdma_cfg_value("max_read_latency_us", 3.5)
        else:
            threshold = self._rdma_cfg_value("max_latency_us", 3.5)
        value = float(row.get("latency_us", 0) or 0)
        return threshold, "PASS" if 0 < value <= threshold else "FAIL"
    def _rdma_legacy_note(self, rdma: dict) -> str:
        """Flag old RDMA result schemas whose embedded thresholds were looser."""
        for row in rdma.get("bandwidth_tests", []) or []:
            if row.get("min_required_gbps") != self._rdma_cfg_value("min_bandwidth_gbps", 47.0):
                return (
                    "Legacy RDMA result re-evaluated with current PDF acceptance thresholds; "
                    "old WARN statuses and old 50GB/s/10us limits are not used for verdict."
                )
        for row in rdma.get("latency_tests", []) or []:
            threshold, _ = self._rdma_latency_verdict(row)
            if row.get("max_allowed_us") != threshold:
                return (
                    "Legacy RDMA result re-evaluated with current PDF acceptance thresholds; "
                    "old WARN statuses and old 50GB/s/10us limits are not used for verdict."
                )
        return ""
    def _rdma_failure_reasons(self, rdma: dict) -> list[str]:
        failures = []
        for row in rdma.get("bandwidth_tests", []) or []:
            threshold, status = self._rdma_bandwidth_verdict(row)
            if status != "PASS":
                failures.append(
                    f"{row.get('test')} bandwidth {row.get('bandwidth_gbps', 0)}GB/s < {threshold:g}GB/s"
                )
        for row in rdma.get("latency_tests", []) or []:
            threshold, status = self._rdma_latency_verdict(row)
            if status != "PASS":
                failures.append(
                    f"{row.get('test')} latency {row.get('latency_us', 0)}us > {threshold:g}us"
                )
        for row in rdma.get("ibping_tests", []) or []:
            if row.get("status") != "PASS":
                failures.append(f"{row.get('test')} failed")
        return failures
    @staticmethod
    def _overall_acceptance_verdict(summary_items: list[tuple[str, str]]) -> tuple[str, list[tuple[str, str]], list[str]]:
        """PDF-style machine verdict: every required item must be present and PASS."""
        required = [
            "GPU Info",
            "Health Check",
            "Memory Bandwidth",
            "Compute Throughput",
            "NVLink/NVSwitch",
            "NCCL",
            "Stress Test",
            "RDMA",
            "DCGM",
            "Training",
        ]
        status_by_name = dict(summary_items)
        missing = [name for name in required if name not in status_by_name]
        failures = [
            (name, status)
            for name, status in summary_items
            if name in required and not str(status).startswith("PASS")
        ]
        verdict = "PASS" if not missing and not failures else "FAIL"
        return verdict, failures, missing
    def _build_summary(self, results: dict) -> list[tuple[str, str]]:
        """Build summary verdict list from results."""
        items = []
@ -473,7 +739,7 @@ class ReportGenerator:
                d2d = mem.get("d2d_bandwidth_gbps") or 0
                items.append(("Memory Bandwidth", f"WARN ({d2d:.0f} GB/s via PyTorch fallback)"))
            else:
-                eff = mem.get("efficiency_pct") or 0
+                eff = mem.get("d2d_efficiency_pct") or mem.get("efficiency_pct") or 0
                verdict = "PASS" if eff >= 80 else ("WARN" if eff >= 60 else "FAIL")
                items.append(("Memory Bandwidth", f"{verdict} ({eff:.1f}%)"))
@ -491,20 +757,38 @@ class ReportGenerator:
                    rank = {"PASS": 0, "WARN": 1, "FAIL": 2}
                    worst_status = "PASS"
                    worst_dt = None
                    lowest_margin = None
                    for dt, thr in pass_thresholds.items():
                        val = per_dtype.get(dt)
                        if not isinstance(val, (int, float)):
                            continue
                        if val >= thr:
                            st = "PASS"
                        elif val >= thr * 0.9:
                            st = "WARN"
                        else:
                            st = "FAIL"
                        margin = val / thr if thr else 0
                        if lowest_margin is None or margin < lowest_margin:
                            lowest_margin = margin
                            worst_dt = dt
                        if rank[st] > rank[worst_status]:
                            worst_status = st
                            worst_dt = dt
                    if worst_dt:
                        consistency = comp.get("consistency", {}) or {}
                        failed_consistency = [
                            (dt, row)
                            for dt, row in consistency.items()
                            if not row.get("passed", False)
                        ]
                        if failed_consistency:
                            worst_status = "FAIL"
                            fail_dt, fail_row = failed_consistency[0]
                            items.append((
                                "Compute Throughput",
                                f"FAIL ({fail_dt.upper()} spread "
                                f"{fail_row.get('spread_pct', 0):.2f}% > "
                                f"{fail_row.get('max_allowed_pct', 3)}%)"
                            ))
                        else:
                            items.append((
                                "Compute Throughput",
                                f"{worst_status} (worst {worst_dt.upper()} "
@ -521,11 +805,32 @@ class ReportGenerator:
                    else:
                        items.append(("Compute Throughput", "N/A"))
        # NCCL
        if "nvlink" in results:
            nvl = results["nvlink"]
            if nvl.get("error"):
                items.append(("NVLink/NVSwitch", f"ERROR: {nvl['error']}"))
            elif nvl.get("passed"):
                items.append(("NVLink/NVSwitch", "PASS"))
            else:
                items.append(("NVLink/NVSwitch", "FAIL"))
        if "dcgm" in results:
            d = results["dcgm"]
            if d.get("error"):
                items.append(("DCGM", f"ERROR: {d['error']}"))
            elif d.get("passed"):
                items.append(("DCGM", "PASS"))
            else:
                items.append(("DCGM", "FAIL"))
        # NCCL
        if "nccl" in results:
            n = results["nccl"]
            if n.get("error"):
                items.append(("NCCL", f"ERROR: {n['error']}"))
            elif n.get("source") == "torchrun_fallback":
                items.append(("NCCL", "FAIL (no nccl-tests bus BW)"))
            elif n.get("passed"):
                items.append(("NCCL", "PASS"))
            else:
@ -559,7 +864,7 @@ class ReportGenerator:
            if t.get("error"):
                items.append(("Training", f"ERROR: {t['error']}"))
            else:
-                tps = t.get("throughput_tokens_per_sec", 0)
+                status, detail, _missing = self._training_verdict(t)
-                items.append(("Training", f"PASS ({tps:.0f} tokens/sec)"))
+                items.append(("Training", f"{status} ({detail})"))
        return items
--- a/modules/stress_test.py
+++ b/modules/stress_test.py
@ -1,9 +1,10 @@
-"""GPU stress test module — wraps gpu-burn for long-running stability tests."""
+"""GPU stress test module — gpu-burn or PyTorch GEMM with telemetry."""
 import glob
 import os
 import shutil
 import subprocess
 import threading
 import time
 from datetime import datetime
@ -46,7 +47,7 @@ class StressTest:
        memory_pct = cfg.get("memory_pct", 90)
        target_gpus = cfg.get("gpus", "all")
-        gpu_burn = self._find_gpu_burn()
+        gpu_burn = self._find_gpu_burn() if cfg.get("use_gpu_burn", False) else ""
        if gpu_burn:
            # Try gpu-burn first
@ -60,7 +61,7 @@ class StressTest:
            return result
-        self.console.print("[yellow]gpu_burn not found, using PyTorch stress test[/yellow]")
+        self.console.print("[yellow]Using PyTorch stress test[/yellow]")
        return self._run_pytorch_stress(duration_sec, memory_pct)
    def _run_gpu_burn(self, gpu_burn: str, duration: int,
@ -77,12 +78,26 @@ class StressTest:
        cmd.append(str(duration))
        t0 = time.time()
        xid_before = self._collect_xid_events()
        interval = int(self.stress_cfg.get("telemetry_interval_sec", 1))
        telemetry = []
        stop_sampling = threading.Event()
        sampler = threading.Thread(
            target=self._sample_telemetry,
            args=(telemetry, stop_sampling, interval),
            daemon=True,
        )
        sampler.start()
        try:
            r = subprocess.run(cmd, capture_output=True, text=True, timeout=duration + 120)
            elapsed = round(time.time() - t0, 1)
            stop_sampling.set()
            sampler.join(timeout=interval + 1)
            output = r.stdout + r.stderr
-            passed = r.returncode == 0
+            xid_events = self._new_xid_events(xid_before, self._collect_xid_events())
            telemetry_summary = self._evaluate_telemetry(telemetry, [], xid_events)
            passed = r.returncode == 0 and telemetry_summary.get("passed", False)
            gpu_results = []
            for line in output.split("\n"):
@ -96,25 +111,36 @@ class StressTest:
                "duration_sec": duration,
                "elapsed_sec": elapsed,
                "gpu_results": gpu_results,
                "telemetry": telemetry_summary,
                "raw_output_tail": output[-500:] if output else "",
                "timestamp": datetime.now().isoformat(),
            }
        except subprocess.TimeoutExpired:
            stop_sampling.set()
            return {
                "source": "gpu-burn",
                "passed": False,
                "duration_sec": duration,
                "error": "timeout",
                "telemetry": self._evaluate_telemetry(
                    telemetry, [], self._new_xid_events(xid_before, self._collect_xid_events())
                ),
                "timestamp": datetime.now().isoformat(),
            }
        except Exception as e:
            stop_sampling.set()
            return {
                "source": "gpu-burn",
                "passed": False,
                "error": str(e),
                "telemetry": self._evaluate_telemetry(
                    telemetry, [], self._new_xid_events(xid_before, self._collect_xid_events())
                ),
                "timestamp": datetime.now().isoformat(),
            }
        finally:
            stop_sampling.set()
    def _run_pytorch_stress(self, duration: int, memory_pct: int = 90) -> dict:
        try:
@ -127,58 +153,79 @@ class StressTest:
        gpu_count = torch.cuda.device_count()
        self.console.print(f"[cyan]PyTorch Stress Test ({duration}s, {gpu_count} GPUs, target {memory_pct}% memory)[/cyan]")
        dtype_name = self.stress_cfg.get("dtype", "bf16")
        matrix_size = int(self.stress_cfg.get("matrix_size", 8192))
        interval = int(self.stress_cfg.get("telemetry_interval_sec", 1))
        dtype_map = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}
        dtype = dtype_map.get(dtype_name, torch.bfloat16)
        gpu_status = {}
        telemetry = []
        stop_sampling = threading.Event()
        t0 = time.time()
        xid_before = self._collect_xid_events()
        try:
            sampler = threading.Thread(
                target=self._sample_telemetry,
                args=(telemetry, stop_sampling, interval),
                daemon=True,
            )
            sampler.start()
            tensors = {}
            ballast = {}
            pass_tflops = []
            for i in range(gpu_count):
                with torch.cuda.device(i):
                    # Get actual free memory (accounting for other processes)
                    free_mem, total_mem = torch.cuda.mem_get_info(i)
-                    
+                    side = matrix_size
-                    # Calculate allocation from configured memory_pct
+                    elem = torch.tensor([], dtype=dtype).element_size()
-                    target_mem = int(total_mem * memory_pct / 100)
+                    compute_bytes = side * side * elem * 3
-                    
+                    target_mem = min(int(total_mem * memory_pct / 100), int(free_mem * 0.90))
-                    # Cap at actual free memory with 5% safety margin
+                    ballast_bytes = max(0, target_mem - compute_bytes)
-                    alloc_bytes = min(target_mem, int(free_mem * 0.95))
+                    if ballast_bytes:
-                    
+                        ballast_elems = ballast_bytes // 2
-                    # matmul(A, A.T) needs 2x input memory (input + output)
+                        ballast[i] = torch.empty(ballast_elems, device=f"cuda:{i}", dtype=torch.float16)
-                    mem_side = int((alloc_bytes / 4 / 2) ** 0.5)
+                    actual_mem_mb = (compute_bytes + ballast_bytes) / 1024 / 1024
                    # Cap compute matrix so a single matmul completes in ~2s on H100/H200
                    # (FP32 ≈ 67 TFLOPS → 2*4096³/67e12 ≈ 2s). Without this cap, a 141GB
                    # HBM yields side ≈ 131K → single matmul ~68s × 8 GPUs serial → loop
                    # overshoots a 60s duration request by 10×+.
                    MAX_COMPUTE_SIDE = 4096
                    side = min(mem_side, MAX_COMPUTE_SIDE)
                    actual_mem_mb = side * side * 4 / 1024 / 1024
                    total_mem_mb = total_mem / 1024 / 1024
                    free_mem_mb = free_mem / 1024 / 1024
                    self.console.print(
                        f"  [dim]GPU {i}: total {total_mem_mb:.0f}MB, free {free_mem_mb:.0f}MB, "
                        f"alloc {actual_mem_mb:.0f}MB ({actual_mem_mb/total_mem_mb*100:.0f}%) - "
-                        f"matrix {side}x{side}[/dim]"
+                        f"{dtype_name} matrix {side}x{side}[/dim]"
                    )
                    tensors[i] = (
                        torch.randn(side, side, device=f"cuda:{i}", dtype=dtype),
                        torch.randn(side, side, device=f"cuda:{i}", dtype=dtype),
                        torch.empty(side, side, device=f"cuda:{i}", dtype=dtype),
                    )
                    tensors[i] = torch.randn(side, side, device=f"cuda:{i}", dtype=torch.float32)
            self.console.print(f"\n[cyan]Starting stress test for {duration} seconds...[/cyan]")
            elapsed_check = 0
            while time.time() - t0 < duration:
                loop_start = time.perf_counter()
                # Dispatch matmul on all GPUs in parallel — do NOT synchronize between
                # GPUs, otherwise the 8 GPUs run serially and overshoot the duration.
                for i in range(gpu_count):
                    with torch.cuda.device(i):
-                        tensors[i] = torch.matmul(tensors[i], tensors[i].T)
+                        a, b, out = tensors[i]
                        torch.matmul(a, b, out=out)
                # Single sync per pass — waits for all 8 streams concurrently
                for i in range(gpu_count):
                    with torch.cuda.device(i):
                        torch.cuda.synchronize()
                loop_elapsed = time.perf_counter() - loop_start
                current_elapsed = time.time() - t0
                if loop_elapsed > 0:
                    flops = gpu_count * 2 * (matrix_size ** 3)
                    pass_tflops.append({
                        "elapsed_sec": current_elapsed,
                        "tflops": flops / loop_elapsed / 1e12,
                    })
                # Show progress every 10 seconds
                current_elapsed = time.time() - t0
                if int(current_elapsed) != int(elapsed_check) and int(current_elapsed) % 10 == 0:
                    self.console.print(f"  [dim]Running {int(current_elapsed)}s / {duration}s[/dim]")
                    elapsed_check = current_elapsed
@ -198,21 +245,196 @@ class StressTest:
                "duration_sec": duration,
                "error": error_msg,
                "gpu_status": gpu_status,
                "telemetry": self._evaluate_telemetry(
                    telemetry, pass_tflops if "pass_tflops" in locals() else [],
                    self._new_xid_events(xid_before, self._collect_xid_events()),
                ),
            }
        finally:
            stop_sampling.set()
            tensors.clear()
            ballast.clear()
            torch.cuda.empty_cache()
        elapsed = round(time.time() - t0, 1)
        xid_events = self._new_xid_events(xid_before, self._collect_xid_events())
        telemetry_summary = self._evaluate_telemetry(telemetry, pass_tflops, xid_events)
        passed = all(v == "PASS" for v in gpu_status.values()) and telemetry_summary.get("passed", False)
        return {
            "source": "pytorch",
-            "passed": True,
+            "passed": passed,
            "duration_sec": duration,
            "elapsed_sec": elapsed,
            "gpu_status": gpu_status,
            "telemetry": telemetry_summary,
            "timestamp": datetime.now().isoformat(),
        }
    def _sample_telemetry(self, telemetry: list, stop_event: threading.Event, interval: int):
        query = "index,temperature.gpu,power.draw,clocks_throttle_reasons.active"
        while not stop_event.is_set():
            try:
                r = subprocess.run(
                    ["nvidia-smi", f"--query-gpu={query}", "--format=csv,noheader,nounits"],
                    capture_output=True, text=True, timeout=10,
                )
                if r.returncode == 0:
                    sample = {"time": time.time(), "gpus": []}
                    for line in r.stdout.splitlines():
                        parts = [p.strip() for p in line.split(",")]
                        if len(parts) >= 4:
                            sample["gpus"].append({
                                "index": int(parts[0]),
                                "temp_c": float(parts[1]),
                                "power_w": float(parts[2]),
                                "throttle": parts[3],
                            })
                    telemetry.append(sample)
            except Exception:
                pass
            stop_event.wait(interval)
    def _collect_xid_events(self) -> list[str]:
        try:
            r = subprocess.run(
                ["dmesg", "--color=never"],
                capture_output=True, text=True, timeout=10,
            )
            if r.returncode != 0:
                return []
            return [
                line.strip()
                for line in r.stdout.splitlines()
                if any(token in line.upper() for token in ("XID", "NVRM: XID"))
            ]
        except Exception:
            return []
    @staticmethod
    def _new_xid_events(before: list[str], after: list[str]) -> list[str]:
        seen = set(before)
        return [line for line in after if line not in seen]
    def _evaluate_telemetry(self, telemetry: list, pass_tflops: list, xid_events: list[str] | None = None) -> dict:
        cfg = self.stress_cfg
        max_temp = float(cfg.get("max_temp_c", 80))
        max_delta = float(cfg.get("max_temp_delta_c", 5))
        min_power = float(cfg.get("min_power_watts", 630))
        max_jitter = float(cfg.get("max_tflops_jitter_pct", 5))
        require_jitter = bool(cfg.get("require_tflops_jitter", True))
        duration = float(cfg.get("duration_sec", 60))
        requested_warmup = float(cfg.get("warmup_sec", 60))
        warmup_sec = min(requested_warmup, max(0.0, duration * 0.2))
        min_steady_samples = int(cfg.get("min_steady_samples", 10))
        temps = {}
        powers = {}
        throttle_bad = []
        xid_events = xid_events or []
        steady_telemetry = [
            sample for sample in telemetry
            if sample.get("time", 0) - telemetry[0].get("time", 0) >= warmup_sec
        ] if telemetry else []
        evaluation_samples = steady_telemetry if len(steady_telemetry) >= min_steady_samples else telemetry
        for sample in evaluation_samples:
            for g in sample.get("gpus", []):
                idx = g["index"]
                temps.setdefault(idx, []).append(g["temp_c"])
                powers.setdefault(idx, []).append(g["power_w"])
                try:
                    bitmask = int(str(g["throttle"]), 16)
                except ValueError:
                    bitmask = 0
                real_throttle = bitmask & ~0x1
                if real_throttle:
                    throttle_bad.append({
                        "gpu": idx,
                        "throttle": g["throttle"],
                        "real_throttle": f"0x{real_throttle:x}",
                    })
        max_temps = {idx: max(vals) for idx, vals in temps.items() if vals}
        avg_powers = {idx: sum(vals) / len(vals) for idx, vals in powers.items() if vals}
        temp_delta = (max(max_temps.values()) - min(max_temps.values())) if len(max_temps) >= 2 else 0
        jitter = 0
        steady_tflops = []
        for item in pass_tflops:
            if isinstance(item, dict):
                if float(item.get("elapsed_sec", 0)) >= warmup_sec:
                    steady_tflops.append(float(item.get("tflops", 0)))
            else:
                steady_tflops.append(float(item))
        if len(steady_tflops) < 2 and pass_tflops:
            steady_tflops = [
                float(item.get("tflops", 0)) if isinstance(item, dict) else float(item)
                for item in pass_tflops
            ]
        if steady_tflops:
            mean = sum(steady_tflops) / len(steady_tflops)
            jitter = max(abs(v - mean) / mean * 100 for v in steady_tflops) if mean else 0
        failures = []
        temp_failures = {idx: v for idx, v in max_temps.items() if v > max_temp}
        power_failures = {idx: v for idx, v in avg_powers.items() if v < min_power}
        if not evaluation_samples:
            failures.append("no telemetry samples available for evaluation")
        if temp_failures:
            failures.append(
                "max temperature above threshold: "
                + ", ".join(f"GPU {idx} {val:.1f}C" for idx, val in sorted(temp_failures.items()))
            )
        if temp_delta > max_delta:
            failures.append(f"GPU temperature delta {temp_delta:.1f}C exceeds {max_delta:.1f}C")
        if power_failures:
            failures.append(
                "average steady-state power below threshold: "
                + ", ".join(f"GPU {idx} {val:.1f}W" for idx, val in sorted(power_failures.items()))
            )
        if throttle_bad:
            failures.append(
                f"non-idle throttle reasons observed in {len(throttle_bad)} samples "
                f"(first: GPU {throttle_bad[0]['gpu']} {throttle_bad[0]['real_throttle']})"
            )
        if xid_events:
            failures.append(f"{len(xid_events)} new XID/NVRM XID events observed")
        if require_jitter and len(steady_tflops) < 2:
            failures.append(
                f"insufficient steady TFLOPS samples for jitter evaluation: {len(steady_tflops)} < 2"
            )
        if jitter > max_jitter:
            failures.append(f"TFLOPS jitter {jitter:.2f}% exceeds {max_jitter:.2f}%")
        passed = (
            bool(evaluation_samples)
            and all(v <= max_temp for v in max_temps.values())
            and temp_delta <= max_delta
            and all(v >= min_power for v in avg_powers.values())
            and not throttle_bad
            and not xid_events
            and (not require_jitter or len(steady_tflops) >= 2)
            and jitter <= max_jitter
        )
        return {
            "passed": passed,
            "samples": len(telemetry),
            "steady_samples": len(evaluation_samples),
            "warmup_sec": round(warmup_sec, 1),
            "max_temp_c": {k: round(v, 1) for k, v in max_temps.items()},
            "avg_power_w": {k: round(v, 1) for k, v in avg_powers.items()},
            "temp_delta_c": round(temp_delta, 1),
            "throttle_events": throttle_bad[:20],
            "throttle_event_count": len(throttle_bad),
            "xid_events": xid_events[-20:],
            "tflops_jitter_pct": round(jitter, 2),
            "steady_tflops_samples": len(steady_tflops),
            "failures": failures,
            "thresholds": {
                "max_temp_c": max_temp,
                "max_temp_delta_c": max_delta,
                "min_power_w": min_power,
                "max_tflops_jitter_pct": max_jitter,
                "require_tflops_jitter": require_jitter,
                "warmup_sec": requested_warmup,
                "min_steady_samples": min_steady_samples,
            },
        }
    @staticmethod
    def print_results(results: dict, console: Console = None):
        c = console or Console()
@ -245,5 +467,21 @@ class StressTest:
                color = "green" if status == "PASS" else "red"
                c.print(f"    GPU {gid}: [{color}]{status}[/{color}]")
        telemetry = results.get("telemetry") or {}
        if telemetry:
            c.print("\n  Telemetry:")
            c.print(f"    Samples: {telemetry.get('samples', 0)} total, {telemetry.get('steady_samples', 0)} evaluated after {telemetry.get('warmup_sec', 0)}s warmup")
            c.print(f"    Avg steady power: {telemetry.get('avg_power_w', {})}")
            c.print(f"    Max steady temp: {telemetry.get('max_temp_c', {})}")
            c.print(f"    Temp delta: {telemetry.get('temp_delta_c', 'N/A')} C")
            c.print(f"    TFLOPS jitter: {telemetry.get('tflops_jitter_pct', 'N/A')}%")
            c.print(f"    Throttle events: {telemetry.get('throttle_event_count', len(telemetry.get('throttle_events', [])))}")
            c.print(f"    XID events: {len(telemetry.get('xid_events', []))}")
            failures = telemetry.get("failures", [])
            if failures:
                c.print("  [red]Failure reasons:[/red]")
                for reason in failures:
                    c.print(f"    [red]- {reason}[/red]")
        if results.get("error"):
            c.print(f"  [red]Error: {results['error']}[/red]")
--- a/modules/training_sim.py
+++ b/modules/training_sim.py
@ -1,8 +1,13 @@
 """Training simulation module - LLM training workload with PyTorch."""
 import json
 import os
 import sys
 import tempfile
 import time
 import subprocess
 import shutil
 import math
 from datetime import datetime
 from typing import Optional
@ -36,6 +41,7 @@ class TrainingSim:
        batch_size = self.train_cfg.get("batch_size", 8)
        seq_length = self.train_cfg.get("seq_length", 2048)
        num_steps = self.train_cfg.get("num_steps", 50)
        warmup_steps = int(self.train_cfg.get("warmup_steps", 5))
        dtype_str = self.train_cfg.get("dtype", "bf16")
        dtype_map = {
@ -47,7 +53,13 @@ class TrainingSim:
        self.console.print(f"[cyan]Training Simulation[/cyan]")
        self.console.print(f"  Model: {model_name} | Batch: {batch_size} | Seq: {seq_length} | "
-                           f"DType: {dtype_str} | Steps: {num_steps} | GPUs: {gpu_count}")
+                           f"DType: {dtype_str} | Steps: {num_steps} | Warmup: {warmup_steps} | GPUs: {gpu_count}")
        if self.train_cfg.get("mode", "ddp") == "ddp" and gpu_count > 1:
            ddp_result = self._run_synthetic_ddp(gpu_count, batch_size, seq_length, num_steps, dtype_str)
            if ddp_result.get("passed") or not self.train_cfg.get("allow_fallback", False):
                return ddp_result
            self.console.print("[yellow]DDP synthetic training failed, falling back to single-process synthetic path[/yellow]")
        try:
            from transformers import AutoModelForCausalLM, AutoTokenizer
@ -87,9 +99,10 @@ class TrainingSim:
                BarColumn(), TextColumn("{task.completed}/{task.total}"),
                TimeElapsedColumn(), console=self.console,
            ) as progress:
-                task = progress.add_task("Training steps...", total=num_steps)
+                total_steps = num_steps + warmup_steps
                task = progress.add_task("Training steps...", total=total_steps)
-                for step in range(num_steps):
+                for step in range(total_steps):
                    torch.cuda.synchronize()
                    t0 = time.perf_counter()
@ -119,8 +132,15 @@ class TrainingSim:
                    progress.advance(task)
-            avg_step_time = sum(step_times) / len(step_times)
+            measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
            avg_step_time = sum(measured_steps) / len(measured_steps)
            throughput = batch_size * seq_length / avg_step_time
            jitter = self._jitter_pct(measured_steps)
            peak_mem = round(max(mem_usage) if mem_usage else 0, 2)
            final_loss = float(loss.item()) if hasattr(loss, "item") else float("nan")
            passed = self._acceptance_pass(throughput, jitter, peak_mem, final_loss)
            if self.train_cfg.get("require_distributed", True):
                passed = False
            return {
                "model": model_name,
@ -130,11 +150,18 @@ class TrainingSim:
                "batch_size": batch_size,
                "seq_length": seq_length,
                "num_steps": num_steps,
                "warmup_steps": warmup_steps,
                "total_steps": total_steps,
                "avg_step_time_ms": round(avg_step_time * 1000, 1),
                "throughput_tokens_per_sec": round(throughput, 0),
                "throughput_samples_per_sec": round(batch_size / avg_step_time, 2),
-                "peak_memory_gb": round(max(mem_usage) if mem_usage else 0, 2),
+                "peak_memory_gb": peak_mem,
-                "final_loss": round(loss.item(), 4) if hasattr(loss, 'item') else None,
+                "final_loss": round(final_loss, 4),
                "step_jitter_pct": round(jitter, 2),
                "distributed_mode": "device_map",
                "loss_finite": math.isfinite(final_loss),
                "passed": passed,
                "acceptance_gap": "8-GPU DDP was not used" if self.train_cfg.get("require_distributed", True) else "",
                "timestamp": datetime.now().isoformat(),
            }
@ -142,6 +169,196 @@ class TrainingSim:
            self.console.print(f"[yellow]Model loading failed: {e}[/yellow]")
            return self._run_synthetic(gpu_count, batch_size, seq_length, num_steps, dtype)
    def _run_synthetic_ddp(self, gpu_count: int, batch_size: int, seq_length: int,
                           num_steps: int, dtype_str: str) -> dict:
        """Run the 1.5B synthetic Transformer with one process per GPU."""
        torchrun = os.path.join(os.path.dirname(sys.executable), "torchrun")
        if not os.path.isfile(torchrun):
            torchrun = shutil.which("torchrun") or ""
        if not torchrun:
            return {
                "model": "synthetic_transformer_1.5b",
                "gpu_count": gpu_count,
                "distributed_mode": "ddp",
                "passed": False,
                "error": "torchrun not found",
                "timestamp": datetime.now().isoformat(),
            }
        script = r'''
 import json
 import math
 import os
 import time
 import torch
 import torch.distributed as dist
 from torch.nn.parallel import DistributedDataParallel as DDP
 def main():
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])
    torch.cuda.set_device(local_rank)
    dist.init_process_group("nccl")
    global_batch = int(os.environ["TRAIN_BATCH_SIZE"])
    local_batch = max(1, global_batch // world_size)
    seq_length = int(os.environ["TRAIN_SEQ_LENGTH"])
    num_steps = int(os.environ["TRAIN_NUM_STEPS"])
    warmup_steps = int(os.environ.get("TRAIN_WARMUP_STEPS", "5"))
    total_steps = num_steps + warmup_steps
    dtype_name = os.environ.get("TRAIN_DTYPE", "bf16")
    dtype = {"fp16": torch.float16, "bf16": torch.bfloat16, "fp32": torch.float32}.get(dtype_name, torch.bfloat16)
    hidden_size = 4096
    num_layers = 6
    num_heads = 32
    vocab_size = 32000
    class SyntheticTransformer(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.embed = torch.nn.Embedding(vocab_size, hidden_size)
            self.layers = torch.nn.ModuleList([
                torch.nn.TransformerEncoderLayer(
                    d_model=hidden_size,
                    nhead=num_heads,
                    dim_feedforward=hidden_size * 4,
                    batch_first=True,
                    dtype=dtype,
                ) for _ in range(num_layers)
            ])
            self.head = torch.nn.Linear(hidden_size, vocab_size, dtype=dtype)
        def forward(self, x):
            h = self.embed(x).to(dtype)
            for layer in self.layers:
                h = layer(h)
            return self.head(h)
    model = SyntheticTransformer().cuda()
    total_params = sum(p.numel() for p in model.parameters())
    model = DDP(model, device_ids=[local_rank], output_device=local_rank)
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
    input_ids = torch.randint(0, vocab_size, (local_batch, seq_length), device="cuda")
    step_times = []
    last_loss = torch.tensor(float("nan"), device="cuda")
    torch.cuda.reset_peak_memory_stats(local_rank)
    for _ in range(total_steps):
        torch.cuda.synchronize()
        t0 = time.perf_counter()
        with torch.amp.autocast("cuda", dtype=dtype, enabled=dtype in (torch.float16, torch.bfloat16)):
            logits = model(input_ids)
            loss = torch.nn.functional.cross_entropy(logits.reshape(-1, vocab_size), input_ids.reshape(-1))
        loss.backward()
        optimizer.step()
        optimizer.zero_grad(set_to_none=True)
        torch.cuda.synchronize()
        step_times.append(time.perf_counter() - t0)
        last_loss = loss.detach()
    peak_mem = torch.tensor(torch.cuda.max_memory_allocated(local_rank) / 1024**3, device="cuda")
    dist.all_reduce(peak_mem, op=dist.ReduceOp.MAX)
    finite = torch.tensor(1 if math.isfinite(float(last_loss.item())) else 0, device="cuda")
    dist.all_reduce(finite, op=dist.ReduceOp.MIN)
    if dist.get_rank() == 0:
        measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
        avg_step = sum(measured_steps) / len(measured_steps)
        mean = avg_step
        jitter = max(abs(v - mean) / mean * 100 for v in measured_steps) if mean else 0.0
        throughput = global_batch * seq_length / avg_step if avg_step else 0.0
        print("TRAINING_DDP_JSON=" + json.dumps({
            "model": "synthetic_transformer_1.5b",
            "total_params_m": round(total_params / 1e6, 1),
            "num_layers": num_layers,
            "hidden_size": hidden_size,
            "gpu_count": world_size,
            "dtype": dtype_name,
            "batch_size": global_batch,
            "local_batch_size": local_batch,
            "seq_length": seq_length,
            "num_steps": num_steps,
            "warmup_steps": warmup_steps,
            "total_steps": total_steps,
            "avg_step_time_ms": round(avg_step * 1000, 1),
            "throughput_tokens_per_sec": round(throughput, 0),
            "throughput_samples_per_sec": round(global_batch / avg_step, 2) if avg_step else 0,
            "peak_memory_gb": round(float(peak_mem.item()), 2),
            "final_loss": round(float(last_loss.item()), 4),
            "step_jitter_pct": round(jitter, 2),
            "distributed_mode": "ddp",
            "loss_finite": bool(int(finite.item())),
        }), flush=True)
    dist.destroy_process_group()
 if __name__ == "__main__":
    main()
 '''
        tmp = tempfile.NamedTemporaryFile("w", suffix="_training_ddp.py", delete=False)
        tmp.write(script)
        tmp.close()
        env = {
            **os.environ,
            "TRAIN_BATCH_SIZE": str(batch_size),
            "TRAIN_SEQ_LENGTH": str(seq_length),
            "TRAIN_NUM_STEPS": str(num_steps),
            "TRAIN_WARMUP_STEPS": str(int(self.train_cfg.get("warmup_steps", 5))),
            "TRAIN_DTYPE": dtype_str,
            "NCCL_DEBUG": os.environ.get("NCCL_DEBUG", "WARN"),
        }
        cmd = [torchrun, f"--nproc_per_node={gpu_count}", tmp.name]
        self.console.print(f"  Running synthetic 1.5B DDP via torchrun ({gpu_count} processes)...")
        try:
            timeout = int(self.train_cfg.get("timeout_sec", max(600, num_steps * 180)))
            r = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, env=env)
        except subprocess.TimeoutExpired:
            os.unlink(tmp.name)
            return {
                "model": "synthetic_transformer_1.5b",
                "gpu_count": gpu_count,
                "distributed_mode": "ddp",
                "passed": False,
                "error": "training_ddp_timeout",
                "timestamp": datetime.now().isoformat(),
            }
        finally:
            if os.path.exists(tmp.name):
                try:
                    os.unlink(tmp.name)
                except OSError:
                    pass
        marker = "TRAINING_DDP_JSON="
        payload = None
        for line in (r.stdout + "\n" + r.stderr).splitlines():
            if marker in line:
                payload = line.split(marker, 1)[1].strip()
        if r.returncode != 0 or not payload:
            return {
                "model": "synthetic_transformer_1.5b",
                "gpu_count": gpu_count,
                "distributed_mode": "ddp",
                "passed": False,
                "error": (r.stderr or r.stdout or "training_ddp_failed")[-1000:],
                "timestamp": datetime.now().isoformat(),
            }
        result = json.loads(payload)
        loss_value = float(result.get("final_loss", "nan"))
        passed = self._acceptance_pass(
            float(result.get("throughput_tokens_per_sec", 0)),
            float(result.get("step_jitter_pct", 999)),
            float(result.get("peak_memory_gb", 999)),
            loss_value,
        ) and bool(result.get("loss_finite", False)) and result.get("gpu_count") == gpu_count
        result.update({
            "passed": passed,
            "timestamp": datetime.now().isoformat(),
        })
        return result
    def _run_synthetic(self, gpu_count, batch_size, seq_length, num_steps, dtype) -> dict:
        self.console.print("  Running synthetic training benchmark...")
@ -170,11 +387,17 @@ class TrainingSim:
                    h = layer(h)
                return self.head(h)
-        model = SyntheticTransformer().cuda()
+        model = SyntheticTransformer()
        total_params = sum(p.numel() for p in model.parameters())
        self.console.print(f"  Synthetic params: {total_params / 1e6:.1f}M")
        distributed_mode = "single_gpu"
        if gpu_count > 1:
            model = torch.nn.DataParallel(model).cuda()
            distributed_mode = "data_parallel"
        else:
            model = model.cuda()
        model.train()
        optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
@ -183,14 +406,17 @@ class TrainingSim:
        step_times = []
        mem_usage = []
        warmup_steps = int(self.train_cfg.get("warmup_steps", 5))
        total_steps = num_steps + warmup_steps
        with Progress(
            SpinnerColumn(), TextColumn("[progress.description]{task.description}"),
            BarColumn(), TextColumn("{task.completed}/{task.total}"),
            TimeElapsedColumn(), console=self.console,
        ) as progress:
-            task = progress.add_task("Synthetic training...", total=num_steps)
+            task = progress.add_task("Synthetic training...", total=total_steps)
-            for step in range(num_steps):
+            for step in range(total_steps):
                torch.cuda.synchronize()
                t0 = time.perf_counter()
@ -206,14 +432,22 @@ class TrainingSim:
                elapsed = time.perf_counter() - t0
                step_times.append(elapsed)
-                mem_used = torch.cuda.max_memory_allocated() / 1024**3
+                mem_used = max(torch.cuda.max_memory_allocated(i) for i in range(gpu_count)) / 1024**3
                mem_usage.append(mem_used)
-                torch.cuda.reset_peak_memory_stats()
+                for i in range(gpu_count):
                    torch.cuda.reset_peak_memory_stats(i)
                progress.advance(task)
-        avg_step_time = sum(step_times) / len(step_times)
+        measured_steps = step_times[warmup_steps:] if len(step_times) > warmup_steps else step_times
        avg_step_time = sum(measured_steps) / len(measured_steps)
        throughput = batch_size * seq_length / avg_step_time
        jitter = self._jitter_pct(measured_steps)
        peak_mem = round(max(mem_usage) if mem_usage else 0, 2)
        final_loss = float(loss.item())
        passed = self._acceptance_pass(throughput, jitter, peak_mem, final_loss)
        if self.train_cfg.get("require_distributed", True):
            passed = False
        return {
            "model": "synthetic_transformer",
@ -225,14 +459,36 @@ class TrainingSim:
            "batch_size": batch_size,
            "seq_length": seq_length,
            "num_steps": num_steps,
            "warmup_steps": warmup_steps,
            "total_steps": total_steps,
            "avg_step_time_ms": round(avg_step_time * 1000, 1),
            "throughput_tokens_per_sec": round(throughput, 0),
            "throughput_samples_per_sec": round(batch_size / avg_step_time, 2),
-            "peak_memory_gb": round(max(mem_usage) if mem_usage else 0, 2),
+            "peak_memory_gb": peak_mem,
-            "final_loss": round(loss.item(), 4),
+            "final_loss": round(final_loss, 4),
            "step_jitter_pct": round(jitter, 2),
            "distributed_mode": distributed_mode,
            "loss_finite": math.isfinite(final_loss),
            "passed": passed,
            "acceptance_gap": "8-GPU DDP was not used" if self.train_cfg.get("require_distributed", True) else "",
            "timestamp": datetime.now().isoformat(),
        }
    @staticmethod
    def _jitter_pct(step_times: list[float]) -> float:
        if not step_times:
            return 0.0
        mean = sum(step_times) / len(step_times)
        return max(abs(v - mean) / mean * 100 for v in step_times) if mean else 0.0
    def _acceptance_pass(self, throughput: float, jitter: float, peak_mem: float, loss_value: float) -> bool:
        return (
            throughput >= float(self.train_cfg.get("min_tokens_per_sec", 45000))
            and jitter <= float(self.train_cfg.get("max_step_jitter_pct", 3))
            and peak_mem <= float(self.train_cfg.get("max_peak_memory_gb", 70))
            and math.isfinite(loss_value)
        )
    @staticmethod
    def print_results(results: dict, console: Console = None):
        c = console or Console()
@ -254,11 +510,15 @@ class TrainingSim:
            ("Batch Size", str(results.get("batch_size", "N/A"))),
            ("Seq Length", str(results.get("seq_length", "N/A"))),
            ("Steps", str(results.get("num_steps", "N/A"))),
            ("Warmup Steps", str(results.get("warmup_steps", "N/A"))),
            ("Avg Step Time", f"{results.get('avg_step_time_ms', 'N/A')} ms"),
            ("Throughput", f"{results.get('throughput_tokens_per_sec', 'N/A')} tokens/s"),
            ("Samples/sec", f"{results.get('throughput_samples_per_sec', 'N/A')}"),
            ("Peak Memory", f"{results.get('peak_memory_gb', 'N/A')} GB"),
            ("Final Loss", str(results.get("final_loss", "N/A"))),
            ("Step Jitter", f"{results.get('step_jitter_pct', 'N/A')}%"),
            ("Distributed Mode", results.get("distributed_mode", "N/A")),
            ("Verdict", "PASS" if results.get("passed") else "FAIL"),
        ]
        for label, val in metrics:
            table.add_row(label, str(val))
--- a/reports_all_aikubeworker0016.json
+++ b/reports_all_aikubeworker0016.json
@ -0,0 +1,921 @@
 {
  "timestamp": "2026-05-22T15:49:02.368516",
  "gpu_info": {
    "driver_version": "580.159.03",
    "cuda_version": "13.0",
    "gpu_count": 8,
    "gpus": [
      {
        "index": 0,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-dfbc9513-255d-4fe7-2b77-7b1ec3972e75",
        "pci_bus_id": "00000000:18:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 69.98,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924016120",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 1,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-bb845ef7-d7b5-f011-9395-ea74274e2282",
        "pci_bus_id": "00000000:2A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 67.54,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924015483",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 2,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-3720cf13-2a34-be38-27be-0a7adc4addc4",
        "pci_bus_id": "00000000:3A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 66.82,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 22,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924025595",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 3,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-87080b2d-ac43-be0d-d574-c193078850ae",
        "pci_bus_id": "00000000:5D:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 67.02,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924016862",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 4,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-599bd883-cc5c-a5dd-6c33-c15f7049da48",
        "pci_bus_id": "00000000:9A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 67.24,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924025670",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 5,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-a1c6bba4-61b0-e623-06c9-9c88635e26fe",
        "pci_bus_id": "00000000:AB:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 69.31,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 23,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924027166",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 6,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-98745a0c-39bd-3e56-d6ca-54ba3647ab6d",
        "pci_bus_id": "00000000:BA:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 67.84,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924026234",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 7,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-8c73bd8b-666b-357e-ac5d-c75ac7a759db",
        "pci_bus_id": "00000000:DB:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 66.21,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924027255",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      }
    ],
    "topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n  X    = Self\n  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n  PIX  = Connection traversing at most a single PCIe bridge\n  NV#  = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n  NIC0: mlx5_0\n  NIC1: mlx5_1\n  NIC2: mlx5_2\n  NIC3: mlx5_3\n  NIC4: mlx5_4\n  NIC5: mlx5_5\n  NIC6: mlx5_6\n  NIC7: mlx5_7\n  NIC8: mlx5_8\n  NIC9: mlx5_9\n\n",
    "timestamp": "2026-05-22T15:49:09.197459",
    "detected_gpu_type": "h100",
    "gpu_label": "H100 SXM5"
  },
  "health": {
    "passed": true,
    "gpu_health": [
      {
        "index": 0,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 21,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 69.86,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 1,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 21,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 67.48,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 2,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 22,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 66.76,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 3,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 21,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 67.06,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 4,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 21,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 67.23,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 5,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 23,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 69.27,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 6,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 21,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 67.81,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      },
      {
        "index": 7,
        "status": "WARN",
        "checks": {
          "temperature": {
            "value": 21,
            "status": "PASS",
            "threshold": 75
          },
          "power": {
            "value": 66.3,
            "limit": 700.0,
            "status": "PASS"
          },
          "ecc_errors": {
            "single": 0,
            "double": 0,
            "status": "PASS"
          },
          "memory_errors": {
            "status": "PASS"
          },
          "pcie_link": {
            "gen": 5,
            "width": 16,
            "status": "PASS"
          },
          "clock_speed": {
            "sm": 345,
            "mem": 2619,
            "status": "PASS"
          },
          "throttling": {
            "status": "PASS",
            "reasons": []
          },
          "persistence_mode": {
            "enabled": false,
            "status": "WARN"
          }
        }
      }
    ],
    "system_health": {
      "nvidia_persistenced": {
        "installed": true,
        "running": false
      },
      "hugepages": {
        "configured": false,
        "count": 0
      },
      "swap": {
        "enabled": true
      },
      "transparent_hugepage": "madvise",
      "file_descriptors": {
        "soft": 1024,
        "max": 1048576
      },
      "infiniband_devices": [
        "mlx5_4",
        "mlx5_2",
        "mlx5_0",
        "mlx5_9",
        "mlx5_7",
        "mlx5_5",
        "mlx5_3",
        "mlx5_1",
        "mlx5_8",
        "mlx5_6"
      ],
      "rdma_devices": [
        "abi_version",
        "uverbs4",
        "uverbs2",
        "uverbs0",
        "uverbs9",
        "uverbs7",
        "uverbs5",
        "uverbs3",
        "uverbs1",
        "uverbs8",
        "uverbs6"
      ],
      "nccl_env_vars": {}
    },
    "timestamp": "2026-05-22T15:49:11.294816",
    "detected_gpu_type": "h100"
  },
  "memory_bench": {
    "memory": {
      "source": "nvbandwidth",
      "h2d_bandwidth_gbps": 55.5,
      "d2h_bandwidth_gbps": 55.3,
      "d2d_bandwidth_gbps": 486.5,
      "h2d_peak_gbps": 64,
      "d2h_peak_gbps": 64,
      "d2d_peak_gbps": 450.0,
      "h2d_efficiency_pct": 86.7,
      "d2h_efficiency_pct": 86.4,
      "d2d_efficiency_pct": 108.1,
      "peak_bandwidth_gbps": 3400,
      "efficiency_pct": 108.1,
      "results_by_test": {
        "h2d": 55.5,
        "d2h": 55.3,
        "d2d_write": 397.4,
        "d2d_read": 395.1,
        "d2d_bidir": 486.5
      },
      "per_gpu": []
    }
  },
  "compute_bench": {
    "compute": {
      "per_dtype_tflops": {
        "fp32": 51.9,
        "tf32": 357.0,
        "fp16": 664.0,
        "bf16": 700.1,
        "fp8": 1116.2
      },
      "peak_tflops": {
        "fp32": 67,
        "tf32": 495,
        "fp16": 990,
        "bf16": 990,
        "fp8": 1979
      },
      "efficiency_pct": {
        "fp32": 77.5,
        "tf32": 72.1,
        "fp16": 67.1,
        "bf16": 70.7,
        "fp8": 56.4
      },
      "pass_thresholds_tflops": {
        "fp32": 54,
        "tf32": 444,
        "fp16": 734,
        "bf16": 745,
        "fp8": 1400
      },
      "per_gpu": [
        {
          "index": 0,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 1,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 2,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 3,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 4,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 5,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 6,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        },
        {
          "index": 7,
          "fp32": 51.9,
          "tf32": 357.0,
          "fp16": 664.0,
          "bf16": 700.1,
          "fp8": 1116.2
        }
      ],
      "matrix_size": 8192,
      "warmup": 50,
      "iterations": 500
    }
  },
  "nccl": {
    "passed": false,
    "source": "torchrun_fallback",
    "tests": {
      "NCCL version 2.21.5+cuda12.4": {
        "status": "FAIL",
        "error": null
      },
      "allreduce": {
        "status": "PASS",
        "error": null
      },
      "broadcast": {
        "status": "PASS",
        "error": null
      },
      "allgather": {
        "status": "PASS",
        "error": null
      },
      "reducescatter": {
        "status": "PASS",
        "error": null
      },
      "alltoall": {
        "status": "PASS",
        "error": null
      }
    },
    "gpu_count": 8
  },
  "stress": {
    "source": "pytorch",
    "passed": true,
    "duration_sec": 60,
    "elapsed_sec": 60.0,
    "gpu_status": {
      "0": "PASS",
      "1": "PASS",
      "2": "PASS",
      "3": "PASS",
      "4": "PASS",
      "5": "PASS",
      "6": "PASS",
      "7": "PASS"
    },
    "timestamp": "2026-05-22T15:51:56.803540"
  },
  "rdma": {
    "passed": false,
    "devices": [
      {
        "name": "mlx5_0",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:58a2:e103:0088:81e0"
          }
        ]
      },
      {
        "name": "mlx5_1",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:0054:e00a"
          }
        ]
      },
      {
        "name": "mlx5_2",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
          }
        ]
      },
      {
        "name": "mlx5_3",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "1: DOWN",
            "phys_state": "3: Disabled",
            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:5bd9"
          }
        ]
      },
      {
        "name": "mlx5_4",
        "ports": [
          {
            "port": "1",
            "rate": "100 Gb/sec (2X HDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ec"
          }
        ]
      },
      {
        "name": "mlx5_5",
        "ports": [
          {
            "port": "1",
            "rate": "100 Gb/sec (2X HDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ed"
          }
        ]
      },
      {
        "name": "mlx5_6",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:0055:0e56"
          }
        ]
      },
      {
        "name": "mlx5_7",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:a088:c203:00f0:286c"
          }
        ]
      },
      {
        "name": "mlx5_8",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
          }
        ]
      },
      {
        "name": "mlx5_9",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "1: DOWN",
            "phys_state": "3: Disabled",
            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:569d"
          }
        ]
      }
    ],
    "bandwidth_tests": [
      {
        "test": "ib_write_bw",
        "status": "WARN",
        "bandwidth_gbps": 0.13,
        "min_required_gbps": 50
      },
      {
        "test": "ib_read_bw",
        "status": "WARN",
        "bandwidth_gbps": 0.13,
        "min_required_gbps": 50
      }
    ],
    "latency_tests": [
      {
        "test": "ib_write_lat",
        "status": "PASS",
        "latency_us": 4.1,
        "max_allowed_us": 10
      },
      {
        "test": "ib_read_lat",
        "status": "WARN",
        "latency_us": 16.0,
        "max_allowed_us": 10
      }
    ],
    "timestamp": "2026-05-22T15:52:03.507540"
  },
  "training": {
    "model": "synthetic_transformer",
    "total_params_m": 1470.5,
    "num_layers": 6,
    "hidden_size": 4096,
    "gpu_count": 8,
    "dtype": "bfloat16",
    "batch_size": 8,
    "seq_length": 2048,
    "num_steps": 50,
    "avg_step_time_ms": 312.3,
    "throughput_tokens_per_sec": 52471.0,
    "throughput_samples_per_sec": 25.62,
    "peak_memory_gb": 27.31,
    "final_loss": 0.0041,
    "timestamp": "2026-05-22T15:52:32.650522"
  }
 }
--- a/reports_all_aikubeworker0016.md
+++ b/reports_all_aikubeworker0016.md
@ -0,0 +1,157 @@
 # GPU Test Report
 - **Date:** 2026-05-22T15:49:02.368516
 - **Host:** aikubeworker0016
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - Compute Throughput: FAIL (worst FP32 52 vs >= 54)
 - NCCL: FAIL (no nccl-tests bus BW)
 - RDMA: FAIL
 - Training: UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict)
 Missing required evidence:
 - NVLink/NVSwitch
 - DCGM
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Health Check | PASS |
 | Memory Bandwidth | PASS (108.1%) |
 | Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
 | NCCL | FAIL (no nccl-tests bus BW) |
 | Stress Test | PASS |
 | RDMA | FAIL |
 | Training | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 70/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 67/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 23C | 69/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 66/700W | 345 MHz |
 ## Health Check
 **Overall: PASS**
 | GPU | Temp | Power | ECC | PCIe | Throttle | Status |
 |-----|------|-------|-----|------|----------|--------|
 | 0 | 21C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 1 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 2 | 22C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 3 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 4 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 5 | 23C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 6 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 | 7 | 21C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
 | D2H (PCIe) | 55.3 GB/s | 64 GB/s | 86.4% |
 | D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
 **Verdict: PASS** (D2D efficiency 108.1%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 51.9 | 67 | >= 54 | FAIL |
 | TF32 | 357.0 | 495 | >= 444 | FAIL |
 | FP16 | 664.0 | 990 | >= 734 | FAIL |
 | BF16 | 700.1 | 990 | >= 745 | FAIL |
 | FP8 | 1116.2 | 1979 | >= 1400 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 56.4%)
 ### Compute Per-GPU TFLOPS
 | GPU | FP32 | TF32 | FP16 | BF16 | FP8 |
 |---|---|---|---|---|---|
 | 0 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 1 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 2 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 3 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 4 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 5 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 6 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 | 7 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
 ## NCCL Multi-GPU
 Source: torchrun_fallback | GPUs: 8
 > Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.
 | Operation | Bus BW (GB/s) | Threshold | Status |
 |-----------|---------------|-----------|--------|
 | NCCL version 2.21.5+cuda12.4 | 0.0 | >= 0 | FAIL |
 | allreduce | 0.0 | >= 0 | PASS |
 | broadcast | 0.0 | >= 0 | PASS |
 | allgather | 0.0 | >= 0 | PASS |
 | reducescatter | 0.0 | >= 0 | PASS |
 | alltoall | 0.0 | >= 0 | PASS |
 **Overall: FAIL**
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 60s (requested 60s)
 - **Result: PASS**
 ## RDMA/InfiniBand
 > Legacy RDMA result re-evaluated with current PDF acceptance thresholds; old WARN statuses and old 50GB/s/10us limits are not used for verdict.
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
 | ib_read_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 4.10 us | <= 2 us | FAIL |
 | ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
 - **Failure reasons:**
  - ib_write_bw bandwidth 0.13GB/s < 47GB/s
  - ib_read_bw bandwidth 0.13GB/s < 47GB/s
  - ib_write_lat latency 4.1us > 2us
  - ib_read_lat latency 16.0us > 3.5us
 **Overall: FAIL**
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer |
 | Params | 1470.5M |
 | Throughput | 52471 tokens/sec |
 | Avg Step Time | 312.3 ms |
 | Peak Memory | 27.3 GB |
 | Final Loss | 0.0041 |
 | Step Jitter | N/A% |
 | Distributed Mode | N/A |
 | Acceptance Gaps | missing passed, step_jitter_pct, distributed_mode, loss_finite |
 | Verdict | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_dcgm_r3_aikubeworker0012_20260522_200338.md
+++ b/reports_dcgm_r3_aikubeworker0012_20260522_200338.md
@ -0,0 +1,65 @@
 # GPU Test Report
 - **Date:** 2026-05-22T20:26:56.947796
 - **Host:** aikubeworker0012
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Missing required evidence:
 - GPU Info
 - Health Check
 - Memory Bandwidth
 - Compute Throughput
 - NVLink/NVSwitch
 - NCCL
 - Stress Test
 - RDMA
 - Training
 ## Summary
 | Test | Result |
 |------|--------|
 | DCGM | PASS |
 ## DCGM Diagnostic
 **Overall: PASS**
 | Subtest | Status |
 |---------|--------|
 | Hardware/nvbandwidth/GPU6 | PASS |
 | Hardware/nvbandwidth/GPU7 | PASS |
 | Hardware/nvbandwidth/summary | PASS |
 | Integration/pcie/GPU0 | PASS |
 | Integration/pcie/GPU1 | PASS |
 | Integration/pcie/GPU2 | PASS |
 | Integration/pcie/GPU3 | PASS |
 | Integration/pcie/GPU4 | PASS |
 | Integration/pcie/GPU5 | PASS |
 | Integration/pcie/GPU6 | PASS |
 | Integration/pcie/GPU7 | PASS |
 | Integration/pcie/summary | PASS |
 | Stress/targeted_stress/GPU0 | PASS |
 | Stress/targeted_stress/GPU1 | PASS |
 | Stress/targeted_stress/GPU2 | PASS |
 | Stress/targeted_stress/GPU3 | PASS |
 | Stress/targeted_stress/GPU4 | PASS |
 | Stress/targeted_stress/GPU5 | PASS |
 | Stress/targeted_stress/GPU6 | PASS |
 | Stress/targeted_stress/GPU7 | PASS |
 | Stress/targeted_stress/summary | PASS |
 | Stress/targeted_power/GPU0 | PASS |
 | Stress/targeted_power/GPU1 | PASS |
 | Stress/targeted_power/GPU2 | PASS |
 | Stress/targeted_power/GPU3 | PASS |
 | Stress/targeted_power/GPU4 | PASS |
 | Stress/targeted_power/GPU5 | PASS |
 | Stress/targeted_power/GPU6 | PASS |
 | Stress/targeted_power/GPU7 | PASS |
 | Stress/targeted_power/summary | PASS |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_dcgm_r3_aikubeworker0016_20260522_200538.md
+++ b/reports_dcgm_r3_aikubeworker0016_20260522_200538.md
@ -0,0 +1,65 @@
 # GPU Test Report
 - **Date:** 2026-05-22T20:28:58.716266
 - **Host:** aikubeworker0016
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Missing required evidence:
 - GPU Info
 - Health Check
 - Memory Bandwidth
 - Compute Throughput
 - NVLink/NVSwitch
 - NCCL
 - Stress Test
 - RDMA
 - Training
 ## Summary
 | Test | Result |
 |------|--------|
 | DCGM | PASS |
 ## DCGM Diagnostic
 **Overall: PASS**
 | Subtest | Status |
 |---------|--------|
 | Hardware/nvbandwidth/GPU6 | PASS |
 | Hardware/nvbandwidth/GPU7 | PASS |
 | Hardware/nvbandwidth/summary | PASS |
 | Integration/pcie/GPU0 | PASS |
 | Integration/pcie/GPU1 | PASS |
 | Integration/pcie/GPU2 | PASS |
 | Integration/pcie/GPU3 | PASS |
 | Integration/pcie/GPU4 | PASS |
 | Integration/pcie/GPU5 | PASS |
 | Integration/pcie/GPU6 | PASS |
 | Integration/pcie/GPU7 | PASS |
 | Integration/pcie/summary | PASS |
 | Stress/targeted_stress/GPU0 | PASS |
 | Stress/targeted_stress/GPU1 | PASS |
 | Stress/targeted_stress/GPU2 | PASS |
 | Stress/targeted_stress/GPU3 | PASS |
 | Stress/targeted_stress/GPU4 | PASS |
 | Stress/targeted_stress/GPU5 | PASS |
 | Stress/targeted_stress/GPU6 | PASS |
 | Stress/targeted_stress/GPU7 | PASS |
 | Stress/targeted_stress/summary | PASS |
 | Stress/targeted_power/GPU0 | PASS |
 | Stress/targeted_power/GPU1 | PASS |
 | Stress/targeted_power/GPU2 | PASS |
 | Stress/targeted_power/GPU3 | PASS |
 | Stress/targeted_power/GPU4 | PASS |
 | Stress/targeted_power/GPU5 | PASS |
 | Stress/targeted_power/GPU6 | PASS |
 | Stress/targeted_power/GPU7 | PASS |
 | Stress/targeted_power/summary | PASS |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_nvbandwidth_aikubeworker0012.json
+++ b/reports_nvbandwidth_aikubeworker0012.json
@ -0,0 +1,70 @@
 {
  "benchmark": {
    "memory": {
      "source": "nvbandwidth",
      "h2d_bandwidth_gbps": 55.5,
      "d2h_bandwidth_gbps": 54.8,
      "d2d_bandwidth_gbps": 0.0,
      "h2d_peak_gbps": 64,
      "d2h_peak_gbps": 64,
      "d2d_peak_gbps": 450.0,
      "h2d_efficiency_pct": 86.7,
      "d2h_efficiency_pct": 85.6,
      "d2d_efficiency_pct": null,
      "peak_bandwidth_gbps": 3400,
      "efficiency_pct": null,
      "results_by_test": {
        "h2d": 55.5,
        "d2h": 54.8,
        "d2d_write": 0.0,
        "d2d_read": 0.0,
        "d2d_bidir": 0.0
      },
      "per_gpu": []
    },
    "compute": {
      "per_dtype_tflops": {
        "fp32": 52.2,
        "tf32": 360.7,
        "fp16": 680.0,
        "bf16": 707.6,
        "fp8": 1142.4
      },
      "peak_tflops": {
        "fp32": 67,
        "tf32": 495,
        "fp16": 990,
        "bf16": 990,
        "fp8": 1979
      },
      "efficiency_pct": {
        "fp32": 77.9,
        "tf32": 72.9,
        "fp16": 68.7,
        "bf16": 71.5,
        "fp8": 57.7
      },
      "pass_thresholds_tflops": {
        "fp32": 54,
        "tf32": 444,
        "fp16": 734,
        "bf16": 745,
        "fp8": 1400
      },
      "per_gpu": [
        {
          "index": 0,
          "fp32": 52.2,
          "tf32": 360.7,
          "fp16": 680.0,
          "bf16": 707.6,
          "fp8": 1142.4
        }
      ],
      "matrix_size": 8192,
      "warmup": 50,
      "iterations": 500
    }
  },
  "timestamp": "2026-05-22T15:35:16.675924"
 }
--- a/reports_nvbandwidth_aikubeworker0012.md
+++ b/reports_nvbandwidth_aikubeworker0012.md
@ -0,0 +1,38 @@
 # GPU Test Report
 - **Date:** 2026-05-22 15:37:12
 - **Host:** aikubeworker0012
 ## Summary
 | Test | Result |
 |------|--------|
 | Memory Bandwidth | FAIL (0.0%) |
 | Compute Throughput | FAIL (worst TF32 361 vs >= 444) |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
 | D2H (PCIe) | 54.8 GB/s | 64 GB/s | 85.6% |
 | D2D (NVLink) | 0.0 GB/s | 450 GB/s | 0.0% |
 **Verdict: FAIL** (D2D efficiency 0.0%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 52.2 | 67 | >= 54 | WARN |
 | TF32 | 360.7 | 495 | >= 444 | FAIL |
 | FP16 | 680.0 | 990 | >= 734 | WARN |
 | BF16 | 707.6 | 990 | >= 745 | WARN |
 | FP8 | 1142.4 | 1979 | >= 1400 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.7%)
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_nvbandwidth_aikubeworker0016.json
+++ b/reports_nvbandwidth_aikubeworker0016.json
@ -0,0 +1,70 @@
 {
  "benchmark": {
    "memory": {
      "source": "nvbandwidth",
      "h2d_bandwidth_gbps": 55.5,
      "d2h_bandwidth_gbps": 55.0,
      "d2d_bandwidth_gbps": 0.0,
      "h2d_peak_gbps": 64,
      "d2h_peak_gbps": 64,
      "d2d_peak_gbps": 450.0,
      "h2d_efficiency_pct": 86.7,
      "d2h_efficiency_pct": 85.9,
      "d2d_efficiency_pct": null,
      "peak_bandwidth_gbps": 3400,
      "efficiency_pct": null,
      "results_by_test": {
        "h2d": 55.5,
        "d2h": 55.0,
        "d2d_write": 0.0,
        "d2d_read": 0.0,
        "d2d_bidir": 0.0
      },
      "per_gpu": []
    },
    "compute": {
      "per_dtype_tflops": {
        "fp32": 52.2,
        "tf32": 357.5,
        "fp16": 665.3,
        "bf16": 697.1,
        "fp8": 1138.8
      },
      "peak_tflops": {
        "fp32": 67,
        "tf32": 495,
        "fp16": 990,
        "bf16": 990,
        "fp8": 1979
      },
      "efficiency_pct": {
        "fp32": 77.9,
        "tf32": 72.2,
        "fp16": 67.2,
        "bf16": 70.4,
        "fp8": 57.5
      },
      "pass_thresholds_tflops": {
        "fp32": 54,
        "tf32": 444,
        "fp16": 734,
        "bf16": 745,
        "fp8": 1400
      },
      "per_gpu": [
        {
          "index": 0,
          "fp32": 52.2,
          "tf32": 357.5,
          "fp16": 665.3,
          "bf16": 697.1,
          "fp8": 1138.8
        }
      ],
      "matrix_size": 8192,
      "warmup": 50,
      "iterations": 500
    }
  },
  "timestamp": "2026-05-22T15:35:19.219299"
 }
--- a/reports_nvbandwidth_aikubeworker0016.md
+++ b/reports_nvbandwidth_aikubeworker0016.md
@ -0,0 +1,38 @@
 # GPU Test Report
 - **Date:** 2026-05-22 15:37:18
 - **Host:** aikubeworker0016
 ## Summary
 | Test | Result |
 |------|--------|
 | Memory Bandwidth | FAIL (0.0%) |
 | Compute Throughput | FAIL (worst TF32 358 vs >= 444) |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
 | D2H (PCIe) | 55.0 GB/s | 64 GB/s | 85.9% |
 | D2D (NVLink) | 0.0 GB/s | 450 GB/s | 0.0% |
 **Verdict: FAIL** (D2D efficiency 0.0%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 52.2 | 67 | >= 54 | WARN |
 | TF32 | 357.5 | 495 | >= 444 | FAIL |
 | FP16 | 665.3 | 990 | >= 734 | WARN |
 | BF16 | 697.1 | 990 | >= 745 | WARN |
 | FP8 | 1138.8 | 1979 | >= 1400 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.5%)
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_rdma_aikubeworker0012.json
+++ b/reports_rdma_aikubeworker0012.json
@ -0,0 +1,157 @@
 {
  "rdma": {
    "passed": false,
    "devices": [
      {
        "name": "mlx5_0",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3898"
          }
        ]
      },
      {
        "name": "mlx5_1",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3db0"
          }
        ]
      },
      {
        "name": "mlx5_2",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:5c3f:b8ff:fe5e:7832"
          }
        ]
      },
      {
        "name": "mlx5_3",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "1: DOWN",
            "phys_state": "3: Disabled",
            "gid": "fe80:0000:0000:0000:5e25:73ff:fe4e:eac1"
          }
        ]
      },
      {
        "name": "mlx5_4",
        "ports": [
          {
            "port": "1",
            "rate": "100 Gb/sec (2X HDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:005f:63cc"
          }
        ]
      },
      {
        "name": "mlx5_5",
        "ports": [
          {
            "port": "1",
            "rate": "100 Gb/sec (2X HDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:005f:63cd"
          }
        ]
      },
      {
        "name": "mlx5_6",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3bf4"
          }
        ]
      },
      {
        "name": "mlx5_7",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:58a2:e103:0093:3e28"
          }
        ]
      },
      {
        "name": "mlx5_8",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:5c3f:b8ff:fe5e:7832"
          }
        ]
      },
      {
        "name": "mlx5_9",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "1: DOWN",
            "phys_state": "3: Disabled",
            "gid": "fe80:0000:0000:0000:5e25:73ff:fe63:1717"
          }
        ]
      }
    ],
    "bandwidth_tests": [
      {
        "test": "ib_write_bw",
        "status": "WARN",
        "bandwidth_gbps": 0.13,
        "min_required_gbps": 50
      },
      {
        "test": "ib_read_bw",
        "status": "WARN",
        "bandwidth_gbps": 0.13,
        "min_required_gbps": 50
      }
    ],
    "latency_tests": [
      {
        "test": "ib_write_lat",
        "status": "PASS",
        "latency_us": 4.53,
        "max_allowed_us": 10
      },
      {
        "test": "ib_read_lat",
        "status": "WARN",
        "latency_us": 16.0,
        "max_allowed_us": 10
      }
    ],
    "timestamp": "2026-05-22T15:41:20.534115"
  },
  "timestamp": "2026-05-22T15:41:20.544589"
 }
--- a/reports_rdma_aikubeworker0016.json
+++ b/reports_rdma_aikubeworker0016.json
@ -0,0 +1,157 @@
 {
  "rdma": {
    "passed": false,
    "devices": [
      {
        "name": "mlx5_0",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:58a2:e103:0088:81e0"
          }
        ]
      },
      {
        "name": "mlx5_1",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:0054:e00a"
          }
        ]
      },
      {
        "name": "mlx5_2",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
          }
        ]
      },
      {
        "name": "mlx5_3",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "1: DOWN",
            "phys_state": "3: Disabled",
            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:5bd9"
          }
        ]
      },
      {
        "name": "mlx5_4",
        "ports": [
          {
            "port": "1",
            "rate": "100 Gb/sec (2X HDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ec"
          }
        ]
      },
      {
        "name": "mlx5_5",
        "ports": [
          {
            "port": "1",
            "rate": "100 Gb/sec (2X HDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:005f:58ed"
          }
        ]
      },
      {
        "name": "mlx5_6",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:9c63:c003:0055:0e56"
          }
        ]
      },
      {
        "name": "mlx5_7",
        "ports": [
          {
            "port": "1",
            "rate": "400 Gb/sec (4X NDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:a088:c203:00f0:286c"
          }
        ]
      },
      {
        "name": "mlx5_8",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "4: ACTIVE",
            "phys_state": "5: LinkUp",
            "gid": "fe80:0000:0000:0000:a02d:75ff:feae:2bcf"
          }
        ]
      },
      {
        "name": "mlx5_9",
        "ports": [
          {
            "port": "1",
            "rate": "25 Gb/sec (1X EDR)",
            "state": "1: DOWN",
            "phys_state": "3: Disabled",
            "gid": "fe80:0000:0000:0000:c670:bdff:fefd:569d"
          }
        ]
      }
    ],
    "bandwidth_tests": [
      {
        "test": "ib_write_bw",
        "status": "WARN",
        "bandwidth_gbps": 0.13,
        "min_required_gbps": 50
      },
      {
        "test": "ib_read_bw",
        "status": "WARN",
        "bandwidth_gbps": 0.13,
        "min_required_gbps": 50
      }
    ],
    "latency_tests": [
      {
        "test": "ib_write_lat",
        "status": "PASS",
        "latency_us": 4.22,
        "max_allowed_us": 10
      },
      {
        "test": "ib_read_lat",
        "status": "WARN",
        "latency_us": 16.0,
        "max_allowed_us": 10
      }
    ],
    "timestamp": "2026-05-22T15:41:07.851101"
  },
  "timestamp": "2026-05-22T15:41:07.861558"
 }
--- a/reports_rdma_counter_aikubeworker0012_20260522_194808.md
+++ b/reports_rdma_counter_aikubeworker0012_20260522_194808.md
@ -0,0 +1,62 @@
 # GPU Test Report
 - **Date:** 2026-05-22T19:48:26.622179
 - **Host:** aikubeworker0012
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - RDMA: FAIL
 Missing required evidence:
 - GPU Info
 - Health Check
 - Memory Bandwidth
 - Compute Throughput
 - NVLink/NVSwitch
 - NCCL
 - Stress Test
 - DCGM
 - Training
 ## Summary
 | Test | Result |
 |------|--------|
 | RDMA | FAIL |
 ## RDMA/InfiniBand
 ### RDMA Port Checks
 | Device | Port | State | Rate | Required | Status |
 |--------|------|-------|------|----------|--------|
 | mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 49.3 GB/s | >= 47 GB/s | PASS |
 | ib_read_bw | 39.2 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 4.49 us | <= 2 us | FAIL |
 | ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
 | ibping | target=0x58 count=5 | 0% packet loss | PASS |
 - **PFC/ECN/CNP/congestion counters checked:** 146
 - **PFC/ECN/CNP/congestion non-zero:** no
 - **Failure reasons:**
  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - ib_read_bw bandwidth 39.21GB/s < 47GB/s
  - ib_write_lat latency 4.49us > 2.0us
  - ib_read_lat latency 16.0us > 3.5us
 **Overall: FAIL**
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_rdma_counter_aikubeworker0016_20260522_194828.md
+++ b/reports_rdma_counter_aikubeworker0016_20260522_194828.md
@ -0,0 +1,62 @@
 # GPU Test Report
 - **Date:** 2026-05-22T19:48:45.899570
 - **Host:** aikubeworker0016
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - RDMA: FAIL
 Missing required evidence:
 - GPU Info
 - Health Check
 - Memory Bandwidth
 - Compute Throughput
 - NVLink/NVSwitch
 - NCCL
 - Stress Test
 - DCGM
 - Training
 ## Summary
 | Test | Result |
 |------|--------|
 | RDMA | FAIL |
 ## RDMA/InfiniBand
 ### RDMA Port Checks
 | Device | Port | State | Rate | Required | Status |
 |--------|------|-------|------|----------|--------|
 | mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 48.1 GB/s | >= 47 GB/s | PASS |
 | ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 4.28 us | <= 2 us | FAIL |
 | ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
 | ibping | target=0x4b count=5 | 0% packet loss | PASS |
 - **PFC/ECN/CNP/congestion counters checked:** 146
 - **PFC/ECN/CNP/congestion non-zero:** no
 - **Failure reasons:**
  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - ib_read_bw bandwidth 40.3GB/s < 47GB/s
  - ib_write_lat latency 4.28us > 2.0us
  - ib_read_lat latency 16.0us > 3.5us
 **Overall: FAIL**
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_rdma_cross_node_mlx5_0_20260523.md
+++ b/reports_rdma_cross_node_mlx5_0_20260523.md
@ -0,0 +1,50 @@
 # RDMA Cross-node Evidence Report
 - **Date:** 2026-05-23 Asia/Shanghai
 - **Scope:** `aikubeworker0012` <-> `aikubeworker0016`, single rail `mlx5_0`, port 1
 - **Client/server bootstrap IPs:** `172.72.8.12` and `172.72.8.16`
 - **Bandwidth message size:** 4MB
 - **Latency message size:** 8B
 - **Iterations:** 1000
 ## Port Evidence
 | Host | Device | State | Rate | Link | LID |
 |---|---|---|---|---|---|
 | aikubeworker0012 | mlx5_0/1 | ACTIVE | 400 Gb/sec (4X NDR) | InfiniBand | 0x58 |
 | aikubeworker0016 | mlx5_0/1 | ACTIVE | 400 Gb/sec (4X NDR) | InfiniBand | 0x4b |
 ## Cross-node Perftest Results
 | Direction | Test | Value | PDF Threshold | Status |
 |---|---|---:|---:|---|
 | 0016 -> 0012 | ib_write_bw | 49.35 GB/s | >= 47 GB/s | PASS |
 | 0016 -> 0012 | ib_read_bw | 44.36 GB/s | >= 47 GB/s | FAIL |
 | 0016 -> 0012 | ib_write_lat avg | 2.17 us | <= 2.0 us | FAIL |
 | 0016 -> 0012 | ib_read_lat avg | 4.05 us | <= 3.5 us | FAIL |
 | 0012 -> 0016 | ib_write_bw | 48.38 GB/s | >= 47 GB/s | PASS |
 | 0012 -> 0016 | ib_read_bw | 44.37 GB/s | >= 47 GB/s | FAIL |
 | 0012 -> 0016 | ib_write_lat avg | 2.13 us | <= 2.0 us | FAIL |
 | 0012 -> 0016 | ib_read_lat avg | 4.08 us | <= 3.5 us | FAIL |
 ## Bidirectional ibping
 | Direction | Target LID | Result |
 |---|---|---|
 | 0016 -> 0012 | 0x58 | 5 transmitted, 5 received, 0% packet loss; avg 0.005 ms |
 | 0012 -> 0016 | 0x4b | 5 transmitted, 5 received, 0% packet loss; avg 0.005 ms |
 ## Fabric Counters
 | Host | PFC/ECN/CNP/congestion Counters Checked | Non-zero Counters | Status |
 |---|---:|---:|---|
 | aikubeworker0012 | 146 | 0 | PASS |
 | aikubeworker0016 | 146 | 0 | PASS |
 ## Verdict
 **RDMA cross-node verdict: FAIL**
 Reason: bidirectional connectivity is good, PFC/ECN/CNP/congestion counters are clean, and write bandwidth passes. However read bandwidth is below 47 GB/s in both directions, write latency is slightly above 2.0 us in both directions, and read latency is above 3.5 us in both directions.
 Note: `modules/rdma_test.py` was corrected on 2026-05-23 to parse `ib_write_lat` / `ib_read_lat` `t_avg[usec]` rather than the 99.9 percentile column. Older reports that show `read_lat` around 16 us are therefore not the current parser output.
--- a/reports_rdma_single_node_summary.md
+++ b/reports_rdma_single_node_summary.md
@ -0,0 +1,73 @@
 # Single-node RDMA/IB Report
 Generated: 2026-05-22 23:41 Asia/Shanghai
 Scope: project CLI `gpu_tester.py --test rdma --report --format json`, run separately on each host.
 Important note: the current repository RDMA test is single-node only. In `modules/rdma_test.py`, the perftest client connects to `localhost`, so this report validates local IB device discovery and local perftest behavior. It does not validate cross-node RDMA bandwidth between `aikubeworker0012` and `aikubeworker0016`.
 ## Summary
 | Host | Devices Found | Active 400G Ports | Active 100G Ports | Down Ports | Overall |
 | --- | ---: | --- | --- | --- | --- |
 | aikubeworker0012 / 172.72.8.12 | 10 | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | mlx5_4, mlx5_5 | mlx5_3, mlx5_9 | WARN |
 | aikubeworker0016 / 172.72.8.16 | 10 | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | mlx5_4, mlx5_5 | mlx5_3, mlx5_9 | WARN |
 ## Bandwidth
 The bandwidth numbers below are from the repo's local `localhost` RDMA perftest path.
 | Host | ib_write_bw | Threshold | Status | ib_read_bw | Threshold | Status |
 | --- | ---: | ---: | --- | ---: | ---: | --- |
 | aikubeworker0012 | 0.13 GB/s | 50 GB/s | WARN | 0.13 GB/s | 50 GB/s | WARN |
 | aikubeworker0016 | 0.13 GB/s | 50 GB/s | WARN | 0.13 GB/s | 50 GB/s | WARN |
 ## Latency
 | Host | ib_write_lat | Limit | Status | ib_read_lat | Limit | Status |
 | --- | ---: | ---: | --- | ---: | ---: | --- |
 | aikubeworker0012 | 4.53 us | 10 us | PASS | 16.00 us | 10 us | WARN |
 | aikubeworker0016 | 4.22 us | 10 us | PASS | 16.00 us | 10 us | WARN |
 ## Device Inventory
 ### aikubeworker0012
 | Device | Port | State | Physical State | Rate |
 | --- | --- | --- | --- | --- |
 | mlx5_0 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_1 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_2 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
 | mlx5_3 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
 | mlx5_4 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
 | mlx5_5 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
 | mlx5_6 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_7 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_8 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
 | mlx5_9 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
 ### aikubeworker0016
 | Device | Port | State | Physical State | Rate |
 | --- | --- | --- | --- | --- |
 | mlx5_0 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_1 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_2 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
 | mlx5_3 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
 | mlx5_4 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
 | mlx5_5 | 1 | ACTIVE | LinkUp | 100 Gb/sec (2X HDR) |
 | mlx5_6 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_7 | 1 | ACTIVE | LinkUp | 400 Gb/sec (4X NDR) |
 | mlx5_8 | 1 | ACTIVE | LinkUp | 25 Gb/sec (1X EDR) |
 | mlx5_9 | 1 | DOWN | Disabled | 25 Gb/sec (1X EDR) |
 ## Files
 Raw JSON:
 - `reports_rdma_aikubeworker0012.json`
 - `reports_rdma_aikubeworker0016.json`
 Markdown summary:
 - `reports_rdma_single_node_summary.md`
--- a/reports_single_gpu_aikubeworker0012.json
+++ b/reports_single_gpu_aikubeworker0012.json
@ -0,0 +1,292 @@
 {
  "timestamp": "2026-05-22T15:26:26.973586",
  "gpu_info": {
    "driver_version": "580.159.03",
    "cuda_version": "13.0",
    "gpu_count": 8,
    "gpus": [
      {
        "index": 0,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-7658c03c-7659-9886-041e-545c21d53e12",
        "pci_bus_id": "00000000:18:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 69.72,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 25,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1654923030411",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 1,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-6392d40b-893b-9fc2-4284-a3f1d8c4d7f1",
        "pci_bus_id": "00000000:2A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 73.17,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 25,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1654724063165",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 2,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-2ae38735-10de-fb0b-fb20-9d1b5b434558",
        "pci_bus_id": "00000000:3A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 68.71,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 26,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1654823036530",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 3,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-ec62123f-0c48-6dbd-49e4-8b231b3fed0e",
        "pci_bus_id": "00000000:5D:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 69.73,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 25,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1654923021638",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 4,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-b64fc270-109e-1543-fb0c-be7feecf14f1",
        "pci_bus_id": "00000000:9A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 68.84,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 24,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1655023033179",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 5,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-15ab7baf-9010-7cf3-5462-eeb09f8dbe65",
        "pci_bus_id": "00000000:AB:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 69.94,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 27,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1655023034225",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 6,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-225f6f3c-6fef-d1e2-5428-d90f665fb3d3",
        "pci_bus_id": "00000000:BA:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 70.46,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 25,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1654923078278",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 7,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-79aeb6a8-c00c-6edb-956f-779ef56950a3",
        "pci_bus_id": "00000000:DB:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 71.76,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 24,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1654024031464",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      }
    ],
    "topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n  X    = Self\n  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n  PIX  = Connection traversing at most a single PCIe bridge\n  NV#  = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n  NIC0: mlx5_0\n  NIC1: mlx5_1\n  NIC2: mlx5_2\n  NIC3: mlx5_3\n  NIC4: mlx5_4\n  NIC5: mlx5_5\n  NIC6: mlx5_6\n  NIC7: mlx5_7\n  NIC8: mlx5_8\n  NIC9: mlx5_9\n\n",
    "timestamp": "2026-05-22T15:26:34.187409",
    "detected_gpu_type": "h100",
    "gpu_label": "H100 SXM5"
  },
  "memory_bench": {
    "memory": {
      "source": "pytorch",
      "h2d_bandwidth_gbps": 11.8,
      "d2h_bandwidth_gbps": 9.9,
      "d2d_bandwidth_gbps": 829.1,
      "peak_bandwidth_gbps": 3400,
      "efficiency_pct": 24.4,
      "test_sizes_mb": [
        1,
        4,
        16,
        64,
        256,
        1024,
        4096
      ],
      "bandwidth_by_size": {
        "1": {
          "h2d_gbps": 3.8,
          "d2h_gbps": 1.4,
          "d2d_gbps": 40.6
        },
        "4": {
          "h2d_gbps": 7.6,
          "d2h_gbps": 9.9,
          "d2d_gbps": 141.5
        },
        "16": {
          "h2d_gbps": 11.0,
          "d2h_gbps": 1.9,
          "d2d_gbps": 450.3
        },
        "64": {
          "h2d_gbps": 11.8,
          "d2h_gbps": 1.4,
          "d2d_gbps": 726.5
        },
        "256": {
          "h2d_gbps": 9.0,
          "d2h_gbps": 1.4,
          "d2d_gbps": 793.8
        },
        "1024": {
          "h2d_gbps": 5.5,
          "d2h_gbps": 1.4,
          "d2d_gbps": 821.2
        },
        "4096": {
          "h2d_gbps": 5.9,
          "d2h_gbps": 1.4,
          "d2d_gbps": 829.1
        }
      },
      "per_gpu": []
    }
  },
  "compute_bench": {
    "compute": {
      "per_dtype_tflops": {
        "fp32": 52.0,
        "tf32": 362.3,
        "fp16": 691.0,
        "bf16": 713.0,
        "fp8": 1148.8
      },
      "peak_tflops": {
        "fp32": 67,
        "tf32": 495,
        "fp16": 990,
        "bf16": 990,
        "fp8": 1979
      },
      "efficiency_pct": {
        "fp32": 77.6,
        "tf32": 73.2,
        "fp16": 69.8,
        "bf16": 72.0,
        "fp8": 58.0
      },
      "pass_thresholds_tflops": {
        "fp32": 54,
        "tf32": 444,
        "fp16": 734,
        "bf16": 745,
        "fp8": 1400
      },
      "per_gpu": [
        {
          "index": 0,
          "fp32": 52.0,
          "tf32": 362.3,
          "fp16": 691.0,
          "bf16": 713.0,
          "fp8": 1148.8
        }
      ],
      "matrix_size": 8192,
      "warmup": 50,
      "iterations": 500
    }
  }
 }
--- a/reports_single_gpu_aikubeworker0012.md
+++ b/reports_single_gpu_aikubeworker0012.md
@ -0,0 +1,54 @@
 # GPU Test Report
 - **Date:** 2026-05-22 15:27:51
 - **Host:** aikubeworker0012
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Memory Bandwidth | WARN (829 GB/s via PyTorch fallback) |
 | Compute Throughput | FAIL (worst TF32 362 vs >= 444) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 72/700W | 345 MHz |
 ## Memory Bandwidth
 Source: pytorch
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 11.8 GB/s | 0 GB/s | 0.0% |
 | D2H (PCIe) | 9.9 GB/s | 0 GB/s | 0.0% |
 | D2D (NVLink) | 829.1 GB/s | 3400 GB/s | 24.4% |
 **Verdict: WARN** (D2D 829.1 GB/s via PyTorch fallback; nvbandwidth unavailable — figure is indicative only, not a true HBM peak)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 52.0 | 67 | >= 54 | WARN |
 | TF32 | 362.3 | 495 | >= 444 | FAIL |
 | FP16 | 691.0 | 990 | >= 734 | WARN |
 | BF16 | 713.0 | 990 | >= 745 | WARN |
 | FP8 | 1148.8 | 1979 | >= 1400 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 58.0%)
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_single_gpu_aikubeworker0016.json
+++ b/reports_single_gpu_aikubeworker0016.json
@ -0,0 +1,292 @@
 {
  "timestamp": "2026-05-22T15:26:29.511252",
  "gpu_info": {
    "driver_version": "580.159.03",
    "cuda_version": "13.0",
    "gpu_count": 8,
    "gpus": [
      {
        "index": 0,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-dfbc9513-255d-4fe7-2b77-7b1ec3972e75",
        "pci_bus_id": "00000000:18:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 4,
        "vram_free_mb": 81076,
        "power_draw": 69.81,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 20,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924016120",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 1,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-bb845ef7-d7b5-f011-9395-ea74274e2282",
        "pci_bus_id": "00000000:2A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 67.45,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 20,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924015483",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 2,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-3720cf13-2a34-be38-27be-0a7adc4addc4",
        "pci_bus_id": "00000000:3A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 66.69,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 21,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924025595",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 3,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-87080b2d-ac43-be0d-d574-c193078850ae",
        "pci_bus_id": "00000000:5D:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 66.86,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 20,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924016862",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 4,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-599bd883-cc5c-a5dd-6c33-c15f7049da48",
        "pci_bus_id": "00000000:9A:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 67.07,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 20,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924025670",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 5,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-a1c6bba4-61b0-e623-06c9-9c88635e26fe",
        "pci_bus_id": "00000000:AB:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 69.12,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 22,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924027166",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 6,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-98745a0c-39bd-3e56-d6ca-54ba3647ab6d",
        "pci_bus_id": "00000000:BA:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 67.61,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 20,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924026234",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      },
      {
        "index": 7,
        "name": "NVIDIA H100 80GB HBM3",
        "uuid": "GPU-8c73bd8b-666b-357e-ac5d-c75ac7a759db",
        "pci_bus_id": "00000000:DB:00.0",
        "pcie_link_gen": 5,
        "pcie_link_width": 16,
        "vram_total_mb": 81559,
        "vram_used_mb": 0,
        "vram_free_mb": 81079,
        "power_draw": 66.19,
        "power_limit": 700.0,
        "clock_sm": 345,
        "clock_mem": 2619,
        "temperature": 20,
        "fan_speed": 0,
        "persistence_mode": false,
        "compute_mode": "Default",
        "serial_number": "1651924027255",
        "ecc_errors_single": 0,
        "ecc_errors_double": 0
      }
    ],
    "topology": "\t\u001b[4mGPU0\tGPU1\tGPU2\tGPU3\tGPU4\tGPU5\tGPU6\tGPU7\tNIC0\tNIC1\tNIC2\tNIC3\tNIC4\tNIC5\tNIC6\tNIC7\tNIC8\tNIC9\tCPU Affinity\tNUMA Affinity\tGPU NUMA ID\u001b[0m\nGPU0\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tPIX\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU1\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tPIX\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU2\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tPIX\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU3\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tNV18\tNODE\tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t0-55,112-167\t0\t\tN/A\nGPU4\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU5\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nGPU6\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tNV18\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tPIX\t56-111,168-223\t1\t\tN/A\nGPU7\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\tNV18\t X \tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t56-111,168-223\t1\t\tN/A\nNIC0\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC1\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC2\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC3\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC4\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\t X \tPIX\tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC5\tNODE\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tNODE\tNODE\tPIX\t X \tSYS\tSYS\tSYS\tSYS\t\t\t\t\nNIC6\tSYS\tSYS\tSYS\tSYS\tPIX\tNODE\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\t X \tNODE\tNODE\tNODE\t\t\t\t\nNIC7\tSYS\tSYS\tSYS\tSYS\tNODE\tPIX\tNODE\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\t X \tNODE\tNODE\t\t\t\t\nNIC8\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\t X \tPIX\t\t\t\t\nNIC9\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\tNODE\tSYS\tSYS\tSYS\tSYS\tSYS\tSYS\tNODE\tNODE\tPIX\t X \t\t\t\t\n\nLegend:\n\n  X    = Self\n  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)\n  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node\n  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)\n  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)\n  PIX  = Connection traversing at most a single PCIe bridge\n  NV#  = Connection traversing a bonded set of # NVLinks\n\nNIC Legend:\n\n  NIC0: mlx5_0\n  NIC1: mlx5_1\n  NIC2: mlx5_2\n  NIC3: mlx5_3\n  NIC4: mlx5_4\n  NIC5: mlx5_5\n  NIC6: mlx5_6\n  NIC7: mlx5_7\n  NIC8: mlx5_8\n  NIC9: mlx5_9\n\n",
    "timestamp": "2026-05-22T15:26:36.627805",
    "detected_gpu_type": "h100",
    "gpu_label": "H100 SXM5"
  },
  "memory_bench": {
    "memory": {
      "source": "pytorch",
      "h2d_bandwidth_gbps": 11.8,
      "d2h_bandwidth_gbps": 10.1,
      "d2d_bandwidth_gbps": 829.0,
      "peak_bandwidth_gbps": 3400,
      "efficiency_pct": 24.4,
      "test_sizes_mb": [
        1,
        4,
        16,
        64,
        256,
        1024,
        4096
      ],
      "bandwidth_by_size": {
        "1": {
          "h2d_gbps": 3.6,
          "d2h_gbps": 1.4,
          "d2d_gbps": 40.3
        },
        "4": {
          "h2d_gbps": 7.7,
          "d2h_gbps": 10.1,
          "d2d_gbps": 159.5
        },
        "16": {
          "h2d_gbps": 10.9,
          "d2h_gbps": 1.9,
          "d2d_gbps": 439.5
        },
        "64": {
          "h2d_gbps": 11.8,
          "d2h_gbps": 1.4,
          "d2d_gbps": 740.5
        },
        "256": {
          "h2d_gbps": 9.0,
          "d2h_gbps": 1.4,
          "d2d_gbps": 792.1
        },
        "1024": {
          "h2d_gbps": 8.4,
          "d2h_gbps": 1.4,
          "d2d_gbps": 818.9
        },
        "4096": {
          "h2d_gbps": 6.1,
          "d2h_gbps": 1.4,
          "d2d_gbps": 829.0
        }
      },
      "per_gpu": []
    }
  },
  "compute_bench": {
    "compute": {
      "per_dtype_tflops": {
        "fp32": 51.9,
        "tf32": 357.8,
        "fp16": 667.2,
        "bf16": 699.1,
        "fp8": 1146.2
      },
      "peak_tflops": {
        "fp32": 67,
        "tf32": 495,
        "fp16": 990,
        "bf16": 990,
        "fp8": 1979
      },
      "efficiency_pct": {
        "fp32": 77.5,
        "tf32": 72.3,
        "fp16": 67.4,
        "bf16": 70.6,
        "fp8": 57.9
      },
      "pass_thresholds_tflops": {
        "fp32": 54,
        "tf32": 444,
        "fp16": 734,
        "bf16": 745,
        "fp8": 1400
      },
      "per_gpu": [
        {
          "index": 0,
          "fp32": 51.9,
          "tf32": 357.8,
          "fp16": 667.2,
          "bf16": 699.1,
          "fp8": 1146.2
        }
      ],
      "matrix_size": 8192,
      "warmup": 50,
      "iterations": 500
    }
  }
 }
--- a/reports_single_gpu_aikubeworker0016.md
+++ b/reports_single_gpu_aikubeworker0016.md
@ -0,0 +1,54 @@
 # GPU Test Report
 - **Date:** 2026-05-22 15:27:53
 - **Host:** aikubeworker0016
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Memory Bandwidth | WARN (829 GB/s via PyTorch fallback) |
 | Compute Throughput | FAIL (worst TF32 358 vs >= 444) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 70/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 69/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 66/700W | 345 MHz |
 ## Memory Bandwidth
 Source: pytorch
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 11.8 GB/s | 0 GB/s | 0.0% |
 | D2H (PCIe) | 10.1 GB/s | 0 GB/s | 0.0% |
 | D2D (NVLink) | 829.0 GB/s | 3400 GB/s | 24.4% |
 **Verdict: WARN** (D2D 829.0 GB/s via PyTorch fallback; nvbandwidth unavailable — figure is indicative only, not a true HBM peak)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 51.9 | 67 | >= 54 | WARN |
 | TF32 | 357.8 | 495 | >= 444 | FAIL |
 | FP16 | 667.2 | 990 | >= 734 | WARN |
 | BF16 | 699.1 | 990 | >= 745 | WARN |
 | FP8 | 1146.2 | 1979 | >= 1400 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 57.9%)
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_stress_smoke_reasons_aikubeworker0012.json
+++ b/reports_stress_smoke_reasons_aikubeworker0012.json
@ -0,0 +1,165 @@
 {
  "stress": {
    "source": "pytorch",
    "passed": false,
    "duration_sec": 45,
    "elapsed_sec": 45.4,
    "gpu_status": {
      "0": "PASS",
      "1": "PASS",
      "2": "PASS",
      "3": "PASS",
      "4": "PASS",
      "5": "PASS",
      "6": "PASS",
      "7": "PASS"
    },
    "telemetry": {
      "passed": false,
      "samples": 39,
      "steady_samples": 31,
      "warmup_sec": 9.0,
      "max_temp_c": {
        "0": 59.0,
        "1": 58.0,
        "2": 65.0,
        "3": 54.0,
        "4": 59.0,
        "5": 66.0,
        "6": 62.0,
        "7": 55.0
      },
      "avg_power_w": {
        "0": 697.0,
        "1": 697.4,
        "2": 697.9,
        "3": 698.0,
        "4": 697.8,
        "5": 697.6,
        "6": 697.9,
        "7": 698.2
      },
      "temp_delta_c": 12.0,
      "throttle_events": [
        {
          "gpu": 0,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 1,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 2,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 3,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 4,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 5,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 6,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 7,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 0,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 1,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 2,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 3,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 4,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 5,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 6,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 7,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 0,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 1,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 2,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 3,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        }
      ],
      "throttle_event_count": 248,
      "xid_events": [],
      "tflops_jitter_pct": 4.07,
      "steady_tflops_samples": 781,
      "failures": [
        "GPU temperature delta 12.0C exceeds 5.0C",
        "non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)"
      ],
      "thresholds": {
        "max_temp_c": 80.0,
        "max_temp_delta_c": 5.0,
        "min_power_w": 630.0,
        "max_tflops_jitter_pct": 5.0,
        "warmup_sec": 10.0,
        "min_steady_samples": 10
      }
    },
    "timestamp": "2026-05-22T17:52:09.074859"
  },
  "timestamp": "2026-05-22T17:52:09.082873"
 }
--- a/reports_stress_smoke_reasons_aikubeworker0012.md
+++ b/reports_stress_smoke_reasons_aikubeworker0012.md
@ -0,0 +1,29 @@
 # GPU Test Report
 - **Date:** 2026-05-22T17:52:09.082873
 - **Host:** aikubeworker0012
 ## Summary
 | Test | Result |
 |------|--------|
 | Stress Test | FAIL |
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 45s (requested 45s)
 - **Telemetry samples:** 39
 - **Max temp:** {'0': 59.0, '1': 58.0, '2': 65.0, '3': 54.0, '4': 59.0, '5': 66.0, '6': 62.0, '7': 55.0}
 - **Avg power:** {'0': 697.0, '1': 697.4, '2': 697.9, '3': 698.0, '4': 697.8, '5': 697.6, '6': 697.9, '7': 698.2}
 - **Temp delta:** 12.0 C
 - **TFLOPS jitter:** 4.07%
 - **Throttle events:** 248
 - **XID events:** 0
 - **Failure reasons:**
  - GPU temperature delta 12.0C exceeds 5.0C
  - non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)
 - **Result: FAIL**
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_stress_smoke_reasons_aikubeworker0016.json
+++ b/reports_stress_smoke_reasons_aikubeworker0016.json
@ -0,0 +1,165 @@
 {
  "stress": {
    "source": "pytorch",
    "passed": false,
    "duration_sec": 45,
    "elapsed_sec": 45.4,
    "gpu_status": {
      "0": "PASS",
      "1": "PASS",
      "2": "PASS",
      "3": "PASS",
      "4": "PASS",
      "5": "PASS",
      "6": "PASS",
      "7": "PASS"
    },
    "telemetry": {
      "passed": false,
      "samples": 39,
      "steady_samples": 31,
      "warmup_sec": 9.0,
      "max_temp_c": {
        "0": 50.0,
        "1": 56.0,
        "2": 57.0,
        "3": 52.0,
        "4": 51.0,
        "5": 58.0,
        "6": 53.0,
        "7": 51.0
      },
      "avg_power_w": {
        "0": 698.3,
        "1": 698.5,
        "2": 697.6,
        "3": 697.9,
        "4": 697.8,
        "5": 698.0,
        "6": 697.5,
        "7": 698.0
      },
      "temp_delta_c": 8.0,
      "throttle_events": [
        {
          "gpu": 0,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 1,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 2,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 3,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 4,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 5,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 6,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 7,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 0,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 1,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 2,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 3,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 4,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 5,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 6,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 7,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 0,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 1,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 2,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        },
        {
          "gpu": 3,
          "throttle": "0x0000000000000004",
          "real_throttle": "0x4"
        }
      ],
      "throttle_event_count": 248,
      "xid_events": [],
      "tflops_jitter_pct": 3.77,
      "steady_tflops_samples": 787,
      "failures": [
        "GPU temperature delta 8.0C exceeds 5.0C",
        "non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)"
      ],
      "thresholds": {
        "max_temp_c": 80.0,
        "max_temp_delta_c": 5.0,
        "min_power_w": 630.0,
        "max_tflops_jitter_pct": 5.0,
        "warmup_sec": 10.0,
        "min_steady_samples": 10
      }
    },
    "timestamp": "2026-05-22T17:53:02.058687"
  },
  "timestamp": "2026-05-22T17:53:02.066792"
 }
--- a/reports_stress_smoke_reasons_aikubeworker0016.md
+++ b/reports_stress_smoke_reasons_aikubeworker0016.md
@ -0,0 +1,29 @@
 # GPU Test Report
 - **Date:** 2026-05-22T17:53:02.066792
 - **Host:** aikubeworker0016
 ## Summary
 | Test | Result |
 |------|--------|
 | Stress Test | FAIL |
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 45s (requested 45s)
 - **Telemetry samples:** 39
 - **Max temp:** {'0': 50.0, '1': 56.0, '2': 57.0, '3': 52.0, '4': 51.0, '5': 58.0, '6': 53.0, '7': 51.0}
 - **Avg power:** {'0': 698.3, '1': 698.5, '2': 697.6, '3': 697.9, '4': 697.8, '5': 698.0, '6': 697.5, '7': 698.0}
 - **Temp delta:** 8.0 C
 - **TFLOPS jitter:** 3.77%
 - **Throttle events:** 248
 - **XID events:** 0
 - **Failure reasons:**
  - GPU temperature delta 8.0C exceeds 5.0C
  - non-idle throttle reasons observed in 248 samples (first: GPU 0 0x4)
 - **Result: FAIL**
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_latest_aikubeworker0012_20260522_203246.md
+++ b/reports_test_all_latest_aikubeworker0012_20260522_203246.md
@ -0,0 +1,322 @@
 # GPU Test Report
 - **Date:** 2026-05-22T20:32:51.687830
 - **Host:** aikubeworker0012
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - Compute Throughput: FAIL (FP16 spread 3.04% > 3%)
 - NCCL: FAIL
 - Stress Test: FAIL
 - RDMA: FAIL
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Health Check | PASS |
 | Memory Bandwidth | PASS (108.1%) |
 | Compute Throughput | FAIL (FP16 spread 3.04% > 3%) |
 | NVLink/NVSwitch | PASS |
 | DCGM | PASS |
 | NCCL | FAIL |
 | Stress Test | FAIL |
 | RDMA | FAIL |
 | Training | PASS (216498 tokens/sec) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 69/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 71/700W | 345 MHz |
 ## Health Check
 **Overall: PASS**
 | GPU | Temp | Power | ECC | PCIe | Throttle | Status |
 |-----|------|-------|-----|------|----------|--------|
 | 0 | 25C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 1 | 25C PASS | 73W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 2 | 26C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 3 | 24C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 4 | 24C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 5 | 27C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 6 | 25C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 7 | 24C PASS | 71W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.4 GB/s | 64 GB/s | 86.6% |
 | D2H (PCIe) | 54.0 GB/s | 64 GB/s | 84.4% |
 | D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
 **Verdict: PASS** (D2D efficiency 108.1%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 51.9 | 67 | >= 54 | FAIL |
 | TF32 | 364.9 | 495 | >= 444 | FAIL |
 | FP16 | 680.0 | 990 | >= 734 | FAIL |
 | BF16 | 713.2 | 990 | >= 745 | FAIL |
 | FP8 | 1170.4 | 1979 | >= 1400 | FAIL |
 | FP64 | 46.9 | 67 | >= 63 | FAIL |
 | INT8 | 100.4 | 1979 | >= 1536 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 5.1%)
 ### Compute Consistency
 | DType | Min | Mean | Max | Spread | Limit | Status |
 |-------|-----|------|-----|--------|-------|--------|
 | FP32 | 51.9 | 52.0 | 52.1 | 0.38% | <= 3% | PASS |
 | TF32 | 361.0 | 364.9 | 369.0 | 2.19% | <= 3% | PASS |
 | FP16 | 667.3 | 680.0 | 688.0 | 3.04% | <= 3% | FAIL |
 | BF16 | 703.0 | 713.3 | 735.7 | 4.58% | <= 3% | FAIL |
 | FP8 | 1156.9 | 1170.5 | 1186.1 | 2.49% | <= 3% | PASS |
 | FP64 | 45.9 | 46.9 | 47.5 | 3.41% | <= 3% | FAIL |
 | INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
 ### Compute Per-GPU TFLOPS
 | GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
 |---|---|---|---|---|---|---|---|
 | 0 | 52.0 | 369.0 | 688.0 | 735.7 | 1186.1 | 47.5 | 100.4 |
 | 1 | 51.9 | 365.6 | 675.3 | 711.6 | 1171.0 | 47.0 | 100.4 |
 | 2 | 51.9 | 364.9 | 685.7 | 715.3 | 1175.3 | 47.1 | 100.4 |
 | 3 | 51.9 | 364.0 | 679.9 | 704.0 | 1167.6 | 47.4 | 100.4 |
 | 4 | 51.9 | 367.7 | 681.2 | 719.0 | 1178.0 | 46.6 | 100.4 |
 | 5 | 52.0 | 364.3 | 680.8 | 712.3 | 1165.5 | 46.8 | 100.4 |
 | 6 | 52.1 | 362.9 | 681.8 | 703.0 | 1156.9 | 46.9 | 100.4 |
 | 7 | 51.9 | 361.0 | 667.3 | 705.3 | 1163.2 | 45.9 | 100.4 |
 ## NVLink/NVSwitch
 **Overall: PASS**
 | GPU | Active Links | Issues |
 |-----|--------------|--------|
 | 0 | 18/18 | OK |
 | 1 | 18/18 | OK |
 | 2 | 18/18 | OK |
 | 3 | 18/18 | OK |
 | 4 | 18/18 | OK |
 | 5 | 18/18 | OK |
 | 6 | 18/18 | OK |
 | 7 | 18/18 | OK |
 ## DCGM Diagnostic
 **Overall: PASS**
 | Subtest | Status |
 |---------|--------|
 | Deployment/software/GPU0 | PASS |
 | Deployment/software/GPU1 | PASS |
 | Deployment/software/GPU2 | PASS |
 | Deployment/software/GPU3 | PASS |
 | Deployment/software/GPU4 | PASS |
 | Deployment/software/GPU5 | PASS |
 | Deployment/software/GPU6 | PASS |
 | Deployment/software/GPU7 | PASS |
 | Deployment/software/summary | PASS |
 | Hardware/memory/GPU0 | PASS |
 | Hardware/memory/GPU1 | PASS |
 | Hardware/memory/GPU2 | PASS |
 | Hardware/memory/GPU3 | PASS |
 | Hardware/memory/GPU4 | PASS |
 | Hardware/memory/GPU5 | PASS |
 | Hardware/memory/GPU6 | PASS |
 | Hardware/memory/GPU7 | PASS |
 | Hardware/memory/summary | PASS |
 | Hardware/diagnostic/GPU0 | PASS |
 | Hardware/diagnostic/GPU1 | PASS |
 | Hardware/diagnostic/GPU2 | PASS |
 | Hardware/diagnostic/GPU3 | PASS |
 | Hardware/diagnostic/GPU4 | PASS |
 | Hardware/diagnostic/GPU5 | PASS |
 | Hardware/diagnostic/GPU6 | PASS |
 | Hardware/diagnostic/GPU7 | PASS |
 | Hardware/diagnostic/summary | PASS |
 | Hardware/nvbandwidth/GPU0 | PASS |
 | Hardware/nvbandwidth/GPU1 | PASS |
 | Hardware/nvbandwidth/GPU2 | PASS |
 | Hardware/nvbandwidth/GPU3 | PASS |
 | Hardware/nvbandwidth/GPU4 | PASS |
 | Hardware/nvbandwidth/GPU5 | PASS |
 | Hardware/nvbandwidth/GPU6 | PASS |
 | Hardware/nvbandwidth/GPU7 | PASS |
 | Hardware/nvbandwidth/summary | PASS |
 | Integration/pcie/GPU0 | PASS |
 | Integration/pcie/GPU1 | PASS |
 | Integration/pcie/GPU2 | PASS |
 | Integration/pcie/GPU3 | PASS |
 | Integration/pcie/GPU4 | PASS |
 | Integration/pcie/GPU5 | PASS |
 | Integration/pcie/GPU6 | PASS |
 | Integration/pcie/GPU7 | PASS |
 | Integration/pcie/summary | PASS |
 | Stress/targeted_stress/GPU0 | PASS |
 | Stress/targeted_stress/GPU1 | PASS |
 | Stress/targeted_stress/GPU2 | PASS |
 | Stress/targeted_stress/GPU3 | PASS |
 | Stress/targeted_stress/GPU4 | PASS |
 | Stress/targeted_stress/GPU5 | PASS |
 | Stress/targeted_stress/GPU6 | PASS |
 | Stress/targeted_stress/GPU7 | PASS |
 | Stress/targeted_stress/summary | PASS |
 | Stress/targeted_power/GPU0 | PASS |
 | Stress/targeted_power/GPU1 | PASS |
 | Stress/targeted_power/GPU2 | PASS |
 | Stress/targeted_power/GPU3 | PASS |
 | Stress/targeted_power/GPU4 | PASS |
 | Stress/targeted_power/GPU5 | PASS |
 | Stress/targeted_power/GPU6 | PASS |
 | Stress/targeted_power/GPU7 | PASS |
 | Stress/targeted_power/summary | PASS |
 ## NCCL Multi-GPU
 Source: nccl-tests | GPUs: 8
 | Operation | Bus BW (GB/s) | Threshold | Status |
 |-----------|---------------|-----------|--------|
 | allreduce | 472.3 | >= 405 | FAIL |
 | alltoall | 343.3 | >= 315 | FAIL |
 | broadcast | 364.1 | >= 360 | FAIL |
 | reducescatter | 352.8 | >= 405 | FAIL |
 | allgather | 366.4 | >= 405 | FAIL |
 | sendrecv | 369.0 | >= 360 | FAIL |
 ### NCCL allreduce by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 24.9, 25.0, 24.7 | 24.7 | 24.9 | 0.50% | >= 405 | FAIL |
 | 256M | 421.6, 421.8, 421.6 | 421.6 | 421.7 | 0.02% | >= 405 | PASS |
 | 2G | 472.8, 472.7, 471.5 | 471.5 | 472.3 | 0.13% | >= 405 | PASS |
 ### NCCL alltoall by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 8.1, 8.0, 8.0 | 8.0 | 8.0 | 0.59% | >= 315 | FAIL |
 | 256M | 305.3, 314.9, 313.1 | 305.3 | 311.1 | 1.34% | >= 315 | FAIL |
 | 2G | 342.1, 342.5, 345.4 | 342.1 | 343.3 | 0.43% | >= 315 | PASS |
 ### NCCL broadcast by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.5, 14.6, 14.2 | 14.2 | 14.4 | 1.18% | >= 360 | FAIL |
 | 256M | 344.2, 345.9, 344.6 | 344.2 | 344.9 | 0.21% | >= 360 | FAIL |
 | 2G | 364.2, 364.0, 364.1 | 364.0 | 364.1 | 0.02% | >= 360 | PASS |
 ### NCCL reducescatter by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.1, 13.8, 14.2 | 13.8 | 14.0 | 1.21% | >= 405 | FAIL |
 | 256M | 328.6, 328.3, 328.2 | 328.2 | 328.4 | 0.05% | >= 405 | FAIL |
 | 2G | 352.6, 352.4, 353.3 | 352.4 | 352.8 | 0.11% | >= 405 | FAIL |
 ### NCCL allgather by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.6, 14.3, 14.4 | 14.3 | 14.4 | 0.86% | >= 405 | FAIL |
 | 256M | 350.5, 350.4, 349.9 | 349.9 | 350.3 | 0.07% | >= 405 | FAIL |
 | 2G | 366.3, 366.6, 366.2 | 366.2 | 366.4 | 0.05% | >= 405 | FAIL |
 ### NCCL sendrecv by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 18.4, 18.4, 18.4 | 18.4 | 18.4 | 0.00% | >= 360 | FAIL |
 | 256M | 350.9, 351.6, 351.4 | 350.9 | 351.3 | 0.08% | >= 360 | FAIL |
 | 2G | 368.9, 369.1, 368.9 | 368.9 | 369.0 | 0.03% | >= 360 | PASS |
 **Overall: FAIL**
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 1800s (requested 1800s)
 - **Telemetry samples:** 1266
 - **Max temp:** {0: 60.0, 1: 60.0, 2: 68.0, 3: 56.0, 4: 60.0, 5: 68.0, 6: 64.0, 7: 56.0}
 - **Avg power:** {0: 697.7, 1: 697.5, 2: 697.1, 3: 697.8, 4: 697.8, 5: 697.9, 6: 697.7, 7: 698.3}
 - **Temp delta:** 12.0 C
 - **TFLOPS jitter:** 4.37%
 - **Steady TFLOPS samples:** 37672
 - **Throttle events:** 9712
 - **XID events:** 0
 - **Failure reasons:**
  - GPU temperature delta 12.0C exceeds 5.0C
  - non-idle throttle reasons observed in 9712 samples (first: GPU 0 0x4)
 - **Result: FAIL**
 ## RDMA/InfiniBand
 ### RDMA Port Checks
 | Device | Port | State | Rate | Required | Status |
 |--------|------|-------|------|----------|--------|
 | mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 49.5 GB/s | >= 47 GB/s | PASS |
 | ib_read_bw | 39.1 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 1.25 us | <= 2 us | PASS |
 | ib_read_lat | 2.60 us | <= 3.5 us | PASS |
 | ibping | local_loopback target=0x58 count=5 | 0% packet loss | PASS |
 - **PFC/ECN/CNP/congestion counters checked:** 146
 - **PFC/ECN/CNP/congestion non-zero:** no
 - **Failure reasons:**
  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - ib_read_bw bandwidth 39.12GB/s < 47GB/s
 **Overall: FAIL**
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer_1.5b |
 | Params | 1470.5M |
 | Throughput | 216498 tokens/sec |
 | Avg Step Time | 75.7 ms |
 | Warmup Steps | 5 |
 | Peak Memory | 18.1 GB |
 | Final Loss | 0.0039 |
 | Step Jitter | 1.89% |
 | Distributed Mode | ddp |
 | Verdict | PASS (216498 tokens/sec) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_latest_aikubeworker0016_20260522_203447.md
+++ b/reports_test_all_latest_aikubeworker0016_20260522_203447.md
@ -0,0 +1,322 @@
 # GPU Test Report
 - **Date:** 2026-05-22T20:34:52.129246
 - **Host:** aikubeworker0016
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - Compute Throughput: FAIL (BF16 spread 3.44% > 3%)
 - NCCL: FAIL
 - Stress Test: FAIL
 - RDMA: FAIL
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Health Check | PASS |
 | Memory Bandwidth | PASS (108.1%) |
 | Compute Throughput | FAIL (BF16 spread 3.44% > 3%) |
 | NVLink/NVSwitch | PASS |
 | DCGM | PASS |
 | NCCL | FAIL |
 | Stress Test | FAIL |
 | RDMA | FAIL |
 | Training | PASS (216683 tokens/sec) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 70/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 69/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 68/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 66/700W | 345 MHz |
 ## Health Check
 **Overall: PASS**
 | GPU | Temp | Power | ECC | PCIe | Throttle | Status |
 |-----|------|-------|-----|------|----------|--------|
 | 0 | 20C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 1 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 2 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 3 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 4 | 20C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 5 | 22C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 6 | 20C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 7 | 20C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.4 GB/s | 64 GB/s | 86.6% |
 | D2H (PCIe) | 54.4 GB/s | 64 GB/s | 85.0% |
 | D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
 **Verdict: PASS** (D2D efficiency 108.1%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 52.1 | 67 | >= 54 | FAIL |
 | TF32 | 366.7 | 495 | >= 444 | FAIL |
 | FP16 | 682.7 | 990 | >= 734 | FAIL |
 | BF16 | 717.3 | 990 | >= 745 | FAIL |
 | FP8 | 1173.5 | 1979 | >= 1400 | FAIL |
 | FP64 | 47.4 | 67 | >= 63 | FAIL |
 | INT8 | 100.4 | 1979 | >= 1536 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 5.1%)
 ### Compute Consistency
 | DType | Min | Mean | Max | Spread | Limit | Status |
 |-------|-----|------|-----|--------|-------|--------|
 | FP32 | 51.9 | 52.1 | 52.2 | 0.58% | <= 3% | PASS |
 | TF32 | 362.3 | 366.7 | 369.2 | 1.88% | <= 3% | PASS |
 | FP16 | 674.4 | 682.7 | 693.1 | 2.74% | <= 3% | PASS |
 | BF16 | 705.3 | 717.2 | 730.0 | 3.44% | <= 3% | FAIL |
 | FP8 | 1155.2 | 1173.5 | 1186.2 | 2.64% | <= 3% | PASS |
 | FP64 | 46.3 | 47.4 | 48.5 | 4.64% | <= 3% | FAIL |
 | INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
 ### Compute Per-GPU TFLOPS
 | GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
 |---|---|---|---|---|---|---|---|
 | 0 | 52.2 | 362.3 | 674.4 | 714.3 | 1159.0 | 46.3 | 100.4 |
 | 1 | 51.9 | 366.5 | 674.7 | 721.4 | 1185.4 | 47.7 | 100.4 |
 | 2 | 52.2 | 367.4 | 693.1 | 730.0 | 1185.7 | 48.5 | 100.4 |
 | 3 | 52.2 | 367.8 | 682.2 | 708.2 | 1163.4 | 47.4 | 100.4 |
 | 4 | 52.0 | 366.4 | 686.9 | 714.1 | 1186.2 | 47.3 | 100.4 |
 | 5 | 52.0 | 369.2 | 679.9 | 721.1 | 1155.2 | 47.3 | 100.4 |
 | 6 | 51.9 | 365.1 | 677.7 | 705.3 | 1169.0 | 47.0 | 100.4 |
 | 7 | 52.2 | 369.0 | 692.8 | 723.5 | 1184.3 | 47.6 | 100.4 |
 ## NVLink/NVSwitch
 **Overall: PASS**
 | GPU | Active Links | Issues |
 |-----|--------------|--------|
 | 0 | 18/18 | OK |
 | 1 | 18/18 | OK |
 | 2 | 18/18 | OK |
 | 3 | 18/18 | OK |
 | 4 | 18/18 | OK |
 | 5 | 18/18 | OK |
 | 6 | 18/18 | OK |
 | 7 | 18/18 | OK |
 ## DCGM Diagnostic
 **Overall: PASS**
 | Subtest | Status |
 |---------|--------|
 | Deployment/software/GPU0 | PASS |
 | Deployment/software/GPU1 | PASS |
 | Deployment/software/GPU2 | PASS |
 | Deployment/software/GPU3 | PASS |
 | Deployment/software/GPU4 | PASS |
 | Deployment/software/GPU5 | PASS |
 | Deployment/software/GPU6 | PASS |
 | Deployment/software/GPU7 | PASS |
 | Deployment/software/summary | PASS |
 | Hardware/memory/GPU0 | PASS |
 | Hardware/memory/GPU1 | PASS |
 | Hardware/memory/GPU2 | PASS |
 | Hardware/memory/GPU3 | PASS |
 | Hardware/memory/GPU4 | PASS |
 | Hardware/memory/GPU5 | PASS |
 | Hardware/memory/GPU6 | PASS |
 | Hardware/memory/GPU7 | PASS |
 | Hardware/memory/summary | PASS |
 | Hardware/diagnostic/GPU0 | PASS |
 | Hardware/diagnostic/GPU1 | PASS |
 | Hardware/diagnostic/GPU2 | PASS |
 | Hardware/diagnostic/GPU3 | PASS |
 | Hardware/diagnostic/GPU4 | PASS |
 | Hardware/diagnostic/GPU5 | PASS |
 | Hardware/diagnostic/GPU6 | PASS |
 | Hardware/diagnostic/GPU7 | PASS |
 | Hardware/diagnostic/summary | PASS |
 | Hardware/nvbandwidth/GPU0 | PASS |
 | Hardware/nvbandwidth/GPU1 | PASS |
 | Hardware/nvbandwidth/GPU2 | PASS |
 | Hardware/nvbandwidth/GPU3 | PASS |
 | Hardware/nvbandwidth/GPU4 | PASS |
 | Hardware/nvbandwidth/GPU5 | PASS |
 | Hardware/nvbandwidth/GPU6 | PASS |
 | Hardware/nvbandwidth/GPU7 | PASS |
 | Hardware/nvbandwidth/summary | PASS |
 | Integration/pcie/GPU0 | PASS |
 | Integration/pcie/GPU1 | PASS |
 | Integration/pcie/GPU2 | PASS |
 | Integration/pcie/GPU3 | PASS |
 | Integration/pcie/GPU4 | PASS |
 | Integration/pcie/GPU5 | PASS |
 | Integration/pcie/GPU6 | PASS |
 | Integration/pcie/GPU7 | PASS |
 | Integration/pcie/summary | PASS |
 | Stress/targeted_stress/GPU0 | PASS |
 | Stress/targeted_stress/GPU1 | PASS |
 | Stress/targeted_stress/GPU2 | PASS |
 | Stress/targeted_stress/GPU3 | PASS |
 | Stress/targeted_stress/GPU4 | PASS |
 | Stress/targeted_stress/GPU5 | PASS |
 | Stress/targeted_stress/GPU6 | PASS |
 | Stress/targeted_stress/GPU7 | PASS |
 | Stress/targeted_stress/summary | PASS |
 | Stress/targeted_power/GPU0 | PASS |
 | Stress/targeted_power/GPU1 | PASS |
 | Stress/targeted_power/GPU2 | PASS |
 | Stress/targeted_power/GPU3 | PASS |
 | Stress/targeted_power/GPU4 | PASS |
 | Stress/targeted_power/GPU5 | PASS |
 | Stress/targeted_power/GPU6 | PASS |
 | Stress/targeted_power/GPU7 | PASS |
 | Stress/targeted_power/summary | PASS |
 ## NCCL Multi-GPU
 Source: nccl-tests | GPUs: 8
 | Operation | Bus BW (GB/s) | Threshold | Status |
 |-----------|---------------|-----------|--------|
 | allreduce | 472.4 | >= 405 | FAIL |
 | alltoall | 344.3 | >= 315 | FAIL |
 | broadcast | 363.6 | >= 360 | FAIL |
 | reducescatter | 353.1 | >= 405 | FAIL |
 | allgather | 366.4 | >= 405 | FAIL |
 | sendrecv | 368.9 | >= 360 | FAIL |
 ### NCCL allreduce by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 24.9, 24.4, 24.9 | 24.4 | 24.7 | 0.95% | >= 405 | FAIL |
 | 256M | 421.9, 421.1, 421.9 | 421.1 | 421.6 | 0.09% | >= 405 | PASS |
 | 2G | 472.6, 472.0, 472.5 | 472.0 | 472.4 | 0.06% | >= 405 | PASS |
 ### NCCL alltoall by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 7.9, 7.8, 8.1 | 7.8 | 7.9 | 1.57% | >= 315 | FAIL |
 | 256M | 298.7, 312.7, 303.2 | 298.7 | 304.9 | 1.91% | >= 315 | FAIL |
 | 2G | 342.2, 345.4, 345.2 | 342.2 | 344.3 | 0.43% | >= 315 | PASS |
 ### NCCL broadcast by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.5, 14.3, 14.4 | 14.3 | 14.4 | 0.57% | >= 360 | FAIL |
 | 256M | 344.1, 344.3, 344.8 | 344.1 | 344.4 | 0.09% | >= 360 | FAIL |
 | 2G | 364.0, 363.6, 363.3 | 363.3 | 363.6 | 0.08% | >= 360 | PASS |
 ### NCCL reducescatter by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.0, 14.2, 14.3 | 14.0 | 14.2 | 0.88% | >= 405 | FAIL |
 | 256M | 328.8, 328.7, 328.4 | 328.4 | 328.6 | 0.05% | >= 405 | FAIL |
 | 2G | 351.9, 353.8, 353.6 | 351.9 | 353.1 | 0.24% | >= 405 | FAIL |
 ### NCCL allgather by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.4, 13.9, 14.0 | 13.9 | 14.1 | 1.53% | >= 405 | FAIL |
 | 256M | 350.2, 350.4, 350.7 | 350.2 | 350.4 | 0.06% | >= 405 | FAIL |
 | 2G | 366.9, 366.4, 366.0 | 366.0 | 366.4 | 0.10% | >= 405 | FAIL |
 ### NCCL sendrecv by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 18.4, 18.3, 18.5 | 18.3 | 18.4 | 0.44% | >= 360 | FAIL |
 | 256M | 351.1, 351.4, 351.3 | 351.1 | 351.3 | 0.04% | >= 360 | FAIL |
 | 2G | 368.9, 368.8, 368.9 | 368.8 | 368.9 | 0.01% | >= 360 | PASS |
 **Overall: FAIL**
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 1800s (requested 1800s)
 - **Telemetry samples:** 1295
 - **Max temp:** {0: 51.0, 1: 59.0, 2: 61.0, 3: 53.0, 4: 53.0, 5: 62.0, 6: 56.0, 7: 52.0}
 - **Avg power:** {0: 698.8, 1: 697.8, 2: 698.1, 3: 697.9, 4: 697.9, 5: 698.2, 6: 698.0, 7: 697.8}
 - **Temp delta:** 11.0 C
 - **TFLOPS jitter:** 3.4%
 - **Steady TFLOPS samples:** 37874
 - **Throttle events:** 9944
 - **XID events:** 0
 - **Failure reasons:**
  - GPU temperature delta 11.0C exceeds 5.0C
  - non-idle throttle reasons observed in 9944 samples (first: GPU 0 0x4)
 - **Result: FAIL**
 ## RDMA/InfiniBand
 ### RDMA Port Checks
 | Device | Port | State | Rate | Required | Status |
 |--------|------|-------|------|----------|--------|
 | mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 48.6 GB/s | >= 47 GB/s | PASS |
 | ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 1.29 us | <= 2 us | PASS |
 | ib_read_lat | 2.59 us | <= 3.5 us | PASS |
 | ibping | local_loopback target=0x4b count=5 | 0% packet loss | PASS |
 - **PFC/ECN/CNP/congestion counters checked:** 146
 - **PFC/ECN/CNP/congestion non-zero:** no
 - **Failure reasons:**
  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - ib_read_bw bandwidth 40.29GB/s < 47GB/s
 **Overall: FAIL**
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer_1.5b |
 | Params | 1470.5M |
 | Throughput | 216683 tokens/sec |
 | Avg Step Time | 75.6 ms |
 | Warmup Steps | 5 |
 | Peak Memory | 18.1 GB |
 | Final Loss | 0.0039 |
 | Step Jitter | 1.2% |
 | Distributed Mode | ddp |
 | Verdict | PASS (216683 tokens/sec) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_latest_summary_cn_20260523.md
+++ b/reports_test_all_latest_summary_cn_20260523.md
@ -0,0 +1,101 @@
 # H100 单节点 test all 中文汇总
 生成时间：2026-05-23  
 测试范围：`aikubeworker0012`、`aikubeworker0016` 单节点 `python gpu_tester.py --test all --report --format md`
 原始报告：
 - `reports_test_all_latest_aikubeworker0012_20260522_203246.md`
 - `reports_test_all_latest_aikubeworker0016_20260522_203447.md`
 ## 总结论
 | 机器 | Suite | PDF 验收结论 | 主要失败项 |
 |---|---:|---|---|
 | aikubeworker0012 | 6/10 PASS | FAIL | Compute、NCCL、Stress、RDMA |
 | aikubeworker0016 | 6/10 PASS | FAIL | Compute、NCCL、Stress、RDMA |
 按 PDF 口径，任一必测子项 FAIL，则整机 FAIL。因此两台机器当前都不通过生产验收。
 ## 通过项
 | 项目 | aikubeworker0012 | aikubeworker0016 | 说明 |
 |---|---|---|---|
 | GPU Info | PASS | PASS | 8 张 H100 |
 | Health | PASS | PASS | 温度、空闲功耗、ECC、PCIe、空闲 throttle 正常 |
 | Memory Bandwidth | PASS | PASS | D2D 效率均约 108.1% |
 | NVLink/NVSwitch | PASS | PASS | 8 卡均 18/18 links |
 | DCGM diag -r 3 | PASS | PASS | software、memory、diagnostic、nvbandwidth、pcie、targeted stress/power 全 PASS |
 | Training Simulation | PASS | PASS | 8 卡 DDP synthetic 1.5B，loss finite |
 Training 结果：
 | 机器 | Throughput | Step jitter | Peak memory | Verdict |
 |---|---:|---:|---:|---|
 | aikubeworker0012 | 216498 tokens/s | 1.89% | 18.08 GB | PASS |
 | aikubeworker0016 | 216683 tokens/s | 1.20% | 18.08 GB | PASS |
 ## 失败项
 ### Compute
 两台机器都未达到当前 H100 绝对 TFLOPS 阈值，且部分 dtype 的跨 GPU spread 超过 3%。
 | 机器 | 代表性失败 |
 |---|---|
 | aikubeworker0012 | FP16 spread 3.04%，BF16 spread 4.58%，FP64 spread 3.41%；FP32/TF32/FP16/BF16/FP8/FP64/INT8 绝对阈值均 FAIL |
 | aikubeworker0016 | BF16 spread 3.44%，FP64 spread 4.64%；FP32/TF32/FP16/BF16/FP8/FP64/INT8 绝对阈值均 FAIL |
 ### NCCL
 NCCL 已经使用真实 `nccl-tests` bus BW，不是 torchrun fallback。失败主要来自小 size 以及部分 256M/2G op 未达阈值。
 | 机器 | allreduce best | alltoall best | broadcast best | reducescatter best | allgather best | sendrecv best | Verdict |
 |---|---:|---:|---:|---:|---:|---:|---|
 | aikubeworker0012 | 472.3 | 343.3 | 364.1 | 352.8 | 366.4 | 369.0 | FAIL |
 | aikubeworker0016 | 472.4 | 344.3 | 363.6 | 353.1 | 366.4 | 368.9 | FAIL |
 关键原因：
 - `1M` size 在所有 op 上都明显低于阈值。
 - `reducescatter`、`allgather` 的 2G 也低于 405 GB/s 阈值。
 - `broadcast/sendrecv` 的 256M 低于 360 GB/s 阈值。
 ### Stress
 两台机器的 1800 秒 PyTorch BF16 GEMM 压力测试均跑满，但 telemetry 判定 FAIL。
 | 机器 | 平均稳态功耗 | 最高温度范围 | 温差 | TFLOPS jitter | throttle events | XID | Verdict |
 |---|---|---|---:|---:|---:|---:|---|
 | aikubeworker0012 | 约 697-698W/GPU | 56-68C | 12C | 4.37% | 9712 | 0 | FAIL |
 | aikubeworker0016 | 约 698W/GPU | 51-62C | 11C | 3.40% | 9944 | 0 | FAIL |
 失败原因：
 - GPU 间温差超过 5C 阈值。
 - 观测到大量非 idle throttle，首个原因是 `0x4`，即 `sw_power_cap`。
 ### RDMA/InfiniBand
 本轮 `test all` 是单节点 RDMA 路径，`ibping` 显示为 `local_loopback`。这份结果不能替代跨节点 RDMA 验收，但仍反映单节点 perftest read bandwidth 未达标。
 | 机器 | ib_write_bw | ib_read_bw | ib_write_lat | ib_read_lat | Verdict |
 |---|---:|---:|---:|---:|---|
 | aikubeworker0012 | 49.5 GB/s PASS | 39.1 GB/s FAIL | 1.25 us PASS | 2.60 us PASS | FAIL |
 | aikubeworker0016 | 48.6 GB/s PASS | 40.3 GB/s FAIL | 1.29 us PASS | 2.59 us PASS | FAIL |
 另外，两台机器都有 `mlx5_4`、`mlx5_5` 处于 ACTIVE 但速率为 100 Gb/sec，低于当前 400G 端口阈值，因此 RDMA port check 也有 FAIL。
 ## 当前阻塞
 1. Compute 阈值口径较严，当前实测绝对 TFLOPS 全 dtype 未达配置阈值，尤其 INT8 路径仅约 100 TFLOPS。
 2. NCCL 真实 bus BW 已可测，但多 op/size 未达 PDF 阈值。
 3. Stress 负载可跑满 30 分钟，但温差和 `sw_power_cap` throttle 导致 FAIL。
 4. 单节点 RDMA read bandwidth 未达 47 GB/s，且部分 IB 端口速率低于 400G。
 5. 跨节点 RDMA 需要继续使用单独 server/client 报告；不能把本轮 `local_loopback` 当作跨节点验收。
 ## 状态判断
 脚本能力已经基本补齐到 PDF 验收口径：真实 nccl-tests、30 分钟 stress telemetry、NVLink、DCGM r3、RDMA perftest/ibping/counter、逐 GPU compute、8 卡 DDP training、最终任一 FAIL 即整机 FAIL 都已经跑通。
 当前剩余问题主要不是脚本缺项，而是两台机器的实际验收数据有多项未达标。
--- a/reports_test_all_pdf_aikubeworker0012_20260522_182656.md
+++ b/reports_test_all_pdf_aikubeworker0012_20260522_182656.md
@ -0,0 +1,259 @@
 # GPU Test Report
 - **Date:** 2026-05-22T18:27:01.103760
 - **Host:** aikubeworker0012
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - Compute Throughput: FAIL (worst FP32 52 vs >= 54)
 - DCGM: ERROR: dcgmi diag -r 3 timeout after 1200s
 - NCCL: FAIL
 - Stress Test: FAIL
 - RDMA: FAIL
 - Training: FAIL (188741 tokens/sec)
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Health Check | PASS |
 | Memory Bandwidth | PASS (108.1%) |
 | Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
 | NVLink/NVSwitch | PASS |
 | DCGM | ERROR: dcgmi diag -r 3 timeout after 1200s |
 | NCCL | FAIL |
 | Stress Test | FAIL |
 | RDMA | FAIL |
 | Training | FAIL (188741 tokens/sec) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 70/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 73/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 26C | 69/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 70/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 69/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 27C | 70/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 25C | 71/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 24C | 72/700W | 345 MHz |
 ## Health Check
 **Overall: PASS**
 | GPU | Temp | Power | ECC | PCIe | Throttle | Status |
 |-----|------|-------|-----|------|----------|--------|
 | 0 | 25C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 1 | 25C PASS | 73W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 2 | 26C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 3 | 24C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 4 | 24C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 5 | 27C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 6 | 25C PASS | 71W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 7 | 24C PASS | 72W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
 | D2H (PCIe) | 54.3 GB/s | 64 GB/s | 84.8% |
 | D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
 **Verdict: PASS** (D2D efficiency 108.1%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 52.0 | 67 | >= 54 | FAIL |
 | TF32 | 364.8 | 495 | >= 444 | FAIL |
 | FP16 | 685.0 | 990 | >= 734 | FAIL |
 | BF16 | 715.9 | 990 | >= 745 | FAIL |
 | FP8 | 1166.6 | 1979 | >= 1400 | FAIL |
 | FP64 | 46.9 | 0 | >= 63 | FAIL |
 | INT8 | 100.4 | 0 | >= 1536 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 58.9%)
 ### Compute Consistency
 | DType | Min | Mean | Max | Spread | Limit | Status |
 |-------|-----|------|-----|--------|-------|--------|
 | FP32 | 51.9 | 52.0 | 52.2 | 0.58% | <= 3% | PASS |
 | TF32 | 360.9 | 364.9 | 368.2 | 2.00% | <= 3% | PASS |
 | FP16 | 676.0 | 685.0 | 689.9 | 2.03% | <= 3% | PASS |
 | BF16 | 697.3 | 715.9 | 730.2 | 4.60% | <= 3% | FAIL |
 | FP8 | 1141.8 | 1166.6 | 1180.3 | 3.30% | <= 3% | FAIL |
 | FP64 | 45.8 | 46.9 | 47.7 | 4.05% | <= 3% | FAIL |
 | INT8 | 100.4 | 100.4 | 100.4 | 0.00% | <= 3% | PASS |
 ### Compute Per-GPU TFLOPS
 | GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
 |---|---|---|---|---|---|---|---|
 | 0 | 51.9 | 368.2 | 689.5 | 730.2 | 1180.3 | 47.1 | 100.4 |
 | 1 | 51.9 | 366.8 | 688.7 | 721.6 | 1170.1 | 47.7 | 100.4 |
 | 2 | 51.9 | 366.3 | 689.9 | 711.3 | 1167.8 | 47.2 | 100.4 |
 | 3 | 51.9 | 363.0 | 677.6 | 699.2 | 1176.3 | 46.6 | 100.4 |
 | 4 | 52.2 | 365.3 | 685.0 | 725.4 | 1163.0 | 46.8 | 100.4 |
 | 5 | 52.1 | 363.9 | 684.2 | 725.0 | 1172.1 | 46.9 | 100.4 |
 | 6 | 51.9 | 364.4 | 688.8 | 717.3 | 1161.2 | 46.9 | 100.4 |
 | 7 | 51.9 | 360.9 | 676.0 | 697.3 | 1141.8 | 45.8 | 100.4 |
 ## NVLink/NVSwitch
 **Overall: PASS**
 | GPU | Active Links | Issues |
 |-----|--------------|--------|
 | 0 | 18/18 | OK |
 | 1 | 18/18 | OK |
 | 2 | 18/18 | OK |
 | 3 | 18/18 | OK |
 | 4 | 18/18 | OK |
 | 5 | 18/18 | OK |
 | 6 | 18/18 | OK |
 | 7 | 18/18 | OK |
 ## DCGM Diagnostic
 **Overall: FAIL** (dcgmi diag -r 3 timeout after 1200s)
 ## NCCL Multi-GPU
 Source: nccl-tests | GPUs: 8
 | Operation | Bus BW (GB/s) | Threshold | Status |
 |-----------|---------------|-----------|--------|
 | allreduce | 472.4 | >= 405 | FAIL |
 | alltoall | 344.4 | >= 315 | FAIL |
 | broadcast | 363.8 | >= 360 | FAIL |
 | reducescatter | 353.0 | >= 405 | FAIL |
 | allgather | 366.4 | >= 405 | FAIL |
 | sendrecv | 368.9 | >= 360 | FAIL |
 ### NCCL allreduce by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 24.0, 24.9, 24.7 | 24.0 | 24.5 | 1.57% | >= 405 | FAIL |
 | 256M | 421.4, 421.7, 421.4 | 421.4 | 421.5 | 0.03% | >= 405 | PASS |
 | 2G | 471.8, 473.0, 472.3 | 471.8 | 472.4 | 0.10% | >= 405 | PASS |
 ### NCCL alltoall by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 8.1, 8.0, 8.0 | 8.0 | 8.0 | 0.59% | >= 315 | FAIL |
 | 256M | 312.3, 310.9, 319.2 | 310.9 | 314.1 | 1.15% | >= 315 | FAIL |
 | 2G | 343.1, 346.2, 344.0 | 343.1 | 344.4 | 0.38% | >= 315 | PASS |
 ### NCCL broadcast by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.6, 13.6, 14.5 | 13.6 | 14.2 | 3.16% | >= 360 | FAIL |
 | 256M | 343.8, 344.2, 344.5 | 343.8 | 344.2 | 0.08% | >= 360 | FAIL |
 | 2G | 363.5, 363.3, 364.7 | 363.3 | 363.8 | 0.17% | >= 360 | PASS |
 ### NCCL reducescatter by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.1, 14.3, 14.3 | 14.1 | 14.2 | 0.66% | >= 405 | FAIL |
 | 256M | 328.1, 328.3, 328.3 | 328.1 | 328.2 | 0.03% | >= 405 | FAIL |
 | 2G | 354.0, 352.6, 352.3 | 352.3 | 353.0 | 0.21% | >= 405 | FAIL |
 ### NCCL allgather by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.5, 14.5, 14.3 | 14.3 | 14.4 | 0.65% | >= 405 | FAIL |
 | 256M | 350.7, 350.7, 350.5 | 350.5 | 350.6 | 0.03% | >= 405 | FAIL |
 | 2G | 366.6, 366.3, 366.3 | 366.3 | 366.4 | 0.04% | >= 405 | FAIL |
 ### NCCL sendrecv by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 18.5, 18.4, 18.1 | 18.1 | 18.3 | 0.93% | >= 360 | FAIL |
 | 256M | 352.3, 350.6, 350.5 | 350.5 | 351.1 | 0.24% | >= 360 | FAIL |
 | 2G | 368.8, 369.0, 368.8 | 368.8 | 368.9 | 0.03% | >= 360 | PASS |
 **Overall: FAIL**
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 1800s (requested 1800s)
 - **Telemetry samples:** 1541
 - **Max temp:** {0: 60.0, 1: 60.0, 2: 68.0, 3: 56.0, 4: 60.0, 5: 68.0, 6: 65.0, 7: 56.0}
 - **Avg power:** {0: 697.7, 1: 697.4, 2: 697.2, 3: 697.7, 4: 697.5, 5: 698.0, 6: 697.8, 7: 698.4}
 - **Temp delta:** 12.0 C
 - **TFLOPS jitter:** 3.16%
 - **Steady TFLOPS samples:** 37676
 - **Throttle events:** 11912
 - **XID events:** 0
 - **Failure reasons:**
  - GPU temperature delta 12.0C exceeds 5.0C
  - non-idle throttle reasons observed in 11912 samples (first: GPU 0 0x4)
 - **Result: FAIL**
 ## RDMA/InfiniBand
 ### RDMA Port Checks
 | Device | Port | State | Rate | Required | Status |
 |--------|------|-------|------|----------|--------|
 | mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 49.2 GB/s | >= 47 GB/s | PASS |
 | ib_read_bw | 39.1 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 5.68 us | <= 2 us | FAIL |
 | ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
 | ibping | target=0x58 count=5 | 0% packet loss | PASS |
 - **PFC/ECN/CNP/congestion counters checked:** 0
 - **PFC/ECN/CNP/congestion non-zero:** no
 - **Failure reasons:**
  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - ib_read_bw bandwidth 39.11GB/s < 47GB/s
  - ib_write_lat latency 5.68us > 2.0us
  - ib_read_lat latency 16.0us > 3.5us
 **Overall: FAIL**
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer_1.5b |
 | Params | 1470.5M |
 | Throughput | 188741 tokens/sec |
 | Avg Step Time | 86.8 ms |
 | Peak Memory | 18.1 GB |
 | Final Loss | 0.0041 |
 | Step Jitter | 626.74% |
 | Distributed Mode | ddp |
 | Verdict | FAIL (188741 tokens/sec) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_test_all_pdf_aikubeworker0016_20260522_182856.md
+++ b/reports_test_all_pdf_aikubeworker0016_20260522_182856.md
@ -0,0 +1,259 @@
 # GPU Test Report
 - **Date:** 2026-05-22T18:29:01.245683
 - **Host:** aikubeworker0016
 - **GPU:** NVIDIA H100 80GB HBM3 x8
 - **Driver:** 580.159.03 | **CUDA:** 13.0
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Failed or unverified items:
 - Compute Throughput: FAIL (worst FP32 52 vs >= 54)
 - DCGM: ERROR: dcgmi diag -r 3 timeout after 1200s
 - NCCL: FAIL
 - Stress Test: FAIL
 - RDMA: FAIL
 - Training: FAIL (193836 tokens/sec)
 ## Summary
 | Test | Result |
 |------|--------|
 | GPU Info | PASS (8 GPUs detected) |
 | Health Check | PASS |
 | Memory Bandwidth | PASS (108.1%) |
 | Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
 | NVLink/NVSwitch | PASS |
 | DCGM | ERROR: dcgmi diag -r 3 timeout after 1200s |
 | NCCL | FAIL |
 | Stress Test | FAIL |
 | RDMA | FAIL |
 | Training | FAIL (193836 tokens/sec) |
 ## GPU Information
 | GPU | Model | VRAM | Temp | Power | SM Clock |
 |-----|-------|------|------|-------|----------|
 | 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 70/700W | 345 MHz |
 | 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
 | 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 20C | 67/700W | 345 MHz |
 | 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 67/700W | 345 MHz |
 | 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 67/700W | 345 MHz |
 | 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 69/700W | 345 MHz |
 | 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 68/700W | 345 MHz |
 | 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 19C | 66/700W | 345 MHz |
 ## Health Check
 **Overall: PASS**
 | GPU | Temp | Power | ECC | PCIe | Throttle | Status |
 |-----|------|-------|-----|------|----------|--------|
 | 0 | 19C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 1 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 2 | 20C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 3 | 19C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 4 | 19C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 5 | 21C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 6 | 19C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 | 7 | 19C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **PASS** |
 ## Memory Bandwidth
 Source: nvbandwidth
 | Metric | Value | Peak | Efficiency |
 |--------|-------|------|------------|
 | H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
 | D2H (PCIe) | 54.7 GB/s | 64 GB/s | 85.5% |
 | D2D (NVLink) | 486.6 GB/s | 450 GB/s | 108.1% |
 **Verdict: PASS** (D2D efficiency 108.1%)
 ## Compute Throughput
 | DType | Achieved (TFLOPS) | Peak | Threshold | Status |
 |-------|-------------------|------|------------|--------|
 | FP32 | 52.0 | 67 | >= 54 | FAIL |
 | TF32 | 366.2 | 495 | >= 444 | FAIL |
 | FP16 | 684.8 | 990 | >= 734 | FAIL |
 | BF16 | 720.7 | 990 | >= 745 | FAIL |
 | FP8 | 1180.3 | 1979 | >= 1400 | FAIL |
 | FP64 | 47.3 | 0 | >= 63 | FAIL |
 | INT8 | 100.5 | 0 | >= 1536 | FAIL |
 **Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 59.6%)
 ### Compute Consistency
 | DType | Min | Mean | Max | Spread | Limit | Status |
 |-------|-----|------|-----|--------|-------|--------|
 | FP32 | 51.9 | 52.0 | 52.2 | 0.58% | <= 3% | PASS |
 | TF32 | 361.1 | 366.2 | 368.9 | 2.13% | <= 3% | PASS |
 | FP16 | 672.6 | 684.8 | 695.0 | 3.27% | <= 3% | FAIL |
 | BF16 | 703.6 | 720.7 | 734.2 | 4.25% | <= 3% | FAIL |
 | FP8 | 1158.6 | 1180.3 | 1241.8 | 7.05% | <= 3% | FAIL |
 | FP64 | 46.7 | 47.3 | 48.0 | 2.75% | <= 3% | PASS |
 | INT8 | 100.4 | 100.5 | 101.1 | 0.70% | <= 3% | PASS |
 ### Compute Per-GPU TFLOPS
 | GPU | FP32 | TF32 | FP16 | BF16 | FP8 | FP64 | INT8 |
 |---|---|---|---|---|---|---|---|
 | 0 | 51.9 | 361.1 | 673.3 | 703.6 | 1158.6 | 46.7 | 100.4 |
 | 1 | 52.0 | 367.0 | 684.0 | 725.7 | 1184.3 | 47.3 | 100.4 |
 | 2 | 52.2 | 368.7 | 695.0 | 734.2 | 1197.7 | 48.0 | 100.4 |
 | 3 | 51.9 | 367.8 | 688.0 | 708.1 | 1174.8 | 47.3 | 100.4 |
 | 4 | 52.0 | 365.2 | 688.4 | 718.2 | 1160.5 | 47.0 | 101.1 |
 | 5 | 52.1 | 368.9 | 684.2 | 733.7 | 1160.5 | 47.3 | 100.4 |
 | 6 | 51.9 | 364.0 | 672.6 | 715.6 | 1164.4 | 47.1 | 100.4 |
 | 7 | 51.9 | 367.0 | 692.5 | 726.5 | 1241.8 | 47.6 | 100.4 |
 ## NVLink/NVSwitch
 **Overall: PASS**
 | GPU | Active Links | Issues |
 |-----|--------------|--------|
 | 0 | 18/18 | OK |
 | 1 | 18/18 | OK |
 | 2 | 18/18 | OK |
 | 3 | 18/18 | OK |
 | 4 | 18/18 | OK |
 | 5 | 18/18 | OK |
 | 6 | 18/18 | OK |
 | 7 | 18/18 | OK |
 ## DCGM Diagnostic
 **Overall: FAIL** (dcgmi diag -r 3 timeout after 1200s)
 ## NCCL Multi-GPU
 Source: nccl-tests | GPUs: 8
 | Operation | Bus BW (GB/s) | Threshold | Status |
 |-----------|---------------|-----------|--------|
 | allreduce | 472.5 | >= 405 | FAIL |
 | alltoall | 344.2 | >= 315 | FAIL |
 | broadcast | 363.8 | >= 360 | FAIL |
 | reducescatter | 352.5 | >= 405 | FAIL |
 | allgather | 366.8 | >= 405 | FAIL |
 | sendrecv | 369.0 | >= 360 | FAIL |
 ### NCCL allreduce by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 24.7, 24.1, 24.5 | 24.1 | 24.4 | 1.02% | >= 405 | FAIL |
 | 256M | 421.8, 422.1, 421.4 | 421.4 | 421.8 | 0.07% | >= 405 | PASS |
 | 2G | 472.8, 472.2, 472.6 | 472.2 | 472.5 | 0.05% | >= 405 | PASS |
 ### NCCL alltoall by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 8.0, 8.0, 7.9 | 7.9 | 8.0 | 0.59% | >= 315 | FAIL |
 | 256M | 326.8, 315.4, 315.8 | 315.4 | 319.3 | 1.65% | >= 315 | PASS |
 | 2G | 344.2, 343.8, 344.6 | 343.8 | 344.2 | 0.09% | >= 315 | PASS |
 ### NCCL broadcast by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.4, 14.2, 14.1 | 14.1 | 14.2 | 0.88% | >= 360 | FAIL |
 | 256M | 345.3, 344.9, 344.4 | 344.4 | 344.9 | 0.11% | >= 360 | FAIL |
 | 2G | 363.6, 363.9, 363.8 | 363.6 | 363.8 | 0.03% | >= 360 | PASS |
 ### NCCL reducescatter by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.3, 14.1, 14.1 | 14.1 | 14.2 | 0.67% | >= 405 | FAIL |
 | 256M | 328.2, 328.3, 328.4 | 328.2 | 328.3 | 0.02% | >= 405 | FAIL |
 | 2G | 352.2, 352.7, 352.6 | 352.2 | 352.5 | 0.06% | >= 405 | FAIL |
 ### NCCL allgather by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 14.2, 14.5, 14.3 | 14.2 | 14.3 | 0.87% | >= 405 | FAIL |
 | 256M | 350.6, 350.6, 350.5 | 350.5 | 350.6 | 0.01% | >= 405 | FAIL |
 | 2G | 367.0, 366.8, 366.5 | 366.5 | 366.8 | 0.06% | >= 405 | FAIL |
 ### NCCL sendrecv by size
 | Size | Runs Bus BW (GB/s) | Worst | Mean | StdDev | Threshold | Status |
 |------|---------------------|-------|------|--------|-----------|--------|
 | 1M | 18.4, 18.2, 18.6 | 18.2 | 18.4 | 0.89% | >= 360 | FAIL |
 | 256M | 350.7, 350.8, 351.1 | 350.7 | 350.9 | 0.05% | >= 360 | FAIL |
 | 2G | 369.0, 369.0, 368.9 | 368.9 | 369.0 | 0.01% | >= 360 | PASS |
 **Overall: FAIL**
 ## Stress Test
 - **Source:** pytorch
 - **Duration:** 1800s (requested 1800s)
 - **Telemetry samples:** 1541
 - **Max temp:** {0: 51.0, 1: 59.0, 2: 62.0, 3: 53.0, 4: 53.0, 5: 62.0, 6: 57.0, 7: 53.0}
 - **Avg power:** {0: 698.7, 1: 698.0, 2: 698.1, 3: 697.9, 4: 697.7, 5: 698.2, 6: 698.0, 7: 697.7}
 - **Temp delta:** 11.0 C
 - **TFLOPS jitter:** 3.05%
 - **Steady TFLOPS samples:** 37841
 - **Throttle events:** 11912
 - **XID events:** 0
 - **Failure reasons:**
  - GPU temperature delta 11.0C exceeds 5.0C
  - non-idle throttle reasons observed in 11912 samples (first: GPU 0 0x4)
 - **Result: FAIL**
 ## RDMA/InfiniBand
 ### RDMA Port Checks
 | Device | Port | State | Rate | Required | Status |
 |--------|------|-------|------|----------|--------|
 | mlx5_0 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_1 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_4 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_5 | 1 | 4: ACTIVE | 100 Gb/sec (2X HDR) | >= 400Gbps ACTIVE | FAIL |
 | mlx5_6 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | mlx5_7 | 1 | 4: ACTIVE | 400 Gb/sec (4X NDR) | >= 400Gbps ACTIVE | PASS |
 | Test | Value | Threshold | Status |
 |------|-------|-----------|--------|
 | ib_write_bw | 48.4 GB/s | >= 47 GB/s | PASS |
 | ib_read_bw | 40.3 GB/s | >= 47 GB/s | FAIL |
 | ib_write_lat | 2.44 us | <= 2 us | FAIL |
 | ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
 | ibping | target=0x4b count=5 | 0% packet loss | PASS |
 - **PFC/ECN/CNP/congestion counters checked:** 0
 - **PFC/ECN/CNP/congestion non-zero:** no
 - **Failure reasons:**
  - mlx5_4 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - mlx5_5 port 1 state/rate failed (4: ACTIVE, 100 Gb/sec (2X HDR); required >= 400.0Gbps ACTIVE)
  - ib_read_bw bandwidth 40.29GB/s < 47GB/s
  - ib_write_lat latency 2.44us > 2.0us
  - ib_read_lat latency 16.0us > 3.5us
 **Overall: FAIL**
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer_1.5b |
 | Params | 1470.5M |
 | Throughput | 193836 tokens/sec |
 | Avg Step Time | 84.5 ms |
 | Peak Memory | 18.1 GB |
 | Final Loss | 0.004 |
 | Step Jitter | 521.24% |
 | Distributed Mode | ddp |
 | Verdict | FAIL (193836 tokens/sec) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_training_warmup_aikubeworker0012_20260522_194528.md
+++ b/reports_training_warmup_aikubeworker0012_20260522_194528.md
@ -0,0 +1,43 @@
 # GPU Test Report
 - **Date:** 2026-05-22T19:46:07.450315
 - **Host:** aikubeworker0012
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Missing required evidence:
 - GPU Info
 - Health Check
 - Memory Bandwidth
 - Compute Throughput
 - NVLink/NVSwitch
 - NCCL
 - Stress Test
 - RDMA
 - DCGM
 ## Summary
 | Test | Result |
 |------|--------|
 | Training | PASS (216654 tokens/sec) |
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer_1.5b |
 | Params | 1470.5M |
 | Throughput | 216654 tokens/sec |
 | Avg Step Time | 75.6 ms |
 | Warmup Steps | 5 |
 | Peak Memory | 18.1 GB |
 | Final Loss | 0.0039 |
 | Step Jitter | 0.87% |
 | Distributed Mode | ddp |
 | Verdict | PASS (216654 tokens/sec) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/reports_training_warmup_aikubeworker0016_20260522_194609.md
+++ b/reports_training_warmup_aikubeworker0016_20260522_194609.md
@ -0,0 +1,43 @@
 # GPU Test Report
 - **Date:** 2026-05-22T19:46:48.023650
 - **Host:** aikubeworker0016
 ## Overall Acceptance Verdict
 **Result: FAIL**
 Missing required evidence:
 - GPU Info
 - Health Check
 - Memory Bandwidth
 - Compute Throughput
 - NVLink/NVSwitch
 - NCCL
 - Stress Test
 - RDMA
 - DCGM
 ## Summary
 | Test | Result |
 |------|--------|
 | Training | PASS (217236 tokens/sec) |
 ## Training Simulation
 | Metric | Value |
 |--------|-------|
 | Model | synthetic_transformer_1.5b |
 | Params | 1470.5M |
 | Throughput | 217236 tokens/sec |
 | Avg Step Time | 75.4 ms |
 | Warmup Steps | 5 |
 | Peak Memory | 18.1 GB |
 | Final Loss | 0.0039 |
 | Step Jitter | 1.23% |
 | Distributed Mode | ddp |
 | Verdict | PASS (217236 tokens/sec) |
 ---
 *Generated by GPU Test Suite v0.2.0*
--- a/test_all_aikubeworker0016_中文结果与验收差距.md
+++ b/test_all_aikubeworker0016_中文结果与验收差距.md
@ -0,0 +1,73 @@
 # aikubeworker0016 `test all` 中文结果与 H100 验收差距
 测试命令：
 ```bash
 /root/gpu-test-venv/bin/python gpu_tester.py --test all --report --format json --output reports_all/test_all.json
 ```
 测试机器：`aikubeworker0016 / 172.72.8.16`
 原始结果：`reports_all_aikubeworker0016.json`
 ## 先说结论
 项目输出里最后显示 `Suite complete: 8/8 tests passed`，但这个结论不能直接当成生产验收 PASS。
 原因是当前 `all` 的汇总逻辑主要看模块有没有抛 `error`，没有把 `nccl.passed=false` 和 `rdma.passed=false` 当成整套失败。因此按 PDF 的生产验收口径，这台机器目前不能算完整验收通过。
 ## 本次 `test all` 实际结果
 | 模块 | 当前结果 | 关键数据 | 按 PDF 验收看 |
 | --- | --- | --- | --- |
 | GPU 信息 | 已覆盖 | 8 张 H100，Driver 580.159.03，CUDA 13.0 | 基础信息 OK，但 NVLink 链路专项不足 |
 | 健康检查 | PASS | health.passed=true | 基础健康 OK，但缺 retired pages、AER/Replay、fabricmanager 日志、stress 期间采样 |
 | Memory | 有结果 | H2D 55.5 GB/s，D2H 55.3 GB/s，D2D 486.5 GB/s | 单项看起来不错，但缺 8x8 P2P 矩阵验收 |
 | Compute | 有结果 | FP32 51.9，TF32 357.0，FP16 664.0，BF16 700.1，FP8 1116.2 TFLOPS | 对 PDF 绝对门槛不全通过 |
 | NCCL | 实际不合格 | source=torchrun_fallback，`nccl.passed=false`，无 bus BW 性能数据 | 不满足 PDF NCCL 性能验收 |
 | Stress | PASS | PyTorch fallback，60 秒，8 GPU 状态 PASS | 不满足 PDF 的 30/60 分钟 burn-in；负载只有约 64MB/卡，压力明显不够 |
 | RDMA/IB | 实际不合格 | ib_write_bw/read_bw 0.13 GB/s WARN；write_lat 4.10us PASS；read_lat 16us WARN | 当前是 localhost 单节点口径，不满足 PDF RDMA 生产验收 |
 | Training | 有结果 | synthetic 1.47B，52471 tokens/s，peak 27.31GB，loss 0.0041 | tokens/s 过线，但代码实际不是 8 卡分布式训练验收 |
 ## Compute 对 PDF 门槛的判断
 PDF H100 PASS 门槛：
 | DType | 本次结果 | PDF PASS 门槛 | 判断 |
 | --- | ---: | ---: | --- |
 | FP32 | 51.9 TFLOPS | >= 54 | WARN |
 | TF32 | 357.0 TFLOPS | >= 444 | FAIL |
 | FP16 | 664.0 TFLOPS | >= 734 | WARN |
 | BF16 | 700.1 TFLOPS | >= 745 | WARN |
 | FP8 | 1116.2 TFLOPS | >= 1400 | FAIL |
 | FP64 | 未测 | >= 63 | 缺失 |
 | INT8 | 未测 | >= 1536 | 缺失 |
 说明：PDF 里 WARN 区间是 PASS 门槛的 90%-100%。TF32 和 FP8 低于 90% 门槛，所以按 PDF 是 FAIL。
 ## 如果只执行当前仓库 `test all`，少了什么
 1. 少 NVLink 专项验收：没有逐卡检查 18 条链路、25GB/s 速率、CRC/Replay/Recovery error = 0。
 2. 少 DCGM 诊断：没有 `dcgmi diag -r 3`。
 3. 少长时间 burn-in：当前是 60 秒，不是 30/60 分钟。
 4. 少 stress 期间 1 秒级采样：温度、功耗、throttle、XID、TFLOPS 抖动都没按 PDF 统计。
 5. 少真正 NCCL 性能：当前退化到 torchrun fallback，没有 `nccl-tests` bus BW。
 6. 少 NCCL 全操作和三档消息：PDF 要 AllReduce/AllGather/ReduceScatter/Broadcast/SendRecv/AllToAll，且 1MB/256MB/2GB 都过线。
 7. 少 NCCL 重复 3 次取最差值和标准差 <=3%。
 8. 少完整 P2P 8x8 矩阵：没有非对角均值、最小值、偏差判断。
 9. 少逐 GPU compute 一致性：没有真正分别测 8 卡同 dtype 极差/均值 <=3%。
 10. 少 FP64 和 INT8。
 11. 少 RDMA 生产口径：当前 `localhost`，64KB message，阈值 10us；PDF 要 4MB BW、8B latency、write/read >=47GB/s、write_lat <=2us、read_lat <=3.5us。
 12. 少 PFC/ECN 错误计数和 ibping 双向。
 13. 少真正 8 卡分布式 Training Simulation 验收。
 14. 少严格最终 verdict：当前代码会把 `passed=false` 的模块也计入“通过”，这是验收逻辑漏洞。
 ## 建议
 `test all` 可以继续作为快速初筛跑，但如果目标是对齐 `H100_production_acceptance.pdf`，需要把它升级成“生产验收模式”。优先级如下：
 1. 先修汇总 verdict：任何子模块 `passed=false` 必须导致整机 FAIL。
 2. 先装好 `nccl-tests` 和 `gpu-burn`，否则 NCCL/Stress 都不是生产口径。
 3. 增加 NVLink、DCGM、长时间 telemetry、P2P 矩阵。
 4. 改 RDMA 为生产参数，且支持跨节点。
 5. 改 compute/training 为逐 GPU/8 卡分布式验收。