Record multinode NCCL PDF matrix run

2026-05-23 19:30:14 +08:00 · 2026-05-23 19:30:14 +08:00 · c73d738557
commit c73d738557
parent 8923270ce0
5 changed files with 205 additions and 10 deletions
--- a/modules/report.py
+++ b/modules/report.py
@ -750,8 +750,14 @@ class ReportGenerator:

    @staticmethod
    def _overall_acceptance_verdict(summary_items: list[tuple[str, str]]) -> tuple[str, list[tuple[str, str]], list[str]]:
-        """PDF-style machine verdict: every required item must be present and PASS."""
-        required = [
+        """PDF-style verdict for the report scope.
+
+        Full-suite reports require every single-node acceptance item. Standalone
+        reports, such as `--test multinode-nccl`, should only judge the items
+        that were actually requested instead of reporting unrelated evidence as
+        missing.
+        """
+        single_node_required = [
            "GPU Info",
            "Health Check",
            "Memory Bandwidth",
@ -764,6 +770,13 @@ class ReportGenerator:
            "Training",
        ]
        status_by_name = dict(summary_items)
+        present_single_node = [name for name in single_node_required if name in status_by_name]
+        if len(present_single_node) >= 3:
+            required = list(single_node_required)
+            if "Multi-node NCCL" in status_by_name:
+                required.append("Multi-node NCCL")
+        else:
+            required = list(status_by_name)
        missing = [name for name in required if name not in status_by_name]
        failures = [
            (name, status)
--- a/reports_multinode_nccl_handoff_plan_20260523.md
+++ b/reports_multinode_nccl_handoff_plan_20260523.md
@ -11,8 +11,9 @@
 | 两台机器可用于 NCCL 的 400G IB rail 是 4 条 | `mlx5_0,mlx5_1,mlx5_6,mlx5_7` 均为 `400 Gb/sec (4X NDR)` |
 | 其他 HCA 不等价 | `mlx5_4/5` 为 100G IB，`mlx5_2/8` 为 25G Ethernet，`mlx5_3/9` DOWN |
 | NCCL 2.27.7 GDR 可用 | GRAPH/NET 日志中 GDR enabled |
-| allreduce 已接近当前 4 rail 物理上限 | `354 GB/s busbw`，反推 `189 GB/s algbw`，接近 4 x 400G 的 `200 GB/s` 单向原始带宽 |
-| alltoall PXN disabled 后 rail 均衡但仍低 | `36-37 GB/s busbw`，每条 rail 约 `19-20 GB/s` |
+| allreduce 已接近当前 4 rail 物理上限 | 最新 PDF matrix 2x8 为 `354.56 GB/s busbw`，反推 `189.10 GB/s algbw`，接近 4 x 400G 的 `200 GB/s` 单向原始带宽 |
+| alltoall PXN disabled 后 rail 均衡但仍低 | 最新 PDF matrix 2x8 为 `36.82 GB/s busbw`，每条 rail 约 `19-20 GB/s` |
+| 正式 PDF matrix 已复跑 | `reports_multinode_nccl_pdf_matrix_20260523_112247.md`，所有 case 正确性通过但性能阈值 FAIL |
 | 没看到硬错误 | 未见 discard、RoCE retrans、slow restart、packet sequence error 等增长 |
 | 当前缺外部 NCCL 网络组件 | 未找到 `libnccl-net*.so*` / `libsharp*.so*`，未见 SHARP/HCOLL 包 |

@ -61,8 +62,19 @@ busbw = algbw * 1.875

 建议把当前 2x8 allreduce 的可解释目标按 4 x 400G rail 物理能力重新评估：

- allreduce 当前 `354 GB/s busbw`，反推 `189 GB/s algbw`，接近 `200 GB/s` 单向原始上限。
- alltoall 当前 `36-37 GB/s` 仍偏低，需要作为独立问题继续排查。
+- allreduce 当前 `354.56 GB/s busbw`，反推 `189.10 GB/s algbw`，接近 `200 GB/s` 单向原始上限。
+- alltoall 当前 `36.82 GB/s` 仍偏低，需要作为独立问题继续排查。
+
+## 最新 PDF matrix 结果
+
+| Topology | AllReduce | AllReduce Target | AllToAll | AllToAll Target |
+|---|---:|---:|---:|---:|
+| 2 nodes x 1 GPU | `47.15` | `48.90` | `24.85` | `27.25` |
+| 2 nodes x 2 GPUs | `136.62` | `136.93` | `47.71` | `54.41` |
+| 2 nodes x 4 GPUs | `335.19` | `335.48` | `72.63` | `73.73` |
+| 2 nodes x 8 GPUs | `354.56` | `491.84` | `36.82` | `76.54` |
+
+所有 case 的 return code 为 `0`，NCCL `Out of bounds values` 为 `0 OK`。因此本轮 FAIL 是性能阈值失败，不是 NCCL 正确性或启动链路失败。

 ### C. 如果要继续优化 alltoall

@ -154,6 +166,8 @@ OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_plugin_check_$(date +%Y%m%
 | 文件 | 用途 |
 |---|---|
 | `reports_multinode_nccl_diagnosis_20260523.md` | 总诊断报告 |
+| `reports_multinode_nccl_pdf_matrix_20260523_112247.md` | 最新多机多卡 PDF matrix 原始报告 |
+| `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新多机多卡 PDF matrix 中文摘要 |
 | `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮深度复跑结果 |
 | `reports_multinode_nccl_environment_gap_20260523.md` | 硬件/软件环境等价性缺口 |
 | `reports_multinode_nccl_counter_probe_20260523.md` | RDMA rail/counter 证据 |
--- a/reports_multinode_nccl_latest_index_20260523.md
+++ b/reports_multinode_nccl_latest_index_20260523.md
@ -6,10 +6,11 @@

 当前结论：

- 2 机 4 GPU 档位通过 GPU-NIC 亲和性修正后，已接近 PDF 参考值。
+- 2026-05-23 `11:22` 已完成正式多机多卡 PDF matrix 复跑，原始报告为 `reports_multinode_nccl_pdf_matrix_20260523_112247.md`，中文结论为 `reports_multinode_nccl_pdf_matrix_run_20260523.md`。
+- 2 机 1/2/4 GPU per node 档位已接近 PDF 参考值，但严格按阈值仍 FAIL。
 - 2 机 8 GPU 档位仍未达到 PDF 参考值：
-  - allreduce 当前约 `354 GB/s busbw`，PDF 目标 `491.84 GB/s`。
-  - alltoall 当前约 `36-37 GB/s busbw`，PDF 目标 `76.54 GB/s`。
+  - allreduce 实测 `354.56 GB/s busbw`，PDF 目标 `491.84 GB/s`。
+  - alltoall 实测 `36.82 GB/s busbw`，PDF 目标 `76.54 GB/s`。
 - 当前 2 机 8 GPU 剩余差距不再像是旧 NCCL、GDR disabled、HCA 顺序、SSH/mpirun 或明显坏链路问题。
 - 当前更像是硬件 rail 数量与 PDF 不等价、NCCL net plugin / SHARP 缺失、或跨 Leaf alltoall 网络/图策略问题。

@ -19,7 +20,8 @@
 |---:|---|---|
 | 1 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给网络/硬件/环境侧的交接计划，包含决策树、要问的问题和复跑命令 |
 | 2 | `reports_multinode_nccl_environment_gap_20260523.md` | 说明当前环境为什么不能证明与 PDF 等价，重点是 4 x 400G rail 和缺少 NCCL net plugin / SHARP |
-| 3 | `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮完整深度诊断复跑结果，包含 counter、GRAPH、PXN sweep |
+| 3 | `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新正式多机多卡 PDF matrix 结果摘要 |
+| 4 | `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮完整深度诊断复跑结果，包含 counter、GRAPH、PXN sweep |

 ## 关键脚本

@ -107,6 +109,14 @@ aikubeworker0012: /root/test_gpu_scripts/reports/nccl_environment_snapshot_aikub
 aikubeworker0016: /root/test_gpu_scripts/reports/nccl_environment_snapshot_aikubeworker0016_20260523_111143.md
 ```

+最新多机多卡 PDF matrix：
+
+```text
+aikubeworker0012: /root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_112247.md
+local copy: reports_multinode_nccl_pdf_matrix_20260523_112247.md
+summary: reports_multinode_nccl_pdf_matrix_run_20260523.md
+```
+
 ## 当前证据摘要

 ### HCA / rail
@ -142,6 +152,15 @@ libsharp*.so*

 ### 深度诊断

+正式 PDF matrix 复跑：
+
+| Topology | AllReduce | AllReduce Target | AllToAll | AllToAll Target |
+|---|---:|---:|---:|---:|
+| 2 nodes x 1 GPU | `47.15` | `48.90` | `24.85` | `27.25` |
+| 2 nodes x 2 GPUs | `136.62` | `136.93` | `47.71` | `54.41` |
+| 2 nodes x 4 GPUs | `335.19` | `335.48` | `72.63` | `73.73` |
+| 2 nodes x 8 GPUs | `354.56` | `491.84` | `36.82` | `76.54` |
+
 本轮完整复跑：

 | 项目 | 结果 |
@ -162,6 +181,8 @@ PXN disabled sweep 未发现有效参数：
 |---|---|
 | `reports_multinode_nccl_diagnosis_20260523.md` | 长版总诊断，包含从旧 NCCL/GDR disabled 到 PDF 矩阵对齐的全过程 |
 | `reports_multinode_nccl_pdf_matrix_nccl227.md` | 按 PDF 矩阵跑出的正式 raw report |
+| `reports_multinode_nccl_pdf_matrix_20260523_112247.md` | 最新正式 PDF matrix 原始报告 |
+| `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新正式 PDF matrix 中文摘要 |
 | `reports_multinode_nccl_counter_probe_20260523.md` | RDMA rail 和 counter 证据 |
 | `reports_multinode_nccl_alltoall_tuning_20260523.md` | alltoall PXN 和参数 sweep 结论 |
 | `reports_rdma_single_node_summary.md` | 单节点 RDMA/HCA 速率摘要 |
--- a/reports_multinode_nccl_pdf_matrix_20260523_112247.md
+++ b/reports_multinode_nccl_pdf_matrix_20260523_112247.md
@ -0,0 +1,84 @@
+# GPU Test Report
+
+- **Date:** 2026-05-23T11:26:21.306224
+- **Host:** aikubeworker0012
+
+## Overall Acceptance Verdict
+
+**Result: FAIL**
+
+Missing required evidence:
+- GPU Info
+- Health Check
+- Memory Bandwidth
+- Compute Throughput
+- NVLink/NVSwitch
+- NCCL
+- Stress Test
+- RDMA
+- DCGM
+- Training
+
+## Summary
+
+| Test | Result |
+|------|--------|
+| Multi-node NCCL | FAIL |
+
+## Multi-node NCCL / Cross Leaf
+
+Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7
+
+- **Hosts:** nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16)
+- **Preflight:** PASS
+
+### Multi-node NCCL allreduce
+
+| Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
+|----------|----------------------|-------------|-----------|------------|-----------|--------|
+| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 47.15 GB/s | 16G | 47.18 GB/s | >= 49 GB/s | FAIL |
+| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 136.62 GB/s | 16G | 136.67 GB/s | >= 137 GB/s | FAIL |
+| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 335.19 GB/s | 16G | 334.85 GB/s | >= 335 GB/s | FAIL |
+| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 354.56 GB/s | 16G | 354.21 GB/s | >= 492 GB/s | FAIL |
+
+| Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs |
+|----------|--------------|-----------------|------------------|-------------------|
+| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+
+| Topology | Return Code | Error / Output Tail |
+|----------|-------------|---------------------|
+| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1321368:1321509 [0] NCCL INFO comm 0x56428b645570 rank 1 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth    : 47.1841  #   |
+| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0012:2199872:2199936 [0] NCCL INFO comm 0x561da4512280 rank 0 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth    : 136.668  #   |
+| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1321707:1321805 [0] NCCL INFO comm 0x562bad8777a0 rank 4 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth    : 334.846  #   |
+| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | nks 16 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1321873:1322056 [0] NCCL INFO comm 0x55ba6708f500 rank 8 nranks 16 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth    : 354.211  #   |
+
+### Multi-node NCCL alltoall
+
+| Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
+|----------|----------------------|-------------|-----------|------------|-----------|--------|
+| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 24.85 GB/s | 16G | 24.92 GB/s | >= 27 GB/s | FAIL |
+| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 47.71 GB/s | 16G | 47.93 GB/s | >= 54 GB/s | FAIL |
+| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 72.63 GB/s | 16G | 72.67 GB/s | >= 74 GB/s | FAIL |
+| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 36.82 GB/s | 16G | 36.86 GB/s | >= 77 GB/s | FAIL |
+
+| Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs |
+|----------|--------------|-----------------|------------------|-------------------|
+| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
+
+| Topology | Return Code | Error / Output Tail |
+|----------|-------------|---------------------|
+| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1322113:1322193 [0] NCCL INFO comm 0x55b760411150 rank 1 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth    : 24.917  #   |
+| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ker0012:2200344:2200469 [1] NCCL INFO comm 0x55efef439da0 rank 1 nranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0016:1322250:1322338 [1] NCCL INFO comm 0x558ecf546380 rank 3 nranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE   |
+| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0012:2200479:2200573 [0] NCCL INFO comm 0x55db60daef30 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth    : 72.6664  #   |
+| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | r0012:2200587:2200767 [5] NCCL INFO comm 0x5556a6f71620 rank 5 nranks 16 cudaDev 5 busId ab000 - Destroy COMPLETE aikubeworker0012:2200588:2200772 [6] NCCL INFO comm 0x5585a1623170 rank 6 nranks 16 cudaDev 6 busId ba000 - Destroy COMPLETE   |
+
+**Overall: FAIL**
+
+---
+*Generated by GPU Test Suite v0.2.0*
--- a/reports_multinode_nccl_pdf_matrix_run_20260523.md
+++ b/reports_multinode_nccl_pdf_matrix_run_20260523.md
@ -0,0 +1,63 @@
+# 多机多卡 NCCL PDF 矩阵实测 2026-05-23
+
+执行节点：`aikubeworker0012`
+
+对端节点：`aikubeworker0016`
+
+原始报告：`reports_multinode_nccl_pdf_matrix_20260523_112247.md`
+
+远端报告：`/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_112247.md`
+
+远端日志：`/root/test_gpu_scripts/reports/run_logs/multinode_nccl_pdf_matrix_20260523_112247.log`
+
+执行命令：
+
+```bash
+cd /root/test_gpu_scripts
+bash scripts/run_multinode_nccl_pdf_matrix.sh
+```
+
+## 结论
+
+本轮正式矩阵已跑通，`mpirun`、SSH、`nccl-tests`、GDRDMA、4 条 400G HCA 都可用；失败不是启动失败或功能错误，而是 bus bandwidth 未达到 PDF 阈值。
+
+所有 case 的 return code 都是 `0`，`Out of bounds values` 为 `0 OK`，说明 NCCL 正确性没有报错。FAIL 来自性能阈值。
+
+## Preflight
+
+| 项目 | 结果 |
+|---|---|
+| OpenMPI | PASS，`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun` |
+| all_reduce_perf | PASS，`/data/nccl-tests-latest/build/all_reduce_perf` |
+| alltoall_perf | PASS，`/data/nccl-tests-latest/build/alltoall_perf` |
+| SSH 172.72.8.12 | PASS |
+| SSH 172.72.8.16 | PASS |
+| HCA | 两端 `mlx5_0,mlx5_1,mlx5_6,mlx5_7` 均为 `400 Gb/sec (4X NDR)` ACTIVE |
+| NCCL network | IB |
+| GPU Direct RDMA | ENABLED |
+
+## AllReduce
+
+| Topology | Peak Bus BW | Avg Bus BW | PDF Threshold | Gap | Status |
+|---|---:|---:|---:|---:|---|
+| 2 nodes x 1 GPU | 47.15 GB/s | 47.18 GB/s | >= 48.90 GB/s | -1.75 GB/s | FAIL |
+| 2 nodes x 2 GPUs | 136.62 GB/s | 136.67 GB/s | >= 136.93 GB/s | -0.31 GB/s | FAIL |
+| 2 nodes x 4 GPUs | 335.19 GB/s | 334.85 GB/s | >= 335.48 GB/s | -0.29 GB/s | FAIL |
+| 2 nodes x 8 GPUs | 354.56 GB/s | 354.21 GB/s | >= 491.84 GB/s | -137.28 GB/s | FAIL |
+
+## AllToAll
+
+| Topology | Peak Bus BW | Avg Bus BW | PDF Threshold | Gap | Status |
+|---|---:|---:|---:|---:|---|
+| 2 nodes x 1 GPU | 24.85 GB/s | 24.92 GB/s | >= 27.25 GB/s | -2.40 GB/s | FAIL |
+| 2 nodes x 2 GPUs | 47.71 GB/s | 47.93 GB/s | >= 54.41 GB/s | -6.70 GB/s | FAIL |
+| 2 nodes x 4 GPUs | 72.63 GB/s | 72.67 GB/s | >= 73.73 GB/s | -1.10 GB/s | FAIL |
+| 2 nodes x 8 GPUs | 36.82 GB/s | 36.86 GB/s | >= 76.54 GB/s | -39.72 GB/s | FAIL |
+
+## 判断
+
+1. 2x2、2x4 的 AllReduce 已非常接近 PDF 阈值，差距分别只有 `0.31` 和 `0.29 GB/s`。
+2. 2x4 的 AllToAll 也接近阈值，差 `1.10 GB/s`。
+3. 2x8 是主要问题：AllReduce 只有 `354.56 / 491.84`，AllToAll 只有 `36.82 / 76.54`。
+4. 当前环境已经确认只有 4 条 400G IB rail 参与 NCCL，且没有发现外部 NCCL net plugin / SHARP；这仍是解释 2x8 目标不可达或严重掉速的最强证据。
+5. 本轮没有看到 GDR disabled 或 HCA 不可用，所以下一步不应继续纠结 SSH/mpirun/nccl-tests 启动链路，而应对齐 PDF 参考环境的 rail 数量、net plugin/SHARP、交换机跨 Leaf 策略。