Record multinode NCCL artifacts run
This commit is contained in:
parent
1a8cf6cbbb
commit
18cebd8e06
@ -11,9 +11,10 @@
|
||||
| 两台机器可用于 NCCL 的 400G IB rail 是 4 条 | `mlx5_0,mlx5_1,mlx5_6,mlx5_7` 均为 `400 Gb/sec (4X NDR)` |
|
||||
| 其他 HCA 不等价 | `mlx5_4/5` 为 100G IB,`mlx5_2/8` 为 25G Ethernet,`mlx5_3/9` DOWN |
|
||||
| NCCL 2.27.7 GDR 可用 | GRAPH/NET 日志中 GDR enabled |
|
||||
| allreduce 已接近当前 4 rail 物理上限 | 最新 PDF matrix 2x8 为 `354.56 GB/s busbw`,反推 `189.10 GB/s algbw`,接近 4 x 400G 的 `200 GB/s` 单向原始带宽 |
|
||||
| alltoall PXN disabled 后 rail 均衡但仍低 | 最新 PDF matrix 2x8 为 `36.82 GB/s busbw`,每条 rail 约 `19-20 GB/s` |
|
||||
| 正式 PDF matrix 已复跑 | `reports_multinode_nccl_pdf_matrix_20260523_112247.md`,所有 case 正确性通过但性能阈值 FAIL |
|
||||
| allreduce 已接近当前 4 rail 物理上限 | 最新 PDF matrix 2x8 为 `353.85 GB/s busbw`,反推 `188.72 GB/s algbw`,接近 4 x 400G 的 `200 GB/s` 单向原始带宽 |
|
||||
| alltoall PXN disabled 后 rail 均衡但仍低 | 最新 PDF matrix 2x8 为 `36.83 GB/s busbw`,每条 rail 约 `19-20 GB/s` |
|
||||
| 正式 PDF matrix 已复跑 | `reports_multinode_nccl_pdf_matrix_20260523_113803.md`,所有 case 正确性通过;除 2x2 allreduce 外,性能阈值仍 FAIL |
|
||||
| 原始 artifacts 已归档 | `/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts`,每个 case 有完整 `cmd/stdout/stderr/json` |
|
||||
| 没看到硬错误 | 未见 discard、RoCE retrans、slow restart、packet sequence error 等增长 |
|
||||
| 当前缺外部 NCCL 网络组件 | 未找到 `libnccl-net*.so*` / `libsharp*.so*`,未见 SHARP/HCOLL 包 |
|
||||
|
||||
@ -62,17 +63,17 @@ busbw = algbw * 1.875
|
||||
|
||||
建议把当前 2x8 allreduce 的可解释目标按 4 x 400G rail 物理能力重新评估:
|
||||
|
||||
- allreduce 当前 `354.56 GB/s busbw`,反推 `189.10 GB/s algbw`,接近 `200 GB/s` 单向原始上限。
|
||||
- alltoall 当前 `36.82 GB/s` 仍偏低,需要作为独立问题继续排查。
|
||||
- allreduce 当前 `353.85 GB/s busbw`,反推 `188.72 GB/s algbw`,接近 `200 GB/s` 单向原始上限。
|
||||
- alltoall 当前 `36.83 GB/s` 仍偏低,需要作为独立问题继续排查。
|
||||
|
||||
## 最新 PDF matrix 结果
|
||||
|
||||
| Topology | AllReduce | AllReduce Target | AllToAll | AllToAll Target |
|
||||
|---|---:|---:|---:|---:|
|
||||
| 2 nodes x 1 GPU | `47.15` | `48.90` | `24.85` | `27.25` |
|
||||
| 2 nodes x 2 GPUs | `136.62` | `136.93` | `47.71` | `54.41` |
|
||||
| 2 nodes x 4 GPUs | `335.19` | `335.48` | `72.63` | `73.73` |
|
||||
| 2 nodes x 8 GPUs | `354.56` | `491.84` | `36.82` | `76.54` |
|
||||
| 2 nodes x 1 GPU | `47.29` | `48.90` | `24.85` | `27.25` |
|
||||
| 2 nodes x 2 GPUs | `137.16` | `136.93` | `47.76` | `54.41` |
|
||||
| 2 nodes x 4 GPUs | `335.07` | `335.48` | `72.74` | `73.73` |
|
||||
| 2 nodes x 8 GPUs | `353.85` | `491.84` | `36.83` | `76.54` |
|
||||
|
||||
所有 case 的 return code 为 `0`,NCCL `Out of bounds values` 为 `0 OK`。因此本轮 FAIL 是性能阈值失败,不是 NCCL 正确性或启动链路失败。
|
||||
|
||||
@ -166,8 +167,10 @@ OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_plugin_check_$(date +%Y%m%
|
||||
| 文件 | 用途 |
|
||||
|---|---|
|
||||
| `reports_multinode_nccl_diagnosis_20260523.md` | 总诊断报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_20260523_112247.md` | 最新多机多卡 PDF matrix 原始报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_20260523_112247.md` | 上一次多机多卡 PDF matrix 原始报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_20260523_113803.md` | 最新带 artifacts 的多机多卡 PDF matrix 原始报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新多机多卡 PDF matrix 中文摘要 |
|
||||
| `reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md` | 最新 artifacts manifest 和 checksum |
|
||||
| `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮深度复跑结果 |
|
||||
| `reports_multinode_nccl_environment_gap_20260523.md` | 硬件/软件环境等价性缺口 |
|
||||
| `reports_multinode_nccl_counter_probe_20260523.md` | RDMA rail/counter 证据 |
|
||||
|
||||
@ -6,11 +6,11 @@
|
||||
|
||||
当前结论:
|
||||
|
||||
- 2026-05-23 `11:22` 已完成正式多机多卡 PDF matrix 复跑,原始报告为 `reports_multinode_nccl_pdf_matrix_20260523_112247.md`,中文结论为 `reports_multinode_nccl_pdf_matrix_run_20260523.md`。
|
||||
- 2026-05-23 `11:38` 已完成带 artifacts 的正式多机多卡 PDF matrix 复跑,原始报告为 `reports_multinode_nccl_pdf_matrix_20260523_113803.md`,中文结论为 `reports_multinode_nccl_pdf_matrix_run_20260523.md`,artifact manifest 为 `reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md`。
|
||||
- 2 机 1/2/4 GPU per node 档位已接近 PDF 参考值,但严格按阈值仍 FAIL。
|
||||
- 2 机 8 GPU 档位仍未达到 PDF 参考值:
|
||||
- allreduce 实测 `354.56 GB/s busbw`,PDF 目标 `491.84 GB/s`。
|
||||
- alltoall 实测 `36.82 GB/s busbw`,PDF 目标 `76.54 GB/s`。
|
||||
- allreduce 实测 `353.85 GB/s busbw`,PDF 目标 `491.84 GB/s`。
|
||||
- alltoall 实测 `36.83 GB/s busbw`,PDF 目标 `76.54 GB/s`。
|
||||
- 当前 2 机 8 GPU 剩余差距不再像是旧 NCCL、GDR disabled、HCA 顺序、SSH/mpirun 或明显坏链路问题。
|
||||
- 当前更像是硬件 rail 数量与 PDF 不等价、NCCL net plugin / SHARP 缺失、或跨 Leaf alltoall 网络/图策略问题。
|
||||
|
||||
@ -112,9 +112,12 @@ aikubeworker0016: /root/test_gpu_scripts/reports/nccl_environment_snapshot_aikub
|
||||
最新多机多卡 PDF matrix:
|
||||
|
||||
```text
|
||||
aikubeworker0012: /root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_112247.md
|
||||
local copy: reports_multinode_nccl_pdf_matrix_20260523_112247.md
|
||||
aikubeworker0012: /root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803.md
|
||||
artifacts: /root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts
|
||||
artifacts tar: /root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.tar.gz
|
||||
local copy: reports_multinode_nccl_pdf_matrix_20260523_113803.md
|
||||
summary: reports_multinode_nccl_pdf_matrix_run_20260523.md
|
||||
manifest: reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md
|
||||
```
|
||||
|
||||
下一次用 `scripts/run_multinode_nccl_pdf_matrix.sh` 复跑时,还会生成:
|
||||
@ -164,10 +167,10 @@ libsharp*.so*
|
||||
|
||||
| Topology | AllReduce | AllReduce Target | AllToAll | AllToAll Target |
|
||||
|---|---:|---:|---:|---:|
|
||||
| 2 nodes x 1 GPU | `47.15` | `48.90` | `24.85` | `27.25` |
|
||||
| 2 nodes x 2 GPUs | `136.62` | `136.93` | `47.71` | `54.41` |
|
||||
| 2 nodes x 4 GPUs | `335.19` | `335.48` | `72.63` | `73.73` |
|
||||
| 2 nodes x 8 GPUs | `354.56` | `491.84` | `36.82` | `76.54` |
|
||||
| 2 nodes x 1 GPU | `47.29` | `48.90` | `24.85` | `27.25` |
|
||||
| 2 nodes x 2 GPUs | `137.16` | `136.93` | `47.76` | `54.41` |
|
||||
| 2 nodes x 4 GPUs | `335.07` | `335.48` | `72.74` | `73.73` |
|
||||
| 2 nodes x 8 GPUs | `353.85` | `491.84` | `36.83` | `76.54` |
|
||||
|
||||
本轮完整复跑:
|
||||
|
||||
@ -189,8 +192,10 @@ PXN disabled sweep 未发现有效参数:
|
||||
|---|---|
|
||||
| `reports_multinode_nccl_diagnosis_20260523.md` | 长版总诊断,包含从旧 NCCL/GDR disabled 到 PDF 矩阵对齐的全过程 |
|
||||
| `reports_multinode_nccl_pdf_matrix_nccl227.md` | 按 PDF 矩阵跑出的正式 raw report |
|
||||
| `reports_multinode_nccl_pdf_matrix_20260523_112247.md` | 最新正式 PDF matrix 原始报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_20260523_112247.md` | 上一次正式 PDF matrix 原始报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_20260523_113803.md` | 最新带 artifacts 的正式 PDF matrix 原始报告 |
|
||||
| `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新正式 PDF matrix 中文摘要 |
|
||||
| `reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md` | 最新 artifacts manifest 和 checksum |
|
||||
| `reports_multinode_nccl_counter_probe_20260523.md` | RDMA rail 和 counter 证据 |
|
||||
| `reports_multinode_nccl_alltoall_tuning_20260523.md` | alltoall PXN 和参数 sweep 结论 |
|
||||
| `reports_rdma_single_node_summary.md` | 单节点 RDMA/HCA 速率摘要 |
|
||||
|
||||
75
reports_multinode_nccl_pdf_matrix_20260523_113803.md
Normal file
75
reports_multinode_nccl_pdf_matrix_20260523_113803.md
Normal file
@ -0,0 +1,75 @@
|
||||
# GPU Test Report
|
||||
|
||||
- **Date:** 2026-05-23T11:41:35.567886
|
||||
- **Host:** aikubeworker0012
|
||||
|
||||
## Overall Acceptance Verdict
|
||||
|
||||
**Result: FAIL**
|
||||
|
||||
Failed or unverified items:
|
||||
- Multi-node NCCL: FAIL
|
||||
|
||||
## Summary
|
||||
|
||||
| Test | Result |
|
||||
|------|--------|
|
||||
| Multi-node NCCL | FAIL |
|
||||
|
||||
## Multi-node NCCL / Cross Leaf
|
||||
|
||||
Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7
|
||||
|
||||
- **Artifacts:** `/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts`
|
||||
- **Hosts:** nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16)
|
||||
- **Preflight:** PASS
|
||||
|
||||
### Multi-node NCCL allreduce
|
||||
|
||||
| Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
|
||||
|----------|----------------------|-------------|-----------|------------|-----------|--------|
|
||||
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 47.29 GB/s | 16G | 47.26 GB/s | >= 48.90 GB/s | FAIL |
|
||||
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 137.16 GB/s | 16G | 137.13 GB/s | >= 136.93 GB/s | PASS |
|
||||
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 335.07 GB/s | 16G | 335.02 GB/s | >= 335.48 GB/s | FAIL |
|
||||
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 353.85 GB/s | 16G | 353.85 GB/s | >= 491.84 GB/s | FAIL |
|
||||
|
||||
| Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs |
|
||||
|----------|--------------|-----------------|------------------|-------------------|
|
||||
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
|
||||
| Topology | Return Code | Error / Output Tail |
|
||||
|----------|-------------|---------------------|
|
||||
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0012:2203142:2203200 [0] NCCL INFO comm 0x55e463572510 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.2628 # |
|
||||
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0012:2203280:2203363 [0] NCCL INFO comm 0x55e2f3808c60 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 335.021 # |
|
||||
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | nks 16 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0012:2203376:2203528 [0] NCCL INFO comm 0x55a5166a30c0 rank 0 nranks 16 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 353.854 # |
|
||||
|
||||
### Multi-node NCCL alltoall
|
||||
|
||||
| Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
|
||||
|----------|----------------------|-------------|-----------|------------|-----------|--------|
|
||||
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 24.85 GB/s | 16G | 24.90 GB/s | >= 27.25 GB/s | FAIL |
|
||||
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 47.76 GB/s | 16G | 47.98 GB/s | >= 54.41 GB/s | FAIL |
|
||||
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 72.74 GB/s | 16G | 72.80 GB/s | >= 73.73 GB/s | FAIL |
|
||||
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 36.83 GB/s | 16G | 36.85 GB/s | >= 76.54 GB/s | FAIL |
|
||||
|
||||
| Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs |
|
||||
|----------|--------------|-----------------|------------------|-------------------|
|
||||
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
|
||||
|
||||
| Topology | Return Code | Error / Output Tail |
|
||||
|----------|-------------|---------------------|
|
||||
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0012:2203543:2203602 [0] NCCL INFO comm 0x55af2a804ba0 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 24.9006 # |
|
||||
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ker0012:2203610:2203792 [1] NCCL INFO comm 0x55e99a564500 rank 1 nranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0016:1325607:1325696 [0] NCCL INFO comm 0x55eaaa7389c0 rank 2 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE |
|
||||
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1325765:1325869 [3] NCCL INFO comm 0x55cb0f1c9c10 rank 7 nranks 8 cudaDev 3 busId ab000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 72.7968 # |
|
||||
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | 0016:1325927:1326140 [2] NCCL INFO comm 0x5627d2adee20 rank 10 nranks 16 cudaDev 2 busId 3a000 - Destroy COMPLETE aikubeworker0016:1325926:1326135 [1] NCCL INFO comm 0x55c00c344ea0 rank 9 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE |
|
||||
|
||||
**Overall: FAIL**
|
||||
|
||||
---
|
||||
*Generated by GPU Test Suite v0.2.0*
|
||||
@ -0,0 +1,33 @@
|
||||
# 多机多卡 NCCL PDF Matrix Artifacts Manifest 2026-05-23
|
||||
|
||||
- Remote report: `reports/multinode_nccl_pdf_matrix_20260523_113803.md`
|
||||
- Remote artifact dir: `reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts`
|
||||
- Remote artifact tar: `reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.tar.gz`
|
||||
- Case count: `8`
|
||||
- Artifact files: `32`
|
||||
|
||||
## Case Summary
|
||||
|
||||
| Case | Peak Bus BW | Avg Bus BW | Threshold | Wrong | Return Code | Status |
|
||||
|---|---:|---:|---:|---:|---:|---|
|
||||
| `allreduce_2x1_2_nodes_x_1_GPU_PDF_2_machines_2_GPUs` | 47.29 | 47.26 | 48.90 | 0 | 0 | FAIL |
|
||||
| `allreduce_2x2_2_nodes_x_2_GPUs_PDF_2_machines_4_GPUs` | 137.16 | 137.13 | 136.93 | 0 | 0 | PASS |
|
||||
| `allreduce_2x4_2_nodes_x_4_GPUs_PDF_2_machines_8_GPUs` | 335.07 | 335.02 | 335.48 | 0 | 0 | FAIL |
|
||||
| `allreduce_2x8_2_nodes_x_8_GPUs_PDF_2_machines_16_GPUs` | 353.85 | 353.85 | 491.84 | 0 | 0 | FAIL |
|
||||
| `alltoall_2x1_2_nodes_x_1_GPU_PDF_2_machines_2_GPUs` | 24.85 | 24.90 | 27.25 | 0 | 0 | FAIL |
|
||||
| `alltoall_2x2_2_nodes_x_2_GPUs_PDF_2_machines_4_GPUs` | 47.76 | 47.98 | 54.41 | 0 | 0 | FAIL |
|
||||
| `alltoall_2x4_2_nodes_x_4_GPUs_PDF_2_machines_8_GPUs` | 72.74 | 72.80 | 73.73 | 0 | 0 | FAIL |
|
||||
| `alltoall_2x8_2_nodes_x_8_GPUs_PDF_2_machines_16_GPUs` | 36.83 | 36.85 | 76.54 | 0 | 0 | FAIL |
|
||||
|
||||
## Checksums
|
||||
|
||||
```text
|
||||
682ac637460472d464a0d56ccc0f3335ed7f79a270157a403ebec23b8d9feceb reports/multinode_nccl_pdf_matrix_20260523_113803.md
|
||||
7371fcaf7269f92eb1544e5e63573ebf77f4ae38f668b5b22169ca86e6d603ee reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.tar.gz
|
||||
```
|
||||
|
||||
Per-file artifact checksums are on the remote node at:
|
||||
|
||||
```text
|
||||
reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.sha256
|
||||
```
|
||||
@ -4,11 +4,15 @@
|
||||
|
||||
对端节点:`aikubeworker0016`
|
||||
|
||||
原始报告:`reports_multinode_nccl_pdf_matrix_20260523_112247.md`
|
||||
原始报告:`reports_multinode_nccl_pdf_matrix_20260523_113803.md`
|
||||
|
||||
远端报告:`/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_112247.md`
|
||||
远端报告:`/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803.md`
|
||||
|
||||
远端日志:`/root/test_gpu_scripts/reports/run_logs/multinode_nccl_pdf_matrix_20260523_112247.log`
|
||||
远端 artifacts:`/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts`
|
||||
|
||||
远端 artifacts tar:`/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.tar.gz`
|
||||
|
||||
Artifacts manifest:`reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md`
|
||||
|
||||
执行命令:
|
||||
|
||||
@ -40,24 +44,24 @@ bash scripts/run_multinode_nccl_pdf_matrix.sh
|
||||
|
||||
| Topology | Peak Bus BW | Avg Bus BW | PDF Threshold | Gap | Status |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| 2 nodes x 1 GPU | 47.15 GB/s | 47.18 GB/s | >= 48.90 GB/s | -1.75 GB/s | FAIL |
|
||||
| 2 nodes x 2 GPUs | 136.62 GB/s | 136.67 GB/s | >= 136.93 GB/s | -0.31 GB/s | FAIL |
|
||||
| 2 nodes x 4 GPUs | 335.19 GB/s | 334.85 GB/s | >= 335.48 GB/s | -0.29 GB/s | FAIL |
|
||||
| 2 nodes x 8 GPUs | 354.56 GB/s | 354.21 GB/s | >= 491.84 GB/s | -137.28 GB/s | FAIL |
|
||||
| 2 nodes x 1 GPU | 47.29 GB/s | 47.26 GB/s | >= 48.90 GB/s | -1.61 GB/s | FAIL |
|
||||
| 2 nodes x 2 GPUs | 137.16 GB/s | 137.13 GB/s | >= 136.93 GB/s | +0.23 GB/s | PASS |
|
||||
| 2 nodes x 4 GPUs | 335.07 GB/s | 335.02 GB/s | >= 335.48 GB/s | -0.41 GB/s | FAIL |
|
||||
| 2 nodes x 8 GPUs | 353.85 GB/s | 353.85 GB/s | >= 491.84 GB/s | -137.99 GB/s | FAIL |
|
||||
|
||||
## AllToAll
|
||||
|
||||
| Topology | Peak Bus BW | Avg Bus BW | PDF Threshold | Gap | Status |
|
||||
|---|---:|---:|---:|---:|---|
|
||||
| 2 nodes x 1 GPU | 24.85 GB/s | 24.92 GB/s | >= 27.25 GB/s | -2.40 GB/s | FAIL |
|
||||
| 2 nodes x 2 GPUs | 47.71 GB/s | 47.93 GB/s | >= 54.41 GB/s | -6.70 GB/s | FAIL |
|
||||
| 2 nodes x 4 GPUs | 72.63 GB/s | 72.67 GB/s | >= 73.73 GB/s | -1.10 GB/s | FAIL |
|
||||
| 2 nodes x 8 GPUs | 36.82 GB/s | 36.86 GB/s | >= 76.54 GB/s | -39.72 GB/s | FAIL |
|
||||
| 2 nodes x 1 GPU | 24.85 GB/s | 24.90 GB/s | >= 27.25 GB/s | -2.40 GB/s | FAIL |
|
||||
| 2 nodes x 2 GPUs | 47.76 GB/s | 47.98 GB/s | >= 54.41 GB/s | -6.65 GB/s | FAIL |
|
||||
| 2 nodes x 4 GPUs | 72.74 GB/s | 72.80 GB/s | >= 73.73 GB/s | -0.99 GB/s | FAIL |
|
||||
| 2 nodes x 8 GPUs | 36.83 GB/s | 36.85 GB/s | >= 76.54 GB/s | -39.71 GB/s | FAIL |
|
||||
|
||||
## 判断
|
||||
|
||||
1. 2x2、2x4 的 AllReduce 已非常接近 PDF 阈值,差距分别只有 `0.31` 和 `0.29 GB/s`。
|
||||
2. 2x4 的 AllToAll 也接近阈值,差 `1.10 GB/s`。
|
||||
3. 2x8 是主要问题:AllReduce 只有 `354.56 / 491.84`,AllToAll 只有 `36.82 / 76.54`。
|
||||
1. 2x2 的 AllReduce 本次过线,2x4 的 AllReduce 非常接近 PDF 阈值,差 `0.41 GB/s`。
|
||||
2. 2x4 的 AllToAll 也接近阈值,差 `0.99 GB/s`。
|
||||
3. 2x8 是主要问题:AllReduce 只有 `353.85 / 491.84`,AllToAll 只有 `36.83 / 76.54`。
|
||||
4. 当前环境已经确认只有 4 条 400G IB rail 参与 NCCL,且没有发现外部 NCCL net plugin / SHARP;这仍是解释 2x8 目标不可达或严重掉速的最强证据。
|
||||
5. 本轮没有看到 GDR disabled 或 HCA 不可用,所以下一步不应继续纠结 SSH/mpirun/nccl-tests 启动链路,而应对齐 PDF 参考环境的 rail 数量、net plugin/SHARP、交换机跨 Leaf 策略。
|
||||
|
||||
Loading…
x
Reference in New Issue
Block a user