Document NCCL deep diagnosis rerun
This commit is contained in:
parent
b55666948c
commit
c183f5a9d1
125
reports_multinode_nccl_deep_diagnose_run_20260523.md
Normal file
125
reports_multinode_nccl_deep_diagnose_run_20260523.md
Normal file
@ -0,0 +1,125 @@
|
||||
# 多节点 NCCL 深度诊断复跑报告 2026-05-23
|
||||
|
||||
## 执行信息
|
||||
|
||||
- 发起节点:`aikubeworker0012`
|
||||
- 对端节点:`aikubeworker0016`
|
||||
- 测试规模:2 节点 x 8 GPU
|
||||
- NCCL:`2.27.7+cuda12.4`
|
||||
- nccl-tests:`/data/nccl-tests-latest/build`
|
||||
- OpenMPI:`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun`
|
||||
- 远端产物目录:`/root/test_gpu_scripts/reports/nccl_deep_diag_20260523_103932`
|
||||
- 诊断脚本:`scripts/multinode_nccl_deep_diagnose.sh all`
|
||||
|
||||
## Preflight
|
||||
|
||||
两台机器均通过轻量环境检查:
|
||||
|
||||
| 项目 | aikubeworker0012 | aikubeworker0016 |
|
||||
|---|---:|---:|
|
||||
| OpenMPI | `4.1.9a1` | `4.1.9a1` |
|
||||
| `all_reduce_perf` | OK | OK |
|
||||
| `alltoall_perf` | OK | OK |
|
||||
| `mlx5_0` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
| `mlx5_1` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
| `mlx5_6` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
| `mlx5_7` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
|
||||
## 16G 核心结果
|
||||
|
||||
| 测试 | 配置 | Avg Bus BW | 结论 |
|
||||
|---|---|---:|---|
|
||||
| allreduce | 自动参数 | `354.025 GB/s` | 稳定复现当前高位基线 |
|
||||
| alltoall | `NCCL_PXN_DISABLE=1` | `36.9377 GB/s` | 稳定复现当前瓶颈基线 |
|
||||
| graph allreduce | `NCCL_DEBUG=INFO` | `354.224 GB/s` | 与 counter run 一致 |
|
||||
| graph alltoall | `NCCL_PXN_DISABLE=1`, `NCCL_DEBUG=INFO` | `37.14 GB/s` | 与 counter run 一致 |
|
||||
|
||||
对 PDF 目标的含义:
|
||||
|
||||
- 2x8 allreduce 仍明显低于 PDF 2 机 16 GPU 目标 `491.84 GB/s`。
|
||||
- 2x8 alltoall 仍明显低于 PDF 2 机 16 GPU 目标 `76.54 GB/s`。
|
||||
- 本轮没有发现能把 8 卡 alltoall 推出 `36-37 GB/s` 平台的参数。
|
||||
|
||||
## Counter 观察
|
||||
|
||||
### Rail 流量
|
||||
|
||||
allreduce 每条 rail 发送流量约 `178.03-178.07 GiB`,alltoall + PXN disabled 每条 rail 发送流量约 `712.23-712.28 GiB`。四条 400G rail 在两类测试中都均衡。
|
||||
|
||||
### 错误/拥塞类计数
|
||||
|
||||
本轮未看到 discard、symbol error、RoCE retrans、slow restart、packet sequence error 等硬错误增长。
|
||||
|
||||
有增长的是 `port_xmit_wait`:
|
||||
|
||||
| 测试 | 计数增长 |
|
||||
|---|---|
|
||||
| allreduce | `aikubeworker0016 mlx5_1 +6725565`, `mlx5_7 +6103180` |
|
||||
| alltoall + PXN disabled | `aikubeworker0016 mlx5_1 +20988680`, `mlx5_7 +16271960` |
|
||||
|
||||
这说明 `port_xmit_wait` 不是 alltoall 独有现象;高吞吐 allreduce 也会出现。它可以作为交换网络/credit 等待的信号继续给网络侧看,但不能单独解释 alltoall 低带宽。
|
||||
|
||||
## GRAPH/TUNING 对照
|
||||
|
||||
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||
|---|---:|---:|
|
||||
| `avg_busbw` | `354.224` | `37.14` |
|
||||
| `plugin_missing` | `16` | `16` |
|
||||
| GDR enabled lines | `1344` | `704` |
|
||||
| channel summary | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
|
||||
| Pattern 4 | `crossNic 0`, `NVL/PXN` | `crossNic 2`, `NVL/PIX` |
|
||||
| `NET/IB/*/GDRDMA` lines | `256` | `512` |
|
||||
| `P2P/CUMEM` lines | `0` | `224` |
|
||||
| total NET/P2P edge lines | `256` | `736` |
|
||||
|
||||
解释:
|
||||
|
||||
- HCA、GDR、NCCL 版本和基础 channel 数量不是差异根因。
|
||||
- alltoall 的通信图明显更复杂,引入更多 NET/P2P 边,且 Pattern 4 从 allreduce 的 `NVL/PXN` 变成 `NVL/PIX`。
|
||||
- 这继续支持问题偏向 NCCL alltoall 图策略、internal IB plugin、缺少外部 `libnccl-net.so`/SHARP,或交换网络策略,而不是单纯链路坏、HCA 不通、GDR 没开。
|
||||
|
||||
## PXN Disabled Sweep
|
||||
|
||||
基线均为 `NCCL_PXN_DISABLE=1`,16G,2x8 GPU。
|
||||
|
||||
| Case | 额外参数 | Avg Bus BW |
|
||||
|---|---|---:|
|
||||
| baseline | 无 | `36.8024` |
|
||||
| nvls_off | `NCCL_NVLS_ENABLE=0` | `36.8095` |
|
||||
| qps4_split1 | `NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `30.5464` |
|
||||
| qps8_split1 | `NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `23.9345` |
|
||||
| qps4_split0 | `NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0` | `35.8679` |
|
||||
| channels16 | `NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16` | `37.1776` |
|
||||
| buff8m | `NCCL_BUFFSIZE=8388608` | `37.0265` |
|
||||
| p2pchunk4m | `NCCL_P2P_NET_CHUNKSIZE=4194304` | `37.0188` |
|
||||
| netpeer8 | `NCCL_NCHANNELS_PER_NET_PEER=8` | `31.103` |
|
||||
| ar0 | `NCCL_IB_AR_THRESHOLD=0` | `36.9965` |
|
||||
|
||||
结论:
|
||||
|
||||
- `channels16`、`buff8m`、`p2pchunk4m`、`ar0` 只有 0.2-1.0% 左右波动,不能视为有效优化。
|
||||
- `qps4_split1`、`qps8_split1`、`netpeer8` 明显负向。
|
||||
- 当前 8 卡 alltoall 不建议套用 PDF 固定 QP/split 参数。
|
||||
|
||||
## 脚本修正验证
|
||||
|
||||
复跑后发现脚本在 GRAPH 模式后会把 `NCCL_DEBUG=INFO` 继承到 sweep,导致 sweep 日志过大;同时 OpenMPI 会对未设置的 `-x` 变量打印 warning。
|
||||
|
||||
已修正:
|
||||
|
||||
- `set_common_env` 每个 case 重置到默认 `NCCL_DEBUG=WARN`。
|
||||
- `mpi_xargs` 只导出已经设置的环境变量。
|
||||
|
||||
验证方式:
|
||||
|
||||
- 本地 `bash -n scripts/multinode_nccl_deep_diagnose.sh` 通过。
|
||||
- 远端 1M tiny `all` 冒烟测试通过。
|
||||
- tiny 产物中 `could not find environment variable` 计数为 `0`。
|
||||
|
||||
## 当前判断
|
||||
|
||||
1. allreduce 的高位基线稳定,2x8 仍在 `354 GB/s` 左右。
|
||||
2. alltoall 即使 PXN disabled 并且 rail 均衡,也只能稳定在 `36-37 GB/s`。
|
||||
3. 未发现明显坏链路、重传、丢包、HCA 不通或 GDR disabled。
|
||||
4. 当前 4 条 400G rail 的硬件形态与 PDF 目标疑似不等价;PDF 2x8 allreduce 目标 `491.84 GB/s` 反推需要超过当前 4 rail 单向理论上限。
|
||||
5. alltoall 还需要从 NCCL net plugin/SHARP、交换机路径/ECMP/拥塞控制、以及 NCCL alltoall 图策略侧继续排。
|
||||
Loading…
x
Reference in New Issue
Block a user