Compare commits
10 Commits
05294a66d8
...
ec6b868d3f
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
ec6b868d3f | ||
|
|
3a0e739991 | ||
|
|
b914cb7b4b | ||
|
|
adcdb36e05 | ||
|
|
0d63ea5e05 | ||
|
|
08e0d93a16 | ||
|
|
9a32645c9d | ||
|
|
71ac97a24e | ||
|
|
890e623be4 | ||
|
|
17441a4583 |
11
README.md
11
README.md
@ -575,6 +575,17 @@ report:
|
||||
└── 确认: 训练 loss 正常下降
|
||||
```
|
||||
|
||||
#### 多节点 NCCL 深度诊断
|
||||
|
||||
当 SOP-3 的多节点 NCCL 结果与验收 PDF 不一致时,可以在发起节点运行深度诊断脚本,复现 counter 抓取、GRAPH/TUNING 日志和 PXN disabled sweep:
|
||||
|
||||
```bash
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||
```
|
||||
|
||||
详细参数、输出目录和解读方法见 [docs/multinode_nccl_deep_diagnose_runbook.md](/Users/d-robotics/lab/test_gpu_scripts/docs/multinode_nccl_deep_diagnose_runbook.md)。
|
||||
|
||||
---
|
||||
|
||||
### SOP-4: 故障诊断
|
||||
|
||||
201
docs/multinode_nccl_deep_diagnose_runbook.md
Normal file
201
docs/multinode_nccl_deep_diagnose_runbook.md
Normal file
@ -0,0 +1,201 @@
|
||||
# 多机 NCCL 深度诊断 runbook
|
||||
|
||||
本文档用于复现 2026-05-23 这轮 2 机 8 卡 NCCL 排查里的关键动作:counter 抓取、GRAPH/TUNING 日志、以及 PXN disabled 基线上的二次参数 sweep。
|
||||
|
||||
## 适用场景
|
||||
|
||||
当前默认参数面向:
|
||||
|
||||
- `aikubeworker0012` / `172.72.8.12`
|
||||
- `aikubeworker0016` / `172.72.8.16`
|
||||
- 每节点 8 GPU
|
||||
- 每节点 4 条 400G HCA:`mlx5_0,mlx5_1,mlx5_6,mlx5_7`
|
||||
- NCCL 临时运行库:`/tmp/nccl-2.27.7-cuda12.4`
|
||||
- nccl-tests:`/data/nccl-tests-latest/build`
|
||||
- OpenMPI:`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun`
|
||||
|
||||
脚本应在 coordinator 节点上执行,当前即 `aikubeworker0012`。
|
||||
|
||||
## 快速运行
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||
```
|
||||
|
||||
默认输出目录为:
|
||||
|
||||
```text
|
||||
/tmp/nccl_deep_diagnose_YYYYMMDD_HHMMSS
|
||||
```
|
||||
|
||||
只跑单项:
|
||||
|
||||
```bash
|
||||
# 轻量检查 SSH、mpirun、nccl-tests 和 HCA 路径
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||
|
||||
# allreduce counter 对照
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh allreduce-counter
|
||||
|
||||
# PXN disabled alltoall counter
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh alltoall-counter
|
||||
|
||||
# NCCL GRAPH/TUNING/COLL 对照
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh graph
|
||||
|
||||
# PXN disabled 基线上的二次参数 sweep
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh pxn-sweep
|
||||
```
|
||||
|
||||
## 常用参数覆盖
|
||||
|
||||
```bash
|
||||
OUT_DIR=/tmp/my_nccl_diag \
|
||||
HOSTS=172.72.8.12:8,172.72.8.16:8 \
|
||||
PEER_HOST=172.72.8.16 \
|
||||
HCAS="mlx5_0 mlx5_1 mlx5_6 mlx5_7" \
|
||||
HCA_CSV=mlx5_0,mlx5_1,mlx5_6,mlx5_7 \
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||
```
|
||||
|
||||
如果 nccl-tests 或 NCCL 运行库路径变化:
|
||||
|
||||
```bash
|
||||
NCCL_TESTS_DIR=/opt/gpu-test-tools/nccl-tests/build \
|
||||
NCCL_LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.9a1/lib:/path/to/nccl/lib:/usr/local/cuda/lib64 \
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh graph
|
||||
```
|
||||
|
||||
## 输出解读
|
||||
|
||||
### preflight 模式
|
||||
|
||||
典型输出文件:
|
||||
|
||||
```text
|
||||
preflight.txt
|
||||
```
|
||||
|
||||
该模式不跑 NCCL workload,只检查:
|
||||
|
||||
- 本机和对端主机名。
|
||||
- OpenMPI `mpirun` 是否存在且可执行。
|
||||
- `all_reduce_perf` / `alltoall_perf` 是否存在且可执行。
|
||||
- 配置的 HCA 是否能在 `/sys/class/infiniband/<hca>/ports/1` 下读到 state/rate。
|
||||
- 发起节点到 `PEER_HOST` 的 root SSH 是否可用。
|
||||
|
||||
如果这里出现 `MISSING`,先修环境;否则再跑 `all` 或单项诊断。
|
||||
|
||||
### counter 模式
|
||||
|
||||
典型输出文件:
|
||||
|
||||
```text
|
||||
allreduce_counter/
|
||||
allreduce.log
|
||||
before.local
|
||||
before.remote
|
||||
after.local
|
||||
after.remote
|
||||
counter_delta.txt
|
||||
|
||||
alltoall_pxn_counter/
|
||||
alltoall_pxn.log
|
||||
before.local
|
||||
before.remote
|
||||
after.local
|
||||
after.remote
|
||||
counter_delta.txt
|
||||
```
|
||||
|
||||
重点看 `counter_delta.txt`:
|
||||
|
||||
- `port_xmit_data` / `port_rcv_data`:端口流量,单位为 4-byte words,脚本同时换算 GiB。
|
||||
- `port_xmit_wait`:发送等待或 credit/拥塞等待信号。注意它不是 alltoall 独有根因,因为高吞吐 allreduce 也会出现。
|
||||
- `port_xmit_discards`、`port_rcv_errors`、`symbol_error`、`roce_adp_retrans`、`packet_seq_err` 等:错误、丢包、重传、链路异常类信号。
|
||||
|
||||
当前已知基线:
|
||||
|
||||
- allreduce 可到约 `354 GB/s busbw`,4 条 rail 均衡。
|
||||
- PXN disabled alltoall 通常在 `36-37 GB/s busbw` 附近,但有窗口波动。
|
||||
- alltoall PXN disabled 后 rail 均衡,且没有明显 error/retrans/slow restart。
|
||||
|
||||
### graph 模式
|
||||
|
||||
典型输出文件:
|
||||
|
||||
```text
|
||||
graph/
|
||||
allreduce.log
|
||||
allreduce_summary.txt
|
||||
alltoall_pxn.log
|
||||
alltoall_pxn_summary.txt
|
||||
```
|
||||
|
||||
重点看:
|
||||
|
||||
- `nccl_version`
|
||||
- `plugin_missing`
|
||||
- `gdr_enabled_lines`
|
||||
- `pattern_counts`
|
||||
- `channel_summary`
|
||||
- `NET/IB/*/GDRDMA`
|
||||
- `P2P/CUMEM`
|
||||
- `channel_edge_lines`
|
||||
|
||||
当前已知对照:
|
||||
|
||||
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||
|--------|-----------|----------------------------------|
|
||||
| HCA / GDR | 4 HCA, GDR enabled | 4 HCA, GDR enabled |
|
||||
| channels | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
|
||||
| `NET/IB/*/GDRDMA` channel edge lines | `256` | `512` |
|
||||
| `P2P/CUMEM` channel edge lines | `0` | `224` |
|
||||
| total NET/P2P channel edge lines | `256` | `736` |
|
||||
|
||||
判断边界:
|
||||
|
||||
- 如果 HCA/GDR/channel 基础状态一致,但 alltoall graph 明显更复杂,问题更偏向 NCCL collective graph、P2P/NET 组合方式、internal IB plugin 或交换网络策略。
|
||||
- 如果 GDR disabled、HCA 不完整、plugin 路径变化,则不能直接与当前报告结论对比。
|
||||
|
||||
### pxn-sweep 模式
|
||||
|
||||
典型输出:
|
||||
|
||||
```text
|
||||
pxn_sweep/
|
||||
baseline.log
|
||||
nvls_off.log
|
||||
qps4_split1.log
|
||||
qps8_split1.log
|
||||
qps4_split0.log
|
||||
channels16.log
|
||||
buff8m.log
|
||||
p2pchunk4m.log
|
||||
netpeer8.log
|
||||
ar0.log
|
||||
summary.txt
|
||||
```
|
||||
|
||||
当前结论:
|
||||
|
||||
- `NCCL_PXN_DISABLE=1` 是已发现的唯一稳定正向项。
|
||||
- 在 PXN disabled 基线上继续叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR,没有稳定收益。
|
||||
- QP/split 和 `NCCL_NCHANNELS_PER_NET_PEER=8` 在当前环境下明显变差。
|
||||
|
||||
## 交接给网络/NCCL 环境侧的重点
|
||||
|
||||
1. 当前不是旧 NCCL/GDR disabled 问题:NCCL `2.27.7` 下 4 条 HCA 都是 GDR enabled。
|
||||
2. 当前不是 rail 完全打偏问题:`NCCL_PXN_DISABLE=1` 后 alltoall 的 4 条 rail 已均衡。
|
||||
3. 当前不是明显坏链路/重传问题:未看到 discard、symbol error、RoCE retrans、slow restart、packet sequence error 等增长。
|
||||
4. allreduce 已接近当前 4 x 400G rail 的物理可用带宽;PDF 8 卡 allreduce 目标反推需要超过当前 4 rail 单向理论带宽。
|
||||
5. alltoall 剩余差距更像 NCCL internal alltoall graph、P2P/NET 组合方式、缺少 NCCL net plugin/SHARP,或交换网络策略/ECMP/拥塞控制问题。
|
||||
|
||||
## 关联报告
|
||||
|
||||
- `reports_multinode_nccl_diagnosis_20260523.md`
|
||||
- `reports_multinode_nccl_alltoall_tuning_20260523.md`
|
||||
- `reports_multinode_nccl_counter_probe_20260523.md`
|
||||
- `reports_multinode_nccl_pdf_matrix_nccl227.md`
|
||||
@ -10,7 +10,11 @@
|
||||
|
||||
`NCCL_PXN_DISABLE=1` 是本轮唯一有效正向参数,可以把 8 卡 alltoall 从约 `30.06 GB/s` 提升到约 `37.24 GB/s`。纳入正式 PDF 矩阵配置后,8 卡 alltoall 原始报告结果为 `36.70 GB/s peak` / `36.74 GB/s avg`。
|
||||
|
||||
这个提升有实际价值,但仍远低于 PDF 参考 `76.54 GB/s`。其他参数没有改善,部分明显变差:
|
||||
补充计数器探测显示,`NCCL_PXN_DISABLE=1` 的实际作用是把 alltoall 流量重新均匀分配到 4 条 400G rail 上。baseline 下 `mlx5_0/6` 与 `mlx5_1/7` 的流量约为 3:1;禁用 PXN 后四条 HCA 均衡。但每条 rail 的实际吞吐仍只有约 `19-20 GB/s`,没有打满 400G rail。
|
||||
|
||||
复测错误/拥塞 counter 后,没有看到 discard、链路错误、RoCE 重传、slow restart 或 packet sequence error 增长;主要非零异常是部分端口 `port_xmit_wait`。不过 allreduce 对照在 `354 GB/s busbw` 时也会出现同类 `port_xmit_wait`,所以当前不支持“链路坏包/重传导致慢”的判断,也不能只用 `port_xmit_wait` 解释 alltoall 低吞吐。更可能的方向是 NCCL internal alltoall 通信模式效率、交换侧调度/拥塞控制,或缺少 NCCL net plugin/SHARP。
|
||||
|
||||
这个提升有实际价值,但仍远低于 PDF 参考 `76.54 GB/s`。在 `NCCL_PXN_DISABLE=1` 之前做过一轮参数 sweep,其他参数没有改善,部分明显变差:
|
||||
|
||||
| Case | Avg Bus BW | 结论 |
|
||||
|------|------------|------|
|
||||
@ -28,6 +32,109 @@
|
||||
| `NCCL_IB_ADAPTIVE_ROUTING=0` | `30.0535 GB/s` | 无改善 |
|
||||
| `NCCL_IB_PCI_RELAXED_ORDERING=0` | 未完成 | 明显异常,不建议 |
|
||||
|
||||
在 `NCCL_PXN_DISABLE=1` 作为基线后又补跑了一轮叠加参数 sweep。短测窗口里 `NVLS_ENABLE=0`、`P2P_NET_CHUNKSIZE=4M` 有小幅波动式提升,但更长 `-w 10 -n 10` 复测没有复现,不能作为稳定优化项。
|
||||
|
||||
| Case | Avg Bus BW | 结论 |
|
||||
|------|------------|------|
|
||||
| `NCCL_PXN_DISABLE=1` | `37.0069 GB/s` | 短测基线 |
|
||||
| `+ NCCL_NVLS_ENABLE=0` | `37.2217 GB/s` | 小幅波动,不稳定 |
|
||||
| `+ NCCL_P2P_NET_CHUNKSIZE=4194304` | `37.2522 GB/s` | 小幅波动,不稳定 |
|
||||
| `+ NCCL_BUFFSIZE=8388608` | `37.0911 GB/s` | 无实质改善 |
|
||||
| `+ NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16` | `37.0189 GB/s` | 无实质改善 |
|
||||
| `+ NCCL_IB_AR_THRESHOLD=0` | `37.0843 GB/s` | 无实质改善 |
|
||||
| `+ NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0` | `35.9847 GB/s` | 变差 |
|
||||
| `+ NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `29.8406 GB/s` | 明显变差 |
|
||||
| `+ NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `24.1183 GB/s` | 明显变差 |
|
||||
| `+ NCCL_NCHANNELS_PER_NET_PEER=8` | `29.8904 GB/s` | 明显变差 |
|
||||
|
||||
长测复核:
|
||||
|
||||
| Case | Avg Bus BW | 结论 |
|
||||
|------|------------|------|
|
||||
| `NCCL_PXN_DISABLE=1` | `32.7280 GB/s` | 当前窗口基线下滑 |
|
||||
| `+ NCCL_P2P_NET_CHUNKSIZE=4194304` | `31.9340 GB/s` | 未复现短测提升 |
|
||||
| `+ NCCL_NVLS_ENABLE=0 NCCL_P2P_NET_CHUNKSIZE=4194304` | `27.6585 GB/s` | 明显变差 |
|
||||
|
||||
补充 ENV/INIT/NET 日志确认,性能波动时仍是 NCCL `2.27.7+cuda12.4`、4 条 400G HCA、GDR enabled、internal IB plugin;不是退回旧 NCCL、HCA 选择错误或 GDR 失效。
|
||||
|
||||
## NCCL GRAPH/TUNING 对照
|
||||
|
||||
为避免只看带宽结果,补抓了 allreduce 与 PXN disabled alltoall 的 `NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,TUNING,COLL` 日志。该日志采样使用短迭代,只用于看 NCCL 图和通道选择,不作为性能结论。
|
||||
|
||||
共同点:
|
||||
|
||||
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||
|--------|-----------|----------------------------------|
|
||||
| NCCL version | `2.27.7+cuda12.4` | `2.27.7+cuda12.4` |
|
||||
| HCA | `mlx5_0,mlx5_1,mlx5_6,mlx5_7` | `mlx5_0,mlx5_1,mlx5_6,mlx5_7` |
|
||||
| GDR | enabled | enabled |
|
||||
| external net plugin | missing, internal IB | missing, internal IB |
|
||||
| channels | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
|
||||
| p2p channels per peer | `2` | `2` |
|
||||
| P2P chunk | `131072` | `131072` |
|
||||
|
||||
差异:
|
||||
|
||||
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||
|--------|-----------|----------------------------------|
|
||||
| Pattern 4 | `crossNic 0`, `type NVL/PXN`, `nChannels 8` | `crossNic 2`, `type NVL/PIX`, `nChannels 8` |
|
||||
| `NET/IB/*/GDRDMA` channel edge lines | `256` | `512` |
|
||||
| `P2P/CUMEM` channel edge lines | `0` | `224` |
|
||||
| total NET/P2P channel edge lines | `256` | `736` |
|
||||
|
||||
判断:PXN disabled 后 4 条 IB/GDRDMA rail 都仍被使用,且通道数没有少;但 alltoall 的 NCCL graph 明显更复杂,并混入大量本机 `P2P/CUMEM` 路径。这个结果进一步支持:剩余差距不是 HCA/GDR 基础环境没有生效,而是 alltoall collective graph、P2P/NET 组合方式、internal IB plugin 能力或交换网络策略的问题。
|
||||
|
||||
## PXN disabled 端口计数器
|
||||
|
||||
`NCCL_PXN_DISABLE=1` 后,8 卡 alltoall 输出:
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `algbw` | `39.37 / 39.46 GB/s` |
|
||||
| `busbw` | `36.91 / 37.00 GB/s` |
|
||||
| `Avg bus bandwidth` | `36.9518 GB/s` |
|
||||
|
||||
端口计数器:
|
||||
|
||||
| Host | HCA | Xmit GB | Recv GB | Xmit GB/s | Recv GB/s |
|
||||
|------|-----|---------|---------|-----------|-----------|
|
||||
| 172.72.8.12 | `mlx5_0` | `590.98` | `590.91` | `19.82` | `19.82` |
|
||||
| 172.72.8.12 | `mlx5_1` | `590.98` | `590.98` | `19.82` | `19.82` |
|
||||
| 172.72.8.12 | `mlx5_6` | `590.98` | `590.90` | `19.82` | `19.82` |
|
||||
| 172.72.8.12 | `mlx5_7` | `590.98` | `590.98` | `19.82` | `19.82` |
|
||||
| 172.72.8.16 | `mlx5_0` | `590.94` | `590.98` | `19.82` | `19.82` |
|
||||
| 172.72.8.16 | `mlx5_1` | `590.94` | `590.98` | `19.82` | `19.82` |
|
||||
| 172.72.8.16 | `mlx5_6` | `590.94` | `590.98` | `19.82` | `19.82` |
|
||||
| 172.72.8.16 | `mlx5_7` | `590.94` | `590.98` | `19.82` | `19.82` |
|
||||
|
||||
对比 baseline:
|
||||
|
||||
| Case | Rail 分布 | Avg Bus BW |
|
||||
|------|-----------|------------|
|
||||
| baseline | `mlx5_0/6` 约 `885 GB`,`mlx5_1/7` 约 `295 GB` | `30.04 GB/s` |
|
||||
| `NCCL_PXN_DISABLE=1` | 四条 HCA 均约 `591 GB` | `36.95 GB/s` |
|
||||
|
||||
### 错误/等待 counter 复测
|
||||
|
||||
PXN disabled 复测结果:
|
||||
|
||||
| 观察项 | 结果 |
|
||||
|--------|------|
|
||||
| `Avg bus bandwidth` | `36.4512 GB/s` |
|
||||
| 每条 HCA 流量 | 约 `712.18-712.28 GiB`,四条 rail 均衡 |
|
||||
| discard / rcv error / symbol error / link down / link recovery | `0` 增量 |
|
||||
| RoCE retrans / slow restart / packet sequence error / out of sequence | `0` 增量 |
|
||||
| `port_xmit_wait` | `mlx5_1`、`mlx5_7` 有增长,约 `15.65M-23.49M` |
|
||||
|
||||
allreduce 对照:
|
||||
|
||||
| 观察项 | 结果 |
|
||||
|--------|------|
|
||||
| `Avg bus bandwidth` | `354.366 GB/s` |
|
||||
| 每条 HCA 流量 | 约 `178.03-178.07 GiB`,四条 rail 均衡 |
|
||||
| 错误/重传类 counter | `0` 增量 |
|
||||
| `port_xmit_wait` | `mlx5_1`、`mlx5_7` 有增长,约 `6.11M-6.59M` |
|
||||
|
||||
## 正式配置更新
|
||||
|
||||
`configs/multinode_nccl_nccl227_pdf_matrix.yaml` 已对 2 nodes x 8 GPUs 的 alltoall 增加:
|
||||
@ -47,5 +154,7 @@ op_env:
|
||||
## 判断
|
||||
|
||||
1. PXN 在当前拓扑下对 8 卡 alltoall 有负面影响,禁用后有约 `22-24%` 提升。
|
||||
2. 禁用 PXN 后仍只有 PDF 目标的一半左右,剩余差距不是单一 NCCL 环境变量可以补齐。
|
||||
3. 后续重点仍应放在 NCCL net plugin/SHARP、交换网络策略、路由/拥塞和 alltoall rail 分布。
|
||||
2. 禁用 PXN 可以修复 rail 分布不均衡,但无法打满每条 400G rail。
|
||||
3. PXN disabled 基线上继续叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR 等参数,没有稳定收益;QP/split 和 `NCCL_NCHANNELS_PER_NET_PEER=8` 反而明显变差。
|
||||
4. 禁用 PXN 后仍只有 PDF 目标的一半左右,剩余差距不是单一 NCCL 环境变量可以补齐。
|
||||
5. 后续重点仍应放在 NCCL net plugin/SHARP、交换网络策略和 NCCL internal alltoall 实现效率;`port_xmit_wait` 需要结合 allreduce 对照解读,不能单独作为 alltoall 根因。
|
||||
|
||||
@ -14,6 +14,10 @@
|
||||
|
||||
8 卡 alltoall 仍只有 `30 GB/s busbw`,不是 HCA 顺序导致。HCA 顺序 sweep 都稳定在 `30.02-30.07 GB/s`。计数器显示 alltoall 流量主要压在 `mlx5_0` 和 `mlx5_6` 上,`mlx5_1` 和 `mlx5_7` 只有约三分之一流量,说明剩余问题更像 NCCL alltoall rail 分布、路由、拥塞、NCCL net plugin/SHARP 或网络侧策略问题。
|
||||
|
||||
补充测试显示,`NCCL_PXN_DISABLE=1` 可以把 alltoall 流量均匀分配到四条 HCA,并将 busbw 提升到约 `36.5-37.0 GB/s`。不过每条 400G rail 仍只有约 `19-20 GB/s`,没有达到裸 RDMA 单 rail 能力。
|
||||
|
||||
进一步抓 `counters`/`hw_counters` 后,未看到 discard、CRC/符号错误、packet sequence error、RoCE retrans、slow restart 等错误类计数增长;只看到部分端口 `port_xmit_wait` 增长。对照 allreduce 后发现,allreduce 在 `354 GB/s busbw` 时也会出现同类 `port_xmit_wait`,因此 `port_xmit_wait` 不是 alltoall 低吞吐的充分解释,只能说明发送侧存在等待。剩余问题更像 NCCL internal alltoall 通信模式、交换网络调度/拥塞控制、或缺少 NCCL net plugin/SHARP 能力。
|
||||
|
||||
## 裸 RDMA 4 rail 并发
|
||||
|
||||
命令类型:
|
||||
@ -58,6 +62,40 @@ busbw = algbw * 2 * (nranks - 1) / nranks
|
||||
|
||||
当前 `189.12 GB/s algbw` 已接近 `4 x 400Gb/s = 200 GB/s` 理论单向总带宽。
|
||||
|
||||
### allreduce counter 对照
|
||||
|
||||
对同样 2 nodes x 8 GPUs、同样 4 条 HCA 的 16G allreduce 复测 counter:
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `algbw` | `189.22 / 188.77 GB/s` |
|
||||
| `busbw` | `354.79 / 353.94 GB/s` |
|
||||
| `Avg bus bandwidth` | `354.366 GB/s` |
|
||||
|
||||
流量分布:
|
||||
|
||||
| Host | HCA | Xmit GiB | Recv GiB |
|
||||
|------|-----|----------|----------|
|
||||
| aikubeworker0012 | `mlx5_0` | `178.07` | `178.03` |
|
||||
| aikubeworker0012 | `mlx5_1` | `178.07` | `178.07` |
|
||||
| aikubeworker0012 | `mlx5_6` | `178.07` | `178.03` |
|
||||
| aikubeworker0012 | `mlx5_7` | `178.07` | `178.07` |
|
||||
| aikubeworker0016 | `mlx5_0` | `178.03` | `178.07` |
|
||||
| aikubeworker0016 | `mlx5_1` | `178.07` | `178.07` |
|
||||
| aikubeworker0016 | `mlx5_6` | `178.03` | `178.07` |
|
||||
| aikubeworker0016 | `mlx5_7` | `178.07` | `178.07` |
|
||||
|
||||
错误类 counter 增量同样为 `0`,非零等待类 counter 为:
|
||||
|
||||
| Host | HCA | `port_xmit_wait` delta |
|
||||
|------|-----|------------------------|
|
||||
| aikubeworker0012 | `mlx5_1` | `6,555,518` |
|
||||
| aikubeworker0012 | `mlx5_7` | `6,325,059` |
|
||||
| aikubeworker0016 | `mlx5_1` | `6,585,965` |
|
||||
| aikubeworker0016 | `mlx5_7` | `6,112,874` |
|
||||
|
||||
判断:allreduce 在达到当前 4 x 400G rail 物理上限附近时也会出现 `port_xmit_wait`,所以这个 counter 不能单独解释 alltoall 只有 `36-37 GB/s`。alltoall 的问题更偏向通信模式效率或网络调度策略,而不是简单链路错误。
|
||||
|
||||
## 8 卡 alltoall
|
||||
|
||||
NCCL 输出:
|
||||
@ -93,10 +131,79 @@ NCCL 输出:
|
||||
| `mlx5_1,mlx5_0,mlx5_7,mlx5_6` | `30.0413 GB/s` |
|
||||
| `mlx5_6,mlx5_7,mlx5_0,mlx5_1` | `30.0230 GB/s` |
|
||||
|
||||
## PXN disabled alltoall 计数器
|
||||
|
||||
`NCCL_PXN_DISABLE=1` 后:
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `Avg bus bandwidth` | `36.9518 GB/s` |
|
||||
| 每条 HCA 流量 | 约 `590.94-590.98 GB` |
|
||||
| 每条 HCA 吞吐 | 约 `19.82 GB/s` |
|
||||
| 每节点 4 HCA 合计吞吐 | 约 `79.29 GB/s` |
|
||||
|
||||
判断:禁用 PXN 可以修复 rail 分布不均衡,但不能让 alltoall 打满当前 4 条 400G rail。
|
||||
|
||||
### PXN disabled 错误/拥塞 counter 复测
|
||||
|
||||
复测命令仍为 2 nodes x 8 GPUs,`alltoall_perf -b 16G -e 16G -w 10 -n 10`,并使用:
|
||||
|
||||
```bash
|
||||
NCCL_PXN_DISABLE=1
|
||||
NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_6,mlx5_7
|
||||
NCCL_NET_PLUGIN=none
|
||||
NCCL_NET_GDR_LEVEL=5
|
||||
NCCL_NET_GDR_READ=1
|
||||
NCCL_DMABUF_ENABLE=0
|
||||
```
|
||||
|
||||
NCCL 输出:
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| `algbw` | `39.04 / 38.72 GB/s` |
|
||||
| `busbw` | `36.60 / 36.30 GB/s` |
|
||||
| `Avg bus bandwidth` | `36.4512 GB/s` |
|
||||
|
||||
流量分布保持均衡:
|
||||
|
||||
| Host | HCA | Xmit GiB | Recv GiB |
|
||||
|------|-----|----------|----------|
|
||||
| aikubeworker0012 | `mlx5_0` | `712.28` | `712.19` |
|
||||
| aikubeworker0012 | `mlx5_1` | `712.27` | `712.27` |
|
||||
| aikubeworker0012 | `mlx5_6` | `712.28` | `712.18` |
|
||||
| aikubeworker0012 | `mlx5_7` | `712.27` | `712.27` |
|
||||
| aikubeworker0016 | `mlx5_0` | `712.23` | `712.27` |
|
||||
| aikubeworker0016 | `mlx5_1` | `712.23` | `712.27` |
|
||||
| aikubeworker0016 | `mlx5_6` | `712.23` | `712.27` |
|
||||
| aikubeworker0016 | `mlx5_7` | `712.23` | `712.27` |
|
||||
|
||||
错误类 counter 增量:
|
||||
|
||||
| Counter group | Result |
|
||||
|---------------|--------|
|
||||
| `port_xmit_discards`, `port_rcv_errors`, `port_rcv_remote_physical_errors`, `port_rcv_switch_relay_errors` | `0` |
|
||||
| `symbol_error`, `link_error_recovery`, `link_downed`, `local_link_integrity_errors`, `excessive_buffer_overrun_errors` | `0` |
|
||||
| `roce_adp_retrans`, `roce_adp_retrans_to`, `roce_slow_restart*` | `0` |
|
||||
| `packet_seq_err`, `out_of_sequence`, `out_of_buffer`, `duplicate_request`, `implied_nak_seq_err` | `0` |
|
||||
| `local_ack_timeout_err`, `req_transport_retries_exceeded`, `rnr_nak_retry_err` | `0` |
|
||||
|
||||
非零等待类 counter:
|
||||
|
||||
| Host | HCA | `port_xmit_wait` delta |
|
||||
|------|-----|------------------------|
|
||||
| aikubeworker0012 | `mlx5_1` | `23,492,853` |
|
||||
| aikubeworker0012 | `mlx5_7` | `17,420,720` |
|
||||
| aikubeworker0016 | `mlx5_1` | `20,428,901` |
|
||||
| aikubeworker0016 | `mlx5_7` | `15,650,027` |
|
||||
|
||||
判断:PXN disabled 后 alltoall 没有明显链路错误、重传或丢包证据。结合 allreduce 对照,`port_xmit_wait` 只能作为发送等待信号,不能单独解释 alltoall 低吞吐;剩余性能缺口更偏向 NCCL internal alltoall 在当前拓扑下的通信模式效率、交换网络调度/拥塞控制,或外部 NCCL net plugin/SHARP 缺失。
|
||||
|
||||
## 判断
|
||||
|
||||
1. 裸 RDMA 4 rail 可以并发跑到约 `184.62 GB/s`,网络基础带宽不是单 rail 瓶颈。
|
||||
2. 8 卡 allreduce 当前不是软件参数小调能解决的问题,性能已经贴近当前 4 条 400G rail 的物理带宽上限。
|
||||
3. 8 卡 alltoall 仍明显异常,且不是 HCA 顺序问题;需要继续从 NCCL alltoall rail 分布、网络路由/拥塞、NCCL net plugin/SHARP、交换机侧策略排查。
|
||||
4. 如果验收必须达到 PDF 的 2 机 16 卡 `491.84/76.54 GB/s`,需要确认当前两台机器是否具备与 PDF 参考环境同等的有效跨节点 rail 数量和交换网络能力。
|
||||
5. 两台机器当前均未发现 `libnccl-net.so` 或 SHARP/HCOLL 包,NCCL 使用 internal IB plugin;如果目标值依赖 NCCL net plugin/SHARP,需要先补齐对应运行环境。
|
||||
3. 8 卡 alltoall 仍明显异常,且不是 HCA 顺序问题;PXN disabled 后 rail 已均衡,`port_xmit_wait` 不是 alltoall 独有,需要继续从 NCCL alltoall 模式、交换机侧策略、NCCL net plugin/SHARP 排查。
|
||||
4. `NCCL_PXN_DISABLE=1` 可改善 8 卡 alltoall 的 rail 均衡性和性能,但无法补齐到 PDF 目标。
|
||||
5. 如果验收必须达到 PDF 的 2 机 16 卡 `491.84/76.54 GB/s`,需要确认当前两台机器是否具备与 PDF 参考环境同等的有效跨节点 rail 数量和交换网络能力。
|
||||
6. 两台机器当前均未发现 `libnccl-net.so` 或 SHARP/HCOLL 包,NCCL 使用 internal IB plugin;如果目标值依赖 NCCL net plugin/SHARP,需要先补齐对应运行环境。
|
||||
|
||||
125
reports_multinode_nccl_deep_diagnose_run_20260523.md
Normal file
125
reports_multinode_nccl_deep_diagnose_run_20260523.md
Normal file
@ -0,0 +1,125 @@
|
||||
# 多节点 NCCL 深度诊断复跑报告 2026-05-23
|
||||
|
||||
## 执行信息
|
||||
|
||||
- 发起节点:`aikubeworker0012`
|
||||
- 对端节点:`aikubeworker0016`
|
||||
- 测试规模:2 节点 x 8 GPU
|
||||
- NCCL:`2.27.7+cuda12.4`
|
||||
- nccl-tests:`/data/nccl-tests-latest/build`
|
||||
- OpenMPI:`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun`
|
||||
- 远端产物目录:`/root/test_gpu_scripts/reports/nccl_deep_diag_20260523_103932`
|
||||
- 诊断脚本:`scripts/multinode_nccl_deep_diagnose.sh all`
|
||||
|
||||
## Preflight
|
||||
|
||||
两台机器均通过轻量环境检查:
|
||||
|
||||
| 项目 | aikubeworker0012 | aikubeworker0016 |
|
||||
|---|---:|---:|
|
||||
| OpenMPI | `4.1.9a1` | `4.1.9a1` |
|
||||
| `all_reduce_perf` | OK | OK |
|
||||
| `alltoall_perf` | OK | OK |
|
||||
| `mlx5_0` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
| `mlx5_1` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
| `mlx5_6` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
| `mlx5_7` | 400 Gb/sec ACTIVE | 400 Gb/sec ACTIVE |
|
||||
|
||||
## 16G 核心结果
|
||||
|
||||
| 测试 | 配置 | Avg Bus BW | 结论 |
|
||||
|---|---|---:|---|
|
||||
| allreduce | 自动参数 | `354.025 GB/s` | 稳定复现当前高位基线 |
|
||||
| alltoall | `NCCL_PXN_DISABLE=1` | `36.9377 GB/s` | 稳定复现当前瓶颈基线 |
|
||||
| graph allreduce | `NCCL_DEBUG=INFO` | `354.224 GB/s` | 与 counter run 一致 |
|
||||
| graph alltoall | `NCCL_PXN_DISABLE=1`, `NCCL_DEBUG=INFO` | `37.14 GB/s` | 与 counter run 一致 |
|
||||
|
||||
对 PDF 目标的含义:
|
||||
|
||||
- 2x8 allreduce 仍明显低于 PDF 2 机 16 GPU 目标 `491.84 GB/s`。
|
||||
- 2x8 alltoall 仍明显低于 PDF 2 机 16 GPU 目标 `76.54 GB/s`。
|
||||
- 本轮没有发现能把 8 卡 alltoall 推出 `36-37 GB/s` 平台的参数。
|
||||
|
||||
## Counter 观察
|
||||
|
||||
### Rail 流量
|
||||
|
||||
allreduce 每条 rail 发送流量约 `178.03-178.07 GiB`,alltoall + PXN disabled 每条 rail 发送流量约 `712.23-712.28 GiB`。四条 400G rail 在两类测试中都均衡。
|
||||
|
||||
### 错误/拥塞类计数
|
||||
|
||||
本轮未看到 discard、symbol error、RoCE retrans、slow restart、packet sequence error 等硬错误增长。
|
||||
|
||||
有增长的是 `port_xmit_wait`:
|
||||
|
||||
| 测试 | 计数增长 |
|
||||
|---|---|
|
||||
| allreduce | `aikubeworker0016 mlx5_1 +6725565`, `mlx5_7 +6103180` |
|
||||
| alltoall + PXN disabled | `aikubeworker0016 mlx5_1 +20988680`, `mlx5_7 +16271960` |
|
||||
|
||||
这说明 `port_xmit_wait` 不是 alltoall 独有现象;高吞吐 allreduce 也会出现。它可以作为交换网络/credit 等待的信号继续给网络侧看,但不能单独解释 alltoall 低带宽。
|
||||
|
||||
## GRAPH/TUNING 对照
|
||||
|
||||
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||
|---|---:|---:|
|
||||
| `avg_busbw` | `354.224` | `37.14` |
|
||||
| `plugin_missing` | `16` | `16` |
|
||||
| GDR enabled lines | `1344` | `704` |
|
||||
| channel summary | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
|
||||
| Pattern 4 | `crossNic 0`, `NVL/PXN` | `crossNic 2`, `NVL/PIX` |
|
||||
| `NET/IB/*/GDRDMA` lines | `256` | `512` |
|
||||
| `P2P/CUMEM` lines | `0` | `224` |
|
||||
| total NET/P2P edge lines | `256` | `736` |
|
||||
|
||||
解释:
|
||||
|
||||
- HCA、GDR、NCCL 版本和基础 channel 数量不是差异根因。
|
||||
- alltoall 的通信图明显更复杂,引入更多 NET/P2P 边,且 Pattern 4 从 allreduce 的 `NVL/PXN` 变成 `NVL/PIX`。
|
||||
- 这继续支持问题偏向 NCCL alltoall 图策略、internal IB plugin、缺少外部 `libnccl-net.so`/SHARP,或交换网络策略,而不是单纯链路坏、HCA 不通、GDR 没开。
|
||||
|
||||
## PXN Disabled Sweep
|
||||
|
||||
基线均为 `NCCL_PXN_DISABLE=1`,16G,2x8 GPU。
|
||||
|
||||
| Case | 额外参数 | Avg Bus BW |
|
||||
|---|---|---:|
|
||||
| baseline | 无 | `36.8024` |
|
||||
| nvls_off | `NCCL_NVLS_ENABLE=0` | `36.8095` |
|
||||
| qps4_split1 | `NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `30.5464` |
|
||||
| qps8_split1 | `NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `23.9345` |
|
||||
| qps4_split0 | `NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0` | `35.8679` |
|
||||
| channels16 | `NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16` | `37.1776` |
|
||||
| buff8m | `NCCL_BUFFSIZE=8388608` | `37.0265` |
|
||||
| p2pchunk4m | `NCCL_P2P_NET_CHUNKSIZE=4194304` | `37.0188` |
|
||||
| netpeer8 | `NCCL_NCHANNELS_PER_NET_PEER=8` | `31.103` |
|
||||
| ar0 | `NCCL_IB_AR_THRESHOLD=0` | `36.9965` |
|
||||
|
||||
结论:
|
||||
|
||||
- `channels16`、`buff8m`、`p2pchunk4m`、`ar0` 只有 0.2-1.0% 左右波动,不能视为有效优化。
|
||||
- `qps4_split1`、`qps8_split1`、`netpeer8` 明显负向。
|
||||
- 当前 8 卡 alltoall 不建议套用 PDF 固定 QP/split 参数。
|
||||
|
||||
## 脚本修正验证
|
||||
|
||||
复跑后发现脚本在 GRAPH 模式后会把 `NCCL_DEBUG=INFO` 继承到 sweep,导致 sweep 日志过大;同时 OpenMPI 会对未设置的 `-x` 变量打印 warning。
|
||||
|
||||
已修正:
|
||||
|
||||
- `set_common_env` 每个 case 重置到默认 `NCCL_DEBUG=WARN`。
|
||||
- `mpi_xargs` 只导出已经设置的环境变量。
|
||||
|
||||
验证方式:
|
||||
|
||||
- 本地 `bash -n scripts/multinode_nccl_deep_diagnose.sh` 通过。
|
||||
- 远端 1M tiny `all` 冒烟测试通过。
|
||||
- tiny 产物中 `could not find environment variable` 计数为 `0`。
|
||||
|
||||
## 当前判断
|
||||
|
||||
1. allreduce 的高位基线稳定,2x8 仍在 `354 GB/s` 左右。
|
||||
2. alltoall 即使 PXN disabled 并且 rail 均衡,也只能稳定在 `36-37 GB/s`。
|
||||
3. 未发现明显坏链路、重传、丢包、HCA 不通或 GDR disabled。
|
||||
4. 当前 4 条 400G rail 的硬件形态与 PDF 目标疑似不等价;PDF 2x8 allreduce 目标 `491.84 GB/s` 反推需要超过当前 4 rail 单向理论上限。
|
||||
5. alltoall 还需要从 NCCL net plugin/SHARP、交换机路径/ECMP/拥塞控制、以及 NCCL alltoall 图策略侧继续排。
|
||||
@ -16,7 +16,7 @@
|
||||
|
||||
按 `sx算力节点跨Leaf NCCL测试报告.pdf` 的矩阵继续对齐后,发现 2 机 4 卡档位的核心问题是默认 GPU 选择不符合 GPU-NIC 亲和性。显式选择 `CUDA_VISIBLE_DEVICES=0,1,4,5` 后,2 机 4 卡 allreduce 可以恢复到 `333-335 GB/s` 区间,接近 PDF 的 `335.48 GB/s`;alltoall 配合 PDF 固定 NCCL 参数可到 `72.93 GB/s`,接近 PDF 的 `73.73 GB/s`。但 2 机 8 卡档位仍只有 allreduce `354.02 GB/s`、alltoall `30.04 GB/s`,与 PDF 的 `491.84/76.54 GB/s` 差距明显。
|
||||
|
||||
进一步 sweep 8 卡 alltoall 网络参数后,`NCCL_PXN_DISABLE=1` 是唯一有效正向项。正式矩阵配置已对 2 机 8 GPU 的 alltoall 单独加入该变量,8 卡 alltoall 从约 `30.04 GB/s` 提升到 `36.70 GB/s` peak / `36.74 GB/s` avg,但仍低于 PDF 参考 `76.54 GB/s`。
|
||||
进一步 sweep 8 卡 alltoall 网络参数后,`NCCL_PXN_DISABLE=1` 是唯一有效正向项。正式矩阵配置已对 2 机 8 GPU 的 alltoall 单独加入该变量,8 卡 alltoall 从约 `30.04 GB/s` 提升到 `36.70 GB/s` peak / `36.74 GB/s` avg,但仍低于 PDF 参考 `76.54 GB/s`。复测端口 counter 后,PXN disabled 下 4 条 rail 的流量已均衡,且没有明显链路错误、丢包、RoCE 重传或 slow restart;同类 `port_xmit_wait` 在高吞吐 allreduce 中也会出现,因此它不是 alltoall 低吞吐的充分解释。继续在 PXN disabled 基线上叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR 等参数,没有稳定收益。NCCL GRAPH/TUNING 日志显示 alltoall 的 channel graph 比 allreduce 复杂很多,且混入大量本机 `P2P/CUMEM` 路径,但 HCA/GDR/channel 基础状态一致。剩余差距更像 NCCL internal alltoall 通信模式效率、交换网络策略,或缺少 NCCL net plugin/SHARP 能力。
|
||||
|
||||
同时,`nccl-gpu-2` 的 SSH 入口曾因未认证连接过多触发 `MaxStartups` 随机拒绝,导致 `mpirun` 拉起远端 rank 失败。已经做了临时 SSHD 缓解并拿到有效的 2 节点 x 8 GPU allreduce/alltoall 报告。
|
||||
|
||||
@ -36,6 +36,10 @@
|
||||
12. 增加 topology 级 `cuda_visible_devices`、`env`、`op_env` 配置能力,支持按 GPU/NIC 亲和性和不同 NCCL op 分别设置环境变量。
|
||||
13. 生成 PDF 矩阵式原始报告 `reports_multinode_nccl_pdf_matrix_nccl227.md`,覆盖 2 机 1/2/4/8 GPU per node。
|
||||
14. 对 8 卡 alltoall 做 NCCL 网络参数 sweep,并将有效项 `NCCL_PXN_DISABLE=1` 固化到 PDF 矩阵配置。
|
||||
15. 对 PXN disabled 后的 8 卡 alltoall 抓取 `counters`/`hw_counters` 增量,确认 rail 已均衡且无明显错误/重传。
|
||||
16. 对同样 2x8 allreduce 抓 counter 对照,确认高吞吐 allreduce 也会出现 `port_xmit_wait`,因此该 counter 不是 alltoall 低吞吐的唯一根因。
|
||||
17. 在 PXN disabled 基线上继续 sweep NVLS、P2P chunk、buffer、channel、QP/split、AR 等参数,确认没有稳定收益,部分参数明显变差。
|
||||
18. 抓取 allreduce 与 PXN disabled alltoall 的 `GRAPH/TUNING/COLL` 日志,确认两者 HCA/GDR/channel 基础状态一致,但 alltoall graph 明显更复杂。
|
||||
|
||||
## 关键证据
|
||||
|
||||
@ -285,6 +289,70 @@ NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0'
|
||||
|
||||
其他变量如 `NCCL_P2P_PXN_LEVEL`、`NCCL_NET_SHARED_COMMS`、`NCCL_NET_SHARED_BUFFERS`、`NCCL_NCHANNELS_PER_NET_PEER`、`NCCL_IB_ADAPTIVE_ROUTING` 均无改善或变差。
|
||||
|
||||
PXN disabled 计数器显示该参数确实修复了 rail 分布:
|
||||
|
||||
| Case | Rail 分布 | Avg Bus BW |
|
||||
|------|-----------|------------|
|
||||
| baseline | `mlx5_0/6` 约 `885 GB`,`mlx5_1/7` 约 `295 GB` | `30.04 GB/s` |
|
||||
| `NCCL_PXN_DISABLE=1` | 四条 HCA 均约 `591 GB` | `36.95 GB/s` |
|
||||
|
||||
但禁用 PXN 后每条 400G rail 仍只有约 `19-20 GB/s`,没有接近裸 RDMA 单 rail 的 `347-387 Gb/s`。因此它解决的是 rail 分布不均衡的一部分,不是全部 alltoall 性能问题。
|
||||
|
||||
复测 PXN disabled alltoall 时继续抓 `counters`/`hw_counters`:
|
||||
|
||||
| 观察项 | 结果 |
|
||||
|--------|------|
|
||||
| alltoall `Avg bus bandwidth` | `36.4512 GB/s` |
|
||||
| 每条 HCA 流量 | 约 `712.18-712.28 GiB`,四条 rail 均衡 |
|
||||
| discard / rcv error / symbol error / link down / link recovery | `0` 增量 |
|
||||
| RoCE retrans / slow restart / packet sequence error / out of sequence | `0` 增量 |
|
||||
| `port_xmit_wait` | `mlx5_1`、`mlx5_7` 有增长,约 `15.65M-23.49M` |
|
||||
|
||||
判断:当前没有明显坏链路、丢包或重传证据;`port_xmit_wait` 更像发送侧等待 credit/拥塞控制/交换侧调度,或者 NCCL internal alltoall 在当前拓扑下没有把 rail 吞吐打起来。
|
||||
|
||||
同样 2 nodes x 8 GPUs、同样 4 条 HCA 的 16G allreduce 对照:
|
||||
|
||||
| 观察项 | 结果 |
|
||||
|--------|------|
|
||||
| allreduce `Avg bus bandwidth` | `354.366 GB/s` |
|
||||
| 每条 HCA 流量 | 约 `178.03-178.07 GiB`,四条 rail 均衡 |
|
||||
| 错误/重传类 counter | `0` 增量 |
|
||||
| `port_xmit_wait` | `mlx5_1`、`mlx5_7` 有增长,约 `6.11M-6.59M` |
|
||||
|
||||
判断:allreduce 在接近物理上限时也会出现 `port_xmit_wait`,所以 alltoall 的核心问题不能只归因于该 counter。现在更应关注 NCCL alltoall 通信模式、交换网络策略、以及 NCCL net plugin/SHARP 能力差异。
|
||||
|
||||
PXN disabled 基线上的二次参数 sweep:
|
||||
|
||||
| Case | Avg Bus BW | 结论 |
|
||||
|------|------------|------|
|
||||
| `NCCL_PXN_DISABLE=1` | `37.0069 GB/s` | 短测基线 |
|
||||
| `+ NCCL_NVLS_ENABLE=0` | `37.2217 GB/s` | 小幅波动,不稳定 |
|
||||
| `+ NCCL_P2P_NET_CHUNKSIZE=4194304` | `37.2522 GB/s` | 小幅波动,不稳定 |
|
||||
| `+ NCCL_BUFFSIZE=8388608` | `37.0911 GB/s` | 无实质改善 |
|
||||
| `+ NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16` | `37.0189 GB/s` | 无实质改善 |
|
||||
| `+ NCCL_IB_AR_THRESHOLD=0` | `37.0843 GB/s` | 无实质改善 |
|
||||
| `+ NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0` | `35.9847 GB/s` | 变差 |
|
||||
| `+ NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `29.8406 GB/s` | 明显变差 |
|
||||
| `+ NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1` | `24.1183 GB/s` | 明显变差 |
|
||||
| `+ NCCL_NCHANNELS_PER_NET_PEER=8` | `29.8904 GB/s` | 明显变差 |
|
||||
|
||||
长测复核没有复现 `NVLS/P2P chunk` 的短测小涨:同一环境确认仍为 NCCL `2.27.7+cuda12.4`、4 条 400G HCA、GDR enabled、internal IB plugin,但 baseline 窗口下滑到 `32.7280 GB/s`,`P2P_NET_CHUNKSIZE=4M` 为 `31.9340 GB/s`,`NVLS_ENABLE=0 + P2P_NET_CHUNKSIZE=4M` 为 `27.6585 GB/s`。因此这些参数不应固化到正式配置。
|
||||
|
||||
`GRAPH/TUNING/COLL` 日志对照:
|
||||
|
||||
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||
|--------|-----------|----------------------------------|
|
||||
| NCCL version | `2.27.7+cuda12.4` | `2.27.7+cuda12.4` |
|
||||
| HCA / GDR | 4 HCA, GDR enabled | 4 HCA, GDR enabled |
|
||||
| external net plugin | missing, internal IB | missing, internal IB |
|
||||
| channels | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
|
||||
| Pattern 4 | `crossNic 0`, `type NVL/PXN`, `nChannels 8` | `crossNic 2`, `type NVL/PIX`, `nChannels 8` |
|
||||
| `NET/IB/*/GDRDMA` channel edge lines | `256` | `512` |
|
||||
| `P2P/CUMEM` channel edge lines | `0` | `224` |
|
||||
| total NET/P2P channel edge lines | `256` | `736` |
|
||||
|
||||
判断:PXN disabled 后 4 条 IB/GDRDMA rail 和 16 个 p2p/coll/nvls channels 都仍在;但 alltoall graph 明显比 allreduce 复杂,并包含大量本机 P2P/CUMEM 边。这进一步说明问题不在 HCA/GDR 没生效,而在 alltoall collective graph、P2P/NET 组合方式、internal IB plugin 或交换网络策略。
|
||||
|
||||
### 8. 8 卡链路计数器与物理上限判断
|
||||
|
||||
计数器探测报告:`reports_multinode_nccl_counter_probe_20260523.md`
|
||||
@ -327,6 +395,8 @@ busbw = algbw * 2 * (nranks - 1) / nranks = algbw * 1.875
|
||||
|
||||
同一测试窗口内端口计数器显示 alltoall 流量分布不均衡:`mlx5_0` 和 `mlx5_6` 的流量约 `885 GB`,`mlx5_1` 和 `mlx5_7` 约 `295 GB`,约为三倍差距。继续调换 `NCCL_IB_HCA` 顺序后,8 卡 alltoall 仍稳定在 `30.02-30.07 GB/s`,说明不是简单 HCA 列表顺序问题。
|
||||
|
||||
`NCCL_PXN_DISABLE=1` 后,端口流量变为四条 HCA 均约 `591 GB`,alltoall `Avg bus bandwidth` 提升到 `36.9518 GB/s`,但每条 rail 吞吐仍只有约 `19.82 GB/s`。
|
||||
|
||||
### 9. NCCL net plugin / SHARP 状态
|
||||
|
||||
两台机器上均未找到:
|
||||
@ -380,9 +450,13 @@ libnccl-dev
|
||||
- 8 卡 allreduce `algbw ~= 189 GB/s`,接近当前 4 x 400G HCA 的理论单向合计 `200 GB/s`
|
||||
- 裸 RDMA 4 rail 并发 `ib_write_bw` 合计 `1476.95 Gb/s` / `184.62 GB/s`
|
||||
- PDF 8 卡 allreduce `491.84 GB/s busbw` 反推需要约 `262 GB/s algbw`,超过当前 4 x 400G 的物理单向总带宽
|
||||
- 8 卡 alltoall 端口计数器显示 rail 分布不均,且 HCA 顺序 sweep 无改善
|
||||
- 8 卡 alltoall baseline 端口计数器显示 rail 分布不均,且 HCA 顺序 sweep 无改善
|
||||
- 当前环境缺失 NCCL net plugin/SHARP,NCCL 只能使用 internal IB plugin
|
||||
- `NCCL_PXN_DISABLE=1` 可将 8 卡 alltoall 提升到约 `36.7 GB/s`,但仍不到 PDF 参考值的一半
|
||||
- `NCCL_PXN_DISABLE=1` 可将 8 卡 alltoall 提升到约 `36.7 GB/s`,并修复 rail 分布不均,但仍不到 PDF 参考值的一半
|
||||
- PXN disabled 复测没有看到 discard、链路错误、RoCE 重传、slow restart、packet sequence error 等错误类 counter 增长
|
||||
- allreduce 对照同样出现 `port_xmit_wait` 但能跑到 `354.366 GB/s`,说明 `port_xmit_wait` 不是 alltoall 低吞吐的唯一根因
|
||||
- PXN disabled 基线上继续叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR 等参数没有稳定收益;QP/split 和 `NCCL_NCHANNELS_PER_NET_PEER=8` 明显变差
|
||||
- NCCL GRAPH/TUNING 对照显示 alltoall 与 allreduce 的 HCA/GDR/channel 基础状态一致,但 alltoall channel edge 更多,并混入大量 `P2P/CUMEM` 本地路径
|
||||
|
||||
### 阻塞 3:`nccl-gpu-2` SSH 存在外部连接压力
|
||||
|
||||
@ -403,9 +477,9 @@ libnccl-dev
|
||||
4. 4 卡 allreduce 建议继续让 NCCL 自动选择 channel/QP;4 卡 alltoall 如果要贴近 PDF,可单独套 `NCCL_IB_QPS_PER_CONNECTION=4`、`NCCL_MIN_NCHANNELS=4`、`NCCL_IB_SPLIT_DATA_ON_QPS=1`。
|
||||
5. 8 卡 per node 不建议套上述固定参数,会降低 allreduce;继续用 auto。
|
||||
6. 尝试安装或启用匹配当前 OFED/driver 的 NCCL net plugin/SHARP;当前日志显示 `Could not find: libnccl-net.so`,NCCL 使用的是 internal IB plugin。
|
||||
7. 核对跨 Leaf 链路的 rail mapping、交换机端口速率、路由和拥塞计数,确认 4 个 400Gb/s HCA 是否都在跨节点通信中充分利用。
|
||||
7. 核对跨 Leaf 链路的 rail mapping、交换机端口速率、路由、credit/拥塞等待与交换机侧队列计数;同时用 allreduce 对照避免把 `port_xmit_wait` 误判为 alltoall 独有根因。
|
||||
8. 确认当前 PDF 的 `491.84/76.54 GB/s` 是否要求当前这两台节点在只有 4 条 400G rail 的形态下也达到;如果要求一致,需要网络/硬件侧继续介入。
|
||||
9. 对 8 卡 alltoall,重点查 NCCL rail 分布、交换机 ECMP/自适应路由、拥塞计数、SHARP/NCCL net plugin,而不是继续调 `NCCL_IB_HCA` 顺序。
|
||||
9. 8 卡 alltoall 当前不建议继续盲调 NCCL 环境变量;重点查 SHARP/NCCL net plugin、NCCL internal alltoall 行为、交换机 ECMP/自适应路由和拥塞/credit 等待;`NCCL_IB_HCA` 顺序与 rail 分布本身已经不是当前主问题。
|
||||
|
||||
## 当前可交付物
|
||||
|
||||
|
||||
168
reports_multinode_nccl_environment_gap_20260523.md
Normal file
168
reports_multinode_nccl_environment_gap_20260523.md
Normal file
@ -0,0 +1,168 @@
|
||||
# 多节点 NCCL 环境等价性缺口说明 2026-05-23
|
||||
|
||||
## 目的
|
||||
|
||||
这份文档用于回答一个核心问题:当前 `aikubeworker0012` / `aikubeworker0016` 是否具备与参考 PDF 的 2 机 16 GPU NCCL 目标相同的硬件和 NCCL 网络软件环境。
|
||||
|
||||
结论先行:**当前环境不能证明与 PDF 参考环境等价**。主要差异有两类:
|
||||
|
||||
1. 当前每节点只有 4 条可用于 NCCL 的 400G InfiniBand rail。
|
||||
2. 当前没有外部 NCCL net plugin / SHARP / HCOLL 组件,NCCL 使用 internal IB plugin。
|
||||
|
||||
## 采集时间和节点
|
||||
|
||||
采集时间:`2026-05-23T10:53:18+00:00` 至 `2026-05-23T10:53:21+00:00`
|
||||
|
||||
| 节点 | SSH alias | 内网地址 | kernel |
|
||||
|---|---|---|---|
|
||||
| `aikubeworker0012` | `nccl-gpu-1` | `172.72.8.12` | `5.15.0-119-generic` |
|
||||
| `aikubeworker0016` | `nccl-gpu-2` | `172.72.8.16` | `5.15.0-119-generic` |
|
||||
|
||||
## HCA / Rail 现状
|
||||
|
||||
两台机器的 `/sys/class/infiniband/mlx5_*/ports/1` 结果一致:
|
||||
|
||||
| HCA | State | Rate | Link layer | 对 NCCL 跨节点验收的含义 |
|
||||
|---|---|---:|---|---|
|
||||
| `mlx5_0` | ACTIVE | `400 Gb/sec (4X NDR)` | InfiniBand | 可作为 400G rail |
|
||||
| `mlx5_1` | ACTIVE | `400 Gb/sec (4X NDR)` | InfiniBand | 可作为 400G rail |
|
||||
| `mlx5_2` | ACTIVE | `25 Gb/sec (1X EDR)` | Ethernet | 不是 400G IB rail |
|
||||
| `mlx5_3` | DOWN | `25 Gb/sec (1X EDR)` | Ethernet | 不可用 |
|
||||
| `mlx5_4` | ACTIVE | `100 Gb/sec (2X HDR)` | InfiniBand | 不是 400G rail |
|
||||
| `mlx5_5` | ACTIVE | `100 Gb/sec (2X HDR)` | InfiniBand | 不是 400G rail |
|
||||
| `mlx5_6` | ACTIVE | `400 Gb/sec (4X NDR)` | InfiniBand | 可作为 400G rail |
|
||||
| `mlx5_7` | ACTIVE | `400 Gb/sec (4X NDR)` | InfiniBand | 可作为 400G rail |
|
||||
| `mlx5_8` | ACTIVE | `25 Gb/sec (1X EDR)` | Ethernet | 不是 400G IB rail |
|
||||
| `mlx5_9` | DOWN | `25 Gb/sec (1X EDR)` | Ethernet | 不可用 |
|
||||
|
||||
因此当前推荐并实际使用的 HCA 列表是:
|
||||
|
||||
```text
|
||||
NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_6,mlx5_7
|
||||
```
|
||||
|
||||
这代表每节点 `4 x 400Gb/s`,理论单向原始带宽约:
|
||||
|
||||
```text
|
||||
4 * 400Gb/s / 8 = 200 GB/s
|
||||
```
|
||||
|
||||
## 与 PDF 目标的物理带宽关系
|
||||
|
||||
参考 PDF 的 2 机 16 GPU 目标:
|
||||
|
||||
| Operation | PDF Bus BW |
|
||||
|---|---:|
|
||||
| AllReduce | `491.84 GB/s` |
|
||||
| AllToAll | `76.54 GB/s` |
|
||||
|
||||
NCCL allreduce 在 16 ranks 下,`busbw = algbw * 2 * (n - 1) / n = algbw * 1.875`。
|
||||
|
||||
因此 PDF 的 allreduce `491.84 GB/s busbw` 反推:
|
||||
|
||||
```text
|
||||
491.84 / 1.875 = 262.31 GB/s algbw
|
||||
```
|
||||
|
||||
但当前 4 条 400G rail 的理论单向原始带宽约 `200 GB/s`。本项目实测 2x8 allreduce:
|
||||
|
||||
| 测试 | Bus BW | 反推 Alg BW |
|
||||
|---|---:|---:|
|
||||
| 本轮深度诊断 allreduce | `354.025 GB/s` | `188.81 GB/s` |
|
||||
| 本轮 GRAPH allreduce | `354.224 GB/s` | `188.92 GB/s` |
|
||||
|
||||
这已经接近当前 4 x 400G rail 的物理单向上限。除非 PDF 参考环境具备更多有效 400G rail、更高交换网络能力,或使用了当前缺失的网络加速组件,否则当前 2x8 allreduce 很难靠 NCCL 环境变量小调达到 `491.84 GB/s`。
|
||||
|
||||
## GPU-NIC 亲和性影响
|
||||
|
||||
`nvidia-smi topo -m` 显示的 NIC legend 两台一致:
|
||||
|
||||
| NIC | HCA |
|
||||
|---|---|
|
||||
| NIC0 | `mlx5_0` |
|
||||
| NIC1 | `mlx5_1` |
|
||||
| NIC2 | `mlx5_2` |
|
||||
| NIC3 | `mlx5_3` |
|
||||
| NIC4 | `mlx5_4` |
|
||||
| NIC5 | `mlx5_5` |
|
||||
| NIC6 | `mlx5_6` |
|
||||
| NIC7 | `mlx5_7` |
|
||||
| NIC8 | `mlx5_8` |
|
||||
| NIC9 | `mlx5_9` |
|
||||
|
||||
关键亲和关系:
|
||||
|
||||
| GPU | 最近的有效 400G HCA |
|
||||
|---|---|
|
||||
| GPU0 | `mlx5_0` |
|
||||
| GPU1 | `mlx5_1` |
|
||||
| GPU4 | `mlx5_6` |
|
||||
| GPU5 | `mlx5_7` |
|
||||
|
||||
这解释了为什么 2 机 4 GPU 档位需要使用:
|
||||
|
||||
```text
|
||||
CUDA_VISIBLE_DEVICES=0,1,4,5
|
||||
```
|
||||
|
||||
默认 GPU0/1/2/3 会把 GPU2/GPU3 放到非理想 NIC 亲和路径上,其中 GPU2 最近的 `mlx5_2/3` 不是可用 400G IB rail。
|
||||
|
||||
## NCCL Net Plugin / SHARP 状态
|
||||
|
||||
在两台节点上搜索:
|
||||
|
||||
```text
|
||||
find /usr /opt /tmp /root -name 'libnccl-net*.so*' -o -name 'libsharp*.so*'
|
||||
```
|
||||
|
||||
结果为空。
|
||||
|
||||
两台节点包列表中能看到:
|
||||
|
||||
| 包 | 版本/说明 |
|
||||
|---|---|
|
||||
| `doca-ofed` | `3.3.0-088000` |
|
||||
| `mlnx-ofed-kernel-dkms` | `26.01.OFED.26.01.1.0.0.1-1` |
|
||||
| `ucx` | `1.20.0-1.20260211...` |
|
||||
|
||||
未看到:
|
||||
|
||||
- `libnccl-net.so`
|
||||
- `libsharp*.so`
|
||||
- SHARP packages
|
||||
- HCOLL packages
|
||||
|
||||
本轮 NCCL GRAPH 日志也显示 `plugin_missing=16`,说明 NCCL 只能走 internal IB plugin。
|
||||
|
||||
## 当前 2x8 结果归因边界
|
||||
|
||||
已经基本排除:
|
||||
|
||||
- 不是 SSH / mpirun launch 问题:preflight 已通过。
|
||||
- 不是 HCA 完全不可用:4 条 400G rail 都 ACTIVE,allreduce 能跑到约 `354 GB/s busbw`。
|
||||
- 不是 GDR disabled:NCCL `2.27.7` 日志中 GDR enabled。
|
||||
- 不是 rail 完全打偏:`NCCL_PXN_DISABLE=1` 后 alltoall 四条 rail 流量均衡。
|
||||
- 不是明显坏链路/重传:counter 未见 discard、RoCE retrans、slow restart、packet sequence error 等增长。
|
||||
|
||||
仍然成立的缺口:
|
||||
|
||||
1. **2x8 allreduce 的 PDF 目标疑似超过当前 4 x 400G rail 物理能力。**
|
||||
2. **2x8 alltoall 即使 rail 均衡仍只有 `36-37 GB/s`,更像 NCCL alltoall 图策略、internal IB plugin 能力、缺少 SHARP/NCCL net plugin 或交换网络策略问题。**
|
||||
|
||||
## 给网络/环境侧的确认清单
|
||||
|
||||
请网络/环境侧确认以下问题:
|
||||
|
||||
1. PDF 参考环境每节点实际参与 NCCL 的 400G rail 数量是多少?是否为 8 条 400G,而不是当前的 4 条 400G?
|
||||
2. PDF 命令中列出的 HCA 列表是否在参考环境中全部为 400G InfiniBand ACTIVE?
|
||||
3. PDF 参考环境是否启用了 NCCL net plugin、SHARP、HCOLL、UCX plugin 或交换机侧 SHARP aggregation?
|
||||
4. 当前交换网络是否开启 adaptive routing / ECMP / congestion control,是否存在跨 Leaf 场景下对 alltoall pattern 不友好的 hash 或路径限制?
|
||||
5. 当前 `mlx5_4/5` 为什么只有 100G,`mlx5_2/8` 为什么是 Ethernet 25G,`mlx5_3/9` 为什么 DOWN;这些是否符合机器采购和验收预期?
|
||||
6. 如果验收必须按 PDF 的 `491.84/76.54 GB/s`,是否需要更换到与 PDF 等价的 rail 数量/交换网络/软件栈再测。
|
||||
|
||||
## 建议下一步
|
||||
|
||||
1. 暂停继续盲调 NCCL 小参数;已有 sweep 显示收益不稳定或负向。
|
||||
2. 先让硬件/网络侧确认 rail 数量和速率是否与 PDF 等价。
|
||||
3. 如果确认硬件等价,再补齐 NCCL net plugin / SHARP 环境,并用 `scripts/multinode_nccl_deep_diagnose.sh graph` 复查 plugin 和 graph 变化。
|
||||
4. 如果硬件不等价,应调整验收阈值或改用与 PDF 等价的节点组合复测。
|
||||
150
reports_multinode_nccl_handoff_plan_20260523.md
Normal file
150
reports_multinode_nccl_handoff_plan_20260523.md
Normal file
@ -0,0 +1,150 @@
|
||||
# 多节点 NCCL 交接计划 2026-05-23
|
||||
|
||||
## 当前一句话结论
|
||||
|
||||
当前 2 机 8 卡 NCCL 已经排除旧 NCCL、GDR disabled、HCA 选择错误、SSH/mpirun launch、明显链路错误等问题;剩余差距集中在 **硬件 rail 数量是否与 PDF 等价**、**NCCL net plugin / SHARP 是否缺失**、以及 **alltoall 在当前跨 Leaf 网络下的图策略/交换路径效率**。
|
||||
|
||||
## 已经验证的事实
|
||||
|
||||
| 事实 | 当前证据 |
|
||||
|---|---|
|
||||
| 两台机器可用于 NCCL 的 400G IB rail 是 4 条 | `mlx5_0,mlx5_1,mlx5_6,mlx5_7` 均为 `400 Gb/sec (4X NDR)` |
|
||||
| 其他 HCA 不等价 | `mlx5_4/5` 为 100G IB,`mlx5_2/8` 为 25G Ethernet,`mlx5_3/9` DOWN |
|
||||
| NCCL 2.27.7 GDR 可用 | GRAPH/NET 日志中 GDR enabled |
|
||||
| allreduce 已接近当前 4 rail 物理上限 | `354 GB/s busbw`,反推 `189 GB/s algbw`,接近 4 x 400G 的 `200 GB/s` 单向原始带宽 |
|
||||
| alltoall PXN disabled 后 rail 均衡但仍低 | `36-37 GB/s busbw`,每条 rail 约 `19-20 GB/s` |
|
||||
| 没看到硬错误 | 未见 discard、RoCE retrans、slow restart、packet sequence error 等增长 |
|
||||
| 当前缺外部 NCCL 网络组件 | 未找到 `libnccl-net*.so*` / `libsharp*.so*`,未见 SHARP/HCOLL 包 |
|
||||
|
||||
## PDF 目标与当前物理能力的冲突
|
||||
|
||||
PDF 2 机 16 GPU allreduce 目标是:
|
||||
|
||||
```text
|
||||
491.84 GB/s busbw
|
||||
```
|
||||
|
||||
16 ranks allreduce 换算关系:
|
||||
|
||||
```text
|
||||
busbw = algbw * 1.875
|
||||
```
|
||||
|
||||
因此 PDF 目标反推:
|
||||
|
||||
```text
|
||||
491.84 / 1.875 = 262.31 GB/s algbw
|
||||
```
|
||||
|
||||
当前每节点 4 条 400G rail 的理论单向原始带宽:
|
||||
|
||||
```text
|
||||
4 * 400Gb/s / 8 = 200 GB/s
|
||||
```
|
||||
|
||||
所以如果 PDF 环境有更多有效 400G rail,或启用了 SHARP/NCCL net plugin,而当前环境没有,则当前节点不应直接按 PDF 2x8 目标判定。
|
||||
|
||||
## 决策树
|
||||
|
||||
### A. 如果验收坚持 PDF 原始阈值
|
||||
|
||||
必须先证明当前环境与 PDF 等价:
|
||||
|
||||
1. 每节点是否有 8 条 400G IB rail 可用?
|
||||
2. PDF 命令中的 HCA 在参考环境里是否全部是 400G IB ACTIVE?
|
||||
3. PDF 环境是否启用了 SHARP / NCCL net plugin / HCOLL / UCX plugin?
|
||||
4. 当前跨 Leaf 交换网络策略是否与 PDF 环境一致?
|
||||
|
||||
如果任一答案是否定或未知,应先补齐硬件/软件/网络环境再复测,不应继续靠 NCCL 小参数追 `491.84/76.54 GB/s`。
|
||||
|
||||
### B. 如果验收按当前硬件形态重新定标
|
||||
|
||||
建议把当前 2x8 allreduce 的可解释目标按 4 x 400G rail 物理能力重新评估:
|
||||
|
||||
- allreduce 当前 `354 GB/s busbw`,反推 `189 GB/s algbw`,接近 `200 GB/s` 单向原始上限。
|
||||
- alltoall 当前 `36-37 GB/s` 仍偏低,需要作为独立问题继续排查。
|
||||
|
||||
### C. 如果要继续优化 alltoall
|
||||
|
||||
不要继续盲扫以下参数:
|
||||
|
||||
- `NCCL_IB_QPS_PER_CONNECTION`
|
||||
- `NCCL_IB_SPLIT_DATA_ON_QPS`
|
||||
- `NCCL_NCHANNELS_PER_NET_PEER`
|
||||
- `NCCL_BUFFSIZE`
|
||||
- `NCCL_P2P_NET_CHUNKSIZE`
|
||||
- `NCCL_IB_AR_THRESHOLD`
|
||||
|
||||
已有 sweep 表明它们没有稳定正收益,部分明显负向。
|
||||
|
||||
优先做:
|
||||
|
||||
1. 补齐并验证 `libnccl-net.so` / SHARP 环境。
|
||||
2. 让网络侧查跨 Leaf ECMP / adaptive routing / congestion control / credit wait。
|
||||
3. 用 `scripts/multinode_nccl_deep_diagnose.sh graph` 对比启用 plugin 前后的 NCCL graph。
|
||||
4. 如有等价 8 rail 节点,迁移同一脚本复测,确认 allreduce 物理上限是否抬升。
|
||||
|
||||
## 给网络/硬件/环境侧的问题
|
||||
|
||||
请直接确认下面这些问题:
|
||||
|
||||
1. 这两台机器是否本来应该有 8 条 400G IB rail?如果是,为什么当前只有 4 条?
|
||||
2. `mlx5_4/5` 当前只有 100G,是配置、线缆、模块、交换机端口还是硬件限制?
|
||||
3. `mlx5_2/8` 为什么是 Ethernet 25G?是否预期不参与 IB NCCL?
|
||||
4. `mlx5_3/9` DOWN 是否符合预期?
|
||||
5. PDF 参考环境是否安装了 SHARP、HCOLL 或 NCCL net plugin?
|
||||
6. 当前交换机是否开启 adaptive routing,并且对 alltoall 这种多点到多点流量友好?
|
||||
7. 当前跨 Leaf 路径是否存在 ECMP hash 不均、PFC/credit wait、拥塞控制参数差异?
|
||||
|
||||
## 后续复跑命令
|
||||
|
||||
### 轻量检查
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||
```
|
||||
|
||||
### 完整深度诊断
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_$(date +%Y%m%d_%H%M%S) \
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||
```
|
||||
|
||||
### 启用新 NCCL plugin / SHARP 后的最小复核
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_plugin_check_$(date +%Y%m%d_%H%M%S) \
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh graph
|
||||
```
|
||||
|
||||
复核重点:
|
||||
|
||||
- `plugin_missing` 是否消失或明显减少。
|
||||
- NCCL 日志是否出现外部 net plugin。
|
||||
- alltoall graph 中 `P2P/CUMEM`、`NET/IB/*/GDRDMA`、`channel_edge_lines` 是否变化。
|
||||
- alltoall busbw 是否突破 `36-37 GB/s` 平台。
|
||||
|
||||
## 关键文件
|
||||
|
||||
| 文件 | 用途 |
|
||||
|---|---|
|
||||
| `reports_multinode_nccl_diagnosis_20260523.md` | 总诊断报告 |
|
||||
| `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮深度复跑结果 |
|
||||
| `reports_multinode_nccl_environment_gap_20260523.md` | 硬件/软件环境等价性缺口 |
|
||||
| `reports_multinode_nccl_counter_probe_20260523.md` | RDMA rail/counter 证据 |
|
||||
| `reports_multinode_nccl_alltoall_tuning_20260523.md` | alltoall 参数 sweep 和结论 |
|
||||
| `docs/multinode_nccl_deep_diagnose_runbook.md` | 诊断脚本 runbook |
|
||||
| `scripts/multinode_nccl_deep_diagnose.sh` | 可复跑诊断脚本 |
|
||||
|
||||
## 当前建议
|
||||
|
||||
当前不建议继续把精力放在 NCCL 环境变量微调上。更高价值的动作是:
|
||||
|
||||
1. 确认 PDF 参考环境的 rail 数量、速率和 SHARP/plugin 状态。
|
||||
2. 补齐或明确排除 NCCL net plugin / SHARP。
|
||||
3. 让网络侧针对 alltoall 多点通信模式查跨 Leaf 路径和拥塞策略。
|
||||
4. 如果硬件不等价,调整验收阈值或换等价节点重测。
|
||||
144
reports_multinode_nccl_latest_index_20260523.md
Normal file
144
reports_multinode_nccl_latest_index_20260523.md
Normal file
@ -0,0 +1,144 @@
|
||||
# 多节点 NCCL 最新索引 2026-05-23
|
||||
|
||||
## 当前状态
|
||||
|
||||
当前工作分支:`h100-acceptance-current`
|
||||
|
||||
当前结论:
|
||||
|
||||
- 2 机 4 GPU 档位通过 GPU-NIC 亲和性修正后,已接近 PDF 参考值。
|
||||
- 2 机 8 GPU 档位仍未达到 PDF 参考值:
|
||||
- allreduce 当前约 `354 GB/s busbw`,PDF 目标 `491.84 GB/s`。
|
||||
- alltoall 当前约 `36-37 GB/s busbw`,PDF 目标 `76.54 GB/s`。
|
||||
- 当前 2 机 8 GPU 剩余差距不再像是旧 NCCL、GDR disabled、HCA 顺序、SSH/mpirun 或明显坏链路问题。
|
||||
- 当前更像是硬件 rail 数量与 PDF 不等价、NCCL net plugin / SHARP 缺失、或跨 Leaf alltoall 网络/图策略问题。
|
||||
|
||||
## 先看这三份
|
||||
|
||||
| 顺序 | 文件 | 用途 |
|
||||
|---:|---|---|
|
||||
| 1 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给网络/硬件/环境侧的交接计划,包含决策树、要问的问题和复跑命令 |
|
||||
| 2 | `reports_multinode_nccl_environment_gap_20260523.md` | 说明当前环境为什么不能证明与 PDF 等价,重点是 4 x 400G rail 和缺少 NCCL net plugin / SHARP |
|
||||
| 3 | `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮完整深度诊断复跑结果,包含 counter、GRAPH、PXN sweep |
|
||||
|
||||
## 关键脚本
|
||||
|
||||
| 文件 | 用途 |
|
||||
|---|---|
|
||||
| `scripts/multinode_nccl_deep_diagnose.sh` | 可复跑的多节点 NCCL 深度诊断脚本 |
|
||||
| `docs/multinode_nccl_deep_diagnose_runbook.md` | 诊断脚本中文 runbook |
|
||||
|
||||
推荐先跑轻量检查:
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||
```
|
||||
|
||||
完整复跑:
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_$(date +%Y%m%d_%H%M%S) \
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||
```
|
||||
|
||||
启用 NCCL plugin / SHARP 后的最小复核:
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_plugin_check_$(date +%Y%m%d_%H%M%S) \
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh graph
|
||||
```
|
||||
|
||||
## 远端机器上的最新同步文件
|
||||
|
||||
三份关键报告已经同步到两台节点:
|
||||
|
||||
```text
|
||||
/root/test_gpu_scripts/reports_multinode_nccl_handoff_plan_20260523.md
|
||||
/root/test_gpu_scripts/reports_multinode_nccl_environment_gap_20260523.md
|
||||
/root/test_gpu_scripts/reports_multinode_nccl_deep_diagnose_run_20260523.md
|
||||
```
|
||||
|
||||
最新完整诊断产物目录在 `aikubeworker0012`:
|
||||
|
||||
```text
|
||||
/root/test_gpu_scripts/reports/nccl_deep_diag_20260523_103932
|
||||
```
|
||||
|
||||
该目录包含:
|
||||
|
||||
- `preflight.txt`
|
||||
- `allreduce_counter/`
|
||||
- `alltoall_pxn_counter/`
|
||||
- `graph/`
|
||||
- `pxn_sweep/`
|
||||
|
||||
## 当前证据摘要
|
||||
|
||||
### HCA / rail
|
||||
|
||||
两台节点当前有效 400G IB rail 一致:
|
||||
|
||||
```text
|
||||
mlx5_0, mlx5_1, mlx5_6, mlx5_7
|
||||
```
|
||||
|
||||
非等价 HCA:
|
||||
|
||||
```text
|
||||
mlx5_4, mlx5_5: 100G InfiniBand
|
||||
mlx5_2, mlx5_8: 25G Ethernet
|
||||
mlx5_3, mlx5_9: DOWN
|
||||
```
|
||||
|
||||
因此当前每节点可用于 NCCL 的 400G rail 是 4 条,理论单向原始带宽约 `200 GB/s`。
|
||||
|
||||
PDF allreduce 目标 `491.84 GB/s busbw` 反推 `262.31 GB/s algbw`,超过当前 4 x 400G rail 的理论单向带宽。
|
||||
|
||||
### NCCL / plugin
|
||||
|
||||
当前两台节点没有找到:
|
||||
|
||||
```text
|
||||
libnccl-net*.so*
|
||||
libsharp*.so*
|
||||
```
|
||||
|
||||
也没有看到 SHARP/HCOLL 包。NCCL GRAPH 日志显示 `plugin_missing=16`,当前走 internal IB plugin。
|
||||
|
||||
### 深度诊断
|
||||
|
||||
本轮完整复跑:
|
||||
|
||||
| 项目 | 结果 |
|
||||
|---|---:|
|
||||
| allreduce 16G | `354.025 GB/s` |
|
||||
| graph allreduce 16G | `354.224 GB/s` |
|
||||
| alltoall + PXN disabled 16G | `36.9377 GB/s` |
|
||||
| graph alltoall + PXN disabled 16G | `37.14 GB/s` |
|
||||
|
||||
PXN disabled sweep 未发现有效参数:
|
||||
|
||||
- `channels16`、`buff8m`、`p2pchunk4m`、`ar0` 只有小幅噪声级波动。
|
||||
- `qps4_split1`、`qps8_split1`、`netpeer8` 明显负向。
|
||||
|
||||
## 历史/支撑报告
|
||||
|
||||
| 文件 | 说明 |
|
||||
|---|---|
|
||||
| `reports_multinode_nccl_diagnosis_20260523.md` | 长版总诊断,包含从旧 NCCL/GDR disabled 到 PDF 矩阵对齐的全过程 |
|
||||
| `reports_multinode_nccl_pdf_matrix_nccl227.md` | 按 PDF 矩阵跑出的正式 raw report |
|
||||
| `reports_multinode_nccl_counter_probe_20260523.md` | RDMA rail 和 counter 证据 |
|
||||
| `reports_multinode_nccl_alltoall_tuning_20260523.md` | alltoall PXN 和参数 sweep 结论 |
|
||||
| `reports_rdma_single_node_summary.md` | 单节点 RDMA/HCA 速率摘要 |
|
||||
| `docs/multinode_nccl_concepts.md` | NCCL/RDMA 概念解释 |
|
||||
|
||||
## 给下一位接手人的路线
|
||||
|
||||
1. 先读 `reports_multinode_nccl_handoff_plan_20260523.md`。
|
||||
2. 用 `reports_multinode_nccl_environment_gap_20260523.md` 和硬件/网络侧确认当前节点是否应具备 8 条 400G rail。
|
||||
3. 如果硬件不等价,调整验收口径或换等价节点复测。
|
||||
4. 如果硬件确认等价,先补齐 NCCL net plugin / SHARP,再跑 `scripts/multinode_nccl_deep_diagnose.sh graph` 对比 plugin 前后。
|
||||
5. alltoall 继续排查时优先找网络路径/ECMP/adaptive routing/拥塞策略,不建议继续盲扫 NCCL 小参数。
|
||||
425
scripts/multinode_nccl_deep_diagnose.sh
Executable file
425
scripts/multinode_nccl_deep_diagnose.sh
Executable file
@ -0,0 +1,425 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Deep-diagnose multi-node NCCL behavior from the coordinator node.
|
||||
# Default values match the current 2-node H100 cross-leaf investigation.
|
||||
|
||||
MODE="${1:-all}"
|
||||
|
||||
MPI_BIN="${MPI_BIN:-/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun}"
|
||||
NCCL_TESTS_DIR="${NCCL_TESTS_DIR:-/data/nccl-tests-latest/build}"
|
||||
HOSTS="${HOSTS:-172.72.8.12:8,172.72.8.16:8}"
|
||||
PEER_HOST="${PEER_HOST:-172.72.8.16}"
|
||||
SSH_USER="${SSH_USER:-root}"
|
||||
HCAS="${HCAS:-mlx5_0 mlx5_1 mlx5_6 mlx5_7}"
|
||||
HCA_CSV="${HCA_CSV:-mlx5_0,mlx5_1,mlx5_6,mlx5_7}"
|
||||
OUT_DIR="${OUT_DIR:-/tmp/nccl_deep_diagnose_$(date +%Y%m%d_%H%M%S)}"
|
||||
|
||||
BEGIN_SIZE="${BEGIN_SIZE:-16G}"
|
||||
END_SIZE="${END_SIZE:-16G}"
|
||||
WARMUP_ITERS="${WARMUP_ITERS:-10}"
|
||||
ITERS="${ITERS:-10}"
|
||||
GRAPH_WARMUP_ITERS="${GRAPH_WARMUP_ITERS:-1}"
|
||||
GRAPH_ITERS="${GRAPH_ITERS:-1}"
|
||||
SWEEP_WARMUP_ITERS="${SWEEP_WARMUP_ITERS:-3}"
|
||||
SWEEP_ITERS="${SWEEP_ITERS:-5}"
|
||||
|
||||
NCCL_LD_LIBRARY_PATH="${NCCL_LD_LIBRARY_PATH:-/usr/mpi/gcc/openmpi-4.1.9a1/lib:/tmp/nccl-2.27.7-cuda12.4/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.4/targets/x86_64-linux/lib}"
|
||||
DEFAULT_NCCL_DEBUG="${NCCL_DEBUG:-WARN}"
|
||||
|
||||
COUNTERS="${COUNTERS:-port_xmit_data port_rcv_data port_xmit_packets port_rcv_packets port_xmit_wait port_xmit_discards port_rcv_errors port_rcv_remote_physical_errors port_rcv_switch_relay_errors port_xmit_constraint_errors port_rcv_constraint_errors symbol_error link_error_recovery link_downed local_link_integrity_errors excessive_buffer_overrun_errors VL15_dropped}"
|
||||
HW_COUNTERS="${HW_COUNTERS:-roce_adp_retrans roce_adp_retrans_to roce_slow_restart roce_slow_restart_cnps roce_slow_restart_trans packet_seq_err out_of_sequence out_of_buffer duplicate_request implied_nak_seq_err local_ack_timeout_err req_transport_retries_exceeded rnr_nak_retry_err rx_write_requests rx_read_requests}"
|
||||
|
||||
mkdir -p "$OUT_DIR"
|
||||
|
||||
mpi_base=(
|
||||
"$MPI_BIN"
|
||||
--allow-run-as-root
|
||||
--mca btl_openib_warn_no_device_params_found 0
|
||||
--mca btl_tcp_if_include bond0
|
||||
--mca oob_tcp_if_include bond0
|
||||
--mca plm_rsh_args "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=10"
|
||||
-H "$HOSTS"
|
||||
--map-by ppr:8:node
|
||||
-np 16
|
||||
)
|
||||
|
||||
base_exports=(
|
||||
LD_LIBRARY_PATH
|
||||
NCCL_IB_GID_INDEX NCCL_IB_SL NCCL_IB_TC NCCL_SOCKET_IFNAME
|
||||
NCCL_DEBUG NCCL_DEBUG_SUBSYS NCCL_IB_TIMEOUT NCCL_IB_HCA
|
||||
NCCL_NET_PLUGIN NCCL_NVLS_ENABLE NCCL_NET_GDR_LEVEL NCCL_NET_GDR_READ
|
||||
NCCL_DMABUF_ENABLE NCCL_PXN_DISABLE NCCL_IB_QPS_PER_CONNECTION
|
||||
NCCL_IB_SPLIT_DATA_ON_QPS NCCL_MIN_NCHANNELS NCCL_MAX_NCHANNELS
|
||||
NCCL_BUFFSIZE NCCL_P2P_NET_CHUNKSIZE NCCL_NCHANNELS_PER_NET_PEER
|
||||
NCCL_IB_AR_THRESHOLD
|
||||
)
|
||||
|
||||
set_common_env() {
|
||||
unset NCCL_DEBUG_SUBSYS NCCL_PXN_DISABLE NCCL_IB_QPS_PER_CONNECTION
|
||||
unset NCCL_IB_SPLIT_DATA_ON_QPS NCCL_MIN_NCHANNELS NCCL_MAX_NCHANNELS
|
||||
unset NCCL_BUFFSIZE NCCL_P2P_NET_CHUNKSIZE NCCL_NCHANNELS_PER_NET_PEER
|
||||
unset NCCL_IB_AR_THRESHOLD
|
||||
|
||||
export LD_LIBRARY_PATH="$NCCL_LD_LIBRARY_PATH"
|
||||
export NCCL_IB_GID_INDEX="${NCCL_IB_GID_INDEX:-3}"
|
||||
export NCCL_IB_SL="${NCCL_IB_SL:-5}"
|
||||
export NCCL_IB_TC="${NCCL_IB_TC:-136}"
|
||||
export NCCL_SOCKET_IFNAME="${NCCL_SOCKET_IFNAME:-bond0}"
|
||||
export NCCL_DEBUG="$DEFAULT_NCCL_DEBUG"
|
||||
export NCCL_IB_TIMEOUT="${NCCL_IB_TIMEOUT:-22}"
|
||||
export NCCL_IB_HCA="$HCA_CSV"
|
||||
export NCCL_NET_PLUGIN="${NCCL_NET_PLUGIN:-none}"
|
||||
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-1}"
|
||||
export NCCL_NET_GDR_LEVEL="${NCCL_NET_GDR_LEVEL:-5}"
|
||||
export NCCL_NET_GDR_READ="${NCCL_NET_GDR_READ:-1}"
|
||||
export NCCL_DMABUF_ENABLE="${NCCL_DMABUF_ENABLE:-0}"
|
||||
}
|
||||
|
||||
mpi_xargs() {
|
||||
for name in "${base_exports[@]}"; do
|
||||
if [[ -n "${!name+x}" ]]; then
|
||||
printf -- '-x\n%s\n' "$name"
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
run_nccl() {
|
||||
local op="$1"
|
||||
local bin="$2"
|
||||
local log="$3"
|
||||
local warmup="$4"
|
||||
local iters="$5"
|
||||
mapfile -t xargs < <(mpi_xargs)
|
||||
"${mpi_base[@]}" "${xargs[@]}" \
|
||||
"$bin" -b "$BEGIN_SIZE" -e "$END_SIZE" -g 1 -f 2 -w "$warmup" -n "$iters" \
|
||||
>"$log" 2>&1
|
||||
awk -v op="$op" '/Avg bus bandwidth/ {print op, $0}' "$log"
|
||||
}
|
||||
|
||||
read_one_snapshot() {
|
||||
local host_label="$1"
|
||||
local out="$2"
|
||||
: >"$out"
|
||||
for hca in $HCAS; do
|
||||
for c in $COUNTERS; do
|
||||
local f="/sys/class/infiniband/$hca/ports/1/counters/$c"
|
||||
if [[ -r "$f" ]]; then
|
||||
printf '%s %s counters %s %s\n' "$host_label" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" >>"$out"
|
||||
fi
|
||||
done
|
||||
for c in $HW_COUNTERS; do
|
||||
local f="/sys/class/infiniband/$hca/ports/1/hw_counters/$c"
|
||||
if [[ -r "$f" ]]; then
|
||||
printf '%s %s hw_counters %s %s\n' "$host_label" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" >>"$out"
|
||||
fi
|
||||
done
|
||||
done
|
||||
}
|
||||
|
||||
read_remote_snapshot() {
|
||||
local out="$1"
|
||||
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
|
||||
-o BatchMode=yes -o ConnectTimeout=5 "${SSH_USER}@${PEER_HOST}" \
|
||||
"HCAS='$HCAS' COUNTERS='$COUNTERS' HW_COUNTERS='$HW_COUNTERS' bash -s" <<'EOS' >"$out"
|
||||
for hca in $HCAS; do
|
||||
for c in $COUNTERS; do
|
||||
f="/sys/class/infiniband/$hca/ports/1/counters/$c"
|
||||
if [ -r "$f" ]; then
|
||||
printf '%s %s counters %s %s\n' "$HOSTNAME" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)"
|
||||
fi
|
||||
done
|
||||
for c in $HW_COUNTERS; do
|
||||
f="/sys/class/infiniband/$hca/ports/1/hw_counters/$c"
|
||||
if [ -r "$f" ]; then
|
||||
printf '%s %s hw_counters %s %s\n' "$HOSTNAME" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)"
|
||||
fi
|
||||
done
|
||||
done
|
||||
EOS
|
||||
}
|
||||
|
||||
summarize_counter_delta() {
|
||||
local before_a="$1"
|
||||
local before_b="$2"
|
||||
local after_a="$3"
|
||||
local after_b="$4"
|
||||
local out="$5"
|
||||
python3 - "$before_a" "$before_b" "$after_a" "$after_b" >"$out" <<'PY'
|
||||
import pathlib
|
||||
import sys
|
||||
|
||||
interesting = {
|
||||
"port_xmit_wait", "port_xmit_discards", "port_rcv_errors",
|
||||
"port_rcv_remote_physical_errors", "port_rcv_switch_relay_errors",
|
||||
"port_xmit_constraint_errors", "port_rcv_constraint_errors",
|
||||
"symbol_error", "link_error_recovery", "link_downed",
|
||||
"local_link_integrity_errors", "excessive_buffer_overrun_errors",
|
||||
"VL15_dropped", "roce_adp_retrans", "roce_adp_retrans_to",
|
||||
"roce_slow_restart", "roce_slow_restart_cnps", "roce_slow_restart_trans",
|
||||
"packet_seq_err", "out_of_sequence", "out_of_buffer",
|
||||
"duplicate_request", "implied_nak_seq_err", "local_ack_timeout_err",
|
||||
"req_transport_retries_exceeded", "rnr_nak_retry_err",
|
||||
}
|
||||
|
||||
def load(path):
|
||||
data = {}
|
||||
for line in pathlib.Path(path).read_text().splitlines():
|
||||
parts = line.split()
|
||||
if len(parts) != 5:
|
||||
continue
|
||||
host, hca, kind, counter, value = parts
|
||||
try:
|
||||
data[(host, hca, kind, counter)] = int(value)
|
||||
except ValueError:
|
||||
pass
|
||||
return data
|
||||
|
||||
before = {}
|
||||
after = {}
|
||||
before.update(load(sys.argv[1]))
|
||||
before.update(load(sys.argv[2]))
|
||||
after.update(load(sys.argv[3]))
|
||||
after.update(load(sys.argv[4]))
|
||||
|
||||
print("NONZERO_DELTAS")
|
||||
for key in sorted(set(before) | set(after)):
|
||||
delta = after.get(key, 0) - before.get(key, 0)
|
||||
if not delta:
|
||||
continue
|
||||
host, hca, kind, counter = key
|
||||
if counter in {"port_xmit_data", "port_rcv_data"}:
|
||||
gib = delta * 4 / (1024 ** 3)
|
||||
print(f"{host} {hca} {kind} {counter} {delta} words4B {gib:.2f} GiB")
|
||||
else:
|
||||
print(f"{host} {hca} {kind} {counter} {delta}")
|
||||
|
||||
print("ERROR_OR_CONGESTION_DELTAS")
|
||||
seen = False
|
||||
for key in sorted(set(before) | set(after)):
|
||||
delta = after.get(key, 0) - before.get(key, 0)
|
||||
if delta and key[3] in interesting:
|
||||
seen = True
|
||||
print(*key, delta)
|
||||
if not seen:
|
||||
print("none")
|
||||
PY
|
||||
}
|
||||
|
||||
run_counter_case() {
|
||||
local op="$1"
|
||||
local bin="$2"
|
||||
local extra="${3:-}"
|
||||
set_common_env
|
||||
if [[ -n "$extra" ]]; then
|
||||
eval "export $extra"
|
||||
fi
|
||||
local dir="$OUT_DIR/${op}_counter"
|
||||
mkdir -p "$dir"
|
||||
read_one_snapshot "$(hostname)" "$dir/before.local"
|
||||
read_remote_snapshot "$dir/before.remote"
|
||||
run_nccl "$op" "$bin" "$dir/${op}.log" "$WARMUP_ITERS" "$ITERS"
|
||||
read_one_snapshot "$(hostname)" "$dir/after.local"
|
||||
read_remote_snapshot "$dir/after.remote"
|
||||
summarize_counter_delta "$dir/before.local" "$dir/before.remote" "$dir/after.local" "$dir/after.remote" "$dir/counter_delta.txt"
|
||||
echo "$dir"
|
||||
}
|
||||
|
||||
summarize_graph_log() {
|
||||
local log="$1"
|
||||
local out="$2"
|
||||
python3 - "$log" >"$out" <<'PY'
|
||||
from pathlib import Path
|
||||
import collections
|
||||
import re
|
||||
import sys
|
||||
|
||||
text = Path(sys.argv[1]).read_text(errors="ignore")
|
||||
print("avg_busbw", (re.findall(r"Avg bus bandwidth\s*:\s*([0-9.]+)", text) or ["NA"])[-1])
|
||||
print("nccl_version", sorted(set(re.findall(r"NCCL version ([^\s]+)", text))))
|
||||
print("plugin_missing", len(re.findall(r"Could not find: none libnccl-net-none\.so", text)))
|
||||
print("gdr_enabled_lines", len(re.findall(r"GPU Direct RDMA Enabled", text)))
|
||||
print("using_hca")
|
||||
for value, count in collections.Counter(re.findall(r"NET/IB : Using \[(.*?)\]; OOB", text)).most_common(4):
|
||||
print(f" {count} {value}")
|
||||
print("pattern_counts")
|
||||
patterns = re.findall(
|
||||
r"Pattern (\d+), crossNic (\d+), nChannels (\d+), bw ([0-9.]+)/([0-9.]+), type ([^,]+), sameChannels (\d+)",
|
||||
text,
|
||||
)
|
||||
for key, count in collections.Counter(patterns).most_common():
|
||||
print(f" {count} {key}")
|
||||
print("channel_summary")
|
||||
for value, count in collections.Counter(
|
||||
re.findall(r"(\d+ coll channels, \d+ collnet channels, \d+ nvls channels, \d+ p2p channels, \d+ p2p channels per peer)", text)
|
||||
).most_common():
|
||||
print(f" {count} {value}")
|
||||
print("p2p_chunks", collections.Counter(re.findall(r"P2P Chunksize set to (\d+)", text)))
|
||||
print("check_p2p", collections.Counter(re.findall(r"Check P2P Type ([^\n]+)", text)))
|
||||
for token in ["NET/IB/0/GDRDMA", "NET/IB/1/GDRDMA", "NET/IB/2/GDRDMA", "NET/IB/3/GDRDMA", "P2P/CUMEM", "P2P/IPC", "SHM"]:
|
||||
print(token, text.count(token))
|
||||
print("channel_edge_lines", len([line for line in text.splitlines() if "Channel " in line and ("via NET/IB" in line or "via P2P" in line)]))
|
||||
PY
|
||||
}
|
||||
|
||||
run_graph_case() {
|
||||
local op="$1"
|
||||
local bin="$2"
|
||||
local extra="${3:-}"
|
||||
set_common_env
|
||||
export NCCL_DEBUG=INFO
|
||||
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,TUNING,COLL
|
||||
if [[ -n "$extra" ]]; then
|
||||
eval "export $extra"
|
||||
fi
|
||||
local dir="$OUT_DIR/graph"
|
||||
mkdir -p "$dir"
|
||||
local log="$dir/${op}.log"
|
||||
run_nccl "$op" "$bin" "$log" "$GRAPH_WARMUP_ITERS" "$GRAPH_ITERS"
|
||||
summarize_graph_log "$log" "$dir/${op}_summary.txt"
|
||||
echo "$dir/${op}_summary.txt"
|
||||
}
|
||||
|
||||
run_pxn_sweep() {
|
||||
local dir="$OUT_DIR/pxn_sweep"
|
||||
mkdir -p "$dir"
|
||||
local cases=(
|
||||
"baseline|"
|
||||
"nvls_off|NCCL_NVLS_ENABLE=0"
|
||||
"qps4_split1|NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1"
|
||||
"qps8_split1|NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1"
|
||||
"qps4_split0|NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0"
|
||||
"channels16|NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16"
|
||||
"buff8m|NCCL_BUFFSIZE=8388608"
|
||||
"p2pchunk4m|NCCL_P2P_NET_CHUNKSIZE=4194304"
|
||||
"netpeer8|NCCL_NCHANNELS_PER_NET_PEER=8"
|
||||
"ar0|NCCL_IB_AR_THRESHOLD=0"
|
||||
)
|
||||
: >"$dir/summary.txt"
|
||||
for item in "${cases[@]}"; do
|
||||
local name="${item%%|*}"
|
||||
local extra="${item#*|}"
|
||||
set_common_env
|
||||
export NCCL_PXN_DISABLE=1
|
||||
if [[ -n "$extra" ]]; then
|
||||
eval "export $extra"
|
||||
fi
|
||||
local log="$dir/${name}.log"
|
||||
{
|
||||
echo "===== CASE $name ====="
|
||||
echo "extra: ${extra:-none}"
|
||||
run_nccl "alltoall" "$NCCL_TESTS_DIR/alltoall_perf" "$log" "$SWEEP_WARMUP_ITERS" "$SWEEP_ITERS"
|
||||
awk '/Avg bus bandwidth/ {print}' "$log" | tail -1
|
||||
} | tee -a "$dir/summary.txt"
|
||||
done
|
||||
echo "$dir/summary.txt"
|
||||
}
|
||||
|
||||
run_preflight() {
|
||||
set_common_env
|
||||
local out="$OUT_DIR/preflight.txt"
|
||||
{
|
||||
echo "===== LOCAL ====="
|
||||
echo "hostname: $(hostname)"
|
||||
echo "mpirun: $MPI_BIN"
|
||||
if [[ -x "$MPI_BIN" ]]; then
|
||||
"$MPI_BIN" --version 2>&1 | sed -n '1p'
|
||||
else
|
||||
echo "MISSING executable: $MPI_BIN"
|
||||
fi
|
||||
for bin in "$NCCL_TESTS_DIR/all_reduce_perf" "$NCCL_TESTS_DIR/alltoall_perf"; do
|
||||
if [[ -x "$bin" ]]; then
|
||||
echo "OK executable: $bin"
|
||||
else
|
||||
echo "MISSING executable: $bin"
|
||||
fi
|
||||
done
|
||||
for hca in $HCAS; do
|
||||
local state="/sys/class/infiniband/$hca/ports/1/state"
|
||||
local rate="/sys/class/infiniband/$hca/ports/1/rate"
|
||||
if [[ -r "$state" ]]; then
|
||||
echo "OK HCA: $hca state=$(cat "$state") rate=$(cat "$rate" 2>/dev/null || echo unknown)"
|
||||
else
|
||||
echo "MISSING HCA path: $hca"
|
||||
fi
|
||||
done
|
||||
|
||||
echo "===== REMOTE ====="
|
||||
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
|
||||
-o BatchMode=yes -o ConnectTimeout=5 "${SSH_USER}@${PEER_HOST}" \
|
||||
"MPI_BIN='$MPI_BIN' NCCL_TESTS_DIR='$NCCL_TESTS_DIR' HCAS='$HCAS' bash -s" <<'EOS'
|
||||
echo "hostname: $(hostname)"
|
||||
echo "mpirun: $MPI_BIN"
|
||||
if [ -x "$MPI_BIN" ]; then
|
||||
"$MPI_BIN" --version 2>&1 | sed -n '1p'
|
||||
else
|
||||
echo "MISSING executable: $MPI_BIN"
|
||||
fi
|
||||
for bin in "$NCCL_TESTS_DIR/all_reduce_perf" "$NCCL_TESTS_DIR/alltoall_perf"; do
|
||||
if [ -x "$bin" ]; then
|
||||
echo "OK executable: $bin"
|
||||
else
|
||||
echo "MISSING executable: $bin"
|
||||
fi
|
||||
done
|
||||
for hca in $HCAS; do
|
||||
state="/sys/class/infiniband/$hca/ports/1/state"
|
||||
rate="/sys/class/infiniband/$hca/ports/1/rate"
|
||||
if [ -r "$state" ]; then
|
||||
echo "OK HCA: $hca state=$(cat "$state") rate=$(cat "$rate" 2>/dev/null || echo unknown)"
|
||||
else
|
||||
echo "MISSING HCA path: $hca"
|
||||
fi
|
||||
done
|
||||
EOS
|
||||
} | tee "$out"
|
||||
echo "$out"
|
||||
}
|
||||
|
||||
usage() {
|
||||
cat <<EOF
|
||||
Usage: $0 [preflight|all|allreduce-counter|alltoall-counter|graph|pxn-sweep]
|
||||
|
||||
Outputs are written to: $OUT_DIR
|
||||
|
||||
Common overrides:
|
||||
HOSTS, PEER_HOST, HCAS, HCA_CSV, MPI_BIN, NCCL_TESTS_DIR,
|
||||
NCCL_LD_LIBRARY_PATH, BEGIN_SIZE, END_SIZE, WARMUP_ITERS, ITERS
|
||||
EOF
|
||||
}
|
||||
|
||||
case "$MODE" in
|
||||
preflight)
|
||||
run_preflight
|
||||
;;
|
||||
all)
|
||||
run_preflight
|
||||
run_counter_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||
run_counter_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||
run_graph_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||
run_graph_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||
run_pxn_sweep
|
||||
;;
|
||||
allreduce-counter)
|
||||
run_counter_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||
;;
|
||||
alltoall-counter)
|
||||
run_counter_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||
;;
|
||||
graph)
|
||||
run_graph_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||
run_graph_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||
;;
|
||||
pxn-sweep)
|
||||
run_pxn_sweep
|
||||
;;
|
||||
-h|--help|help)
|
||||
usage
|
||||
;;
|
||||
*)
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
|
||||
echo "OUT_DIR=$OUT_DIR"
|
||||
Loading…
x
Reference in New Issue
Block a user