Stabilize multinode NCCL launch diagnostics
This commit is contained in:
parent
28fe55d5c7
commit
b82ac54218
@ -73,6 +73,8 @@ multinode_nccl:
|
|||||||
gpus_per_rank: 1
|
gpus_per_rank: 1
|
||||||
timeout_sec: 1800
|
timeout_sec: 1800
|
||||||
socket_ifname: bond0
|
socket_ifname: bond0
|
||||||
|
oob_tcp_ifname: bond0
|
||||||
|
plm_rsh_args: "-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o ServerAliveInterval=30"
|
||||||
ib_gid_index: 3
|
ib_gid_index: 3
|
||||||
ib_sl: 5
|
ib_sl: 5
|
||||||
ib_tc: 136
|
ib_tc: 136
|
||||||
|
|||||||
@ -40,6 +40,8 @@ multinode_nccl:
|
|||||||
timeout_sec: 600
|
timeout_sec: 600
|
||||||
debug: INFO
|
debug: INFO
|
||||||
socket_ifname: bond0
|
socket_ifname: bond0
|
||||||
|
oob_tcp_ifname: bond0
|
||||||
|
plm_rsh_args: "-o StrictHostKeyChecking=accept-new -o ConnectTimeout=10 -o ServerAliveInterval=30"
|
||||||
ib_gid_index: 3
|
ib_gid_index: 3
|
||||||
ib_sl: 5
|
ib_sl: 5
|
||||||
ib_tc: 136
|
ib_tc: 136
|
||||||
|
|||||||
@ -245,10 +245,14 @@ class MultiNodeNCCLTest:
|
|||||||
"--allow-run-as-root",
|
"--allow-run-as-root",
|
||||||
"--mca", "btl_openib_warn_no_device_params_found", "0",
|
"--mca", "btl_openib_warn_no_device_params_found", "0",
|
||||||
"--mca", "btl_tcp_if_include", str(self.cfg.get("socket_ifname", "bond0")),
|
"--mca", "btl_tcp_if_include", str(self.cfg.get("socket_ifname", "bond0")),
|
||||||
|
"--mca", "oob_tcp_if_include", str(self.cfg.get("oob_tcp_ifname", self.cfg.get("socket_ifname", "bond0"))),
|
||||||
"-H", host_arg,
|
"-H", host_arg,
|
||||||
"--map-by", f"ppr:{gpus_per_node}:node",
|
"--map-by", f"ppr:{gpus_per_node}:node",
|
||||||
"-np", str(ranks),
|
"-np", str(ranks),
|
||||||
]
|
]
|
||||||
|
plm_rsh_args = self.cfg.get("plm_rsh_args")
|
||||||
|
if plm_rsh_args:
|
||||||
|
cmd.extend(["--mca", "plm_rsh_args", str(plm_rsh_args)])
|
||||||
for key, value in self._env_exports():
|
for key, value in self._env_exports():
|
||||||
cmd.extend(["-x", f"{key}={value}"])
|
cmd.extend(["-x", f"{key}={value}"])
|
||||||
|
|
||||||
|
|||||||
@ -4,13 +4,13 @@
|
|||||||
- 测试入口:`nccl-gpu-1` / `aikubeworker0012` / `172.72.8.12`
|
- 测试入口:`nccl-gpu-1` / `aikubeworker0012` / `172.72.8.12`
|
||||||
- 对端节点:`nccl-gpu-2` / `aikubeworker0016` / `172.72.8.16`
|
- 对端节点:`nccl-gpu-2` / `aikubeworker0016` / `172.72.8.16`
|
||||||
- 诊断配置:`configs/multinode_nccl_diagnostic.yaml`
|
- 诊断配置:`configs/multinode_nccl_diagnostic.yaml`
|
||||||
- 原始脚本报告:`reports_multinode_nccl_diagnostic_2x8_debug_v2.md`
|
- 原始脚本报告:`reports_multinode_nccl_diagnostic_2x8_sshfix.md`
|
||||||
|
|
||||||
## 当前结论
|
## 当前结论
|
||||||
|
|
||||||
这不是单纯 “IB 不通” 的问题。底层 CUDA RDMA perftest 可以跑到接近单端口 400Gb/s 的水平,但 NCCL 在实际 2 节点通信时把 GPU Direct RDMA 禁用了,导致 NCCL 带宽显著低于验收阈值。
|
这不是单纯 “IB 不通” 的问题。底层 CUDA RDMA perftest 可以跑到接近单端口 400Gb/s 的水平,但 NCCL 在实际 2 节点通信时把 GPU Direct RDMA 禁用了,导致 NCCL 带宽显著低于验收阈值。
|
||||||
|
|
||||||
同时,`nccl-gpu-2` 的 SSH 入口不稳定,会造成 `mpirun` 拉起远端 rank 失败。这个问题会直接影响 alltoall 等多机测试的稳定性,需要和 NCCL GDR 问题一起处理。
|
同时,`nccl-gpu-2` 的 SSH 入口曾因未认证连接过多触发 `MaxStartups` 随机拒绝,导致 `mpirun` 拉起远端 rank 失败。已经做了临时 SSHD 缓解并拿到有效的 2 节点 x 8 GPU allreduce/alltoall 报告;当前剩余核心问题是 NCCL GDR 仍被禁用。
|
||||||
|
|
||||||
## 已完成的修正
|
## 已完成的修正
|
||||||
|
|
||||||
@ -20,6 +20,8 @@
|
|||||||
4. 在脚本中加入 multi-node NCCL 网络诊断解析,报告会展示 `NCCL Network`、`GPU Direct RDMA`、`GDR Disabled HCAs`。
|
4. 在脚本中加入 multi-node NCCL 网络诊断解析,报告会展示 `NCCL Network`、`GPU Direct RDMA`、`GDR Disabled HCAs`。
|
||||||
5. 增加 `multinode_nccl.extra_env`,可以在配置里快速试 NCCL 环境变量,不需要改代码。
|
5. 增加 `multinode_nccl.extra_env`,可以在配置里快速试 NCCL 环境变量,不需要改代码。
|
||||||
6. 增加诊断配置 `configs/multinode_nccl_diagnostic.yaml`,固定跑 2 节点 x 8 GPU、256M、`NCCL_DEBUG=INFO` 和 `NCCL_DEBUG_SUBSYS=INIT,NET`。
|
6. 增加诊断配置 `configs/multinode_nccl_diagnostic.yaml`,固定跑 2 节点 x 8 GPU、256M、`NCCL_DEBUG=INFO` 和 `NCCL_DEBUG_SUBSYS=INIT,NET`。
|
||||||
|
7. 在 `nccl-gpu-2` 上临时提高 SSHD `MaxStartups` 并缩短 `LoginGraceTime`,缓解未认证连接过多导致的 SSH 随机拒绝。
|
||||||
|
8. 将 OpenMPI OOB TCP 控制通道固定到 `bond0`,并加入 `plm_rsh_args`,减少 `mpirun` 远端启动受 SSH/host key/接口选择影响的概率。
|
||||||
|
|
||||||
## 关键证据
|
## 关键证据
|
||||||
|
|
||||||
@ -77,12 +79,12 @@ NCCL INFO NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0'
|
|||||||
|
|
||||||
### 4. 脚本 2 节点 x 8 GPU 诊断结果
|
### 4. 脚本 2 节点 x 8 GPU 诊断结果
|
||||||
|
|
||||||
原始报告:`reports_multinode_nccl_diagnostic_2x8_debug_v2.md`
|
原始报告:`reports_multinode_nccl_diagnostic_2x8_sshfix.md`
|
||||||
|
|
||||||
| Operation | Topology | Peak Bus BW | Threshold | Status | NCCL Network | GPU Direct RDMA |
|
| Operation | Topology | Peak Bus BW | Threshold | Status | NCCL Network | GPU Direct RDMA |
|
||||||
|-----------|----------|-------------|-----------|--------|--------------|-----------------|
|
|-----------|----------|-------------|-----------|--------|--------------|-----------------|
|
||||||
| allreduce | 2 nodes x 8 GPUs | `68.69 GB/s` | `>= 480 GB/s` | FAIL | IB | DISABLED |
|
| allreduce | 2 nodes x 8 GPUs | `67.42 GB/s` | `>= 480 GB/s` | FAIL | IB | DISABLED |
|
||||||
| alltoall | 2 nodes x 8 GPUs | `0.00 GB/s` | `>= 75 GB/s` | FAIL | unknown | UNKNOWN |
|
| alltoall | 2 nodes x 8 GPUs | `9.56 GB/s` | `>= 75 GB/s` | FAIL | IB | DISABLED |
|
||||||
|
|
||||||
allreduce 失败原因是带宽不达标,且报告捕获到 GDR 被 NCCL 禁用:
|
allreduce 失败原因是带宽不达标,且报告捕获到 GDR 被 NCCL 禁用:
|
||||||
|
|
||||||
@ -90,12 +92,51 @@ allreduce 失败原因是带宽不达标,且报告捕获到 GDR 被 NCCL 禁
|
|||||||
|-------------------|
|
|-------------------|
|
||||||
| `mlx5_0, mlx5_1, mlx5_6, mlx5_7` |
|
| `mlx5_0, mlx5_1, mlx5_6, mlx5_7` |
|
||||||
|
|
||||||
alltoall 失败原因这轮不是性能本身,而是 `mpirun` 阶段受 SSH/网络发现影响失败,报告尾部显示:
|
allreduce 和 alltoall 本轮均正常完成,`returncode=0`、`wrong=0`,失败原因是带宽低于阈值,不是正确性失败。
|
||||||
|
|
||||||
|
### 5. SSHD MaxStartups 阻塞已临时缓解
|
||||||
|
|
||||||
|
`nccl-gpu-2` 曾显示:
|
||||||
|
|
||||||
```text
|
```text
|
||||||
lack of common network interfaces and/or no route found between them
|
sshd: /usr/sbin/sshd -D [listener] 52 of 10-100 startups
|
||||||
|
maxstartups 10:30:100
|
||||||
```
|
```
|
||||||
|
|
||||||
|
同时存在大量 `sshd: unknown [priv]` / `sshd: unknown [net]` 未认证连接,来源主要是 `172.239.10.85`。这会触发 OpenSSH `MaxStartups` 随机拒绝,直接表现为:
|
||||||
|
|
||||||
|
```text
|
||||||
|
kex_exchange_identification: Connection closed by remote host
|
||||||
|
```
|
||||||
|
|
||||||
|
已临时改为:
|
||||||
|
|
||||||
|
```text
|
||||||
|
MaxStartups 120:30:240
|
||||||
|
LoginGraceTime 20
|
||||||
|
```
|
||||||
|
|
||||||
|
改完后从 0012 连续 SSH 0016 5 次成功,2 节点 `mpirun hostname` 成功,2 节点 x 8 GPU allreduce/alltoall 也都能跑出有效结果。
|
||||||
|
|
||||||
|
### 6. `nvidia_peermem` legacy 模式实验无效
|
||||||
|
|
||||||
|
两台机器默认参数一致:
|
||||||
|
|
||||||
|
| 参数 | 值 |
|
||||||
|
|------|----|
|
||||||
|
| `nvidia_peermem` version | `580.159.03` |
|
||||||
|
| `peerdirect_support` | `0` |
|
||||||
|
| `persistent_api_support` | `1` |
|
||||||
|
| OFED | `OFED-internal-26.01-1.0.0` |
|
||||||
|
|
||||||
|
临时切换两台机器到 `peerdirect_support=1` 后,2 节点 x 1 GPU NCCL 仍显示:
|
||||||
|
|
||||||
|
```text
|
||||||
|
NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0'
|
||||||
|
```
|
||||||
|
|
||||||
|
带宽仍约 `13.4 GB/s`。测试后已经恢复默认 `peerdirect_support=0,persistent_api_support=1`。
|
||||||
|
|
||||||
## 当前阻塞
|
## 当前阻塞
|
||||||
|
|
||||||
### 阻塞 1:NCCL 禁用 GPU Direct RDMA
|
### 阻塞 1:NCCL 禁用 GPU Direct RDMA
|
||||||
@ -109,26 +150,27 @@ lack of common network interfaces and/or no route found between them
|
|||||||
|
|
||||||
判断:底层 RDMA 能力存在,但 NCCL 的 GDR 判定/注册路径没有打通。优先排查 NCCL 与 NVIDIA driver、OFED、`nvidia_peermem`、NCCL net plugin/内部 IB 后端之间的兼容性。
|
判断:底层 RDMA 能力存在,但 NCCL 的 GDR 判定/注册路径没有打通。优先排查 NCCL 与 NVIDIA driver、OFED、`nvidia_peermem`、NCCL net plugin/内部 IB 后端之间的兼容性。
|
||||||
|
|
||||||
### 阻塞 2:`nccl-gpu-2` SSH 不稳定
|
### 阻塞 2:`nccl-gpu-2` SSH 存在外部连接压力
|
||||||
|
|
||||||
现象:
|
现象:
|
||||||
|
|
||||||
- 多次出现:`kex_exchange_identification: Connection closed by remote host`
|
- 多次出现过:`kex_exchange_identification: Connection closed by remote host`
|
||||||
- MCP 直连 `nccl-gpu-2` 也会失败或长时间超时
|
- 根因是未认证连接过多触发 `MaxStartups`
|
||||||
- `mpirun` 依赖 SSH 拉起远端 rank,因此 SSH 抖动会让 alltoall 这类测试直接没有有效输出
|
- 当前已经通过临时 SSHD 配置缓解,并拿到了有效 2x8 报告
|
||||||
|
- 但如果外部连接压力持续,仍建议从网络侧或安全策略侧处理来源连接
|
||||||
|
|
||||||
判断:需要先处理 `aikubeworker0016` 的 SSHD/连接限制/MaxStartups/安全策略,否则多机测试无法稳定复现。
|
判断:这不再阻塞当前报告产出,但属于环境稳定性风险。
|
||||||
|
|
||||||
## 建议下一步
|
## 建议下一步
|
||||||
|
|
||||||
1. 先修 `nccl-gpu-2` SSH 稳定性:检查 `sshd_config` 的 `MaxStartups`、连接限制、安全审计组件,以及是否有过多半开 SSH 会话。
|
1. 从网络/安全侧处理 `172.239.10.85` 等来源的 SSH 未认证连接压力,或者保留更高的 `MaxStartups` 配置作为测试窗口临时策略。
|
||||||
2. 对两台机器分别确认 `nvidia_peermem` 参数、OFED 版本、NVIDIA driver 版本一致性。
|
2. 尝试安装或启用匹配当前 OFED/driver 的 NCCL net plugin;当前日志显示 `No plugin found (libnccl-net.so)`,NCCL 使用的是 internal network plugin。
|
||||||
3. 在两台机器上测试是否需要切换 `nvidia_peermem peerdirect_support` 模式,并在变更前确认没有正在运行的业务任务。
|
3. 用同版本软件栈补测 `nccl-tests` + NCCL net plugin 后的 GDR 状态,核心判据是报告里 `GPU Direct RDMA` 从 `DISABLED` 变成未禁用,且 2x8 带宽显著抬升。
|
||||||
4. 尝试安装或启用匹配当前 OFED/driver 的 NCCL net plugin;当前日志显示 `No plugin found (libnccl-net.so)`,NCCL 使用的是 internal network plugin。
|
4. 如果仍禁用 GDR,再继续查 NVIDIA driver 580.159.03、OFED 26.01、NCCL 2.21.5 与 H100/IB NDR 组合的兼容矩阵。
|
||||||
5. SSH 稳定后重跑完整多机配置:2 节点 x 8 GPU,至少覆盖 `all_reduce_perf` 和 `alltoall_perf`,消息大小从 `1K` 到 `16G`。
|
5. GDR 修复后重跑完整多机配置:2 节点 x 8 GPU,至少覆盖 `all_reduce_perf` 和 `alltoall_perf`,消息大小从 `1K` 到 `16G`。
|
||||||
|
|
||||||
## 当前可交付物
|
## 当前可交付物
|
||||||
|
|
||||||
- `configs/multinode_nccl_diagnostic.yaml`:多机多卡诊断配置
|
- `configs/multinode_nccl_diagnostic.yaml`:多机多卡诊断配置
|
||||||
- `reports_multinode_nccl_diagnostic_2x8_debug_v2.md`:脚本生成的原始 2x8 诊断报告
|
- `reports_multinode_nccl_diagnostic_2x8_sshfix.md`:脚本生成的原始 2x8 诊断报告
|
||||||
- `reports_multinode_nccl_diagnosis_20260523.md`:本中文诊断总结
|
- `reports_multinode_nccl_diagnosis_20260523.md`:本中文诊断总结
|
||||||
|
|||||||
66
reports_multinode_nccl_diagnostic_2x8_sshfix.md
Normal file
66
reports_multinode_nccl_diagnostic_2x8_sshfix.md
Normal file
@ -0,0 +1,66 @@
|
|||||||
|
# GPU Test Report
|
||||||
|
|
||||||
|
- **Date:** 2026-05-23T07:46:11.464439
|
||||||
|
- **Host:** aikubeworker0012
|
||||||
|
|
||||||
|
## Overall Acceptance Verdict
|
||||||
|
|
||||||
|
**Result: FAIL**
|
||||||
|
|
||||||
|
Missing required evidence:
|
||||||
|
- GPU Info
|
||||||
|
- Health Check
|
||||||
|
- Memory Bandwidth
|
||||||
|
- Compute Throughput
|
||||||
|
- NVLink/NVSwitch
|
||||||
|
- NCCL
|
||||||
|
- Stress Test
|
||||||
|
- RDMA
|
||||||
|
- DCGM
|
||||||
|
- Training
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
| Test | Result |
|
||||||
|
|------|--------|
|
||||||
|
| Multi-node NCCL | FAIL |
|
||||||
|
|
||||||
|
## Multi-node NCCL / Cross Leaf
|
||||||
|
|
||||||
|
Source: nccl-tests-mpirun | Mode: diagnostic
|
||||||
|
|
||||||
|
- **Hosts:** nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16)
|
||||||
|
- **Preflight:** PASS
|
||||||
|
|
||||||
|
### Multi-node NCCL allreduce
|
||||||
|
|
||||||
|
| Topology | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
|
||||||
|
|----------|-------------|-----------|------------|-----------|--------|
|
||||||
|
| 2 nodes x 8 GPUs diagnostic | 67.42 GB/s | 256M | 67.50 GB/s | >= 480 GB/s | FAIL |
|
||||||
|
|
||||||
|
| Topology | NCCL Network | GPU Direct RDMA | GDR Disabled HCAs |
|
||||||
|
|----------|--------------|-----------------|-------------------|
|
||||||
|
| 2 nodes x 8 GPUs diagnostic | IB | DISABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 |
|
||||||
|
|
||||||
|
| Topology | Return Code | Error / Output Tail |
|
||||||
|
|----------|-------------|---------------------|
|
||||||
|
| 2 nodes x 8 GPUs diagnostic | 0 | orker0016:986293:986293 [1] NCCL INFO comm 0x563abe94c350 rank 9 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0016:986292:986292 [0] NCCL INFO comm 0x560ffac51160 rank 8 nranks 16 cudaDev 0 busId 18000 - Destroy COMPLETE |
|
||||||
|
|
||||||
|
### Multi-node NCCL alltoall
|
||||||
|
|
||||||
|
| Topology | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
|
||||||
|
|----------|-------------|-----------|------------|-----------|--------|
|
||||||
|
| 2 nodes x 8 GPUs diagnostic | 9.56 GB/s | 256M | 9.55 GB/s | >= 75 GB/s | FAIL |
|
||||||
|
|
||||||
|
| Topology | NCCL Network | GPU Direct RDMA | GDR Disabled HCAs |
|
||||||
|
|----------|--------------|-----------------|-------------------|
|
||||||
|
| 2 nodes x 8 GPUs diagnostic | IB | DISABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 |
|
||||||
|
|
||||||
|
| Topology | Return Code | Error / Output Tail |
|
||||||
|
|----------|-------------|---------------------|
|
||||||
|
| 2 nodes x 8 GPUs diagnostic | 0 | TE aikubeworker0012:2141982:2141982 [4] NCCL INFO comm 0x55d0bf9c6a00 rank 4 nranks 16 cudaDev 4 busId 9a000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 9.55234 # # Collective test concluded: alltoall_perf # |
|
||||||
|
|
||||||
|
**Overall: FAIL**
|
||||||
|
|
||||||
|
---
|
||||||
|
*Generated by GPU Test Suite v0.2.0*
|
||||||
Loading…
x
Reference in New Issue
Block a user