# 多机 NCCL 深度诊断 runbook

本文档用于复现 2026-05-23 这轮 2 机 8 卡 NCCL 排查里的关键动作：counter 抓取、GRAPH/TUNING 日志、以及 PXN disabled 基线上的二次参数 sweep。

## 适用场景

当前默认参数面向：

- `aikubeworker0012` / `172.72.8.12`
- `aikubeworker0016` / `172.72.8.16`
- 每节点 8 GPU
- 每节点 4 条 400G HCA：`mlx5_0,mlx5_1,mlx5_6,mlx5_7`
- NCCL 临时运行库：`/tmp/nccl-2.27.7-cuda12.4`
- nccl-tests：`/data/nccl-tests-latest/build`
- OpenMPI：`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun`

脚本应在 coordinator 节点上执行，当前即 `aikubeworker0012`。

## 快速运行

```bash
cd /root/test_gpu_scripts
bash scripts/multinode_nccl_deep_diagnose.sh preflight
bash scripts/multinode_nccl_deep_diagnose.sh all
```

如果要按 PDF 参考矩阵跑正式多机多卡报告，使用：

```bash
cd /root/test_gpu_scripts
bash scripts/run_multinode_nccl_pdf_matrix.sh
```

它会跑 2 机 x 1/2/4/8 GPU per node 的 `all_reduce_perf` 和 `alltoall_perf`，输出到
`reports/multinode_nccl_pdf_matrix_YYYYMMDD_HHMMSS.md`。

默认输出目录为：

```text
/tmp/nccl_deep_diagnose_YYYYMMDD_HHMMSS
```

只跑单项：

```bash
# 轻量检查 SSH、mpirun、nccl-tests 和 HCA 路径
bash scripts/multinode_nccl_deep_diagnose.sh preflight

# allreduce counter 对照
bash scripts/multinode_nccl_deep_diagnose.sh allreduce-counter

# PXN disabled alltoall counter
bash scripts/multinode_nccl_deep_diagnose.sh alltoall-counter

# NCCL GRAPH/TUNING/COLL 对照
bash scripts/multinode_nccl_deep_diagnose.sh graph

# PXN disabled 基线上的二次参数 sweep
bash scripts/multinode_nccl_deep_diagnose.sh pxn-sweep
```

## 常用参数覆盖

```bash
OUT_DIR=/tmp/my_nccl_diag \
HOSTS=172.72.8.12:8,172.72.8.16:8 \
PEER_HOST=172.72.8.16 \
HCAS="mlx5_0 mlx5_1 mlx5_6 mlx5_7" \
HCA_CSV=mlx5_0,mlx5_1,mlx5_6,mlx5_7 \
bash scripts/multinode_nccl_deep_diagnose.sh all
```

如果 nccl-tests 或 NCCL 运行库路径变化：

```bash
NCCL_TESTS_DIR=/data/nccl-tests-latest/build \
NCCL_LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.9a1/lib:/path/to/nccl/lib:/usr/local/cuda/lib64 \
bash scripts/multinode_nccl_deep_diagnose.sh graph
```

## 输出解读

### preflight 模式

典型输出文件：

```text
preflight.txt
```

该模式不跑 NCCL workload，只检查：

- 本机和对端主机名。
- OpenMPI `mpirun` 是否存在且可执行。
- `all_reduce_perf` / `alltoall_perf` 是否存在且可执行。
- 配置的 HCA 是否能在 `/sys/class/infiniband/<hca>/ports/1` 下读到 state/rate。
- 发起节点到 `PEER_HOST` 的 root SSH 是否可用。

如果这里出现 `MISSING`，先修环境；否则再跑 `all` 或单项诊断。

### counter 模式

典型输出文件：

```text
allreduce_counter/
  allreduce.log
  before.local
  before.remote
  after.local
  after.remote
  counter_delta.txt

alltoall_pxn_counter/
  alltoall_pxn.log
  before.local
  before.remote
  after.local
  after.remote
  counter_delta.txt
```

重点看 `counter_delta.txt`：

- `port_xmit_data` / `port_rcv_data`：端口流量，单位为 4-byte words，脚本同时换算 GiB。
- `port_xmit_wait`：发送等待或 credit/拥塞等待信号。注意它不是 alltoall 独有根因，因为高吞吐 allreduce 也会出现。
- `port_xmit_discards`、`port_rcv_errors`、`symbol_error`、`roce_adp_retrans`、`packet_seq_err` 等：错误、丢包、重传、链路异常类信号。

当前已知基线：

- allreduce 可到约 `354 GB/s busbw`，4 条 rail 均衡。
- PXN disabled alltoall 通常在 `36-37 GB/s busbw` 附近，但有窗口波动。
- alltoall PXN disabled 后 rail 均衡，且没有明显 error/retrans/slow restart。

### graph 模式

典型输出文件：

```text
graph/
  allreduce.log
  allreduce_summary.txt
  alltoall_pxn.log
  alltoall_pxn_summary.txt
```

重点看：

- `nccl_version`
- `plugin_missing`
- `gdr_enabled_lines`
- `pattern_counts`
- `channel_summary`
- `NET/IB/*/GDRDMA`
- `P2P/CUMEM`
- `channel_edge_lines`

当前已知对照：

| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|--------|-----------|----------------------------------|
| HCA / GDR | 4 HCA, GDR enabled | 4 HCA, GDR enabled |
| channels | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
| `NET/IB/*/GDRDMA` channel edge lines | `256` | `512` |
| `P2P/CUMEM` channel edge lines | `0` | `224` |
| total NET/P2P channel edge lines | `256` | `736` |

判断边界：

- 如果 HCA/GDR/channel 基础状态一致，但 alltoall graph 明显更复杂，问题更偏向 NCCL collective graph、P2P/NET 组合方式、internal IB plugin 或交换网络策略。
- 如果 GDR disabled、HCA 不完整、plugin 路径变化，则不能直接与当前报告结论对比。

### pxn-sweep 模式

典型输出：

```text
pxn_sweep/
  baseline.log
  nvls_off.log
  qps4_split1.log
  qps8_split1.log
  qps4_split0.log
  channels16.log
  buff8m.log
  p2pchunk4m.log
  netpeer8.log
  ar0.log
  summary.txt
```

当前结论：

- `NCCL_PXN_DISABLE=1` 是已发现的唯一稳定正向项。
- 在 PXN disabled 基线上继续叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR，没有稳定收益。
- QP/split 和 `NCCL_NCHANNELS_PER_NET_PEER=8` 在当前环境下明显变差。

## 交接给网络/NCCL 环境侧的重点

1. 当前不是旧 NCCL/GDR disabled 问题：NCCL `2.27.7` 下 4 条 HCA 都是 GDR enabled。
2. 当前不是 rail 完全打偏问题：`NCCL_PXN_DISABLE=1` 后 alltoall 的 4 条 rail 已均衡。
3. 当前不是明显坏链路/重传问题：未看到 discard、symbol error、RoCE retrans、slow restart、packet sequence error 等增长。
4. allreduce 已接近当前 4 x 400G rail 的物理可用带宽；PDF 8 卡 allreduce 目标反推需要超过当前 4 rail 单向理论带宽。
5. alltoall 剩余差距更像 NCCL internal alltoall graph、P2P/NET 组合方式、缺少 NCCL net plugin/SHARP，或交换网络策略/ECMP/拥塞控制问题。

## 关联报告

- `reports_multinode_nccl_diagnosis_20260523.md`
- `reports_multinode_nccl_alltoall_tuning_20260523.md`
- `reports_multinode_nccl_counter_probe_20260523.md`
- `reports_multinode_nccl_pdf_matrix_nccl227.md`