Add multinode NCCL deep diagnosis tools
This commit is contained in:
parent
08e0d93a16
commit
0d63ea5e05
11
README.md
11
README.md
@ -575,6 +575,17 @@ report:
|
|||||||
└── 确认: 训练 loss 正常下降
|
└── 确认: 训练 loss 正常下降
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### 多节点 NCCL 深度诊断
|
||||||
|
|
||||||
|
当 SOP-3 的多节点 NCCL 结果与验收 PDF 不一致时,可以在发起节点运行深度诊断脚本,复现 counter 抓取、GRAPH/TUNING 日志和 PXN disabled sweep:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||||
|
```
|
||||||
|
|
||||||
|
详细参数、输出目录和解读方法见 [docs/multinode_nccl_deep_diagnose_runbook.md](/Users/d-robotics/lab/test_gpu_scripts/docs/multinode_nccl_deep_diagnose_runbook.md)。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### SOP-4: 故障诊断
|
### SOP-4: 故障诊断
|
||||||
|
|||||||
201
docs/multinode_nccl_deep_diagnose_runbook.md
Normal file
201
docs/multinode_nccl_deep_diagnose_runbook.md
Normal file
@ -0,0 +1,201 @@
|
|||||||
|
# 多机 NCCL 深度诊断 runbook
|
||||||
|
|
||||||
|
本文档用于复现 2026-05-23 这轮 2 机 8 卡 NCCL 排查里的关键动作:counter 抓取、GRAPH/TUNING 日志、以及 PXN disabled 基线上的二次参数 sweep。
|
||||||
|
|
||||||
|
## 适用场景
|
||||||
|
|
||||||
|
当前默认参数面向:
|
||||||
|
|
||||||
|
- `aikubeworker0012` / `172.72.8.12`
|
||||||
|
- `aikubeworker0016` / `172.72.8.16`
|
||||||
|
- 每节点 8 GPU
|
||||||
|
- 每节点 4 条 400G HCA:`mlx5_0,mlx5_1,mlx5_6,mlx5_7`
|
||||||
|
- NCCL 临时运行库:`/tmp/nccl-2.27.7-cuda12.4`
|
||||||
|
- nccl-tests:`/data/nccl-tests-latest/build`
|
||||||
|
- OpenMPI:`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun`
|
||||||
|
|
||||||
|
脚本应在 coordinator 节点上执行,当前即 `aikubeworker0012`。
|
||||||
|
|
||||||
|
## 快速运行
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /root/test_gpu_scripts
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||||
|
```
|
||||||
|
|
||||||
|
默认输出目录为:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/tmp/nccl_deep_diagnose_YYYYMMDD_HHMMSS
|
||||||
|
```
|
||||||
|
|
||||||
|
只跑单项:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 轻量检查 SSH、mpirun、nccl-tests 和 HCA 路径
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||||
|
|
||||||
|
# allreduce counter 对照
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh allreduce-counter
|
||||||
|
|
||||||
|
# PXN disabled alltoall counter
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh alltoall-counter
|
||||||
|
|
||||||
|
# NCCL GRAPH/TUNING/COLL 对照
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh graph
|
||||||
|
|
||||||
|
# PXN disabled 基线上的二次参数 sweep
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh pxn-sweep
|
||||||
|
```
|
||||||
|
|
||||||
|
## 常用参数覆盖
|
||||||
|
|
||||||
|
```bash
|
||||||
|
OUT_DIR=/tmp/my_nccl_diag \
|
||||||
|
HOSTS=172.72.8.12:8,172.72.8.16:8 \
|
||||||
|
PEER_HOST=172.72.8.16 \
|
||||||
|
HCAS="mlx5_0 mlx5_1 mlx5_6 mlx5_7" \
|
||||||
|
HCA_CSV=mlx5_0,mlx5_1,mlx5_6,mlx5_7 \
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh all
|
||||||
|
```
|
||||||
|
|
||||||
|
如果 nccl-tests 或 NCCL 运行库路径变化:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
NCCL_TESTS_DIR=/opt/gpu-test-tools/nccl-tests/build \
|
||||||
|
NCCL_LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.9a1/lib:/path/to/nccl/lib:/usr/local/cuda/lib64 \
|
||||||
|
bash scripts/multinode_nccl_deep_diagnose.sh graph
|
||||||
|
```
|
||||||
|
|
||||||
|
## 输出解读
|
||||||
|
|
||||||
|
### preflight 模式
|
||||||
|
|
||||||
|
典型输出文件:
|
||||||
|
|
||||||
|
```text
|
||||||
|
preflight.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
该模式不跑 NCCL workload,只检查:
|
||||||
|
|
||||||
|
- 本机和对端主机名。
|
||||||
|
- OpenMPI `mpirun` 是否存在且可执行。
|
||||||
|
- `all_reduce_perf` / `alltoall_perf` 是否存在且可执行。
|
||||||
|
- 配置的 HCA 是否能在 `/sys/class/infiniband/<hca>/ports/1` 下读到 state/rate。
|
||||||
|
- 发起节点到 `PEER_HOST` 的 root SSH 是否可用。
|
||||||
|
|
||||||
|
如果这里出现 `MISSING`,先修环境;否则再跑 `all` 或单项诊断。
|
||||||
|
|
||||||
|
### counter 模式
|
||||||
|
|
||||||
|
典型输出文件:
|
||||||
|
|
||||||
|
```text
|
||||||
|
allreduce_counter/
|
||||||
|
allreduce.log
|
||||||
|
before.local
|
||||||
|
before.remote
|
||||||
|
after.local
|
||||||
|
after.remote
|
||||||
|
counter_delta.txt
|
||||||
|
|
||||||
|
alltoall_pxn_counter/
|
||||||
|
alltoall_pxn.log
|
||||||
|
before.local
|
||||||
|
before.remote
|
||||||
|
after.local
|
||||||
|
after.remote
|
||||||
|
counter_delta.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
重点看 `counter_delta.txt`:
|
||||||
|
|
||||||
|
- `port_xmit_data` / `port_rcv_data`:端口流量,单位为 4-byte words,脚本同时换算 GiB。
|
||||||
|
- `port_xmit_wait`:发送等待或 credit/拥塞等待信号。注意它不是 alltoall 独有根因,因为高吞吐 allreduce 也会出现。
|
||||||
|
- `port_xmit_discards`、`port_rcv_errors`、`symbol_error`、`roce_adp_retrans`、`packet_seq_err` 等:错误、丢包、重传、链路异常类信号。
|
||||||
|
|
||||||
|
当前已知基线:
|
||||||
|
|
||||||
|
- allreduce 可到约 `354 GB/s busbw`,4 条 rail 均衡。
|
||||||
|
- PXN disabled alltoall 通常在 `36-37 GB/s busbw` 附近,但有窗口波动。
|
||||||
|
- alltoall PXN disabled 后 rail 均衡,且没有明显 error/retrans/slow restart。
|
||||||
|
|
||||||
|
### graph 模式
|
||||||
|
|
||||||
|
典型输出文件:
|
||||||
|
|
||||||
|
```text
|
||||||
|
graph/
|
||||||
|
allreduce.log
|
||||||
|
allreduce_summary.txt
|
||||||
|
alltoall_pxn.log
|
||||||
|
alltoall_pxn_summary.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
重点看:
|
||||||
|
|
||||||
|
- `nccl_version`
|
||||||
|
- `plugin_missing`
|
||||||
|
- `gdr_enabled_lines`
|
||||||
|
- `pattern_counts`
|
||||||
|
- `channel_summary`
|
||||||
|
- `NET/IB/*/GDRDMA`
|
||||||
|
- `P2P/CUMEM`
|
||||||
|
- `channel_edge_lines`
|
||||||
|
|
||||||
|
当前已知对照:
|
||||||
|
|
||||||
|
| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` |
|
||||||
|
|--------|-----------|----------------------------------|
|
||||||
|
| HCA / GDR | 4 HCA, GDR enabled | 4 HCA, GDR enabled |
|
||||||
|
| channels | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` |
|
||||||
|
| `NET/IB/*/GDRDMA` channel edge lines | `256` | `512` |
|
||||||
|
| `P2P/CUMEM` channel edge lines | `0` | `224` |
|
||||||
|
| total NET/P2P channel edge lines | `256` | `736` |
|
||||||
|
|
||||||
|
判断边界:
|
||||||
|
|
||||||
|
- 如果 HCA/GDR/channel 基础状态一致,但 alltoall graph 明显更复杂,问题更偏向 NCCL collective graph、P2P/NET 组合方式、internal IB plugin 或交换网络策略。
|
||||||
|
- 如果 GDR disabled、HCA 不完整、plugin 路径变化,则不能直接与当前报告结论对比。
|
||||||
|
|
||||||
|
### pxn-sweep 模式
|
||||||
|
|
||||||
|
典型输出:
|
||||||
|
|
||||||
|
```text
|
||||||
|
pxn_sweep/
|
||||||
|
baseline.log
|
||||||
|
nvls_off.log
|
||||||
|
qps4_split1.log
|
||||||
|
qps8_split1.log
|
||||||
|
qps4_split0.log
|
||||||
|
channels16.log
|
||||||
|
buff8m.log
|
||||||
|
p2pchunk4m.log
|
||||||
|
netpeer8.log
|
||||||
|
ar0.log
|
||||||
|
summary.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
当前结论:
|
||||||
|
|
||||||
|
- `NCCL_PXN_DISABLE=1` 是已发现的唯一稳定正向项。
|
||||||
|
- 在 PXN disabled 基线上继续叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR,没有稳定收益。
|
||||||
|
- QP/split 和 `NCCL_NCHANNELS_PER_NET_PEER=8` 在当前环境下明显变差。
|
||||||
|
|
||||||
|
## 交接给网络/NCCL 环境侧的重点
|
||||||
|
|
||||||
|
1. 当前不是旧 NCCL/GDR disabled 问题:NCCL `2.27.7` 下 4 条 HCA 都是 GDR enabled。
|
||||||
|
2. 当前不是 rail 完全打偏问题:`NCCL_PXN_DISABLE=1` 后 alltoall 的 4 条 rail 已均衡。
|
||||||
|
3. 当前不是明显坏链路/重传问题:未看到 discard、symbol error、RoCE retrans、slow restart、packet sequence error 等增长。
|
||||||
|
4. allreduce 已接近当前 4 x 400G rail 的物理可用带宽;PDF 8 卡 allreduce 目标反推需要超过当前 4 rail 单向理论带宽。
|
||||||
|
5. alltoall 剩余差距更像 NCCL internal alltoall graph、P2P/NET 组合方式、缺少 NCCL net plugin/SHARP,或交换网络策略/ECMP/拥塞控制问题。
|
||||||
|
|
||||||
|
## 关联报告
|
||||||
|
|
||||||
|
- `reports_multinode_nccl_diagnosis_20260523.md`
|
||||||
|
- `reports_multinode_nccl_alltoall_tuning_20260523.md`
|
||||||
|
- `reports_multinode_nccl_counter_probe_20260523.md`
|
||||||
|
- `reports_multinode_nccl_pdf_matrix_nccl227.md`
|
||||||
425
scripts/multinode_nccl_deep_diagnose.sh
Executable file
425
scripts/multinode_nccl_deep_diagnose.sh
Executable file
@ -0,0 +1,425 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Deep-diagnose multi-node NCCL behavior from the coordinator node.
|
||||||
|
# Default values match the current 2-node H100 cross-leaf investigation.
|
||||||
|
|
||||||
|
MODE="${1:-all}"
|
||||||
|
|
||||||
|
MPI_BIN="${MPI_BIN:-/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun}"
|
||||||
|
NCCL_TESTS_DIR="${NCCL_TESTS_DIR:-/data/nccl-tests-latest/build}"
|
||||||
|
HOSTS="${HOSTS:-172.72.8.12:8,172.72.8.16:8}"
|
||||||
|
PEER_HOST="${PEER_HOST:-172.72.8.16}"
|
||||||
|
SSH_USER="${SSH_USER:-root}"
|
||||||
|
HCAS="${HCAS:-mlx5_0 mlx5_1 mlx5_6 mlx5_7}"
|
||||||
|
HCA_CSV="${HCA_CSV:-mlx5_0,mlx5_1,mlx5_6,mlx5_7}"
|
||||||
|
OUT_DIR="${OUT_DIR:-/tmp/nccl_deep_diagnose_$(date +%Y%m%d_%H%M%S)}"
|
||||||
|
|
||||||
|
BEGIN_SIZE="${BEGIN_SIZE:-16G}"
|
||||||
|
END_SIZE="${END_SIZE:-16G}"
|
||||||
|
WARMUP_ITERS="${WARMUP_ITERS:-10}"
|
||||||
|
ITERS="${ITERS:-10}"
|
||||||
|
GRAPH_WARMUP_ITERS="${GRAPH_WARMUP_ITERS:-1}"
|
||||||
|
GRAPH_ITERS="${GRAPH_ITERS:-1}"
|
||||||
|
SWEEP_WARMUP_ITERS="${SWEEP_WARMUP_ITERS:-3}"
|
||||||
|
SWEEP_ITERS="${SWEEP_ITERS:-5}"
|
||||||
|
|
||||||
|
NCCL_LD_LIBRARY_PATH="${NCCL_LD_LIBRARY_PATH:-/usr/mpi/gcc/openmpi-4.1.9a1/lib:/tmp/nccl-2.27.7-cuda12.4/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.4/targets/x86_64-linux/lib}"
|
||||||
|
DEFAULT_NCCL_DEBUG="${NCCL_DEBUG:-WARN}"
|
||||||
|
|
||||||
|
COUNTERS="${COUNTERS:-port_xmit_data port_rcv_data port_xmit_packets port_rcv_packets port_xmit_wait port_xmit_discards port_rcv_errors port_rcv_remote_physical_errors port_rcv_switch_relay_errors port_xmit_constraint_errors port_rcv_constraint_errors symbol_error link_error_recovery link_downed local_link_integrity_errors excessive_buffer_overrun_errors VL15_dropped}"
|
||||||
|
HW_COUNTERS="${HW_COUNTERS:-roce_adp_retrans roce_adp_retrans_to roce_slow_restart roce_slow_restart_cnps roce_slow_restart_trans packet_seq_err out_of_sequence out_of_buffer duplicate_request implied_nak_seq_err local_ack_timeout_err req_transport_retries_exceeded rnr_nak_retry_err rx_write_requests rx_read_requests}"
|
||||||
|
|
||||||
|
mkdir -p "$OUT_DIR"
|
||||||
|
|
||||||
|
mpi_base=(
|
||||||
|
"$MPI_BIN"
|
||||||
|
--allow-run-as-root
|
||||||
|
--mca btl_openib_warn_no_device_params_found 0
|
||||||
|
--mca btl_tcp_if_include bond0
|
||||||
|
--mca oob_tcp_if_include bond0
|
||||||
|
--mca plm_rsh_args "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=10"
|
||||||
|
-H "$HOSTS"
|
||||||
|
--map-by ppr:8:node
|
||||||
|
-np 16
|
||||||
|
)
|
||||||
|
|
||||||
|
base_exports=(
|
||||||
|
LD_LIBRARY_PATH
|
||||||
|
NCCL_IB_GID_INDEX NCCL_IB_SL NCCL_IB_TC NCCL_SOCKET_IFNAME
|
||||||
|
NCCL_DEBUG NCCL_DEBUG_SUBSYS NCCL_IB_TIMEOUT NCCL_IB_HCA
|
||||||
|
NCCL_NET_PLUGIN NCCL_NVLS_ENABLE NCCL_NET_GDR_LEVEL NCCL_NET_GDR_READ
|
||||||
|
NCCL_DMABUF_ENABLE NCCL_PXN_DISABLE NCCL_IB_QPS_PER_CONNECTION
|
||||||
|
NCCL_IB_SPLIT_DATA_ON_QPS NCCL_MIN_NCHANNELS NCCL_MAX_NCHANNELS
|
||||||
|
NCCL_BUFFSIZE NCCL_P2P_NET_CHUNKSIZE NCCL_NCHANNELS_PER_NET_PEER
|
||||||
|
NCCL_IB_AR_THRESHOLD
|
||||||
|
)
|
||||||
|
|
||||||
|
set_common_env() {
|
||||||
|
unset NCCL_DEBUG_SUBSYS NCCL_PXN_DISABLE NCCL_IB_QPS_PER_CONNECTION
|
||||||
|
unset NCCL_IB_SPLIT_DATA_ON_QPS NCCL_MIN_NCHANNELS NCCL_MAX_NCHANNELS
|
||||||
|
unset NCCL_BUFFSIZE NCCL_P2P_NET_CHUNKSIZE NCCL_NCHANNELS_PER_NET_PEER
|
||||||
|
unset NCCL_IB_AR_THRESHOLD
|
||||||
|
|
||||||
|
export LD_LIBRARY_PATH="$NCCL_LD_LIBRARY_PATH"
|
||||||
|
export NCCL_IB_GID_INDEX="${NCCL_IB_GID_INDEX:-3}"
|
||||||
|
export NCCL_IB_SL="${NCCL_IB_SL:-5}"
|
||||||
|
export NCCL_IB_TC="${NCCL_IB_TC:-136}"
|
||||||
|
export NCCL_SOCKET_IFNAME="${NCCL_SOCKET_IFNAME:-bond0}"
|
||||||
|
export NCCL_DEBUG="$DEFAULT_NCCL_DEBUG"
|
||||||
|
export NCCL_IB_TIMEOUT="${NCCL_IB_TIMEOUT:-22}"
|
||||||
|
export NCCL_IB_HCA="$HCA_CSV"
|
||||||
|
export NCCL_NET_PLUGIN="${NCCL_NET_PLUGIN:-none}"
|
||||||
|
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-1}"
|
||||||
|
export NCCL_NET_GDR_LEVEL="${NCCL_NET_GDR_LEVEL:-5}"
|
||||||
|
export NCCL_NET_GDR_READ="${NCCL_NET_GDR_READ:-1}"
|
||||||
|
export NCCL_DMABUF_ENABLE="${NCCL_DMABUF_ENABLE:-0}"
|
||||||
|
}
|
||||||
|
|
||||||
|
mpi_xargs() {
|
||||||
|
for name in "${base_exports[@]}"; do
|
||||||
|
if [[ -n "${!name+x}" ]]; then
|
||||||
|
printf -- '-x\n%s\n' "$name"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
run_nccl() {
|
||||||
|
local op="$1"
|
||||||
|
local bin="$2"
|
||||||
|
local log="$3"
|
||||||
|
local warmup="$4"
|
||||||
|
local iters="$5"
|
||||||
|
mapfile -t xargs < <(mpi_xargs)
|
||||||
|
"${mpi_base[@]}" "${xargs[@]}" \
|
||||||
|
"$bin" -b "$BEGIN_SIZE" -e "$END_SIZE" -g 1 -f 2 -w "$warmup" -n "$iters" \
|
||||||
|
>"$log" 2>&1
|
||||||
|
awk -v op="$op" '/Avg bus bandwidth/ {print op, $0}' "$log"
|
||||||
|
}
|
||||||
|
|
||||||
|
read_one_snapshot() {
|
||||||
|
local host_label="$1"
|
||||||
|
local out="$2"
|
||||||
|
: >"$out"
|
||||||
|
for hca in $HCAS; do
|
||||||
|
for c in $COUNTERS; do
|
||||||
|
local f="/sys/class/infiniband/$hca/ports/1/counters/$c"
|
||||||
|
if [[ -r "$f" ]]; then
|
||||||
|
printf '%s %s counters %s %s\n' "$host_label" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" >>"$out"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
for c in $HW_COUNTERS; do
|
||||||
|
local f="/sys/class/infiniband/$hca/ports/1/hw_counters/$c"
|
||||||
|
if [[ -r "$f" ]]; then
|
||||||
|
printf '%s %s hw_counters %s %s\n' "$host_label" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" >>"$out"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
read_remote_snapshot() {
|
||||||
|
local out="$1"
|
||||||
|
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
|
||||||
|
-o BatchMode=yes -o ConnectTimeout=5 "${SSH_USER}@${PEER_HOST}" \
|
||||||
|
"HCAS='$HCAS' COUNTERS='$COUNTERS' HW_COUNTERS='$HW_COUNTERS' bash -s" <<'EOS' >"$out"
|
||||||
|
for hca in $HCAS; do
|
||||||
|
for c in $COUNTERS; do
|
||||||
|
f="/sys/class/infiniband/$hca/ports/1/counters/$c"
|
||||||
|
if [ -r "$f" ]; then
|
||||||
|
printf '%s %s counters %s %s\n' "$HOSTNAME" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
for c in $HW_COUNTERS; do
|
||||||
|
f="/sys/class/infiniband/$hca/ports/1/hw_counters/$c"
|
||||||
|
if [ -r "$f" ]; then
|
||||||
|
printf '%s %s hw_counters %s %s\n' "$HOSTNAME" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
EOS
|
||||||
|
}
|
||||||
|
|
||||||
|
summarize_counter_delta() {
|
||||||
|
local before_a="$1"
|
||||||
|
local before_b="$2"
|
||||||
|
local after_a="$3"
|
||||||
|
local after_b="$4"
|
||||||
|
local out="$5"
|
||||||
|
python3 - "$before_a" "$before_b" "$after_a" "$after_b" >"$out" <<'PY'
|
||||||
|
import pathlib
|
||||||
|
import sys
|
||||||
|
|
||||||
|
interesting = {
|
||||||
|
"port_xmit_wait", "port_xmit_discards", "port_rcv_errors",
|
||||||
|
"port_rcv_remote_physical_errors", "port_rcv_switch_relay_errors",
|
||||||
|
"port_xmit_constraint_errors", "port_rcv_constraint_errors",
|
||||||
|
"symbol_error", "link_error_recovery", "link_downed",
|
||||||
|
"local_link_integrity_errors", "excessive_buffer_overrun_errors",
|
||||||
|
"VL15_dropped", "roce_adp_retrans", "roce_adp_retrans_to",
|
||||||
|
"roce_slow_restart", "roce_slow_restart_cnps", "roce_slow_restart_trans",
|
||||||
|
"packet_seq_err", "out_of_sequence", "out_of_buffer",
|
||||||
|
"duplicate_request", "implied_nak_seq_err", "local_ack_timeout_err",
|
||||||
|
"req_transport_retries_exceeded", "rnr_nak_retry_err",
|
||||||
|
}
|
||||||
|
|
||||||
|
def load(path):
|
||||||
|
data = {}
|
||||||
|
for line in pathlib.Path(path).read_text().splitlines():
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) != 5:
|
||||||
|
continue
|
||||||
|
host, hca, kind, counter, value = parts
|
||||||
|
try:
|
||||||
|
data[(host, hca, kind, counter)] = int(value)
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
return data
|
||||||
|
|
||||||
|
before = {}
|
||||||
|
after = {}
|
||||||
|
before.update(load(sys.argv[1]))
|
||||||
|
before.update(load(sys.argv[2]))
|
||||||
|
after.update(load(sys.argv[3]))
|
||||||
|
after.update(load(sys.argv[4]))
|
||||||
|
|
||||||
|
print("NONZERO_DELTAS")
|
||||||
|
for key in sorted(set(before) | set(after)):
|
||||||
|
delta = after.get(key, 0) - before.get(key, 0)
|
||||||
|
if not delta:
|
||||||
|
continue
|
||||||
|
host, hca, kind, counter = key
|
||||||
|
if counter in {"port_xmit_data", "port_rcv_data"}:
|
||||||
|
gib = delta * 4 / (1024 ** 3)
|
||||||
|
print(f"{host} {hca} {kind} {counter} {delta} words4B {gib:.2f} GiB")
|
||||||
|
else:
|
||||||
|
print(f"{host} {hca} {kind} {counter} {delta}")
|
||||||
|
|
||||||
|
print("ERROR_OR_CONGESTION_DELTAS")
|
||||||
|
seen = False
|
||||||
|
for key in sorted(set(before) | set(after)):
|
||||||
|
delta = after.get(key, 0) - before.get(key, 0)
|
||||||
|
if delta and key[3] in interesting:
|
||||||
|
seen = True
|
||||||
|
print(*key, delta)
|
||||||
|
if not seen:
|
||||||
|
print("none")
|
||||||
|
PY
|
||||||
|
}
|
||||||
|
|
||||||
|
run_counter_case() {
|
||||||
|
local op="$1"
|
||||||
|
local bin="$2"
|
||||||
|
local extra="${3:-}"
|
||||||
|
set_common_env
|
||||||
|
if [[ -n "$extra" ]]; then
|
||||||
|
eval "export $extra"
|
||||||
|
fi
|
||||||
|
local dir="$OUT_DIR/${op}_counter"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
read_one_snapshot "$(hostname)" "$dir/before.local"
|
||||||
|
read_remote_snapshot "$dir/before.remote"
|
||||||
|
run_nccl "$op" "$bin" "$dir/${op}.log" "$WARMUP_ITERS" "$ITERS"
|
||||||
|
read_one_snapshot "$(hostname)" "$dir/after.local"
|
||||||
|
read_remote_snapshot "$dir/after.remote"
|
||||||
|
summarize_counter_delta "$dir/before.local" "$dir/before.remote" "$dir/after.local" "$dir/after.remote" "$dir/counter_delta.txt"
|
||||||
|
echo "$dir"
|
||||||
|
}
|
||||||
|
|
||||||
|
summarize_graph_log() {
|
||||||
|
local log="$1"
|
||||||
|
local out="$2"
|
||||||
|
python3 - "$log" >"$out" <<'PY'
|
||||||
|
from pathlib import Path
|
||||||
|
import collections
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
|
||||||
|
text = Path(sys.argv[1]).read_text(errors="ignore")
|
||||||
|
print("avg_busbw", (re.findall(r"Avg bus bandwidth\s*:\s*([0-9.]+)", text) or ["NA"])[-1])
|
||||||
|
print("nccl_version", sorted(set(re.findall(r"NCCL version ([^\s]+)", text))))
|
||||||
|
print("plugin_missing", len(re.findall(r"Could not find: none libnccl-net-none\.so", text)))
|
||||||
|
print("gdr_enabled_lines", len(re.findall(r"GPU Direct RDMA Enabled", text)))
|
||||||
|
print("using_hca")
|
||||||
|
for value, count in collections.Counter(re.findall(r"NET/IB : Using \[(.*?)\]; OOB", text)).most_common(4):
|
||||||
|
print(f" {count} {value}")
|
||||||
|
print("pattern_counts")
|
||||||
|
patterns = re.findall(
|
||||||
|
r"Pattern (\d+), crossNic (\d+), nChannels (\d+), bw ([0-9.]+)/([0-9.]+), type ([^,]+), sameChannels (\d+)",
|
||||||
|
text,
|
||||||
|
)
|
||||||
|
for key, count in collections.Counter(patterns).most_common():
|
||||||
|
print(f" {count} {key}")
|
||||||
|
print("channel_summary")
|
||||||
|
for value, count in collections.Counter(
|
||||||
|
re.findall(r"(\d+ coll channels, \d+ collnet channels, \d+ nvls channels, \d+ p2p channels, \d+ p2p channels per peer)", text)
|
||||||
|
).most_common():
|
||||||
|
print(f" {count} {value}")
|
||||||
|
print("p2p_chunks", collections.Counter(re.findall(r"P2P Chunksize set to (\d+)", text)))
|
||||||
|
print("check_p2p", collections.Counter(re.findall(r"Check P2P Type ([^\n]+)", text)))
|
||||||
|
for token in ["NET/IB/0/GDRDMA", "NET/IB/1/GDRDMA", "NET/IB/2/GDRDMA", "NET/IB/3/GDRDMA", "P2P/CUMEM", "P2P/IPC", "SHM"]:
|
||||||
|
print(token, text.count(token))
|
||||||
|
print("channel_edge_lines", len([line for line in text.splitlines() if "Channel " in line and ("via NET/IB" in line or "via P2P" in line)]))
|
||||||
|
PY
|
||||||
|
}
|
||||||
|
|
||||||
|
run_graph_case() {
|
||||||
|
local op="$1"
|
||||||
|
local bin="$2"
|
||||||
|
local extra="${3:-}"
|
||||||
|
set_common_env
|
||||||
|
export NCCL_DEBUG=INFO
|
||||||
|
export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,TUNING,COLL
|
||||||
|
if [[ -n "$extra" ]]; then
|
||||||
|
eval "export $extra"
|
||||||
|
fi
|
||||||
|
local dir="$OUT_DIR/graph"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
local log="$dir/${op}.log"
|
||||||
|
run_nccl "$op" "$bin" "$log" "$GRAPH_WARMUP_ITERS" "$GRAPH_ITERS"
|
||||||
|
summarize_graph_log "$log" "$dir/${op}_summary.txt"
|
||||||
|
echo "$dir/${op}_summary.txt"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_pxn_sweep() {
|
||||||
|
local dir="$OUT_DIR/pxn_sweep"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
local cases=(
|
||||||
|
"baseline|"
|
||||||
|
"nvls_off|NCCL_NVLS_ENABLE=0"
|
||||||
|
"qps4_split1|NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1"
|
||||||
|
"qps8_split1|NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1"
|
||||||
|
"qps4_split0|NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0"
|
||||||
|
"channels16|NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16"
|
||||||
|
"buff8m|NCCL_BUFFSIZE=8388608"
|
||||||
|
"p2pchunk4m|NCCL_P2P_NET_CHUNKSIZE=4194304"
|
||||||
|
"netpeer8|NCCL_NCHANNELS_PER_NET_PEER=8"
|
||||||
|
"ar0|NCCL_IB_AR_THRESHOLD=0"
|
||||||
|
)
|
||||||
|
: >"$dir/summary.txt"
|
||||||
|
for item in "${cases[@]}"; do
|
||||||
|
local name="${item%%|*}"
|
||||||
|
local extra="${item#*|}"
|
||||||
|
set_common_env
|
||||||
|
export NCCL_PXN_DISABLE=1
|
||||||
|
if [[ -n "$extra" ]]; then
|
||||||
|
eval "export $extra"
|
||||||
|
fi
|
||||||
|
local log="$dir/${name}.log"
|
||||||
|
{
|
||||||
|
echo "===== CASE $name ====="
|
||||||
|
echo "extra: ${extra:-none}"
|
||||||
|
run_nccl "alltoall" "$NCCL_TESTS_DIR/alltoall_perf" "$log" "$SWEEP_WARMUP_ITERS" "$SWEEP_ITERS"
|
||||||
|
awk '/Avg bus bandwidth/ {print}' "$log" | tail -1
|
||||||
|
} | tee -a "$dir/summary.txt"
|
||||||
|
done
|
||||||
|
echo "$dir/summary.txt"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_preflight() {
|
||||||
|
set_common_env
|
||||||
|
local out="$OUT_DIR/preflight.txt"
|
||||||
|
{
|
||||||
|
echo "===== LOCAL ====="
|
||||||
|
echo "hostname: $(hostname)"
|
||||||
|
echo "mpirun: $MPI_BIN"
|
||||||
|
if [[ -x "$MPI_BIN" ]]; then
|
||||||
|
"$MPI_BIN" --version 2>&1 | sed -n '1p'
|
||||||
|
else
|
||||||
|
echo "MISSING executable: $MPI_BIN"
|
||||||
|
fi
|
||||||
|
for bin in "$NCCL_TESTS_DIR/all_reduce_perf" "$NCCL_TESTS_DIR/alltoall_perf"; do
|
||||||
|
if [[ -x "$bin" ]]; then
|
||||||
|
echo "OK executable: $bin"
|
||||||
|
else
|
||||||
|
echo "MISSING executable: $bin"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
for hca in $HCAS; do
|
||||||
|
local state="/sys/class/infiniband/$hca/ports/1/state"
|
||||||
|
local rate="/sys/class/infiniband/$hca/ports/1/rate"
|
||||||
|
if [[ -r "$state" ]]; then
|
||||||
|
echo "OK HCA: $hca state=$(cat "$state") rate=$(cat "$rate" 2>/dev/null || echo unknown)"
|
||||||
|
else
|
||||||
|
echo "MISSING HCA path: $hca"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "===== REMOTE ====="
|
||||||
|
ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \
|
||||||
|
-o BatchMode=yes -o ConnectTimeout=5 "${SSH_USER}@${PEER_HOST}" \
|
||||||
|
"MPI_BIN='$MPI_BIN' NCCL_TESTS_DIR='$NCCL_TESTS_DIR' HCAS='$HCAS' bash -s" <<'EOS'
|
||||||
|
echo "hostname: $(hostname)"
|
||||||
|
echo "mpirun: $MPI_BIN"
|
||||||
|
if [ -x "$MPI_BIN" ]; then
|
||||||
|
"$MPI_BIN" --version 2>&1 | sed -n '1p'
|
||||||
|
else
|
||||||
|
echo "MISSING executable: $MPI_BIN"
|
||||||
|
fi
|
||||||
|
for bin in "$NCCL_TESTS_DIR/all_reduce_perf" "$NCCL_TESTS_DIR/alltoall_perf"; do
|
||||||
|
if [ -x "$bin" ]; then
|
||||||
|
echo "OK executable: $bin"
|
||||||
|
else
|
||||||
|
echo "MISSING executable: $bin"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
for hca in $HCAS; do
|
||||||
|
state="/sys/class/infiniband/$hca/ports/1/state"
|
||||||
|
rate="/sys/class/infiniband/$hca/ports/1/rate"
|
||||||
|
if [ -r "$state" ]; then
|
||||||
|
echo "OK HCA: $hca state=$(cat "$state") rate=$(cat "$rate" 2>/dev/null || echo unknown)"
|
||||||
|
else
|
||||||
|
echo "MISSING HCA path: $hca"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
EOS
|
||||||
|
} | tee "$out"
|
||||||
|
echo "$out"
|
||||||
|
}
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<EOF
|
||||||
|
Usage: $0 [preflight|all|allreduce-counter|alltoall-counter|graph|pxn-sweep]
|
||||||
|
|
||||||
|
Outputs are written to: $OUT_DIR
|
||||||
|
|
||||||
|
Common overrides:
|
||||||
|
HOSTS, PEER_HOST, HCAS, HCA_CSV, MPI_BIN, NCCL_TESTS_DIR,
|
||||||
|
NCCL_LD_LIBRARY_PATH, BEGIN_SIZE, END_SIZE, WARMUP_ITERS, ITERS
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
case "$MODE" in
|
||||||
|
preflight)
|
||||||
|
run_preflight
|
||||||
|
;;
|
||||||
|
all)
|
||||||
|
run_preflight
|
||||||
|
run_counter_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||||
|
run_counter_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||||
|
run_graph_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||||
|
run_graph_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||||
|
run_pxn_sweep
|
||||||
|
;;
|
||||||
|
allreduce-counter)
|
||||||
|
run_counter_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||||
|
;;
|
||||||
|
alltoall-counter)
|
||||||
|
run_counter_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||||
|
;;
|
||||||
|
graph)
|
||||||
|
run_graph_case allreduce "$NCCL_TESTS_DIR/all_reduce_perf" ""
|
||||||
|
run_graph_case alltoall_pxn "$NCCL_TESTS_DIR/alltoall_perf" "NCCL_PXN_DISABLE=1"
|
||||||
|
;;
|
||||||
|
pxn-sweep)
|
||||||
|
run_pxn_sweep
|
||||||
|
;;
|
||||||
|
-h|--help|help)
|
||||||
|
usage
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
echo "OUT_DIR=$OUT_DIR"
|
||||||
Loading…
x
Reference in New Issue
Block a user