diff --git a/README.md b/README.md index eed4791..fd890d4 100644 --- a/README.md +++ b/README.md @@ -575,6 +575,17 @@ report: └── 确认: 训练 loss 正常下降 ``` +#### 多节点 NCCL 深度诊断 + +当 SOP-3 的多节点 NCCL 结果与验收 PDF 不一致时,可以在发起节点运行深度诊断脚本,复现 counter 抓取、GRAPH/TUNING 日志和 PXN disabled sweep: + +```bash +bash scripts/multinode_nccl_deep_diagnose.sh preflight +bash scripts/multinode_nccl_deep_diagnose.sh all +``` + +详细参数、输出目录和解读方法见 [docs/multinode_nccl_deep_diagnose_runbook.md](/Users/d-robotics/lab/test_gpu_scripts/docs/multinode_nccl_deep_diagnose_runbook.md)。 + --- ### SOP-4: 故障诊断 diff --git a/docs/multinode_nccl_deep_diagnose_runbook.md b/docs/multinode_nccl_deep_diagnose_runbook.md new file mode 100644 index 0000000..11a0629 --- /dev/null +++ b/docs/multinode_nccl_deep_diagnose_runbook.md @@ -0,0 +1,201 @@ +# 多机 NCCL 深度诊断 runbook + +本文档用于复现 2026-05-23 这轮 2 机 8 卡 NCCL 排查里的关键动作:counter 抓取、GRAPH/TUNING 日志、以及 PXN disabled 基线上的二次参数 sweep。 + +## 适用场景 + +当前默认参数面向: + +- `aikubeworker0012` / `172.72.8.12` +- `aikubeworker0016` / `172.72.8.16` +- 每节点 8 GPU +- 每节点 4 条 400G HCA:`mlx5_0,mlx5_1,mlx5_6,mlx5_7` +- NCCL 临时运行库:`/tmp/nccl-2.27.7-cuda12.4` +- nccl-tests:`/data/nccl-tests-latest/build` +- OpenMPI:`/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun` + +脚本应在 coordinator 节点上执行,当前即 `aikubeworker0012`。 + +## 快速运行 + +```bash +cd /root/test_gpu_scripts +bash scripts/multinode_nccl_deep_diagnose.sh preflight +bash scripts/multinode_nccl_deep_diagnose.sh all +``` + +默认输出目录为: + +```text +/tmp/nccl_deep_diagnose_YYYYMMDD_HHMMSS +``` + +只跑单项: + +```bash +# 轻量检查 SSH、mpirun、nccl-tests 和 HCA 路径 +bash scripts/multinode_nccl_deep_diagnose.sh preflight + +# allreduce counter 对照 +bash scripts/multinode_nccl_deep_diagnose.sh allreduce-counter + +# PXN disabled alltoall counter +bash scripts/multinode_nccl_deep_diagnose.sh alltoall-counter + +# NCCL GRAPH/TUNING/COLL 对照 +bash scripts/multinode_nccl_deep_diagnose.sh graph + +# PXN disabled 基线上的二次参数 sweep +bash scripts/multinode_nccl_deep_diagnose.sh pxn-sweep +``` + +## 常用参数覆盖 + +```bash +OUT_DIR=/tmp/my_nccl_diag \ +HOSTS=172.72.8.12:8,172.72.8.16:8 \ +PEER_HOST=172.72.8.16 \ +HCAS="mlx5_0 mlx5_1 mlx5_6 mlx5_7" \ +HCA_CSV=mlx5_0,mlx5_1,mlx5_6,mlx5_7 \ +bash scripts/multinode_nccl_deep_diagnose.sh all +``` + +如果 nccl-tests 或 NCCL 运行库路径变化: + +```bash +NCCL_TESTS_DIR=/opt/gpu-test-tools/nccl-tests/build \ +NCCL_LD_LIBRARY_PATH=/usr/mpi/gcc/openmpi-4.1.9a1/lib:/path/to/nccl/lib:/usr/local/cuda/lib64 \ +bash scripts/multinode_nccl_deep_diagnose.sh graph +``` + +## 输出解读 + +### preflight 模式 + +典型输出文件: + +```text +preflight.txt +``` + +该模式不跑 NCCL workload,只检查: + +- 本机和对端主机名。 +- OpenMPI `mpirun` 是否存在且可执行。 +- `all_reduce_perf` / `alltoall_perf` 是否存在且可执行。 +- 配置的 HCA 是否能在 `/sys/class/infiniband//ports/1` 下读到 state/rate。 +- 发起节点到 `PEER_HOST` 的 root SSH 是否可用。 + +如果这里出现 `MISSING`,先修环境;否则再跑 `all` 或单项诊断。 + +### counter 模式 + +典型输出文件: + +```text +allreduce_counter/ + allreduce.log + before.local + before.remote + after.local + after.remote + counter_delta.txt + +alltoall_pxn_counter/ + alltoall_pxn.log + before.local + before.remote + after.local + after.remote + counter_delta.txt +``` + +重点看 `counter_delta.txt`: + +- `port_xmit_data` / `port_rcv_data`:端口流量,单位为 4-byte words,脚本同时换算 GiB。 +- `port_xmit_wait`:发送等待或 credit/拥塞等待信号。注意它不是 alltoall 独有根因,因为高吞吐 allreduce 也会出现。 +- `port_xmit_discards`、`port_rcv_errors`、`symbol_error`、`roce_adp_retrans`、`packet_seq_err` 等:错误、丢包、重传、链路异常类信号。 + +当前已知基线: + +- allreduce 可到约 `354 GB/s busbw`,4 条 rail 均衡。 +- PXN disabled alltoall 通常在 `36-37 GB/s busbw` 附近,但有窗口波动。 +- alltoall PXN disabled 后 rail 均衡,且没有明显 error/retrans/slow restart。 + +### graph 模式 + +典型输出文件: + +```text +graph/ + allreduce.log + allreduce_summary.txt + alltoall_pxn.log + alltoall_pxn_summary.txt +``` + +重点看: + +- `nccl_version` +- `plugin_missing` +- `gdr_enabled_lines` +- `pattern_counts` +- `channel_summary` +- `NET/IB/*/GDRDMA` +- `P2P/CUMEM` +- `channel_edge_lines` + +当前已知对照: + +| 观察项 | allreduce | alltoall + `NCCL_PXN_DISABLE=1` | +|--------|-----------|----------------------------------| +| HCA / GDR | 4 HCA, GDR enabled | 4 HCA, GDR enabled | +| channels | `16 coll / 16 nvls / 16 p2p` | `16 coll / 16 nvls / 16 p2p` | +| `NET/IB/*/GDRDMA` channel edge lines | `256` | `512` | +| `P2P/CUMEM` channel edge lines | `0` | `224` | +| total NET/P2P channel edge lines | `256` | `736` | + +判断边界: + +- 如果 HCA/GDR/channel 基础状态一致,但 alltoall graph 明显更复杂,问题更偏向 NCCL collective graph、P2P/NET 组合方式、internal IB plugin 或交换网络策略。 +- 如果 GDR disabled、HCA 不完整、plugin 路径变化,则不能直接与当前报告结论对比。 + +### pxn-sweep 模式 + +典型输出: + +```text +pxn_sweep/ + baseline.log + nvls_off.log + qps4_split1.log + qps8_split1.log + qps4_split0.log + channels16.log + buff8m.log + p2pchunk4m.log + netpeer8.log + ar0.log + summary.txt +``` + +当前结论: + +- `NCCL_PXN_DISABLE=1` 是已发现的唯一稳定正向项。 +- 在 PXN disabled 基线上继续叠加 NVLS、P2P chunk、buffer、channel、QP/split、AR,没有稳定收益。 +- QP/split 和 `NCCL_NCHANNELS_PER_NET_PEER=8` 在当前环境下明显变差。 + +## 交接给网络/NCCL 环境侧的重点 + +1. 当前不是旧 NCCL/GDR disabled 问题:NCCL `2.27.7` 下 4 条 HCA 都是 GDR enabled。 +2. 当前不是 rail 完全打偏问题:`NCCL_PXN_DISABLE=1` 后 alltoall 的 4 条 rail 已均衡。 +3. 当前不是明显坏链路/重传问题:未看到 discard、symbol error、RoCE retrans、slow restart、packet sequence error 等增长。 +4. allreduce 已接近当前 4 x 400G rail 的物理可用带宽;PDF 8 卡 allreduce 目标反推需要超过当前 4 rail 单向理论带宽。 +5. alltoall 剩余差距更像 NCCL internal alltoall graph、P2P/NET 组合方式、缺少 NCCL net plugin/SHARP,或交换网络策略/ECMP/拥塞控制问题。 + +## 关联报告 + +- `reports_multinode_nccl_diagnosis_20260523.md` +- `reports_multinode_nccl_alltoall_tuning_20260523.md` +- `reports_multinode_nccl_counter_probe_20260523.md` +- `reports_multinode_nccl_pdf_matrix_nccl227.md` diff --git a/scripts/multinode_nccl_deep_diagnose.sh b/scripts/multinode_nccl_deep_diagnose.sh new file mode 100755 index 0000000..b16409c --- /dev/null +++ b/scripts/multinode_nccl_deep_diagnose.sh @@ -0,0 +1,425 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Deep-diagnose multi-node NCCL behavior from the coordinator node. +# Default values match the current 2-node H100 cross-leaf investigation. + +MODE="${1:-all}" + +MPI_BIN="${MPI_BIN:-/usr/mpi/gcc/openmpi-4.1.9a1/bin/mpirun}" +NCCL_TESTS_DIR="${NCCL_TESTS_DIR:-/data/nccl-tests-latest/build}" +HOSTS="${HOSTS:-172.72.8.12:8,172.72.8.16:8}" +PEER_HOST="${PEER_HOST:-172.72.8.16}" +SSH_USER="${SSH_USER:-root}" +HCAS="${HCAS:-mlx5_0 mlx5_1 mlx5_6 mlx5_7}" +HCA_CSV="${HCA_CSV:-mlx5_0,mlx5_1,mlx5_6,mlx5_7}" +OUT_DIR="${OUT_DIR:-/tmp/nccl_deep_diagnose_$(date +%Y%m%d_%H%M%S)}" + +BEGIN_SIZE="${BEGIN_SIZE:-16G}" +END_SIZE="${END_SIZE:-16G}" +WARMUP_ITERS="${WARMUP_ITERS:-10}" +ITERS="${ITERS:-10}" +GRAPH_WARMUP_ITERS="${GRAPH_WARMUP_ITERS:-1}" +GRAPH_ITERS="${GRAPH_ITERS:-1}" +SWEEP_WARMUP_ITERS="${SWEEP_WARMUP_ITERS:-3}" +SWEEP_ITERS="${SWEEP_ITERS:-5}" + +NCCL_LD_LIBRARY_PATH="${NCCL_LD_LIBRARY_PATH:-/usr/mpi/gcc/openmpi-4.1.9a1/lib:/tmp/nccl-2.27.7-cuda12.4/usr/lib/x86_64-linux-gnu:/usr/local/cuda-12.4/targets/x86_64-linux/lib}" +DEFAULT_NCCL_DEBUG="${NCCL_DEBUG:-WARN}" + +COUNTERS="${COUNTERS:-port_xmit_data port_rcv_data port_xmit_packets port_rcv_packets port_xmit_wait port_xmit_discards port_rcv_errors port_rcv_remote_physical_errors port_rcv_switch_relay_errors port_xmit_constraint_errors port_rcv_constraint_errors symbol_error link_error_recovery link_downed local_link_integrity_errors excessive_buffer_overrun_errors VL15_dropped}" +HW_COUNTERS="${HW_COUNTERS:-roce_adp_retrans roce_adp_retrans_to roce_slow_restart roce_slow_restart_cnps roce_slow_restart_trans packet_seq_err out_of_sequence out_of_buffer duplicate_request implied_nak_seq_err local_ack_timeout_err req_transport_retries_exceeded rnr_nak_retry_err rx_write_requests rx_read_requests}" + +mkdir -p "$OUT_DIR" + +mpi_base=( + "$MPI_BIN" + --allow-run-as-root + --mca btl_openib_warn_no_device_params_found 0 + --mca btl_tcp_if_include bond0 + --mca oob_tcp_if_include bond0 + --mca plm_rsh_args "-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=10" + -H "$HOSTS" + --map-by ppr:8:node + -np 16 +) + +base_exports=( + LD_LIBRARY_PATH + NCCL_IB_GID_INDEX NCCL_IB_SL NCCL_IB_TC NCCL_SOCKET_IFNAME + NCCL_DEBUG NCCL_DEBUG_SUBSYS NCCL_IB_TIMEOUT NCCL_IB_HCA + NCCL_NET_PLUGIN NCCL_NVLS_ENABLE NCCL_NET_GDR_LEVEL NCCL_NET_GDR_READ + NCCL_DMABUF_ENABLE NCCL_PXN_DISABLE NCCL_IB_QPS_PER_CONNECTION + NCCL_IB_SPLIT_DATA_ON_QPS NCCL_MIN_NCHANNELS NCCL_MAX_NCHANNELS + NCCL_BUFFSIZE NCCL_P2P_NET_CHUNKSIZE NCCL_NCHANNELS_PER_NET_PEER + NCCL_IB_AR_THRESHOLD +) + +set_common_env() { + unset NCCL_DEBUG_SUBSYS NCCL_PXN_DISABLE NCCL_IB_QPS_PER_CONNECTION + unset NCCL_IB_SPLIT_DATA_ON_QPS NCCL_MIN_NCHANNELS NCCL_MAX_NCHANNELS + unset NCCL_BUFFSIZE NCCL_P2P_NET_CHUNKSIZE NCCL_NCHANNELS_PER_NET_PEER + unset NCCL_IB_AR_THRESHOLD + + export LD_LIBRARY_PATH="$NCCL_LD_LIBRARY_PATH" + export NCCL_IB_GID_INDEX="${NCCL_IB_GID_INDEX:-3}" + export NCCL_IB_SL="${NCCL_IB_SL:-5}" + export NCCL_IB_TC="${NCCL_IB_TC:-136}" + export NCCL_SOCKET_IFNAME="${NCCL_SOCKET_IFNAME:-bond0}" + export NCCL_DEBUG="$DEFAULT_NCCL_DEBUG" + export NCCL_IB_TIMEOUT="${NCCL_IB_TIMEOUT:-22}" + export NCCL_IB_HCA="$HCA_CSV" + export NCCL_NET_PLUGIN="${NCCL_NET_PLUGIN:-none}" + export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-1}" + export NCCL_NET_GDR_LEVEL="${NCCL_NET_GDR_LEVEL:-5}" + export NCCL_NET_GDR_READ="${NCCL_NET_GDR_READ:-1}" + export NCCL_DMABUF_ENABLE="${NCCL_DMABUF_ENABLE:-0}" +} + +mpi_xargs() { + for name in "${base_exports[@]}"; do + if [[ -n "${!name+x}" ]]; then + printf -- '-x\n%s\n' "$name" + fi + done +} + +run_nccl() { + local op="$1" + local bin="$2" + local log="$3" + local warmup="$4" + local iters="$5" + mapfile -t xargs < <(mpi_xargs) + "${mpi_base[@]}" "${xargs[@]}" \ + "$bin" -b "$BEGIN_SIZE" -e "$END_SIZE" -g 1 -f 2 -w "$warmup" -n "$iters" \ + >"$log" 2>&1 + awk -v op="$op" '/Avg bus bandwidth/ {print op, $0}' "$log" +} + +read_one_snapshot() { + local host_label="$1" + local out="$2" + : >"$out" + for hca in $HCAS; do + for c in $COUNTERS; do + local f="/sys/class/infiniband/$hca/ports/1/counters/$c" + if [[ -r "$f" ]]; then + printf '%s %s counters %s %s\n' "$host_label" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" >>"$out" + fi + done + for c in $HW_COUNTERS; do + local f="/sys/class/infiniband/$hca/ports/1/hw_counters/$c" + if [[ -r "$f" ]]; then + printf '%s %s hw_counters %s %s\n' "$host_label" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" >>"$out" + fi + done + done +} + +read_remote_snapshot() { + local out="$1" + ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \ + -o BatchMode=yes -o ConnectTimeout=5 "${SSH_USER}@${PEER_HOST}" \ + "HCAS='$HCAS' COUNTERS='$COUNTERS' HW_COUNTERS='$HW_COUNTERS' bash -s" <<'EOS' >"$out" +for hca in $HCAS; do + for c in $COUNTERS; do + f="/sys/class/infiniband/$hca/ports/1/counters/$c" + if [ -r "$f" ]; then + printf '%s %s counters %s %s\n' "$HOSTNAME" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" + fi + done + for c in $HW_COUNTERS; do + f="/sys/class/infiniband/$hca/ports/1/hw_counters/$c" + if [ -r "$f" ]; then + printf '%s %s hw_counters %s %s\n' "$HOSTNAME" "$hca" "$c" "$(cat "$f" 2>/dev/null || echo 0)" + fi + done +done +EOS +} + +summarize_counter_delta() { + local before_a="$1" + local before_b="$2" + local after_a="$3" + local after_b="$4" + local out="$5" + python3 - "$before_a" "$before_b" "$after_a" "$after_b" >"$out" <<'PY' +import pathlib +import sys + +interesting = { + "port_xmit_wait", "port_xmit_discards", "port_rcv_errors", + "port_rcv_remote_physical_errors", "port_rcv_switch_relay_errors", + "port_xmit_constraint_errors", "port_rcv_constraint_errors", + "symbol_error", "link_error_recovery", "link_downed", + "local_link_integrity_errors", "excessive_buffer_overrun_errors", + "VL15_dropped", "roce_adp_retrans", "roce_adp_retrans_to", + "roce_slow_restart", "roce_slow_restart_cnps", "roce_slow_restart_trans", + "packet_seq_err", "out_of_sequence", "out_of_buffer", + "duplicate_request", "implied_nak_seq_err", "local_ack_timeout_err", + "req_transport_retries_exceeded", "rnr_nak_retry_err", +} + +def load(path): + data = {} + for line in pathlib.Path(path).read_text().splitlines(): + parts = line.split() + if len(parts) != 5: + continue + host, hca, kind, counter, value = parts + try: + data[(host, hca, kind, counter)] = int(value) + except ValueError: + pass + return data + +before = {} +after = {} +before.update(load(sys.argv[1])) +before.update(load(sys.argv[2])) +after.update(load(sys.argv[3])) +after.update(load(sys.argv[4])) + +print("NONZERO_DELTAS") +for key in sorted(set(before) | set(after)): + delta = after.get(key, 0) - before.get(key, 0) + if not delta: + continue + host, hca, kind, counter = key + if counter in {"port_xmit_data", "port_rcv_data"}: + gib = delta * 4 / (1024 ** 3) + print(f"{host} {hca} {kind} {counter} {delta} words4B {gib:.2f} GiB") + else: + print(f"{host} {hca} {kind} {counter} {delta}") + +print("ERROR_OR_CONGESTION_DELTAS") +seen = False +for key in sorted(set(before) | set(after)): + delta = after.get(key, 0) - before.get(key, 0) + if delta and key[3] in interesting: + seen = True + print(*key, delta) +if not seen: + print("none") +PY +} + +run_counter_case() { + local op="$1" + local bin="$2" + local extra="${3:-}" + set_common_env + if [[ -n "$extra" ]]; then + eval "export $extra" + fi + local dir="$OUT_DIR/${op}_counter" + mkdir -p "$dir" + read_one_snapshot "$(hostname)" "$dir/before.local" + read_remote_snapshot "$dir/before.remote" + run_nccl "$op" "$bin" "$dir/${op}.log" "$WARMUP_ITERS" "$ITERS" + read_one_snapshot "$(hostname)" "$dir/after.local" + read_remote_snapshot "$dir/after.remote" + summarize_counter_delta "$dir/before.local" "$dir/before.remote" "$dir/after.local" "$dir/after.remote" "$dir/counter_delta.txt" + echo "$dir" +} + +summarize_graph_log() { + local log="$1" + local out="$2" + python3 - "$log" >"$out" <<'PY' +from pathlib import Path +import collections +import re +import sys + +text = Path(sys.argv[1]).read_text(errors="ignore") +print("avg_busbw", (re.findall(r"Avg bus bandwidth\s*:\s*([0-9.]+)", text) or ["NA"])[-1]) +print("nccl_version", sorted(set(re.findall(r"NCCL version ([^\s]+)", text)))) +print("plugin_missing", len(re.findall(r"Could not find: none libnccl-net-none\.so", text))) +print("gdr_enabled_lines", len(re.findall(r"GPU Direct RDMA Enabled", text))) +print("using_hca") +for value, count in collections.Counter(re.findall(r"NET/IB : Using \[(.*?)\]; OOB", text)).most_common(4): + print(f" {count} {value}") +print("pattern_counts") +patterns = re.findall( + r"Pattern (\d+), crossNic (\d+), nChannels (\d+), bw ([0-9.]+)/([0-9.]+), type ([^,]+), sameChannels (\d+)", + text, +) +for key, count in collections.Counter(patterns).most_common(): + print(f" {count} {key}") +print("channel_summary") +for value, count in collections.Counter( + re.findall(r"(\d+ coll channels, \d+ collnet channels, \d+ nvls channels, \d+ p2p channels, \d+ p2p channels per peer)", text) +).most_common(): + print(f" {count} {value}") +print("p2p_chunks", collections.Counter(re.findall(r"P2P Chunksize set to (\d+)", text))) +print("check_p2p", collections.Counter(re.findall(r"Check P2P Type ([^\n]+)", text))) +for token in ["NET/IB/0/GDRDMA", "NET/IB/1/GDRDMA", "NET/IB/2/GDRDMA", "NET/IB/3/GDRDMA", "P2P/CUMEM", "P2P/IPC", "SHM"]: + print(token, text.count(token)) +print("channel_edge_lines", len([line for line in text.splitlines() if "Channel " in line and ("via NET/IB" in line or "via P2P" in line)])) +PY +} + +run_graph_case() { + local op="$1" + local bin="$2" + local extra="${3:-}" + set_common_env + export NCCL_DEBUG=INFO + export NCCL_DEBUG_SUBSYS=INIT,NET,GRAPH,TUNING,COLL + if [[ -n "$extra" ]]; then + eval "export $extra" + fi + local dir="$OUT_DIR/graph" + mkdir -p "$dir" + local log="$dir/${op}.log" + run_nccl "$op" "$bin" "$log" "$GRAPH_WARMUP_ITERS" "$GRAPH_ITERS" + summarize_graph_log "$log" "$dir/${op}_summary.txt" + echo "$dir/${op}_summary.txt" +} + +run_pxn_sweep() { + local dir="$OUT_DIR/pxn_sweep" + mkdir -p "$dir" + local cases=( + "baseline|" + "nvls_off|NCCL_NVLS_ENABLE=0" + "qps4_split1|NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=1" + "qps8_split1|NCCL_IB_QPS_PER_CONNECTION=8 NCCL_IB_SPLIT_DATA_ON_QPS=1" + "qps4_split0|NCCL_IB_QPS_PER_CONNECTION=4 NCCL_IB_SPLIT_DATA_ON_QPS=0" + "channels16|NCCL_MIN_NCHANNELS=16 NCCL_MAX_NCHANNELS=16" + "buff8m|NCCL_BUFFSIZE=8388608" + "p2pchunk4m|NCCL_P2P_NET_CHUNKSIZE=4194304" + "netpeer8|NCCL_NCHANNELS_PER_NET_PEER=8" + "ar0|NCCL_IB_AR_THRESHOLD=0" + ) + : >"$dir/summary.txt" + for item in "${cases[@]}"; do + local name="${item%%|*}" + local extra="${item#*|}" + set_common_env + export NCCL_PXN_DISABLE=1 + if [[ -n "$extra" ]]; then + eval "export $extra" + fi + local log="$dir/${name}.log" + { + echo "===== CASE $name =====" + echo "extra: ${extra:-none}" + run_nccl "alltoall" "$NCCL_TESTS_DIR/alltoall_perf" "$log" "$SWEEP_WARMUP_ITERS" "$SWEEP_ITERS" + awk '/Avg bus bandwidth/ {print}' "$log" | tail -1 + } | tee -a "$dir/summary.txt" + done + echo "$dir/summary.txt" +} + +run_preflight() { + set_common_env + local out="$OUT_DIR/preflight.txt" + { + echo "===== LOCAL =====" + echo "hostname: $(hostname)" + echo "mpirun: $MPI_BIN" + if [[ -x "$MPI_BIN" ]]; then + "$MPI_BIN" --version 2>&1 | sed -n '1p' + else + echo "MISSING executable: $MPI_BIN" + fi + for bin in "$NCCL_TESTS_DIR/all_reduce_perf" "$NCCL_TESTS_DIR/alltoall_perf"; do + if [[ -x "$bin" ]]; then + echo "OK executable: $bin" + else + echo "MISSING executable: $bin" + fi + done + for hca in $HCAS; do + local state="/sys/class/infiniband/$hca/ports/1/state" + local rate="/sys/class/infiniband/$hca/ports/1/rate" + if [[ -r "$state" ]]; then + echo "OK HCA: $hca state=$(cat "$state") rate=$(cat "$rate" 2>/dev/null || echo unknown)" + else + echo "MISSING HCA path: $hca" + fi + done + + echo "===== REMOTE =====" + ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null \ + -o BatchMode=yes -o ConnectTimeout=5 "${SSH_USER}@${PEER_HOST}" \ + "MPI_BIN='$MPI_BIN' NCCL_TESTS_DIR='$NCCL_TESTS_DIR' HCAS='$HCAS' bash -s" <<'EOS' +echo "hostname: $(hostname)" +echo "mpirun: $MPI_BIN" +if [ -x "$MPI_BIN" ]; then + "$MPI_BIN" --version 2>&1 | sed -n '1p' +else + echo "MISSING executable: $MPI_BIN" +fi +for bin in "$NCCL_TESTS_DIR/all_reduce_perf" "$NCCL_TESTS_DIR/alltoall_perf"; do + if [ -x "$bin" ]; then + echo "OK executable: $bin" + else + echo "MISSING executable: $bin" + fi +done +for hca in $HCAS; do + state="/sys/class/infiniband/$hca/ports/1/state" + rate="/sys/class/infiniband/$hca/ports/1/rate" + if [ -r "$state" ]; then + echo "OK HCA: $hca state=$(cat "$state") rate=$(cat "$rate" 2>/dev/null || echo unknown)" + else + echo "MISSING HCA path: $hca" + fi +done +EOS + } | tee "$out" + echo "$out" +} + +usage() { + cat <