From 619a471634f3435fff56f836040a4650234969fb Mon Sep 17 00:00:00 2001 From: cs Date: Sat, 23 May 2026 17:00:03 +0800 Subject: [PATCH] Tune multinode alltoall PXN behavior --- .../multinode_nccl_nccl227_pdf_matrix.yaml | 3 ++ ...multinode_nccl_alltoall_tuning_20260523.md | 51 +++++++++++++++++++ reports_multinode_nccl_diagnosis_20260523.md | 23 +++++++-- reports_multinode_nccl_pdf_matrix_nccl227.md | 33 ++++++------ 4 files changed, 90 insertions(+), 20 deletions(-) create mode 100644 reports_multinode_nccl_alltoall_tuning_20260523.md diff --git a/configs/multinode_nccl_nccl227_pdf_matrix.yaml b/configs/multinode_nccl_nccl227_pdf_matrix.yaml index 34ce13e..00a3220 100644 --- a/configs/multinode_nccl_nccl227_pdf_matrix.yaml +++ b/configs/multinode_nccl_nccl227_pdf_matrix.yaml @@ -55,6 +55,9 @@ multinode_nccl: - nodes: 2 gpus_per_node: 8 label: 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) + op_env: + alltoall: + NCCL_PXN_DISABLE: 1 min_peak_busbw_gbps: allreduce: 491.84 alltoall: 76.54 diff --git a/reports_multinode_nccl_alltoall_tuning_20260523.md b/reports_multinode_nccl_alltoall_tuning_20260523.md new file mode 100644 index 0000000..d26630a --- /dev/null +++ b/reports_multinode_nccl_alltoall_tuning_20260523.md @@ -0,0 +1,51 @@ +# 多机 NCCL 8 卡 alltoall 网络参数 sweep + +- 日期:2026-05-23 +- 主机:`aikubeworker0012` / `172.72.8.12`,`aikubeworker0016` / `172.72.8.16` +- NCCL:临时 `2.27.7+cuda12.4` +- 测试:2 nodes x 8 GPUs,`alltoall_perf -b 16G -e 16G` +- HCA:`mlx5_0,mlx5_1,mlx5_6,mlx5_7` + +## 结论 + +`NCCL_PXN_DISABLE=1` 是本轮唯一有效正向参数,可以把 8 卡 alltoall 从约 `30.06 GB/s` 提升到约 `37.24 GB/s`。纳入正式 PDF 矩阵配置后,8 卡 alltoall 原始报告结果为 `36.70 GB/s peak` / `36.74 GB/s avg`。 + +这个提升有实际价值,但仍远低于 PDF 参考 `76.54 GB/s`。其他参数没有改善,部分明显变差: + +| Case | Avg Bus BW | 结论 | +|------|------------|------| +| baseline | `30.0633 GB/s` | 基线 | +| `NCCL_PXN_DISABLE=1` | `37.2421 GB/s` | 有效提升 | +| `NCCL_P2P_PXN_LEVEL=0` | `20.1205 GB/s` | 明显变差 | +| `NCCL_P2P_PXN_LEVEL=1` | `30.0588 GB/s` | 无改善 | +| `NCCL_P2P_PXN_LEVEL=2` | `30.0437 GB/s` | 无改善 | +| `NCCL_NET_SHARED_COMMS=0` | `27.3889 GB/s` | 变差 | +| `NCCL_NET_SHARED_BUFFERS=0` | `28.2389 GB/s` | 变差 | +| `NCCL_NET_SHARED_COMMS=0 NCCL_NET_SHARED_BUFFERS=0` | `28.2279 GB/s` | 变差 | +| `NCCL_NCHANNELS_PER_NET_PEER=2` | `30.0281 GB/s` | 无改善 | +| `NCCL_NCHANNELS_PER_NET_PEER=4` | `29.9802 GB/s` | 无改善 | +| `NCCL_IB_ADAPTIVE_ROUTING=1 NCCL_IB_AR_THRESHOLD=0` | `30.0526 GB/s` | 无改善 | +| `NCCL_IB_ADAPTIVE_ROUTING=0` | `30.0535 GB/s` | 无改善 | +| `NCCL_IB_PCI_RELAXED_ORDERING=0` | 未完成 | 明显异常,不建议 | + +## 正式配置更新 + +`configs/multinode_nccl_nccl227_pdf_matrix.yaml` 已对 2 nodes x 8 GPUs 的 alltoall 增加: + +```yaml +op_env: + alltoall: + NCCL_PXN_DISABLE: 1 +``` + +正式矩阵报告:`reports_multinode_nccl_pdf_matrix_nccl227.md` + +| Topology | alltoall Peak Bus BW | alltoall Avg Bus BW | PDF Reference | Status | +|----------|----------------------|---------------------|---------------|--------| +| 2 nodes x 8 GPUs | `36.70 GB/s` | `36.74 GB/s` | `76.54 GB/s` | FAIL | + +## 判断 + +1. PXN 在当前拓扑下对 8 卡 alltoall 有负面影响,禁用后有约 `22-24%` 提升。 +2. 禁用 PXN 后仍只有 PDF 目标的一半左右,剩余差距不是单一 NCCL 环境变量可以补齐。 +3. 后续重点仍应放在 NCCL net plugin/SHARP、交换网络策略、路由/拥塞和 alltoall rail 分布。 diff --git a/reports_multinode_nccl_diagnosis_20260523.md b/reports_multinode_nccl_diagnosis_20260523.md index 8253caf..732a6ac 100644 --- a/reports_multinode_nccl_diagnosis_20260523.md +++ b/reports_multinode_nccl_diagnosis_20260523.md @@ -16,6 +16,8 @@ 按 `sx算力节点跨Leaf NCCL测试报告.pdf` 的矩阵继续对齐后,发现 2 机 4 卡档位的核心问题是默认 GPU 选择不符合 GPU-NIC 亲和性。显式选择 `CUDA_VISIBLE_DEVICES=0,1,4,5` 后,2 机 4 卡 allreduce 可以恢复到 `333-335 GB/s` 区间,接近 PDF 的 `335.48 GB/s`;alltoall 配合 PDF 固定 NCCL 参数可到 `72.93 GB/s`,接近 PDF 的 `73.73 GB/s`。但 2 机 8 卡档位仍只有 allreduce `354.02 GB/s`、alltoall `30.04 GB/s`,与 PDF 的 `491.84/76.54 GB/s` 差距明显。 +进一步 sweep 8 卡 alltoall 网络参数后,`NCCL_PXN_DISABLE=1` 是唯一有效正向项。正式矩阵配置已对 2 机 8 GPU 的 alltoall 单独加入该变量,8 卡 alltoall 从约 `30.04 GB/s` 提升到 `36.70 GB/s` peak / `36.74 GB/s` avg,但仍低于 PDF 参考 `76.54 GB/s`。 + 同时,`nccl-gpu-2` 的 SSH 入口曾因未认证连接过多触发 `MaxStartups` 随机拒绝,导致 `mpirun` 拉起远端 rank 失败。已经做了临时 SSHD 缓解并拿到有效的 2 节点 x 8 GPU allreduce/alltoall 报告。 ## 已完成的修正 @@ -33,6 +35,7 @@ 11. 将 multi-node NCCL 配置中的 `qps_per_connection`、`min_nchannels`、`split_data_on_qps` 改为 `null`,避免默认导出会压低大包 allreduce 的固定 NCCL 参数。 12. 增加 topology 级 `cuda_visible_devices`、`env`、`op_env` 配置能力,支持按 GPU/NIC 亲和性和不同 NCCL op 分别设置环境变量。 13. 生成 PDF 矩阵式原始报告 `reports_multinode_nccl_pdf_matrix_nccl227.md`,覆盖 2 机 1/2/4/8 GPU per node。 +14. 对 8 卡 alltoall 做 NCCL 网络参数 sweep,并将有效项 `NCCL_PXN_DISABLE=1` 固化到 PDF 矩阵配置。 ## 关键证据 @@ -265,13 +268,23 @@ NET/IB : GPU Direct RDMA Disabled for HCA 0 'mlx5_0' | Topology | allreduce | PDF Reference | Status | alltoall | PDF Reference | Status | |----------|-----------|---------------|--------|----------|---------------|--------| -| 2 nodes x 1 GPU | `47.23 GB/s` | `48.90 GB/s` | FAIL | `24.84 GB/s` | `27.25 GB/s` | FAIL | -| 2 nodes x 2 GPUs | `136.97 GB/s` | `136.93 GB/s` | PASS | `47.67 GB/s` | `54.41 GB/s` | FAIL | -| 2 nodes x 4 GPUs | `333.22 GB/s` | `335.48 GB/s` | FAIL | `72.93 GB/s` | `73.73 GB/s` | FAIL | -| 2 nodes x 8 GPUs | `354.02 GB/s` | `491.84 GB/s` | FAIL | `30.04 GB/s` | `76.54 GB/s` | FAIL | +| 2 nodes x 1 GPU | `47.26 GB/s` | `48.90 GB/s` | FAIL | `24.87 GB/s` | `27.25 GB/s` | FAIL | +| 2 nodes x 2 GPUs | `136.36 GB/s` | `136.93 GB/s` | FAIL | `47.69 GB/s` | `54.41 GB/s` | FAIL | +| 2 nodes x 4 GPUs | `333.23 GB/s` | `335.48 GB/s` | FAIL | `72.82 GB/s` | `73.73 GB/s` | FAIL | +| 2 nodes x 8 GPUs | `353.47 GB/s` | `491.84 GB/s` | FAIL | `36.70 GB/s` | `76.54 GB/s` | FAIL | 解释:2 机 4 卡档位已经基本定位并修复到接近 PDF;2 机 8 卡档位不是简单 GPU 顺序问题。尝试调整 8 卡 `CUDA_VISIBLE_DEVICES` 顺序、加入 100G/25G active HCA、以及套 PDF 固定参数都没有改善;固定参数反而会把 8 卡 allreduce 从约 `354 GB/s` 压到约 `239 GB/s`。 +8 卡 alltoall 目前的最佳软件侧改动是 `NCCL_PXN_DISABLE=1`: + +| Case | 8 卡 alltoall Avg Bus BW | +|------|--------------------------| +| baseline | `30.06 GB/s` | +| `NCCL_PXN_DISABLE=1` | `37.24 GB/s` | +| 正式矩阵报告 | `36.74 GB/s` | + +其他变量如 `NCCL_P2P_PXN_LEVEL`、`NCCL_NET_SHARED_COMMS`、`NCCL_NET_SHARED_BUFFERS`、`NCCL_NCHANNELS_PER_NET_PEER`、`NCCL_IB_ADAPTIVE_ROUTING` 均无改善或变差。 + ### 8. 8 卡链路计数器与物理上限判断 计数器探测报告:`reports_multinode_nccl_counter_probe_20260523.md` @@ -369,6 +382,7 @@ libnccl-dev - PDF 8 卡 allreduce `491.84 GB/s busbw` 反推需要约 `262 GB/s algbw`,超过当前 4 x 400G 的物理单向总带宽 - 8 卡 alltoall 端口计数器显示 rail 分布不均,且 HCA 顺序 sweep 无改善 - 当前环境缺失 NCCL net plugin/SHARP,NCCL 只能使用 internal IB plugin +- `NCCL_PXN_DISABLE=1` 可将 8 卡 alltoall 提升到约 `36.7 GB/s`,但仍不到 PDF 参考值的一半 ### 阻塞 3:`nccl-gpu-2` SSH 存在外部连接压力 @@ -408,4 +422,5 @@ libnccl-dev - `reports_multinode_nccl_16g_2x8_nccl227_auto.md`:NCCL 2.27.7 16G 自动 channel/QP 原始报告 - `reports_multinode_nccl_pdf_matrix_nccl227.md`:NCCL 2.27.7 PDF 矩阵式原始报告 - `reports_multinode_nccl_counter_probe_20260523.md`:8 卡链路计数器与 HCA 顺序 sweep 报告 +- `reports_multinode_nccl_alltoall_tuning_20260523.md`:8 卡 alltoall NCCL 网络参数 sweep 报告 - `reports_multinode_nccl_diagnosis_20260523.md`:本中文诊断总结 diff --git a/reports_multinode_nccl_pdf_matrix_nccl227.md b/reports_multinode_nccl_pdf_matrix_nccl227.md index a18fb0d..c04d023 100644 --- a/reports_multinode_nccl_pdf_matrix_nccl227.md +++ b/reports_multinode_nccl_pdf_matrix_nccl227.md @@ -1,6 +1,6 @@ # GPU Test Report -- **Date:** 2026-05-23T08:32:58.113416 +- **Date:** 2026-05-23T08:58:19.911230 - **Host:** aikubeworker0012 ## Overall Acceptance Verdict @@ -36,10 +36,10 @@ Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7 | Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status | |----------|----------------------|-------------|-----------|------------|-----------|--------| -| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 47.23 GB/s | 16G | 47.24 GB/s | >= 49 GB/s | FAIL | -| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 136.97 GB/s | 16G | 137.17 GB/s | >= 137 GB/s | PASS | -| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 333.22 GB/s | 16G | 333.24 GB/s | >= 335 GB/s | FAIL | -| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 354.02 GB/s | 16G | 353.92 GB/s | >= 492 GB/s | FAIL | +| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 47.26 GB/s | 16G | 47.19 GB/s | >= 49 GB/s | FAIL | +| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 136.36 GB/s | 16G | 136.69 GB/s | >= 137 GB/s | FAIL | +| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 333.23 GB/s | 16G | 333.45 GB/s | >= 335 GB/s | FAIL | +| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 353.47 GB/s | 16G | 353.86 GB/s | >= 492 GB/s | FAIL | | Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs | |----------|--------------|-----------------|------------------|-------------------| @@ -50,18 +50,19 @@ Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7 | Topology | Return Code | Error / Output Tail | |----------|-------------|---------------------| -| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | E aikubeworker0012:2157248:2157325 [0] NCCL INFO comm 0x5595f28bf420 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.2399 # # Collective test concluded: all_reduce_perf # | -| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ker0012:2157429:2157526 [3] NCCL INFO comm 0x55a8a0147090 rank 3 nranks 8 cudaDev 3 busId ab000 - Destroy COMPLETE aikubeworker0012:2157427:2157524 [1] NCCL INFO comm 0x55b1b0f86630 rank 1 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE | -| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | aikubeworker0016:1138578:1139592 [0] NCCL INFO comm 0x556eff26c190 rank 8 nranks 16 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 353.915 # # Collective test concluded: all_reduce_perf # | +| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | TE aikubeworker0012:2165982:2166060 [0] NCCL INFO comm 0x55d452f2df80 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.189 # # Collective test concluded: all_reduce_perf # | +| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ker0016:1221425:1222411 [0] NCCL INFO comm 0x56437384f040 rank 2 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1221427:1222412 [1] NCCL INFO comm 0x55ab9313f950 rank 3 nranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE | +| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | E aikubeworker0012:2166160:2166257 [0] NCCL INFO comm 0x557243829d50 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 333.449 # # Collective test concluded: all_reduce_perf # | +| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | r0012:2166272:2166442 [5] NCCL INFO comm 0x55721e270960 rank 5 nranks 16 cudaDev 5 busId ab000 - Destroy COMPLETE aikubeworker0012:2166268:2166447 [1] NCCL INFO comm 0x5644fafd24e0 rank 1 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE | ### Multi-node NCCL alltoall | Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status | |----------|----------------------|-------------|-----------|------------|-----------|--------| -| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 24.84 GB/s | 16G | 24.89 GB/s | >= 27 GB/s | FAIL | -| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 47.67 GB/s | 16G | 47.91 GB/s | >= 54 GB/s | FAIL | -| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 72.93 GB/s | 16G | 72.97 GB/s | >= 74 GB/s | FAIL | -| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 30.04 GB/s | 16G | 30.04 GB/s | >= 77 GB/s | FAIL | +| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 24.87 GB/s | 16G | 24.93 GB/s | >= 27 GB/s | FAIL | +| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 47.69 GB/s | 16G | 47.93 GB/s | >= 54 GB/s | FAIL | +| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 72.82 GB/s | 16G | 72.87 GB/s | >= 74 GB/s | FAIL | +| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 36.70 GB/s | 16G | 36.74 GB/s | >= 77 GB/s | FAIL | | Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs | |----------|--------------|-----------------|------------------|-------------------| @@ -72,10 +73,10 @@ Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7 | Topology | Return Code | Error / Output Tail | |----------|-------------|---------------------| -| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ETE aikubeworker0012:2157727:2157802 [0] NCCL INFO comm 0x55a0349b02b0 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 24.8897 # # Collective test concluded: alltoall_perf # | -| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ETE aikubeworker0016:1141290:1142410 [0] NCCL INFO comm 0x55fabbea6410 rank 2 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.9094 # # Collective test concluded: alltoall_perf # | -| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ETE aikubeworker0012:2158071:2158172 [0] NCCL INFO comm 0x563312baa7f0 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 72.9657 # # Collective test concluded: alltoall_perf # | -| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | 016:1143717:1145948 [7] NCCL INFO comm 0x5558cc9de640 rank 15 nranks 16 cudaDev 7 busId db000 - Destroy COMPLETE aikubeworker0016:1143713:1145946 [3] NCCL INFO comm 0x55c1af080e60 rank 11 nranks 16 cudaDev 3 busId 5d000 - Destroy COMPLETE | +| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ETE aikubeworker0012:2166458:2166534 [0] NCCL INFO comm 0x5603baefb150 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 24.9304 # # Collective test concluded: alltoall_perf # | +| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ETE aikubeworker0012:2166543:2166743 [0] NCCL INFO comm 0x5569d31d4f50 rank 0 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.9258 # # Collective test concluded: alltoall_perf # | +| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ker0016:1227342:1228382 [1] NCCL INFO comm 0x55cdec231780 rank 5 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0016:1227344:1228381 [3] NCCL INFO comm 0x563c7ed39680 rank 7 nranks 8 cudaDev 3 busId ab000 - Destroy COMPLETE | +| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | TE aikubeworker0012:2166925:2167127 [7] NCCL INFO comm 0x560553b91250 rank 7 nranks 16 cudaDev 7 busId db000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 36.7382 # # Collective test concluded: alltoall_perf # | **Overall: FAIL**