cs 71ac97a24e Compare NCCL allreduce alltoall counters

2026-05-23 17:17:22 +08:00

8.9 KiB

Raw Blame History

多机 NCCL 8 卡链路计数器探测

日期：2026-05-23
主机：aikubeworker0012 / 172.72.8.12，aikubeworker0016 / 172.72.8.16
NCCL：临时 2.27.7+cuda12.4
HCA：mlx5_0,mlx5_1,mlx5_6,mlx5_7
HCA 速率：每节点 4 x 400Gb/s NDR，理论单向合计约 200 GB/s

结论

8 卡 allreduce 的 NCCL algbw 已经到 189 GB/s 左右，接近当前每节点 4 条 400G rail 的理论单向合计 200 GB/s。因此 PDF 参考的 491.84 GB/s busbw 对应 262 GB/s algbw，在当前 4 x 400G rail 形态下不太可能达到，除非实际可用跨节点 rail 数量或网络能力高于当前节点暴露的 4 条 400G。

裸 RDMA 并发 perftest 也验证了这 4 条 400G rail 本身可以同时工作：4 个 HCA 并发 ib_write_bw 合计 1476.95 Gb/s，即 184.62 GB/s。这与 NCCL 8 卡 allreduce 换算出的 189 GB/s algbw 一致，说明 allreduce 已经接近裸网络可用带宽。

8 卡 alltoall 仍只有 30 GB/s busbw，不是 HCA 顺序导致。HCA 顺序 sweep 都稳定在 30.02-30.07 GB/s。计数器显示 alltoall 流量主要压在 mlx5_0 和 mlx5_6 上，mlx5_1 和 mlx5_7 只有约三分之一流量，说明剩余问题更像 NCCL alltoall rail 分布、路由、拥塞、NCCL net plugin/SHARP 或网络侧策略问题。

补充测试显示，NCCL_PXN_DISABLE=1 可以把 alltoall 流量均匀分配到四条 HCA，并将 busbw 提升到约 36.5-37.0 GB/s。不过每条 400G rail 仍只有约 19-20 GB/s，没有达到裸 RDMA 单 rail 能力。

进一步抓 counters/hw_counters 后，未看到 discard、CRC/符号错误、packet sequence error、RoCE retrans、slow restart 等错误类计数增长；只看到部分端口 port_xmit_wait 增长。对照 allreduce 后发现，allreduce 在 354 GB/s busbw 时也会出现同类 port_xmit_wait，因此 port_xmit_wait 不是 alltoall 低吞吐的充分解释，只能说明发送侧存在等待。剩余问题更像 NCCL internal alltoall 通信模式、交换网络调度/拥塞控制、或缺少 NCCL net plugin/SHARP 能力。

裸 RDMA 4 rail 并发

命令类型：

ib_write_bw -d <mlx5_X> -i 1 -p <port> -s 4194304 -n 5000 -F --report_gbits

结果：

HCA	BW average
`mlx5_0`	`387.16 Gb/s`
`mlx5_1`	`387.07 Gb/s`
`mlx5_6`	`355.02 Gb/s`
`mlx5_7`	`347.70 Gb/s`
Total	`1476.95 Gb/s` / `184.62 GB/s`

8 卡 allreduce

NCCL 输出：

Metric	Value
`algbw`	`189.16 / 189.07 GB/s`
`busbw`	`354.68 / 354.52 GB/s`
`Avg bus bandwidth`	`354.597 GB/s`

allreduce busbw 换算关系约为：

busbw = algbw * 2 * (nranks - 1) / nranks
      = algbw * 1.875  # nranks=16

因此：

项	busbw	换算 algbw
当前测试	`354.60 GB/s`	`189.12 GB/s`
PDF 参考	`491.84 GB/s`	`262.31 GB/s`

当前 189.12 GB/s algbw 已接近 4 x 400Gb/s = 200 GB/s 理论单向总带宽。

allreduce counter 对照

对同样 2 nodes x 8 GPUs、同样 4 条 HCA 的 16G allreduce 复测 counter：

Metric	Value
`algbw`	`189.22 / 188.77 GB/s`
`busbw`	`354.79 / 353.94 GB/s`
`Avg bus bandwidth`	`354.366 GB/s`

流量分布：

Host	HCA	Xmit GiB	Recv GiB
aikubeworker0012	`mlx5_0`	`178.07`	`178.03`
aikubeworker0012	`mlx5_1`	`178.07`	`178.07`
aikubeworker0012	`mlx5_6`	`178.07`	`178.03`
aikubeworker0012	`mlx5_7`	`178.07`	`178.07`
aikubeworker0016	`mlx5_0`	`178.03`	`178.07`
aikubeworker0016	`mlx5_1`	`178.07`	`178.07`
aikubeworker0016	`mlx5_6`	`178.03`	`178.07`
aikubeworker0016	`mlx5_7`	`178.07`	`178.07`

错误类 counter 增量同样为 0，非零等待类 counter 为：

Host	HCA	`port_xmit_wait` delta
aikubeworker0012	`mlx5_1`	`6,555,518`
aikubeworker0012	`mlx5_7`	`6,325,059`
aikubeworker0016	`mlx5_1`	`6,585,965`
aikubeworker0016	`mlx5_7`	`6,112,874`

判断：allreduce 在达到当前 4 x 400G rail 物理上限附近时也会出现 port_xmit_wait，所以这个 counter 不能单独解释 alltoall 只有 36-37 GB/s。alltoall 的问题更偏向通信模式效率或网络调度策略，而不是简单链路错误。

8 卡 alltoall

NCCL 输出：

Metric	Value
`algbw`	`32.04 / 32.05 GB/s`
`busbw`	`30.03 / 30.04 GB/s`
`Avg bus bandwidth`	`30.0389 GB/s`

同一测试窗口内，端口计数器增量显示流量不均衡：

Host	HCA	Xmit GB	Recv GB
172.72.8.12	`mlx5_0`	`885.54`	`885.51`
172.72.8.12	`mlx5_1`	`295.19`	`295.19`
172.72.8.12	`mlx5_6`	`885.53`	`885.51`
172.72.8.12	`mlx5_7`	`295.19`	`295.19`
172.72.8.16	`mlx5_0`	`885.51`	`885.54`
172.72.8.16	`mlx5_1`	`295.19`	`295.19`
172.72.8.16	`mlx5_6`	`885.51`	`885.53`
172.72.8.16	`mlx5_7`	`295.19`	`295.19`

HCA 顺序 sweep

8 卡 alltoall 对 HCA 顺序不敏感：

`NCCL_IB_HCA`	Avg Bus BW
`mlx5_0,mlx5_1,mlx5_6,mlx5_7`	`30.0367 GB/s`
`mlx5_0,mlx5_6,mlx5_1,mlx5_7`	`30.0696 GB/s`
`mlx5_0,mlx5_7,mlx5_1,mlx5_6`	`30.0397 GB/s`
`mlx5_1,mlx5_0,mlx5_7,mlx5_6`	`30.0413 GB/s`
`mlx5_6,mlx5_7,mlx5_0,mlx5_1`	`30.0230 GB/s`

PXN disabled alltoall 计数器

NCCL_PXN_DISABLE=1 后：

Metric	Value
`Avg bus bandwidth`	`36.9518 GB/s`
每条 HCA 流量	约 `590.94-590.98 GB`
每条 HCA 吞吐	约 `19.82 GB/s`
每节点 4 HCA 合计吞吐	约 `79.29 GB/s`

判断：禁用 PXN 可以修复 rail 分布不均衡，但不能让 alltoall 打满当前 4 条 400G rail。

PXN disabled 错误/拥塞 counter 复测

复测命令仍为 2 nodes x 8 GPUs，alltoall_perf -b 16G -e 16G -w 10 -n 10，并使用：

NCCL_PXN_DISABLE=1
NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_6,mlx5_7
NCCL_NET_PLUGIN=none
NCCL_NET_GDR_LEVEL=5
NCCL_NET_GDR_READ=1
NCCL_DMABUF_ENABLE=0

NCCL 输出：

Metric	Value
`algbw`	`39.04 / 38.72 GB/s`
`busbw`	`36.60 / 36.30 GB/s`
`Avg bus bandwidth`	`36.4512 GB/s`

流量分布保持均衡：

Host	HCA	Xmit GiB	Recv GiB
aikubeworker0012	`mlx5_0`	`712.28`	`712.19`
aikubeworker0012	`mlx5_1`	`712.27`	`712.27`
aikubeworker0012	`mlx5_6`	`712.28`	`712.18`
aikubeworker0012	`mlx5_7`	`712.27`	`712.27`
aikubeworker0016	`mlx5_0`	`712.23`	`712.27`
aikubeworker0016	`mlx5_1`	`712.23`	`712.27`
aikubeworker0016	`mlx5_6`	`712.23`	`712.27`
aikubeworker0016	`mlx5_7`	`712.23`	`712.27`

错误类 counter 增量：

Counter group	Result
`port_xmit_discards`, `port_rcv_errors`, `port_rcv_remote_physical_errors`, `port_rcv_switch_relay_errors`	`0`
`symbol_error`, `link_error_recovery`, `link_downed`, `local_link_integrity_errors`, `excessive_buffer_overrun_errors`	`0`
`roce_adp_retrans`, `roce_adp_retrans_to`, `roce_slow_restart*`	`0`
`packet_seq_err`, `out_of_sequence`, `out_of_buffer`, `duplicate_request`, `implied_nak_seq_err`	`0`
`local_ack_timeout_err`, `req_transport_retries_exceeded`, `rnr_nak_retry_err`	`0`

非零等待类 counter：

Host	HCA	`port_xmit_wait` delta
aikubeworker0012	`mlx5_1`	`23,492,853`
aikubeworker0012	`mlx5_7`	`17,420,720`
aikubeworker0016	`mlx5_1`	`20,428,901`
aikubeworker0016	`mlx5_7`	`15,650,027`

判断：PXN disabled 后 alltoall 没有明显链路错误、重传或丢包证据。结合 allreduce 对照，port_xmit_wait 只能作为发送等待信号，不能单独解释 alltoall 低吞吐；剩余性能缺口更偏向 NCCL internal alltoall 在当前拓扑下的通信模式效率、交换网络调度/拥塞控制，或外部 NCCL net plugin/SHARP 缺失。

判断

裸 RDMA 4 rail 可以并发跑到约 184.62 GB/s，网络基础带宽不是单 rail 瓶颈。
8 卡 allreduce 当前不是软件参数小调能解决的问题，性能已经贴近当前 4 条 400G rail 的物理带宽上限。
8 卡 alltoall 仍明显异常，且不是 HCA 顺序问题；PXN disabled 后 rail 已均衡，port_xmit_wait 不是 alltoall 独有，需要继续从 NCCL alltoall 模式、交换机侧策略、NCCL net plugin/SHARP 排查。
NCCL_PXN_DISABLE=1 可改善 8 卡 alltoall 的 rail 均衡性和性能，但无法补齐到 PDF 目标。
如果验收必须达到 PDF 的 2 机 16 卡 491.84/76.54 GB/s，需要确认当前两台机器是否具备与 PDF 参考环境同等的有效跨节点 rail 数量和交换网络能力。
两台机器当前均未发现 libnccl-net.so 或 SHARP/HCOLL 包，NCCL 使用 internal IB plugin；如果目标值依赖 NCCL net plugin/SHARP，需要先补齐对应运行环境。

8.9 KiB Raw Blame History Unescape Escape

多机 NCCL 8 卡链路计数器探测

结论

裸 RDMA 4 rail 并发

8 卡 allreduce

allreduce counter 对照

8 卡 alltoall

HCA 顺序 sweep

PXN disabled alltoall 计数器

PXN disabled 错误/拥塞 counter 复测

判断

8.9 KiB

Raw Blame History