GPU_Test 合并报告

日期: 2026-05-24
节点: aikubeworker0012 / 172.72.8.12，aikubeworker0016 / 172.72.8.16
GPU: NVIDIA H100 80GB HBM3 x8 / node
范围: 单机单卡算力与多机多卡 NCCL 通信
说明: 本报告汇总既有原始测试结果，不重新启动额外压力测试。

总体结论

测试项	结论	说明
单机 GPU 识别	PASS	两台机器均识别 8 张 H100 80GB HBM3
单机单卡 FP8 硬件算力	PASS	direct cuBLASLt FP8 GEMM 两台机器均超过 `>= 1400 TFLOPS`
PyTorch `_scaled_mm` FP8 路径	FAIL / 软件栈信号	约 `1170-1180 TFLOPS`，低于阈值；已定位为 PyTorch eager / `_scaled_mm` benchmark 路径偏低，不作为硬件失败依据
多机多卡 NCCL 正确性	PASS	return code `0`，`Wrong=0` / `Out of bounds values: 0 OK`
多机多卡 NCCL 性能	符合当前 4x400Gbps 网络形态	2x8 allreduce / alltoall 低于 PDF 8x400Gbps 阈值，但该阈值不应直接硬套到当前 4x400Gbps 环境

单机单卡 / 算力测试

机器信息

Host	GPU	Driver	CUDA	GPU 数量
`aikubeworker0012`	NVIDIA H100 80GB HBM3	580.159.03	13.0	8
`aikubeworker0016`	NVIDIA H100 80GB HBM3	580.159.03	13.0	8

来源：

reports_single_gpu_aikubeworker0012.md
reports_single_gpu_aikubeworker0016.md

原始 PyTorch 单机算力结果

Host	FP32	TF32	FP16	BF16	FP8 `_scaled_mm`	原始 Verdict
`aikubeworker0012`	52.0	362.3	691.0	713.0	1148.8	FAIL
`aikubeworker0016`	51.9	357.8	667.2	699.1	1146.2	FAIL

原始 PyTorch 路径使用 torch._scaled_mm 做 FP8 GEMM。后续复查显示，该路径会受到 PyTorch eager dispatch、输出 Tensor 创建、cuBLASLt heuristic 路径、默认 use_fast_accum=False 等因素影响，不能直接代表 H100 FP8 Tensor Core 硬件上限。

direct cuBLASLt FP8 GEMM 交叉验证

测试参数：

参数	值
Benchmark	direct cuBLASLt FP8 GEMM
Source	`scripts/cublaslt_fp8_gemm_bench.cu`
Matrix	`8192 x 8192 x 8192`
A/B dtype	FP8 E4M3
Output dtype	BF16
Compute type	`CUBLAS_COMPUTE_32F`
Scale type	`CUDA_R_32F`
Scale A/B	`1.0`
Layout	TN
fast accumulation	enabled
Threshold	`>= 1400 TFLOPS`

结果：

Host	Mean FP8 TFLOPS	Min	Max	Spread	Threshold	Verdict
`aikubeworker0012`	1608.6	1599.0	1615.6	1.03%	>= 1400	PASS
`aikubeworker0016`	1613.7	1602.3	1630.3	1.74%	>= 1400	PASS

单卡逐张结果：

Host	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7
`aikubeworker0012`	1615.6	1611.0	1599.0	1607.1	1614.0	1604.4	1608.4	1609.1
`aikubeworker0016`	1602.3	1604.0	1616.9	1610.6	1620.5	1630.3	1605.1	1620.2

结论：direct cuBLASLt FP8 GEMM 已通过 >= 1400 TFLOPS 阈值，说明两台机器的 FP8 硬件计算路径具备达标能力。PyTorch _scaled_mm 的 FAIL 更适合作为软件栈 benchmark 路径问题记录，而不是 GPU 硬件失败结论。

来源：

reports_cublaslt_fp8_crosscheck_20260524.md
reports_cublaslt_fp8_gemm_aikubeworker0012_20260524_071148.json
reports_cublaslt_fp8_gemm_aikubeworker0016_20260524_071200.json

多机多卡 NCCL 测试

测试环境

项目	结果
Hosts	`nccl-gpu-1(172.72.8.12)`，`nccl-gpu-2(172.72.8.16)`
Topology	2 nodes x 8 GPUs，合计 16 GPUs
NCCL source	`nccl-tests-mpirun`
NCCL network	IB
GPU Direct RDMA	ENABLED
Active HCA rails	`mlx5_0, mlx5_1, mlx5_6, mlx5_7`
HCA speed	4 条 `400 Gb/sec (4X NDR)` ACTIVE

注意：NCCL 表里的 GB/s 是大 B，即 Bytes/s。IB 网卡口径 400 Gb/s 是小 b，即 bits/s。

2x8 全集合通信结果

Operation	Peak Bus BW	Avg Bus BW	PDF 8x400Gbps Threshold	Correctness	当前 4x400Gbps 口径
allreduce	354.27 GB/s	354.45 GB/s	>= 491.84 GB/s	PASS	符合当前硬件形态，低于 PDF 8 rail 阈值
alltoall	37.00 GB/s	37.14 GB/s	>= 76.54 GB/s	PASS	符合当前硬件形态，低于 PDF 8 rail 阈值
broadcast	191.65 GB/s	190.25 GB/s	未配置 PDF 阈值	PASS	PASS / 仅记录
reducescatter	192.75 GB/s	192.74 GB/s	未配置 PDF 阈值	PASS	PASS / 仅记录
allgather	192.14 GB/s	192.47 GB/s	未配置 PDF 阈值	PASS	PASS / 仅记录
sendrecv	26.98 GB/s	26.97 GB/s	未配置 PDF 阈值	PASS	PASS / 仅记录

结论：2x8 全集合通信测试中，NCCL 正确性通过。allreduce 和 alltoall 低于 PDF 8x400Gbps 参考阈值，但当前机器确认参与 NCCL 的是 4 条 400Gbps rail，因此该差距不应直接判定为当前 4x400Gbps 环境不合格。

来源：

reports_multinode_nccl_all_collectives_20260523_120144.md
reports_multinode_nccl_all_collectives_artifacts_manifest_20260523_120144.md

PDF Matrix allreduce / alltoall 结果

AllReduce（PDF 8x400Gbps 阈值对比，仅作参考）:

Topology	Peak Bus BW	Avg Bus BW	PDF 8x400Gbps Threshold	Gap	当前解释
2 nodes x 1 GPU	47.29 GB/s	47.26 GB/s	>= 48.90 GB/s	-1.61 GB/s	接近 PDF 阈值
2 nodes x 2 GPUs	137.16 GB/s	137.13 GB/s	>= 136.93 GB/s	+0.23 GB/s	达到 PDF 阈值
2 nodes x 4 GPUs	335.07 GB/s	335.02 GB/s	>= 335.48 GB/s	-0.41 GB/s	接近 PDF 阈值
2 nodes x 8 GPUs	353.85 GB/s	353.85 GB/s	>= 491.84 GB/s	-137.99 GB/s	低于 PDF 8 rail 阈值；当前为 4 rail 环境，不直接判不合格

AllToAll（PDF 8x400Gbps 阈值对比，仅作参考）:

Topology	Peak Bus BW	Avg Bus BW	PDF 8x400Gbps Threshold	Gap	当前解释
2 nodes x 1 GPU	24.85 GB/s	24.90 GB/s	>= 27.25 GB/s	-2.40 GB/s	接近 PDF 阈值
2 nodes x 2 GPUs	47.76 GB/s	47.98 GB/s	>= 54.41 GB/s	-6.65 GB/s	低于 PDF 8 rail 阈值
2 nodes x 4 GPUs	72.74 GB/s	72.80 GB/s	>= 73.73 GB/s	-0.99 GB/s	接近 PDF 阈值
2 nodes x 8 GPUs	36.83 GB/s	36.85 GB/s	>= 76.54 GB/s	-39.71 GB/s	低于 PDF 8 rail 阈值；当前为 4 rail 环境，不直接判不合格

来源：

reports_multinode_nccl_pdf_matrix_run_20260523.md
reports_multinode_nccl_pdf_matrix_20260523_113803.md

风险与判断

单机 FP8 硬件能力通过 direct cuBLASLt 验证，当前不支持将 PyTorch _scaled_mm FAIL 直接判定为 GPU 硬件故障。
多机 NCCL 正确性通过，性能结果应按当前 4x400Gbps rail 环境解释。
当前多机环境确认参与 NCCL 的是 4 条 400G IB rail；PDF 参考环境为 8x400G 计算管理网络，因此 2x8 阈值与当前硬件形态不等价。
2x8 allreduce 和 alltoall 低于 PDF 8 rail 阈值，建议作为“与 PDF 参考环境差异”记录，而不是作为当前 4 rail 环境不合格结论。

建议

单机 FP8 验收以 direct cuBLASLt 或 Transformer Engine GEMM benchmark 为主，PyTorch _scaled_mm 作为软件栈参考项保留。
多机 NCCL 后续若要按 PDF 阈值验收，需要先对齐 PDF 参考环境的 8x400Gbps rail 数量、NCCL net plugin / SHARP、跨 Leaf 交换策略、ECMP / 拥塞控制配置。
对外报告建议明确区分 GB/s 与 Gb/s：NCCL bus bandwidth 是大 B，IB 端口速率是小 b。

7.4 KiB Raw Blame History Unescape Escape