cuBLASLt FP8 GEMM Cross-Check Report

Date: 2026-05-24

Scope: Validate whether the single-node FP8 compute FAIL is caused by hardware/platform limits or by the original PyTorch _scaled_mm benchmark path.

Method

Added a direct cuBLASLt FP8 GEMM micro-benchmark:

Source: scripts/cublaslt_fp8_gemm_bench.cu
Wrapper: scripts/run_cublaslt_fp8_gemm.sh
Input dtype: CUDA_R_8F_E4M3
Output dtype: CUDA_R_16BF
Accumulate / compute type: CUBLAS_COMPUTE_32F
Layout: cuBLASLt FP8-required TN format
Matrix size: 8192
Warmup: 50
Iterations: 500
GPUs: single-node 8 GPUs, measured one GPU at a time

NVIDIA cuBLASLt documentation states FP8 kernels require TN format, CUBLAS_COMPUTE_32F, and CUDA_R_32F scale type. The implemented benchmark follows those constraints.

Results

aikubeworker0012 / nccl-gpu-1

Raw report: reports_cublaslt_fp8_gemm_aikubeworker0012_20260524_071148.json

GPU	FP8 TFLOPS
0	1615.6
1	1611.0
2	1599.0
3	1607.1
4	1614.0
5	1604.4
6	1608.4
7	1609.1

Summary:

Mean: 1608.6 TFLOPS
Min / Max: 1599.0 / 1615.6 TFLOPS
Spread: 1.03%
FP8 absolute threshold: >= 1400 TFLOPS
Verdict against FP8 absolute threshold: PASS
Verdict against 8-GPU consistency threshold <= 3%: PASS

aikubeworker0016 / nccl-gpu-2

Raw report: reports_cublaslt_fp8_gemm_aikubeworker0016_20260524_071200.json

GPU	FP8 TFLOPS
0	1602.3
1	1604.0
2	1616.9
3	1610.6
4	1620.5
5	1630.3
6	1605.1
7	1620.2

Summary:

Mean: 1613.7 TFLOPS
Min / Max: 1602.3 / 1630.3 TFLOPS
Spread: 1.74%
FP8 absolute threshold: >= 1400 TFLOPS
Verdict against FP8 absolute threshold: PASS
Verdict against 8-GPU consistency threshold <= 3%: PASS

Comparison With Existing PyTorch `_scaled_mm` Result

Host	PyTorch `_scaled_mm` FP8	cuBLASLt FP8	Delta
aikubeworker0012	1170.4	1608.6	+438.2
aikubeworker0016	1179.5	1613.7	+434.2

The cuBLASLt path passes the >= 1400 TFLOPS FP8 absolute threshold on both machines, while the original PyTorch _scaled_mm path remains around 1170-1180 TFLOPS.

Conclusion

The FP8 hardware path is capable of exceeding the configured H100 FP8 acceptance threshold on both machines. The earlier FP8 FAIL is therefore most likely a benchmark implementation issue in the current PyTorch _scaled_mm path, not a GPU hardware, power, clock, thermal, MIG, ECC, or Fabric Manager issue.

Recommended next action: replace or augment the existing FP8 compute acceptance item with the cuBLASLt FP8 GEMM cross-check, while keeping the PyTorch _scaled_mm result as a secondary software-stack signal.

2.7 KiB Raw Blame History