test_gpu_scripts/reports_cublaslt_fp8_crosscheck_20260524.md

2.7 KiB

cuBLASLt FP8 GEMM Cross-Check Report

Date: 2026-05-24

Scope: Validate whether the single-node FP8 compute FAIL is caused by hardware/platform limits or by the original PyTorch _scaled_mm benchmark path.

Method

Added a direct cuBLASLt FP8 GEMM micro-benchmark:

  • Source: scripts/cublaslt_fp8_gemm_bench.cu
  • Wrapper: scripts/run_cublaslt_fp8_gemm.sh
  • Input dtype: CUDA_R_8F_E4M3
  • Output dtype: CUDA_R_16BF
  • Accumulate / compute type: CUBLAS_COMPUTE_32F
  • Layout: cuBLASLt FP8-required TN format
  • Matrix size: 8192
  • Warmup: 50
  • Iterations: 500
  • GPUs: single-node 8 GPUs, measured one GPU at a time

NVIDIA cuBLASLt documentation states FP8 kernels require TN format, CUBLAS_COMPUTE_32F, and CUDA_R_32F scale type. The implemented benchmark follows those constraints.

Results

aikubeworker0012 / nccl-gpu-1

Raw report: reports_cublaslt_fp8_gemm_aikubeworker0012_20260524_071148.json

GPU FP8 TFLOPS
0 1615.6
1 1611.0
2 1599.0
3 1607.1
4 1614.0
5 1604.4
6 1608.4
7 1609.1

Summary:

  • Mean: 1608.6 TFLOPS
  • Min / Max: 1599.0 / 1615.6 TFLOPS
  • Spread: 1.03%
  • FP8 absolute threshold: >= 1400 TFLOPS
  • Verdict against FP8 absolute threshold: PASS
  • Verdict against 8-GPU consistency threshold <= 3%: PASS

aikubeworker0016 / nccl-gpu-2

Raw report: reports_cublaslt_fp8_gemm_aikubeworker0016_20260524_071200.json

GPU FP8 TFLOPS
0 1602.3
1 1604.0
2 1616.9
3 1610.6
4 1620.5
5 1630.3
6 1605.1
7 1620.2

Summary:

  • Mean: 1613.7 TFLOPS
  • Min / Max: 1602.3 / 1630.3 TFLOPS
  • Spread: 1.74%
  • FP8 absolute threshold: >= 1400 TFLOPS
  • Verdict against FP8 absolute threshold: PASS
  • Verdict against 8-GPU consistency threshold <= 3%: PASS

Comparison With Existing PyTorch _scaled_mm Result

Host PyTorch _scaled_mm FP8 cuBLASLt FP8 Delta
aikubeworker0012 1170.4 1608.6 +438.2
aikubeworker0016 1179.5 1613.7 +434.2

The cuBLASLt path passes the >= 1400 TFLOPS FP8 absolute threshold on both machines, while the original PyTorch _scaled_mm path remains around 1170-1180 TFLOPS.

Conclusion

The FP8 hardware path is capable of exceeding the configured H100 FP8 acceptance threshold on both machines. The earlier FP8 FAIL is therefore most likely a benchmark implementation issue in the current PyTorch _scaled_mm path, not a GPU hardware, power, clock, thermal, MIG, ECC, or Fabric Manager issue.

Recommended next action: replace or augment the existing FP8 compute acceptance item with the cuBLASLt FP8 GEMM cross-check, while keeping the PyTorch _scaled_mm result as a secondary software-stack signal.