2.7 KiB
cuBLASLt FP8 GEMM Cross-Check Report
Date: 2026-05-24
Scope: Validate whether the single-node FP8 compute FAIL is caused by hardware/platform limits or by the original PyTorch _scaled_mm benchmark path.
Method
Added a direct cuBLASLt FP8 GEMM micro-benchmark:
- Source:
scripts/cublaslt_fp8_gemm_bench.cu - Wrapper:
scripts/run_cublaslt_fp8_gemm.sh - Input dtype:
CUDA_R_8F_E4M3 - Output dtype:
CUDA_R_16BF - Accumulate / compute type:
CUBLAS_COMPUTE_32F - Layout: cuBLASLt FP8-required TN format
- Matrix size:
8192 - Warmup:
50 - Iterations:
500 - GPUs: single-node 8 GPUs, measured one GPU at a time
NVIDIA cuBLASLt documentation states FP8 kernels require TN format, CUBLAS_COMPUTE_32F, and CUDA_R_32F scale type. The implemented benchmark follows those constraints.
Results
aikubeworker0012 / nccl-gpu-1
Raw report: reports_cublaslt_fp8_gemm_aikubeworker0012_20260524_071148.json
| GPU | FP8 TFLOPS |
|---|---|
| 0 | 1615.6 |
| 1 | 1611.0 |
| 2 | 1599.0 |
| 3 | 1607.1 |
| 4 | 1614.0 |
| 5 | 1604.4 |
| 6 | 1608.4 |
| 7 | 1609.1 |
Summary:
- Mean:
1608.6 TFLOPS - Min / Max:
1599.0 / 1615.6 TFLOPS - Spread:
1.03% - FP8 absolute threshold:
>= 1400 TFLOPS - Verdict against FP8 absolute threshold: PASS
- Verdict against 8-GPU consistency threshold
<= 3%: PASS
aikubeworker0016 / nccl-gpu-2
Raw report: reports_cublaslt_fp8_gemm_aikubeworker0016_20260524_071200.json
| GPU | FP8 TFLOPS |
|---|---|
| 0 | 1602.3 |
| 1 | 1604.0 |
| 2 | 1616.9 |
| 3 | 1610.6 |
| 4 | 1620.5 |
| 5 | 1630.3 |
| 6 | 1605.1 |
| 7 | 1620.2 |
Summary:
- Mean:
1613.7 TFLOPS - Min / Max:
1602.3 / 1630.3 TFLOPS - Spread:
1.74% - FP8 absolute threshold:
>= 1400 TFLOPS - Verdict against FP8 absolute threshold: PASS
- Verdict against 8-GPU consistency threshold
<= 3%: PASS
Comparison With Existing PyTorch _scaled_mm Result
| Host | PyTorch _scaled_mm FP8 |
cuBLASLt FP8 | Delta |
|---|---|---|---|
| aikubeworker0012 | 1170.4 | 1608.6 | +438.2 |
| aikubeworker0016 | 1179.5 | 1613.7 | +434.2 |
The cuBLASLt path passes the >= 1400 TFLOPS FP8 absolute threshold on both machines, while the original PyTorch _scaled_mm path remains around 1170-1180 TFLOPS.
Conclusion
The FP8 hardware path is capable of exceeding the configured H100 FP8 acceptance threshold on both machines. The earlier FP8 FAIL is therefore most likely a benchmark implementation issue in the current PyTorch _scaled_mm path, not a GPU hardware, power, clock, thermal, MIG, ECC, or Fabric Manager issue.
Recommended next action: replace or augment the existing FP8 compute acceptance item with the cuBLASLt FP8 GEMM cross-check, while keeping the PyTorch _scaled_mm result as a secondary software-stack signal.