# cuBLASLt FP8 GEMM Cross-Check Report Date: 2026-05-24 Scope: Validate whether the single-node FP8 compute FAIL is caused by hardware/platform limits or by the original PyTorch `_scaled_mm` benchmark path. ## Method Added a direct cuBLASLt FP8 GEMM micro-benchmark: - Source: `scripts/cublaslt_fp8_gemm_bench.cu` - Wrapper: `scripts/run_cublaslt_fp8_gemm.sh` - Input dtype: `CUDA_R_8F_E4M3` - Output dtype: `CUDA_R_16BF` - Accumulate / compute type: `CUBLAS_COMPUTE_32F` - Layout: cuBLASLt FP8-required TN format - Matrix size: `8192` - Warmup: `50` - Iterations: `500` - GPUs: single-node 8 GPUs, measured one GPU at a time NVIDIA cuBLASLt documentation states FP8 kernels require TN format, `CUBLAS_COMPUTE_32F`, and `CUDA_R_32F` scale type. The implemented benchmark follows those constraints. ## Results ### aikubeworker0012 / nccl-gpu-1 Raw report: `reports_cublaslt_fp8_gemm_aikubeworker0012_20260524_071148.json` | GPU | FP8 TFLOPS | |---:|---:| | 0 | 1615.6 | | 1 | 1611.0 | | 2 | 1599.0 | | 3 | 1607.1 | | 4 | 1614.0 | | 5 | 1604.4 | | 6 | 1608.4 | | 7 | 1609.1 | Summary: - Mean: `1608.6 TFLOPS` - Min / Max: `1599.0 / 1615.6 TFLOPS` - Spread: `1.03%` - FP8 absolute threshold: `>= 1400 TFLOPS` - Verdict against FP8 absolute threshold: **PASS** - Verdict against 8-GPU consistency threshold `<= 3%`: **PASS** ### aikubeworker0016 / nccl-gpu-2 Raw report: `reports_cublaslt_fp8_gemm_aikubeworker0016_20260524_071200.json` | GPU | FP8 TFLOPS | |---:|---:| | 0 | 1602.3 | | 1 | 1604.0 | | 2 | 1616.9 | | 3 | 1610.6 | | 4 | 1620.5 | | 5 | 1630.3 | | 6 | 1605.1 | | 7 | 1620.2 | Summary: - Mean: `1613.7 TFLOPS` - Min / Max: `1602.3 / 1630.3 TFLOPS` - Spread: `1.74%` - FP8 absolute threshold: `>= 1400 TFLOPS` - Verdict against FP8 absolute threshold: **PASS** - Verdict against 8-GPU consistency threshold `<= 3%`: **PASS** ## Comparison With Existing PyTorch `_scaled_mm` Result | Host | PyTorch `_scaled_mm` FP8 | cuBLASLt FP8 | Delta | |---|---:|---:|---:| | aikubeworker0012 | 1170.4 | 1608.6 | +438.2 | | aikubeworker0016 | 1179.5 | 1613.7 | +434.2 | The cuBLASLt path passes the `>= 1400 TFLOPS` FP8 absolute threshold on both machines, while the original PyTorch `_scaled_mm` path remains around `1170-1180 TFLOPS`. ## Conclusion The FP8 hardware path is capable of exceeding the configured H100 FP8 acceptance threshold on both machines. The earlier FP8 FAIL is therefore most likely a benchmark implementation issue in the current PyTorch `_scaled_mm` path, not a GPU hardware, power, clock, thermal, MIG, ECC, or Fabric Manager issue. Recommended next action: replace or augment the existing FP8 compute acceptance item with the cuBLASLt FP8 GEMM cross-check, while keeping the PyTorch `_scaled_mm` result as a secondary software-stack signal.