# cuBLASLt FP8 GEMM Cross-Check Report

Date: 2026-05-24

Scope: Validate whether the single-node FP8 compute FAIL is caused by hardware/platform limits or by the original PyTorch `_scaled_mm` benchmark path.

## Method

Added a direct cuBLASLt FP8 GEMM micro-benchmark:

- Source: `scripts/cublaslt_fp8_gemm_bench.cu`
- Wrapper: `scripts/run_cublaslt_fp8_gemm.sh`
- Input dtype: `CUDA_R_8F_E4M3`
- Output dtype: `CUDA_R_16BF`
- Accumulate / compute type: `CUBLAS_COMPUTE_32F`
- Layout: cuBLASLt FP8-required TN format
- Matrix size: `8192`
- Warmup: `50`
- Iterations: `500`
- GPUs: single-node 8 GPUs, measured one GPU at a time

NVIDIA cuBLASLt documentation states FP8 kernels require TN format, `CUBLAS_COMPUTE_32F`, and `CUDA_R_32F` scale type. The implemented benchmark follows those constraints.

## Results

### aikubeworker0012 / nccl-gpu-1

Raw report: `reports_cublaslt_fp8_gemm_aikubeworker0012_20260524_071148.json`

| GPU | FP8 TFLOPS |
|---:|---:|
| 0 | 1615.6 |
| 1 | 1611.0 |
| 2 | 1599.0 |
| 3 | 1607.1 |
| 4 | 1614.0 |
| 5 | 1604.4 |
| 6 | 1608.4 |
| 7 | 1609.1 |

Summary:

- Mean: `1608.6 TFLOPS`
- Min / Max: `1599.0 / 1615.6 TFLOPS`
- Spread: `1.03%`
- FP8 absolute threshold: `>= 1400 TFLOPS`
- Verdict against FP8 absolute threshold: **PASS**
- Verdict against 8-GPU consistency threshold `<= 3%`: **PASS**

### aikubeworker0016 / nccl-gpu-2

Raw report: `reports_cublaslt_fp8_gemm_aikubeworker0016_20260524_071200.json`

| GPU | FP8 TFLOPS |
|---:|---:|
| 0 | 1602.3 |
| 1 | 1604.0 |
| 2 | 1616.9 |
| 3 | 1610.6 |
| 4 | 1620.5 |
| 5 | 1630.3 |
| 6 | 1605.1 |
| 7 | 1620.2 |

Summary:

- Mean: `1613.7 TFLOPS`
- Min / Max: `1602.3 / 1630.3 TFLOPS`
- Spread: `1.74%`
- FP8 absolute threshold: `>= 1400 TFLOPS`
- Verdict against FP8 absolute threshold: **PASS**
- Verdict against 8-GPU consistency threshold `<= 3%`: **PASS**

## Comparison With Existing PyTorch `_scaled_mm` Result

| Host | PyTorch `_scaled_mm` FP8 | cuBLASLt FP8 | Delta |
|---|---:|---:|---:|
| aikubeworker0012 | 1170.4 | 1608.6 | +438.2 |
| aikubeworker0016 | 1179.5 | 1613.7 | +434.2 |

The cuBLASLt path passes the `>= 1400 TFLOPS` FP8 absolute threshold on both machines, while the original PyTorch `_scaled_mm` path remains around `1170-1180 TFLOPS`.

## Conclusion

The FP8 hardware path is capable of exceeding the configured H100 FP8 acceptance threshold on both machines. The earlier FP8 FAIL is therefore most likely a benchmark implementation issue in the current PyTorch `_scaled_mm` path, not a GPU hardware, power, clock, thermal, MIG, ECC, or Fabric Manager issue.

Recommended next action: replace or augment the existing FP8 compute acceptance item with the cuBLASLt FP8 GEMM cross-check, while keeping the PyTorch `_scaled_mm` result as a secondary software-stack signal.