# FP8 GEMM 路径对比测试报告

测试日期：2026-05-25  
测试节点：aikubeworker0012、aikubeworker0016  
测试 GPU：NVIDIA H100 80GB HBM3  
测试目标：对比同一 FP8 GEMM 规模下 PyTorch eager、CUDA Graph、Transformer Engine 和 direct cuBLASLt 的性能差异。

## 一、测试结论

本次 A-E 五条路径均已完成实测。

核心结论：

1. direct cuBLASLt 是本组测试里最快路径，两台机器分别达到 1626.6 TFLOPS 和 1598.1 TFLOPS。
2. PyTorch eager `_scaled_mm` 默认路径约为 1161.9-1186.1 TFLOPS。
3. 打开 `use_fast_accum=True` 后，PyTorch eager 路径有稳定提升，约提升 5.0%-6.7%。
4. CUDA Graph + `_scaled_mm(use_fast_accum=True)` 进一步提升到 1277.7-1322.2 TFLOPS，但仍低于 direct cuBLASLt。
5. Transformer Engine 本次使用的是 `te.Linear` + `fp8_autocast` 路径，不是裸 GEMM，因此包含 TE module、cast、FP8 recipe 等额外开销，结果低于 direct cuBLASLt，也低于 CUDA Graph `_scaled_mm`。

这说明：当前 GPU 硬件和 cuBLASLt 裸 GEMM 能力本身没有问题；之前 PyTorch `_scaled_mm` 1170-1180 TFLOPS 左右的结果，主要反映的是 PyTorch eager 路径和当前 benchmark 方式下的端到端路径性能，而不是 GPU 算力极限。

## 二、测试方法

统一参数：

| 参数 | 值 |
|---|---:|
| matrix_size | 8192 |
| M/N/K | 8192/8192/8192 |
| warmup | 50 |
| iterations | 500 |
| GPU index | 0 |
| PyTorch | 2.6.0+cu124 |
| CUDA | 12.4 |
| 输入 dtype | FP8 E4M3 |
| 输出 dtype | BF16 |
| accumulation | FP32 |
| scale_a / scale_b | 1.0 / 1.0 |

测试路径定义：

| 路径 | 名称 | 含义 |
|---|---|---|
| A | 当前 eager `_scaled_mm` | PyTorch 立即执行模式调用 `torch._scaled_mm`，默认 accumulation 参数 |
| B | `_scaled_mm(use_fast_accum=True)` | PyTorch eager 路径，但显式打开 fast accumulation |
| C | CUDA Graph + `_scaled_mm(use_fast_accum=True)` | 捕获并 replay 同一个 `_scaled_mm` 调用，降低 Python/PyTorch launch 间隙 |
| D | Transformer Engine FP8 GEMM | `te.Linear` 在 `fp8_autocast` 下执行，包含 TE 层封装和 FP8 recipe 开销 |
| E | direct cuBLASLt | C++/CUDA 直接调用 `cublasLtMatmul`，绕过 PyTorch eager |

复现脚本：

```bash
MATRIX_SIZE=8192 WARMUP=50 ITERATIONS=500 GPU_INDEX=0 WORKSPACE_MB=256 \
  /root/test_gpu_scripts/scripts/run_fp8_path_comparison.sh
```

## 三、实测结果

### aikubeworker0012

原始 JSON：`/Users/d-robotics/lab/test_gpu_scripts/reports_fp8_paths_combined_aikubeworker0012_20260525_045408.json`

| 路径 | 状态 | TFLOPS | 单轮 CUDA event 时间 |
|---|---|---:|---:|
| A eager `_scaled_mm` default | OK | 1186.1 | 927.014 us |
| B eager `_scaled_mm` fast_accum | OK | 1266.0 | 868.481 us |
| C CUDA Graph + fast_accum | OK | 1322.2 | 831.573 us |
| D Transformer Engine FP8 Linear | OK | 1153.2 | 953.478 us |
| E direct cuBLASLt fast_accum | OK | 1626.6 | 未在 combined JSON 中记录 |

相对 A 的提升：

| 路径 | 相对 A |
|---|---:|
| B | +6.7% |
| C | +11.5% |
| D | -2.8% |
| E | +37.1% |

E 路径 cuBLASLt 算法信息：

| 字段 | 值 |
|---|---:|
| algo_id | 52 |
| tile_id | 23 |
| splitk | 1 |
| stages_id | 36 |
| inner_shape_id | 0 |
| cluster_shape_id | 3 |

### aikubeworker0016

原始 JSON：`/Users/d-robotics/lab/test_gpu_scripts/reports_fp8_paths_combined_aikubeworker0016_20260525_050048.json`

| 路径 | 状态 | TFLOPS | 单轮 CUDA event 时间 |
|---|---|---:|---:|
| A eager `_scaled_mm` default | OK | 1161.9 | 946.313 us |
| B eager `_scaled_mm` fast_accum | OK | 1220.4 | 900.960 us |
| C CUDA Graph + fast_accum | OK | 1277.7 | 860.543 us |
| D Transformer Engine FP8 Linear | OK | 1125.3 | 977.054 us |
| E direct cuBLASLt fast_accum | OK | 1598.1 | 未在 combined JSON 中记录 |

相对 A 的提升：

| 路径 | 相对 A |
|---|---:|
| B | +5.0% |
| C | +10.0% |
| D | -3.2% |
| E | +37.5% |

E 路径 cuBLASLt 算法信息：

| 字段 | 值 |
|---|---:|
| algo_id | 52 |
| tile_id | 23 |
| splitk | 1 |
| stages_id | 36 |
| inner_shape_id | 0 |
| cluster_shape_id | 3 |

## 四、对 PyTorch FP8 能否“上去”的判断

从本次结果看，PyTorch FP8 路径可以通过两类方式上去：

1. 打开更快的 math/accumulation 参数，例如 `use_fast_accum=True`。
2. 使用 CUDA Graph replay，减少 eager 模式下每轮调度、enqueue 之间的间隙。

但在当前 `matrix_size=8192`、单个 `_scaled_mm`、PyTorch eager/Graph benchmark 的测试形态下，PyTorch 路径仍没有达到 direct cuBLASLt 的 1598-1626 TFLOPS。也就是说，direct cuBLASLt 证明硬件和底层库有能力跑得更高；PyTorch eager `_scaled_mm` 测到的是 PyTorch 当前封装路径在这个 shape 下的实际表现。

如果把目标定义为“让 PyTorch 代码路径更接近裸 cuBLASLt”，后续可以继续验证：

1. 更大的 GEMM size，例如 16384。
2. 固定 shape 后用 `torch.compile` 或 Inductor。
3. CUDA Graph 覆盖更完整的 step，而不是只 replay 单个 op。
4. 使用 Transformer Engine 的更底层 GEMM API 或官方 microbenchmark，而不是 `te.Linear` module forward。
5. 对 `_scaled_mm` 做 Nsight Systems / Nsight Compute 抓取，确认实际 kernel、间隙和 cuBLASLt 算法选择。

## 五、术语说明

`eager` 指 PyTorch 立即执行模式。每次 Python 调用 `torch._scaled_mm`，PyTorch 都会经过 dispatcher、参数检查、Tensor 创建、准备 descriptor、调用 cuBLASLt heuristic，然后把 matmul enqueue 到 CUDA stream。

`cuBLAS` 是 NVIDIA 的基础矩阵乘库。`cuBLASLt` 是更灵活的矩阵乘接口，支持更多 layout、FP8、算法 heuristic、workspace、epilogue 等能力。

`direct cuBLASLt` 指我们自己写 C++/CUDA 直接调用 `cublasLtMatmul`，不经过 PyTorch eager，因此更接近裸 GEMM 峰值。

`CUDA Graph` 指把一次 CUDA work 提前捕获成图，后续直接 replay，减少 CPU 侧反复 launch/调度带来的间隙。

`Transformer Engine` 是 NVIDIA 面向 Transformer/FP8 训练优化的库。本次 D 路径使用的是 `te.Linear` module forward，不等同于裸 GEMM microbenchmark。

## 六、文件清单

本地脚本：

| 文件 | 用途 |
|---|---|
| `/Users/d-robotics/lab/test_gpu_scripts/scripts/pytorch_fp8_path_bench.py` | A/B/C/D PyTorch 与 Transformer Engine 路径 |
| `/Users/d-robotics/lab/test_gpu_scripts/scripts/cublaslt_fp8_gemm_bench.cu` | E direct cuBLASLt 路径 |
| `/Users/d-robotics/lab/test_gpu_scripts/scripts/run_fp8_path_comparison.sh` | 统一运行并合并 A-E 结果 |

本地结果：

| 文件 | 用途 |
|---|---|
| `/Users/d-robotics/lab/test_gpu_scripts/reports_fp8_paths_combined_aikubeworker0012_20260525_045408.json` | aikubeworker0012 A-E 原始结果 |
| `/Users/d-robotics/lab/test_gpu_scripts/reports_fp8_paths_combined_aikubeworker0016_20260525_050048.json` | aikubeworker0016 A-E 原始结果 |
| `/Users/d-robotics/lab/test_gpu_scripts/reports_fp8_path_comparison_20260525.md` | 本中文汇总报告 |