zulifeng
fc97a768cf
feat: 按 H100 生产验收标准更新测试指标与判定逻辑
- gpu_specs: H100 新增 compute_pass_thresholds_tflops 字段
(fp32:54 / tf32:444 / fp16:734 / bf16:745 / fp8:1400),
与 marketing peak 解耦,作为绝对 TFLOPS PASS 门槛
- benchmark: compute 结果中透出 pass_thresholds_tflops 供 report 使用
- report: compute 判定改用绝对 TFLOPS (PASS ≥门槛 / WARN ≥门槛×90% /
FAIL <门槛×90%);表头切换为 Threshold 列;Memory D2D verdict
由 50/30 收紧至 80/60;无阈值配置的 GPU 保留旧 % 效率逻辑
- nccl: _OP_BW_FRACTIONS 收紧至 AllReduce/AllGather/ReduceScatter
0.45、Broadcast/SendRecv 0.40、AllToAll 0.35,与验收文档 §5 一致
- configs: benchmark 默认 matrix_size 4096→8192、warmup 10→50、
iterations 100→500、use_compile 改 true;health temp_warning
80→75、temp_critical 90→85,匹配生产验收稳态温度要求
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:52:41 +08:00
..
2026-04-25 17:23:27 +08:00
2026-05-13 14:52:41 +08:00
2026-05-07 18:09:22 +08:00
2026-05-13 14:52:41 +08:00
2026-05-07 21:32:35 +08:00
2026-05-13 14:52:41 +08:00
2026-05-12 21:41:46 +08:00
2026-05-13 14:52:41 +08:00
2026-05-12 21:41:46 +08:00
2026-05-07 21:32:35 +08:00