dd77a882f1
feat: 跨机 RDMA 并入 rdma_test.py + H800 算力门槛对齐 H100
...
- modules/rdma_test.py: 新增 SSH 编排的跨机 RDMA(run_cross_node /
_cross_node_perftest / 解析器),从 client 端逐设备拉起对端 perftest
server 跑本地 client,替代已删除的 scripts/rdma_cross_node.sh;两机
4×NDR400 实测全 PASS(~387-392 Gb/s,~2 µs)。
- configs/default.yaml: 新增 rdma.cross_node 配置块(默认 enabled:false)。
- modules/gpu_specs.py: H800 PASS 门槛对齐 H100 实测地板
(tf32 400->385, bf16 720->730, fp8 1400->1200);H800=H100 硅片,
PyTorch tensorwise fp8 天花板 ~1310,原 1400 不可达。
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 19:38:43 +08:00
e49ea32094
feat: 新增多机 nccl test 测试脚本
2026-05-25 14:19:02 +08:00
fc97a768cf
feat: 按 H100 生产验收标准更新测试指标与判定逻辑
...
- gpu_specs: H100 新增 compute_pass_thresholds_tflops 字段
(fp32:54 / tf32:444 / fp16:734 / bf16:745 / fp8:1400),
与 marketing peak 解耦,作为绝对 TFLOPS PASS 门槛
- benchmark: compute 结果中透出 pass_thresholds_tflops 供 report 使用
- report: compute 判定改用绝对 TFLOPS (PASS ≥门槛 / WARN ≥门槛×90% /
FAIL <门槛×90%);表头切换为 Threshold 列;Memory D2D verdict
由 50/30 收紧至 80/60;无阈值配置的 GPU 保留旧 % 效率逻辑
- nccl: _OP_BW_FRACTIONS 收紧至 AllReduce/AllGather/ReduceScatter
0.45、Broadcast/SendRecv 0.40、AllToAll 0.35,与验收文档 §5 一致
- configs: benchmark 默认 matrix_size 4096→8192、warmup 10→50、
iterations 100→500、use_compile 改 true;health temp_warning
80→75、temp_critical 90→85,匹配生产验收稳态温度要求
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:52:41 +08:00
375d439abb
feat: 新增 H20 支持、优化算力测试精度并修复多项稳定性问题
...
- gpu_specs: 新增 H20/H20-3e (中国合规版 H200) 规格定义,并修复
GPU 名称匹配顺序,避免 "H200" 被 "H20" 子串误匹配
- benchmark(compute): 引入 L2 cache 规避的 matrix pool 轮换 +
可选 torch.compile(max-autotune),FP8 增加 _scaled_mm 探测,
显著提升 FP16/BF16/FP8 实测吞吐准确性
- benchmark(memory): nvbandwidth 增加 --disableAffinity 规避
fabricmanager NVML 不兼容;全 0 结果时自动回退到 PyTorch;
D2D 平均值排除对角线零值
- nccl: 各通信操作 (AllReduce/AllToAll/Broadcast 等) 使用独立
带宽阈值比例,避免 AllToAll 误报 WARN
- rdma: 仅按 link_layer=InfiniBand 过滤端口,无 IB 硬件或全 DOWN
时直接 SKIP 而非报错
- stress: 计算矩阵尺寸封顶 4096,并改为先并发派发再统一同步,
修复 8 卡串行执行导致 duration 严重超时的问题
- report: 兼容 RDMA SKIP 状态与 PyTorch 回退场景的 Memory 判定,
避免回退结果被误判为 FAIL
- config: 新增 benchmark.compute.use_compile 开关
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 21:41:46 +08:00
qinyusen
fefef8e03b
refactor: remove hardcoding, fix AMP bug, unify English output
...
- Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped)
- Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0
- Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4
- Remove hardcoded GPU lists: dynamic banner, CLI choices, version
- Unknown GPU efficiency displays N/A instead of 0%
- Unify all console output to English (stress_test, gpu_tester)
- Use importlib.metadata for runtime version resolution
- Remove dir="/tmp" from tempfile (use system default)
🤖 Generated with [Qoder][https://qoder.com ]
2026-05-07 21:32:35 +08:00
qinyusen
3e967dd34a
feat: add Ampere (A100/A800) support and generalize project naming
...
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support
🤖 Generated with [Qoder][https://qoder.com ]
2026-05-07 01:02:28 +08:00
qinyusen
98e4977e28
add: GPU specs database with auto-detection (H100/H200/B200/B300)
...
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent )
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:44 +08:00