65 Commits

Author SHA1 Message Date
cs
017c981062 Remove remaining report docs from PR 2026-05-26 00:44:56 +08:00
cs
1c3c811254 Remove generated reports from PR 2026-05-26 00:44:39 +08:00
cs
7ec2da18bc Clean report whitespace 2026-05-26 00:15:48 +08:00
cs
4dddab27b3 Add FP8 GEMM path comparison reports 2026-05-26 00:13:33 +08:00
cs
4484c731b6 Add H100 acceptance PR summary 2026-05-26 00:12:59 +08:00
cs
f80a3b3636 Add H100 acceptance delivery manifest 2026-05-26 00:12:59 +08:00
cs
639651ef24 Add H100 network escalation request 2026-05-26 00:12:59 +08:00
cs
edb4612cc6 Add H100 acceptance closure checklist 2026-05-26 00:12:59 +08:00
cs
1203b025a0 Document H100 acceptance entrypoint 2026-05-26 00:12:59 +08:00
cs
5b022d5849 Summarize current H100 acceptance status 2026-05-26 00:12:59 +08:00
cs
90c46e40b3 Archive all-collectives NCCL artifacts 2026-05-26 00:12:59 +08:00
cs
c2db68f608 Add multinode NCCL all collectives run 2026-05-26 00:12:59 +08:00
cs
e0cb796b0c Analyze multinode NCCL artifact signals 2026-05-26 00:12:59 +08:00
cs
4d06639129 Record multinode NCCL artifacts run 2026-05-26 00:12:59 +08:00
cs
098d1715f2 Archive multinode NCCL raw artifacts 2026-05-26 00:12:59 +08:00
cs
7bc15742ea Clarify multinode NCCL report thresholds 2026-05-26 00:12:59 +08:00
cs
c73d738557 Record multinode NCCL PDF matrix run 2026-05-26 00:12:55 +08:00
cs
8923270ce0 Add multinode NCCL PDF matrix runner 2026-05-26 00:12:55 +08:00
cs
2c5c31e451 Add single-node H100 all runner 2026-05-26 00:12:55 +08:00
cs
cadfbcfaa3 Add NCCL environment snapshot script 2026-05-26 00:12:55 +08:00
cs
ef56e5f15a Add NCCL latest report index 2026-05-26 00:12:55 +08:00
cs
892f833ff4 Add NCCL network handoff plan 2026-05-26 00:12:55 +08:00
cs
f64e85efaf Document NCCL environment equivalence gaps 2026-05-26 00:12:55 +08:00
cs
c183f5a9d1 Document NCCL deep diagnosis rerun 2026-05-26 00:12:55 +08:00
cs
b55666948c Add multinode NCCL deep diagnosis tools 2026-05-26 00:12:55 +08:00
cs
24a7bd5c1b Document NCCL graph comparison 2026-05-26 00:12:55 +08:00
cs
82c6316716 Document NCCL alltoall secondary sweep 2026-05-26 00:12:55 +08:00
cs
1813c11bbf Compare NCCL allreduce alltoall counters 2026-05-26 00:12:55 +08:00
cs
edc469cee9 Document NCCL alltoall counter probe 2026-05-26 00:12:55 +08:00
cs
2e194ded14 Document PXN alltoall rail balancing 2026-05-26 00:12:55 +08:00
cs
619a471634 Tune multinode alltoall PXN behavior 2026-05-26 00:12:54 +08:00
cs
a64e964e3c Add raw RDMA rail bandwidth evidence 2026-05-26 00:12:54 +08:00
cs
ce363b2f7a Document missing NCCL network plugin 2026-05-26 00:12:54 +08:00
cs
e756f0b7b4 Document NCCL rail saturation evidence 2026-05-26 00:12:54 +08:00
cs
aa05ccab2e Add NCCL PDF matrix topology report 2026-05-26 00:12:54 +08:00
cs
6c9f049b71 Tune multinode NCCL auto parameters 2026-05-26 00:12:50 +08:00
cs
1f907e9691 Validate NCCL 2.27 multinode GDR performance 2026-05-26 00:12:50 +08:00
cs
c660e04c99 Stabilize multinode NCCL launch diagnostics 2026-05-26 00:12:50 +08:00
cs
4b93fc785f Add multinode NCCL diagnostic report 2026-05-26 00:12:43 +08:00
cs
4b17bafd53 Add multi-node NCCL sweep test 2026-05-26 00:12:25 +08:00
cs
86f15544d7 Add H100 acceptance test coverage and reports 2026-05-26 00:12:10 +08:00
dd77a882f1 feat: 跨机 RDMA 并入 rdma_test.py + H800 算力门槛对齐 H100
- modules/rdma_test.py: 新增 SSH 编排的跨机 RDMA(run_cross_node /
  _cross_node_perftest / 解析器),从 client 端逐设备拉起对端 perftest
  server 跑本地 client,替代已删除的 scripts/rdma_cross_node.sh;两机
  4×NDR400 实测全 PASS(~387-392 Gb/s,~2 µs)。
- configs/default.yaml: 新增 rdma.cross_node 配置块(默认 enabled:false)。
- modules/gpu_specs.py: H800 PASS 门槛对齐 H100 实测地板
  (tf32 400->385, bf16 720->730, fp8 1400->1200);H800=H100 硅片,
  PyTorch tensorwise fp8 天花板 ~1310,原 1400 不可达。

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 19:38:43 +08:00
e49ea32094 feat: 新增多机 nccl test 测试脚本 2026-05-25 14:19:02 +08:00
fc97a768cf feat: 按 H100 生产验收标准更新测试指标与判定逻辑
- gpu_specs: H100 新增 compute_pass_thresholds_tflops 字段
  (fp32:54 / tf32:444 / fp16:734 / bf16:745 / fp8:1400),
  与 marketing peak 解耦,作为绝对 TFLOPS PASS 门槛
- benchmark: compute 结果中透出 pass_thresholds_tflops 供 report 使用
- report: compute 判定改用绝对 TFLOPS (PASS ≥门槛 / WARN ≥门槛×90% /
  FAIL <门槛×90%);表头切换为 Threshold 列;Memory D2D verdict
  由 50/30 收紧至 80/60;无阈值配置的 GPU 保留旧 % 效率逻辑
- nccl: _OP_BW_FRACTIONS 收紧至 AllReduce/AllGather/ReduceScatter
  0.45、Broadcast/SendRecv 0.40、AllToAll 0.35,与验收文档 §5 一致
- configs: benchmark 默认 matrix_size 4096→8192、warmup 10→50、
  iterations 100→500、use_compile 改 true;health temp_warning
  80→75、temp_critical 90→85,匹配生产验收稳态温度要求

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-13 14:52:41 +08:00
375d439abb feat: 新增 H20 支持、优化算力测试精度并修复多项稳定性问题
- gpu_specs: 新增 H20/H20-3e (中国合规版 H200) 规格定义,并修复
  GPU 名称匹配顺序,避免 "H200" 被 "H20" 子串误匹配
- benchmark(compute): 引入 L2 cache 规避的 matrix pool 轮换 +
  可选 torch.compile(max-autotune),FP8 增加 _scaled_mm 探测,
  显著提升 FP16/BF16/FP8 实测吞吐准确性
- benchmark(memory): nvbandwidth 增加 --disableAffinity 规避
  fabricmanager NVML 不兼容;全 0 结果时自动回退到 PyTorch;
  D2D 平均值排除对角线零值
- nccl: 各通信操作 (AllReduce/AllToAll/Broadcast 等) 使用独立
  带宽阈值比例,避免 AllToAll 误报 WARN
- rdma: 仅按 link_layer=InfiniBand 过滤端口,无 IB 硬件或全 DOWN
  时直接 SKIP 而非报错
- stress: 计算矩阵尺寸封顶 4096,并改为先并发派发再统一同步,
  修复 8 卡串行执行导致 duration 严重超时的问题
- report: 兼容 RDMA SKIP 状态与 PyTorch 回退场景的 Memory 判定,
  避免回退结果被误判为 FAIL
- config: 新增 benchmark.compute.use_compile 开关

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 21:41:46 +08:00
ef2ca11c58 fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs
- benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors
  instead of torch.matmul() which does not support float8_e4m3fn on Hopper;
  fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement

- nccl_test.py: fix two bugs causing all-zero bandwidth results
  1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load
  2. column parser corrected: parts[2] is dtype string not time value;
     now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format

- report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress
  fields to handle None values stored in result dicts (dict.get with default
  does not override an explicitly stored None)

- .gitignore: exclude .claude/settings.local.json

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 15:53:12 +08:00
09f81973bc Merge pull request 'hanks/test_gpu' (#1) from hanks/test_gpu into main
Reviewed-on: #1
2026-05-07 21:34:56 +08:00
qinyusen
fefef8e03b refactor: remove hardcoding, fix AMP bug, unify English output
- Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped)
- Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0
- Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4
- Remove hardcoded GPU lists: dynamic banner, CLI choices, version
- Unknown GPU efficiency displays N/A instead of 0%
- Unify all console output to English (stress_test, gpu_tester)
- Use importlib.metadata for runtime version resolution
- Remove dir="/tmp" from tempfile (use system default)

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 21:32:35 +08:00
qinyusen
f2158f6cd3 fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures
Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 18:09:22 +08:00
qinyusen
24934bc182 feat: rewrite install_deps.sh with env isolation and add numpy to requirements
- Complete rewrite of install_deps.sh (6-phase architecture):
  environment validation, uv-based venv isolation, CUDA auto-detection,
  idempotent native tool compilation, env.sh/run-gpu-tests generation
- Add numpy>=1.24 to requirements.txt to align with pyproject.toml
- Support --install-system-deps, --skip-pytorch, --rebuild, -y flags
- Use subshells for compilation to prevent CWD pollution
- Generate env.sh activation script and run-gpu-tests wrapper

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 01:32:13 +08:00