test_gpu_scripts

han.zhao/test_gpu_scripts

Fork 0

Commit Graph

Author	SHA1	Message	Date
zulifeng	ef2ca11c58	fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs - benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors instead of torch.matmul() which does not support float8_e4m3fn on Hopper; fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement - nccl_test.py: fix two bugs causing all-zero bandwidth results 1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load 2. column parser corrected: parts[2] is dtype string not time value; now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format - report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress fields to handle None values stored in result dicts (dict.get with default does not override an explicitly stored None) - .gitignore: exclude .claude/settings.local.json Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 15:53:12 +08:00
qinyusen	f2158f6cd3	fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures Key changes: - stress_test: use torch.cuda.mem_get_info() for free memory instead of total, allocate 40% to avoid OOM when other processes occupy GPU memory - benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth (not HBM), add H2D/D2H efficiency against PCIe peak - nccl_test: implement direct binary → mpirun → torchrun fallback chain, fix min_bw None bug when YAML value is empty - report: update memory section to use per-metric peak fields - install_deps.sh: add NCCL compatibility detection, enhance CUDA version detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging - gpu_info: parse CUDA version from nvidia-smi header (query field removed in newer drivers) - health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit - gpu_tester: fix suite summary to exclude metadata keys from pass count 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 18:09:22 +08:00
qinyusen	418dc70efb	init: project scaffolding with README, config, and requirements Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:27 +08:00

Author

SHA1

Message

Date

zulifeng

ef2ca11c58

fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs

- benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors
  instead of torch.matmul() which does not support float8_e4m3fn on Hopper;
  fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement

- nccl_test.py: fix two bugs causing all-zero bandwidth results
  1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load
  2. column parser corrected: parts[2] is dtype string not time value;
     now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format

- report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress
  fields to handle None values stored in result dicts (dict.get with default
  does not override an explicitly stored None)

- .gitignore: exclude .claude/settings.local.json

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-10 15:53:12 +08:00

qinyusen

f2158f6cd3

fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures

Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]

2026-05-07 18:09:22 +08:00

qinyusen

418dc70efb

init: project scaffolding with README, config, and requirements

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

2026-04-25 17:23:27 +08:00

3 Commits