test_gpu_scripts/.gitignore at f2158f6cd3d15307b79a9cee69afe8b9d5841962 - test_gpu_scripts - 地瓜机器人

han.zhao/test_gpu_scripts

qinyusen f2158f6cd3 fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures

Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]

2026-05-07 18:09:22 +08:00

17 lines

130 B

Plaintext

Raw Blame History

 __pycache__/
 *.pyc
 *.pyo
 .pytest_cache/
 *.egg-info/
 dist/
 build/
 reports/
 *.egg
 .eggs/
 *.log
 .DS_Store
 .env
 .venv/
 venv/
 .qoder/*