test_gpu_scripts

han.zhao/test_gpu_scripts

Fork 0

Commit Graph

Author	SHA1	Message	Date
qinyusen	fefef8e03b	refactor: remove hardcoding, fix AMP bug, unify English output - Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped) - Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0 - Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4 - Remove hardcoded GPU lists: dynamic banner, CLI choices, version - Unknown GPU efficiency displays N/A instead of 0% - Unify all console output to English (stress_test, gpu_tester) - Use importlib.metadata for runtime version resolution - Remove dir="/tmp" from tempfile (use system default) 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 21:32:35 +08:00
qinyusen	f2158f6cd3	fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures Key changes: - stress_test: use torch.cuda.mem_get_info() for free memory instead of total, allocate 40% to avoid OOM when other processes occupy GPU memory - benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth (not HBM), add H2D/D2H efficiency against PCIe peak - nccl_test: implement direct binary → mpirun → torchrun fallback chain, fix min_bw None bug when YAML value is empty - report: update memory section to use per-metric peak fields - install_deps.sh: add NCCL compatibility detection, enhance CUDA version detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging - gpu_info: parse CUDA version from nvidia-smi header (query field removed in newer drivers) - health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit - gpu_tester: fix suite summary to exclude metadata keys from pass count 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 18:09:22 +08:00
qinyusen	3e967dd34a	feat: add Ampere (A100/A800) support and generalize project naming - Expand GPU specs database to include A100/A800 with Ampere architecture parameters - Rename h200_tester.py to gpu_tester.py for architecture-neutral branding - Add driver/CUDA compatibility validation per GPU generation - Enhance report module with HTML and Markdown output formats - Improve nvbandwidth binary discovery (system paths, DCGM locations) - Add pyproject.toml with uv for dependency management - Update install_deps.sh, configs, and README for multi-architecture support 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 01:02:28 +08:00

Author

SHA1

Message

Date

qinyusen

fefef8e03b

refactor: remove hardcoding, fix AMP bug, unify English output

- Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped)
- Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0
- Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4
- Remove hardcoded GPU lists: dynamic banner, CLI choices, version
- Unknown GPU efficiency displays N/A instead of 0%
- Unify all console output to English (stress_test, gpu_tester)
- Use importlib.metadata for runtime version resolution
- Remove dir="/tmp" from tempfile (use system default)

🤖 Generated with [Qoder][https://qoder.com]

2026-05-07 21:32:35 +08:00

qinyusen

f2158f6cd3

fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures

Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]

2026-05-07 18:09:22 +08:00

qinyusen

3e967dd34a

feat: add Ampere (A100/A800) support and generalize project naming

- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support

🤖 Generated with [Qoder][https://qoder.com]

2026-05-07 01:02:28 +08:00

3 Commits