- benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors
instead of torch.matmul() which does not support float8_e4m3fn on Hopper;
fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement
- nccl_test.py: fix two bugs causing all-zero bandwidth results
1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load
2. column parser corrected: parts[2] is dtype string not time value;
now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format
- report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress
fields to handle None values stored in result dicts (dict.get with default
does not override an explicitly stored None)
- .gitignore: exclude .claude/settings.local.json
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
(not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count
🤖 Generated with [Qoder][https://qoder.com]
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support
🤖 Generated with [Qoder][https://qoder.com]