The PCIe link health check was producing inconsistent verdicts: when the
negotiated link did not meet the GPU's expected Gen/Width (e.g. an H200
running at Gen4 instead of Gen5, or any GPU dropping below x16), the code
correctly flipped overall_pass to False — but recorded the per-GPU status
as "WARN" rather than "FAIL".
This mismatch broke the convention used by every other check in the
module (temperature, ECC, throttling), where FAIL is the only status
that drives overall_pass=False, and WARN is purely informational. As a
result the rendered Markdown / table output would show a yellow WARN
badge for the affected GPU while the overall Health Check verdict came
back red FAIL, leaving operators to wonder which signal to trust.
A PCIe link downgrade is not a soft warning — it halves H2D/D2H
bandwidth (Gen5 x16 ~64 GB/s -> Gen4 x16 ~32 GB/s), directly impacting
data loading, checkpoint I/O, and ZeRO/offload throughput. For an
acceptance-test tool this should be a hard failure, consistent with how
overall_pass already treats it.
Change: in modules/health_check.py, set status to "FAIL" (not "WARN")
when pcie_ok is False. This applies to both the known-GPU path
(Gen >= expected and Width >= 16) and the unknown-GPU fallback path
(Width >= 8). No behavioral change to overall_pass — only the per-GPU
status string is corrected so the table view, Markdown report, and the
overall verdict now agree.
- benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors
instead of torch.matmul() which does not support float8_e4m3fn on Hopper;
fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement
- nccl_test.py: fix two bugs causing all-zero bandwidth results
1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load
2. column parser corrected: parts[2] is dtype string not time value;
now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format
- report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress
fields to handle None values stored in result dicts (dict.get with default
does not override an explicitly stored None)
- .gitignore: exclude .claude/settings.local.json
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
(not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count
🤖 Generated with [Qoder][https://qoder.com]
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support
🤖 Generated with [Qoder][https://qoder.com]