6 Commits

Author SHA1 Message Date
1db1313d50 fix(health): mark PCIe link downgrade as FAIL instead of WARN
The PCIe link health check was producing inconsistent verdicts: when the
negotiated link did not meet the GPU's expected Gen/Width (e.g. an H200
running at Gen4 instead of Gen5, or any GPU dropping below x16), the code
correctly flipped overall_pass to False — but recorded the per-GPU status
as "WARN" rather than "FAIL".

This mismatch broke the convention used by every other check in the
module (temperature, ECC, throttling), where FAIL is the only status
that drives overall_pass=False, and WARN is purely informational. As a
result the rendered Markdown / table output would show a yellow WARN
badge for the affected GPU while the overall Health Check verdict came
back red FAIL, leaving operators to wonder which signal to trust.

A PCIe link downgrade is not a soft warning — it halves H2D/D2H
bandwidth (Gen5 x16 ~64 GB/s -> Gen4 x16 ~32 GB/s), directly impacting
data loading, checkpoint I/O, and ZeRO/offload throughput. For an
acceptance-test tool this should be a hard failure, consistent with how
overall_pass already treats it.

Change: in modules/health_check.py, set status to "FAIL" (not "WARN")
when pcie_ok is False. This applies to both the known-GPU path
(Gen >= expected and Width >= 16) and the unknown-GPU fallback path
(Width >= 8). No behavioral change to overall_pass — only the per-GPU
status string is corrected so the table view, Markdown report, and the
overall verdict now agree.
2026-05-10 17:23:51 +08:00
qinyusen
fefef8e03b refactor: remove hardcoding, fix AMP bug, unify English output
- Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped)
- Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0
- Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4
- Remove hardcoded GPU lists: dynamic banner, CLI choices, version
- Unknown GPU efficiency displays N/A instead of 0%
- Unify all console output to English (stress_test, gpu_tester)
- Use importlib.metadata for runtime version resolution
- Remove dir="/tmp" from tempfile (use system default)

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 21:32:35 +08:00
qinyusen
f2158f6cd3 fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures
Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 18:09:22 +08:00
qinyusen
3e967dd34a feat: add Ampere (A100/A800) support and generalize project naming
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 01:02:28 +08:00
qinyusen
52fe96f2f5 refactor: replace hardcoded H200 specs with dynamic GPU detection
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:51 +08:00
qinyusen
b6dff76ef7 add: health check module (temperature, power, ECC, PCIe, system checks)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:44 +08:00