21 Commits

Author SHA1 Message Date
1db1313d50 fix(health): mark PCIe link downgrade as FAIL instead of WARN
The PCIe link health check was producing inconsistent verdicts: when the
negotiated link did not meet the GPU's expected Gen/Width (e.g. an H200
running at Gen4 instead of Gen5, or any GPU dropping below x16), the code
correctly flipped overall_pass to False — but recorded the per-GPU status
as "WARN" rather than "FAIL".

This mismatch broke the convention used by every other check in the
module (temperature, ECC, throttling), where FAIL is the only status
that drives overall_pass=False, and WARN is purely informational. As a
result the rendered Markdown / table output would show a yellow WARN
badge for the affected GPU while the overall Health Check verdict came
back red FAIL, leaving operators to wonder which signal to trust.

A PCIe link downgrade is not a soft warning — it halves H2D/D2H
bandwidth (Gen5 x16 ~64 GB/s -> Gen4 x16 ~32 GB/s), directly impacting
data loading, checkpoint I/O, and ZeRO/offload throughput. For an
acceptance-test tool this should be a hard failure, consistent with how
overall_pass already treats it.

Change: in modules/health_check.py, set status to "FAIL" (not "WARN")
when pcie_ok is False. This applies to both the known-GPU path
(Gen >= expected and Width >= 16) and the unknown-GPU fallback path
(Width >= 8). No behavioral change to overall_pass — only the per-GPU
status string is corrected so the table view, Markdown report, and the
overall verdict now agree.
2026-05-10 17:23:51 +08:00
ef2ca11c58 fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs
- benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors
  instead of torch.matmul() which does not support float8_e4m3fn on Hopper;
  fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement

- nccl_test.py: fix two bugs causing all-zero bandwidth results
  1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load
  2. column parser corrected: parts[2] is dtype string not time value;
     now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format

- report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress
  fields to handle None values stored in result dicts (dict.get with default
  does not override an explicitly stored None)

- .gitignore: exclude .claude/settings.local.json

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 15:53:12 +08:00
09f81973bc Merge pull request 'hanks/test_gpu' (#1) from hanks/test_gpu into main
Reviewed-on: #1
2026-05-07 21:34:56 +08:00
qinyusen
fefef8e03b refactor: remove hardcoding, fix AMP bug, unify English output
- Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped)
- Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0
- Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4
- Remove hardcoded GPU lists: dynamic banner, CLI choices, version
- Unknown GPU efficiency displays N/A instead of 0%
- Unify all console output to English (stress_test, gpu_tester)
- Use importlib.metadata for runtime version resolution
- Remove dir="/tmp" from tempfile (use system default)

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 21:32:35 +08:00
qinyusen
f2158f6cd3 fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures
Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 18:09:22 +08:00
qinyusen
24934bc182 feat: rewrite install_deps.sh with env isolation and add numpy to requirements
- Complete rewrite of install_deps.sh (6-phase architecture):
  environment validation, uv-based venv isolation, CUDA auto-detection,
  idempotent native tool compilation, env.sh/run-gpu-tests generation
- Add numpy>=1.24 to requirements.txt to align with pyproject.toml
- Support --install-system-deps, --skip-pytorch, --rebuild, -y flags
- Use subshells for compilation to prevent CWD pollution
- Generate env.sh activation script and run-gpu-tests wrapper

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 01:32:13 +08:00
qinyusen
3e967dd34a feat: add Ampere (A100/A800) support and generalize project naming
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 01:02:28 +08:00
qinyusen
07250af845 docs: update README for multi-GPU support with auto-detection guide
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:32:09 +08:00
qinyusen
2cb776d7d5 fix: generic branding, wire up report generation, fix --config flag
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:32:01 +08:00
qinyusen
52fe96f2f5 refactor: replace hardcoded H200 specs with dynamic GPU detection
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:51 +08:00
qinyusen
98e4977e28 add: GPU specs database with auto-detection (H100/H200/B200/B300)
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:44 +08:00
qinyusen
8f7539d9b0 add: research notes on GPU server testing frameworks and tools
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:24:06 +08:00
qinyusen
82cd4d5180 add: training simulation and report generation modules
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:24:01 +08:00
qinyusen
1c6ba4809a add: stress test (gpu-burn) and RDMA/IB test modules
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:57 +08:00
qinyusen
eac1438227 add: NCCL test module (nccl-tests integration + torchrun fallback)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:54 +08:00
qinyusen
65f10dd365 add: benchmark module (nvbandwidth integration + PyTorch compute)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:49 +08:00
qinyusen
b6dff76ef7 add: health check module (temperature, power, ECC, PCIe, system checks)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:44 +08:00
qinyusen
f5fdde5fc1 add: GPU information module (nvidia-smi wrapper, NVLink topology)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:40 +08:00
qinyusen
d4f46b6394 add: CLI entry point with interactive menu and argument parsing
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:35 +08:00
qinyusen
65cf7feee5 add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:32 +08:00
qinyusen
418dc70efb init: project scaffolding with README, config, and requirements
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:27 +08:00