20 Commits

Author SHA1 Message Date
ef2ca11c58 fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs
- benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors
  instead of torch.matmul() which does not support float8_e4m3fn on Hopper;
  fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement

- nccl_test.py: fix two bugs causing all-zero bandwidth results
  1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load
  2. column parser corrected: parts[2] is dtype string not time value;
     now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format

- report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress
  fields to handle None values stored in result dicts (dict.get with default
  does not override an explicitly stored None)

- .gitignore: exclude .claude/settings.local.json

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-10 15:53:12 +08:00
09f81973bc Merge pull request 'hanks/test_gpu' (#1) from hanks/test_gpu into main
Reviewed-on: #1
2026-05-07 21:34:56 +08:00
qinyusen
fefef8e03b refactor: remove hardcoding, fix AMP bug, unify English output
- Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped)
- Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0
- Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4
- Remove hardcoded GPU lists: dynamic banner, CLI choices, version
- Unknown GPU efficiency displays N/A instead of 0%
- Unify all console output to English (stress_test, gpu_tester)
- Use importlib.metadata for runtime version resolution
- Remove dir="/tmp" from tempfile (use system default)

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 21:32:35 +08:00
qinyusen
f2158f6cd3 fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures
Key changes:
- stress_test: use torch.cuda.mem_get_info() for free memory instead of total,
  allocate 40% to avoid OOM when other processes occupy GPU memory
- benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth
  (not HBM), add H2D/D2H efficiency against PCIe peak
- nccl_test: implement direct binary → mpirun → torchrun fallback chain,
  fix min_bw None bug when YAML value is empty
- report: update memory section to use per-metric peak fields
- install_deps.sh: add NCCL compatibility detection, enhance CUDA version
  detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging
- gpu_info: parse CUDA version from nvidia-smi header (query field removed
  in newer drivers)
- health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit
- gpu_tester: fix suite summary to exclude metadata keys from pass count

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 18:09:22 +08:00
qinyusen
24934bc182 feat: rewrite install_deps.sh with env isolation and add numpy to requirements
- Complete rewrite of install_deps.sh (6-phase architecture):
  environment validation, uv-based venv isolation, CUDA auto-detection,
  idempotent native tool compilation, env.sh/run-gpu-tests generation
- Add numpy>=1.24 to requirements.txt to align with pyproject.toml
- Support --install-system-deps, --skip-pytorch, --rebuild, -y flags
- Use subshells for compilation to prevent CWD pollution
- Generate env.sh activation script and run-gpu-tests wrapper

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 01:32:13 +08:00
qinyusen
3e967dd34a feat: add Ampere (A100/A800) support and generalize project naming
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support

🤖 Generated with [Qoder][https://qoder.com]
2026-05-07 01:02:28 +08:00
qinyusen
07250af845 docs: update README for multi-GPU support with auto-detection guide
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:32:09 +08:00
qinyusen
2cb776d7d5 fix: generic branding, wire up report generation, fix --config flag
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:32:01 +08:00
qinyusen
52fe96f2f5 refactor: replace hardcoded H200 specs with dynamic GPU detection
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:51 +08:00
qinyusen
98e4977e28 add: GPU specs database with auto-detection (H100/H200/B200/B300)
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:44 +08:00
qinyusen
8f7539d9b0 add: research notes on GPU server testing frameworks and tools
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:24:06 +08:00
qinyusen
82cd4d5180 add: training simulation and report generation modules
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:24:01 +08:00
qinyusen
1c6ba4809a add: stress test (gpu-burn) and RDMA/IB test modules
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:57 +08:00
qinyusen
eac1438227 add: NCCL test module (nccl-tests integration + torchrun fallback)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:54 +08:00
qinyusen
65f10dd365 add: benchmark module (nvbandwidth integration + PyTorch compute)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:49 +08:00
qinyusen
b6dff76ef7 add: health check module (temperature, power, ECC, PCIe, system checks)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:44 +08:00
qinyusen
f5fdde5fc1 add: GPU information module (nvidia-smi wrapper, NVLink topology)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:40 +08:00
qinyusen
d4f46b6394 add: CLI entry point with interactive menu and argument parsing
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:35 +08:00
qinyusen
65cf7feee5 add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn)
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:32 +08:00
qinyusen
418dc70efb init: project scaffolding with README, config, and requirements
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:27 +08:00