qinyusen
3e967dd34a
feat: add Ampere (A100/A800) support and generalize project naming
...
- Expand GPU specs database to include A100/A800 with Ampere architecture parameters
- Rename h200_tester.py to gpu_tester.py for architecture-neutral branding
- Add driver/CUDA compatibility validation per GPU generation
- Enhance report module with HTML and Markdown output formats
- Improve nvbandwidth binary discovery (system paths, DCGM locations)
- Add pyproject.toml with uv for dependency management
- Update install_deps.sh, configs, and README for multi-architecture support
🤖 Generated with [Qoder][https://qoder.com ]
2026-05-07 01:02:28 +08:00
qinyusen
07250af845
docs: update README for multi-GPU support with auto-detection guide
...
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent )
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:32:09 +08:00
qinyusen
2cb776d7d5
fix: generic branding, wire up report generation, fix --config flag
...
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent )
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:32:01 +08:00
qinyusen
52fe96f2f5
refactor: replace hardcoded H200 specs with dynamic GPU detection
...
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent )
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:51 +08:00
qinyusen
98e4977e28
add: GPU specs database with auto-detection (H100/H200/B200/B300)
...
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent )
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-05-06 19:31:44 +08:00
qinyusen
8f7539d9b0
add: research notes on GPU server testing frameworks and tools
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:24:06 +08:00
qinyusen
82cd4d5180
add: training simulation and report generation modules
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:24:01 +08:00
qinyusen
1c6ba4809a
add: stress test (gpu-burn) and RDMA/IB test modules
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:57 +08:00
qinyusen
eac1438227
add: NCCL test module (nccl-tests integration + torchrun fallback)
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:54 +08:00
qinyusen
65f10dd365
add: benchmark module (nvbandwidth integration + PyTorch compute)
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:49 +08:00
qinyusen
b6dff76ef7
add: health check module (temperature, power, ECC, PCIe, system checks)
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:44 +08:00
qinyusen
f5fdde5fc1
add: GPU information module (nvidia-smi wrapper, NVLink topology)
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:40 +08:00
qinyusen
d4f46b6394
add: CLI entry point with interactive menu and argument parsing
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:35 +08:00
qinyusen
65cf7feee5
add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn)
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:32 +08:00
qinyusen
418dc70efb
init: project scaffolding with README, config, and requirements
...
Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
2026-04-25 17:23:27 +08:00