test_gpu_scripts

Author	SHA1	Message	Date
han.zhao	09f81973bc	Merge pull request 'hanks/test_gpu' (#1 ) from hanks/test_gpu into main Reviewed-on: #1	2026-05-07 21:34:56 +08:00
qinyusen	fefef8e03b	refactor: remove hardcoding, fix AMP bug, unify English output - Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped) - Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0 - Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4 - Remove hardcoded GPU lists: dynamic banner, CLI choices, version - Unknown GPU efficiency displays N/A instead of 0% - Unify all console output to English (stress_test, gpu_tester) - Use importlib.metadata for runtime version resolution - Remove dir="/tmp" from tempfile (use system default) 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 21:32:35 +08:00
qinyusen	f2158f6cd3	fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures Key changes: - stress_test: use torch.cuda.mem_get_info() for free memory instead of total, allocate 40% to avoid OOM when other processes occupy GPU memory - benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth (not HBM), add H2D/D2H efficiency against PCIe peak - nccl_test: implement direct binary → mpirun → torchrun fallback chain, fix min_bw None bug when YAML value is empty - report: update memory section to use per-metric peak fields - install_deps.sh: add NCCL compatibility detection, enhance CUDA version detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging - gpu_info: parse CUDA version from nvidia-smi header (query field removed in newer drivers) - health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit - gpu_tester: fix suite summary to exclude metadata keys from pass count 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 18:09:22 +08:00
qinyusen	24934bc182	feat: rewrite install_deps.sh with env isolation and add numpy to requirements - Complete rewrite of install_deps.sh (6-phase architecture): environment validation, uv-based venv isolation, CUDA auto-detection, idempotent native tool compilation, env.sh/run-gpu-tests generation - Add numpy>=1.24 to requirements.txt to align with pyproject.toml - Support --install-system-deps, --skip-pytorch, --rebuild, -y flags - Use subshells for compilation to prevent CWD pollution - Generate env.sh activation script and run-gpu-tests wrapper 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 01:32:13 +08:00
qinyusen	3e967dd34a	feat: add Ampere (A100/A800) support and generalize project naming - Expand GPU specs database to include A100/A800 with Ampere architecture parameters - Rename h200_tester.py to gpu_tester.py for architecture-neutral branding - Add driver/CUDA compatibility validation per GPU generation - Enhance report module with HTML and Markdown output formats - Improve nvbandwidth binary discovery (system paths, DCGM locations) - Add pyproject.toml with uv for dependency management - Update install_deps.sh, configs, and README for multi-architecture support 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 01:02:28 +08:00
qinyusen	07250af845	docs: update README for multi-GPU support with auto-detection guide Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:32:09 +08:00
qinyusen	2cb776d7d5	fix: generic branding, wire up report generation, fix --config flag Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:32:01 +08:00
qinyusen	52fe96f2f5	refactor: replace hardcoded H200 specs with dynamic GPU detection Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:31:51 +08:00
qinyusen	98e4977e28	add: GPU specs database with auto-detection (H100/H200/B200/B300) Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:31:44 +08:00
qinyusen	8f7539d9b0	add: research notes on GPU server testing frameworks and tools Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:24:06 +08:00
qinyusen	82cd4d5180	add: training simulation and report generation modules Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:24:01 +08:00
qinyusen	1c6ba4809a	add: stress test (gpu-burn) and RDMA/IB test modules Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:57 +08:00
qinyusen	eac1438227	add: NCCL test module (nccl-tests integration + torchrun fallback) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:54 +08:00
qinyusen	65f10dd365	add: benchmark module (nvbandwidth integration + PyTorch compute) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:49 +08:00
qinyusen	b6dff76ef7	add: health check module (temperature, power, ECC, PCIe, system checks) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:44 +08:00
qinyusen	f5fdde5fc1	add: GPU information module (nvidia-smi wrapper, NVLink topology) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:40 +08:00
qinyusen	d4f46b6394	add: CLI entry point with interactive menu and argument parsing Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:35 +08:00
qinyusen	65cf7feee5	add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:32 +08:00
qinyusen	418dc70efb	init: project scaffolding with README, config, and requirements Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:27 +08:00

19 Commits