test_gpu_scripts

Author	SHA1	Message	Date
hongshuai.dong	1db1313d50	fix(health): mark PCIe link downgrade as FAIL instead of WARN The PCIe link health check was producing inconsistent verdicts: when the negotiated link did not meet the GPU's expected Gen/Width (e.g. an H200 running at Gen4 instead of Gen5, or any GPU dropping below x16), the code correctly flipped overall_pass to False — but recorded the per-GPU status as "WARN" rather than "FAIL". This mismatch broke the convention used by every other check in the module (temperature, ECC, throttling), where FAIL is the only status that drives overall_pass=False, and WARN is purely informational. As a result the rendered Markdown / table output would show a yellow WARN badge for the affected GPU while the overall Health Check verdict came back red FAIL, leaving operators to wonder which signal to trust. A PCIe link downgrade is not a soft warning — it halves H2D/D2H bandwidth (Gen5 x16 ~64 GB/s -> Gen4 x16 ~32 GB/s), directly impacting data loading, checkpoint I/O, and ZeRO/offload throughput. For an acceptance-test tool this should be a hard failure, consistent with how overall_pass already treats it. Change: in modules/health_check.py, set status to "FAIL" (not "WARN") when pcie_ok is False. This applies to both the known-GPU path (Gen >= expected and Width >= 16) and the unknown-GPU fallback path (Width >= 8). No behavioral change to overall_pass — only the per-GPU status string is corrected so the table view, Markdown report, and the overall verdict now agree.	2026-05-10 17:23:51 +08:00
zulifeng	ef2ca11c58	fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs - benchmark.py: FP8 dtype now uses torch._scaled_mm() with scale tensors instead of torch.matmul() which does not support float8_e4m3fn on Hopper; fixes "addmm_cuda not implemented" error and enables FP8 TFLOPS measurement - nccl_test.py: fix two bugs causing all-zero bandwidth results 1. buffer size changed from -b 8 (8 bytes) to -b 8M -e 8G for meaningful load 2. column parser corrected: parts[2] is dtype string not time value; now reads time=parts[5], algbw=parts[6], busbw=parts[7] per nccl-tests format - report.py: replace .get(key, 0) with .get(key) or 0 at all bandwidth/stress fields to handle None values stored in result dicts (dict.get with default does not override an explicitly stored None) - .gitignore: exclude .claude/settings.local.json Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-10 15:53:12 +08:00
han.zhao	09f81973bc	Merge pull request 'hanks/test_gpu' (#1 ) from hanks/test_gpu into main Reviewed-on: #1	2026-05-07 21:34:56 +08:00
qinyusen	fefef8e03b	refactor: remove hardcoding, fix AMP bug, unify English output - Fix AMP autocast: bf16 now uses torch.amp.autocast (was skipped) - Fix NCCL threshold: unknown GPU gets 10 GB/s floor instead of 0 - Fix PCIe health check: use specs-driven pcie_gen, not hardcoded Gen4 - Remove hardcoded GPU lists: dynamic banner, CLI choices, version - Unknown GPU efficiency displays N/A instead of 0% - Unify all console output to English (stress_test, gpu_tester) - Use importlib.metadata for runtime version resolution - Remove dir="/tmp" from tempfile (use system default) 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 21:32:35 +08:00
qinyusen	f2158f6cd3	fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures Key changes: - stress_test: use torch.cuda.mem_get_info() for free memory instead of total, allocate 40% to avoid OOM when other processes occupy GPU memory - benchmark: fix D2D efficiency by comparing to NVLink per-direction bandwidth (not HBM), add H2D/D2H efficiency against PCIe peak - nccl_test: implement direct binary → mpirun → torchrun fallback chain, fix min_bw None bug when YAML value is empty - report: update memory section to use per-metric peak fields - install_deps.sh: add NCCL compatibility detection, enhance CUDA version detection with CUDA_HOME/standard paths, improve _map_cuda_tag logging - gpu_info: parse CUDA version from nvidia-smi header (query field removed in newer drivers) - health_check: parse throttle_reasons bitmask properly, ignore gpu_idle bit - gpu_tester: fix suite summary to exclude metadata keys from pass count 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 18:09:22 +08:00
qinyusen	24934bc182	feat: rewrite install_deps.sh with env isolation and add numpy to requirements - Complete rewrite of install_deps.sh (6-phase architecture): environment validation, uv-based venv isolation, CUDA auto-detection, idempotent native tool compilation, env.sh/run-gpu-tests generation - Add numpy>=1.24 to requirements.txt to align with pyproject.toml - Support --install-system-deps, --skip-pytorch, --rebuild, -y flags - Use subshells for compilation to prevent CWD pollution - Generate env.sh activation script and run-gpu-tests wrapper 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 01:32:13 +08:00
qinyusen	3e967dd34a	feat: add Ampere (A100/A800) support and generalize project naming - Expand GPU specs database to include A100/A800 with Ampere architecture parameters - Rename h200_tester.py to gpu_tester.py for architecture-neutral branding - Add driver/CUDA compatibility validation per GPU generation - Enhance report module with HTML and Markdown output formats - Improve nvbandwidth binary discovery (system paths, DCGM locations) - Add pyproject.toml with uv for dependency management - Update install_deps.sh, configs, and README for multi-architecture support 🤖 Generated with [Qoder][https://qoder.com]	2026-05-07 01:02:28 +08:00
qinyusen	07250af845	docs: update README for multi-GPU support with auto-detection guide Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:32:09 +08:00
qinyusen	2cb776d7d5	fix: generic branding, wire up report generation, fix --config flag Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:32:01 +08:00
qinyusen	52fe96f2f5	refactor: replace hardcoded H200 specs with dynamic GPU detection Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:31:51 +08:00
qinyusen	98e4977e28	add: GPU specs database with auto-detection (H100/H200/B200/B300) Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-05-06 19:31:44 +08:00
qinyusen	8f7539d9b0	add: research notes on GPU server testing frameworks and tools Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:24:06 +08:00
qinyusen	82cd4d5180	add: training simulation and report generation modules Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:24:01 +08:00
qinyusen	1c6ba4809a	add: stress test (gpu-burn) and RDMA/IB test modules Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:57 +08:00
qinyusen	eac1438227	add: NCCL test module (nccl-tests integration + torchrun fallback) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:54 +08:00
qinyusen	65f10dd365	add: benchmark module (nvbandwidth integration + PyTorch compute) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:49 +08:00
qinyusen	b6dff76ef7	add: health check module (temperature, power, ECC, PCIe, system checks) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:44 +08:00
qinyusen	f5fdde5fc1	add: GPU information module (nvidia-smi wrapper, NVLink topology) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:40 +08:00
qinyusen	d4f46b6394	add: CLI entry point with interactive menu and argument parsing Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:35 +08:00
qinyusen	65cf7feee5	add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:32 +08:00
qinyusen	418dc70efb	init: project scaffolding with README, config, and requirements Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>	2026-04-25 17:23:27 +08:00

21 Commits