-
017c981062
Remove remaining report docs from PR
h100-acceptance-current
cs
2026-05-26 00:44:56 +08:00
-
1c3c811254
Remove generated reports from PR
cs
2026-05-26 00:44:39 +08:00
-
7ec2da18bc
Clean report whitespace
cs
2026-05-26 00:15:48 +08:00
-
4dddab27b3
Add FP8 GEMM path comparison reports
cs
2026-05-26 00:13:33 +08:00
-
4484c731b6
Add H100 acceptance PR summary
cs
2026-05-23 20:37:19 +08:00
-
f80a3b3636
Add H100 acceptance delivery manifest
cs
2026-05-23 20:34:01 +08:00
-
639651ef24
Add H100 network escalation request
cs
2026-05-23 20:29:19 +08:00
-
edb4612cc6
Add H100 acceptance closure checklist
cs
2026-05-23 20:25:39 +08:00
-
1203b025a0
Document H100 acceptance entrypoint
cs
2026-05-23 20:22:15 +08:00
-
5b022d5849
Summarize current H100 acceptance status
cs
2026-05-23 20:15:01 +08:00
-
90c46e40b3
Archive all-collectives NCCL artifacts
cs
2026-05-23 20:11:22 +08:00
-
c2db68f608
Add multinode NCCL all collectives run
cs
2026-05-23 20:07:47 +08:00
-
e0cb796b0c
Analyze multinode NCCL artifact signals
cs
2026-05-23 19:50:51 +08:00
-
4d06639129
Record multinode NCCL artifacts run
cs
2026-05-23 19:45:03 +08:00
-
098d1715f2
Archive multinode NCCL raw artifacts
cs
2026-05-23 19:36:53 +08:00
-
7bc15742ea
Clarify multinode NCCL report thresholds
cs
2026-05-23 19:33:01 +08:00
-
c73d738557
Record multinode NCCL PDF matrix run
cs
2026-05-23 19:30:14 +08:00
-
8923270ce0
Add multinode NCCL PDF matrix runner
cs
2026-05-23 19:21:58 +08:00
-
2c5c31e451
Add single-node H100 all runner
cs
2026-05-23 19:16:40 +08:00
-
cadfbcfaa3
Add NCCL environment snapshot script
cs
2026-05-23 19:13:35 +08:00
-
ef56e5f15a
Add NCCL latest report index
cs
2026-05-23 18:59:45 +08:00
-
892f833ff4
Add NCCL network handoff plan
cs
2026-05-23 18:57:22 +08:00
-
f64e85efaf
Document NCCL environment equivalence gaps
cs
2026-05-23 18:54:35 +08:00
-
c183f5a9d1
Document NCCL deep diagnosis rerun
cs
2026-05-23 18:51:41 +08:00
-
b55666948c
Add multinode NCCL deep diagnosis tools
cs
2026-05-23 17:37:19 +08:00
-
24a7bd5c1b
Document NCCL graph comparison
cs
2026-05-23 17:32:03 +08:00
-
82c6316716
Document NCCL alltoall secondary sweep
cs
2026-05-23 17:28:28 +08:00
-
1813c11bbf
Compare NCCL allreduce alltoall counters
cs
2026-05-23 17:17:22 +08:00
-
edc469cee9
Document NCCL alltoall counter probe
cs
2026-05-23 17:13:03 +08:00
-
2e194ded14
Document PXN alltoall rail balancing
cs
2026-05-23 17:03:02 +08:00
-
619a471634
Tune multinode alltoall PXN behavior
cs
2026-05-23 17:00:03 +08:00
-
a64e964e3c
Add raw RDMA rail bandwidth evidence
cs
2026-05-23 16:46:15 +08:00
-
ce363b2f7a
Document missing NCCL network plugin
cs
2026-05-23 16:43:25 +08:00
-
e756f0b7b4
Document NCCL rail saturation evidence
cs
2026-05-23 16:42:27 +08:00
-
aa05ccab2e
Add NCCL PDF matrix topology report
cs
2026-05-23 16:35:24 +08:00
-
6c9f049b71
Tune multinode NCCL auto parameters
cs
2026-05-23 16:12:32 +08:00
-
1f907e9691
Validate NCCL 2.27 multinode GDR performance
cs
2026-05-23 15:58:21 +08:00
-
c660e04c99
Stabilize multinode NCCL launch diagnostics
cs
2026-05-23 15:49:14 +08:00
-
4b93fc785f
Add multinode NCCL diagnostic report
cs
2026-05-23 15:39:15 +08:00
-
4b17bafd53
Add multi-node NCCL sweep test
cs
2026-05-23 13:03:26 +08:00
-
86f15544d7
Add H100 acceptance test coverage and reports
cs
2026-05-23 10:41:09 +08:00
-
dd77a882f1
feat: 跨机 RDMA 并入 rdma_test.py + H800 算力门槛对齐 H100
main
zulifeng
2026-05-25 19:38:43 +08:00
-
d0ab823766
update
dk/disk_benchmark
dukai
2026-05-25 19:36:53 +08:00
-
6cad5bca5d
update
dukai
2026-05-25 19:35:37 +08:00
-
6ecb0390e5
update
dukai
2026-05-25 19:16:18 +08:00
-
d0c527744b
feat: add disk benckmark script
dukai
2026-05-25 14:37:22 +08:00
-
-
e49ea32094
feat: 新增多机 nccl test 测试脚本
zulifeng
2026-05-25 14:19:02 +08:00
-
fc97a768cf
feat: 按 H100 生产验收标准更新测试指标与判定逻辑
zulifeng
2026-05-13 14:52:41 +08:00
-
375d439abb
feat: 新增 H20 支持、优化算力测试精度并修复多项稳定性问题
zulifeng
2026-05-12 21:41:46 +08:00
-
1db1313d50
fix(health): mark PCIe link downgrade as FAIL instead of WARN
donghongshuai
hongshuai.dong
2026-05-10 17:22:43 +08:00
-
-
ef2ca11c58
fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs
zulifeng
2026-05-10 15:53:12 +08:00
-
09f81973bc
Merge pull request 'hanks/test_gpu' (#1) from hanks/test_gpu into main
zulifeng
han.zhao
2026-05-07 21:34:56 +08:00
-
-
fefef8e03b
refactor: remove hardcoding, fix AMP bug, unify English output
hanks/test_gpu
qinyusen
2026-05-07 21:32:35 +08:00
-
f2158f6cd3
fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures
qinyusen
2026-05-07 18:09:22 +08:00
-
24934bc182
feat: rewrite install_deps.sh with env isolation and add numpy to requirements
qinyusen
2026-05-07 01:32:13 +08:00
-
3e967dd34a
feat: add Ampere (A100/A800) support and generalize project naming
qinyusen
2026-05-07 01:02:28 +08:00
-
-
07250af845
docs: update README for multi-GPU support with auto-detection guide
qinyusen
2026-05-06 19:32:09 +08:00
-
2cb776d7d5
fix: generic branding, wire up report generation, fix --config flag
qinyusen
2026-05-06 19:32:01 +08:00
-
52fe96f2f5
refactor: replace hardcoded H200 specs with dynamic GPU detection
qinyusen
2026-05-06 19:31:51 +08:00
-
98e4977e28
add: GPU specs database with auto-detection (H100/H200/B200/B300)
qinyusen
2026-05-06 19:31:44 +08:00
-
8f7539d9b0
add: research notes on GPU server testing frameworks and tools
qinyusen
2026-04-25 17:24:06 +08:00
-
82cd4d5180
add: training simulation and report generation modules
qinyusen
2026-04-25 17:24:01 +08:00
-
1c6ba4809a
add: stress test (gpu-burn) and RDMA/IB test modules
qinyusen
2026-04-25 17:23:57 +08:00
-
eac1438227
add: NCCL test module (nccl-tests integration + torchrun fallback)
qinyusen
2026-04-25 17:23:54 +08:00
-
65f10dd365
add: benchmark module (nvbandwidth integration + PyTorch compute)
qinyusen
2026-04-25 17:23:49 +08:00
-
b6dff76ef7
add: health check module (temperature, power, ECC, PCIe, system checks)
qinyusen
2026-04-25 17:23:44 +08:00
-
f5fdde5fc1
add: GPU information module (nvidia-smi wrapper, NVLink topology)
qinyusen
2026-04-25 17:23:40 +08:00
-
d4f46b6394
add: CLI entry point with interactive menu and argument parsing
qinyusen
2026-04-25 17:23:35 +08:00
-
65cf7feee5
add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn)
qinyusen
2026-04-25 17:23:32 +08:00
-
418dc70efb
init: project scaffolding with README, config, and requirements
qinyusen
2026-04-25 17:23:27 +08:00