h100-acceptance-current #3

Open
cs wants to merge 41 commits from h100-acceptance-current into main

41 Commits

Author SHA1 Message Date
cs
017c981062 Remove remaining report docs from PR 2026-05-26 00:44:56 +08:00
cs
1c3c811254 Remove generated reports from PR 2026-05-26 00:44:39 +08:00
cs
7ec2da18bc Clean report whitespace 2026-05-26 00:15:48 +08:00
cs
4dddab27b3 Add FP8 GEMM path comparison reports 2026-05-26 00:13:33 +08:00
cs
4484c731b6 Add H100 acceptance PR summary 2026-05-26 00:12:59 +08:00
cs
f80a3b3636 Add H100 acceptance delivery manifest 2026-05-26 00:12:59 +08:00
cs
639651ef24 Add H100 network escalation request 2026-05-26 00:12:59 +08:00
cs
edb4612cc6 Add H100 acceptance closure checklist 2026-05-26 00:12:59 +08:00
cs
1203b025a0 Document H100 acceptance entrypoint 2026-05-26 00:12:59 +08:00
cs
5b022d5849 Summarize current H100 acceptance status 2026-05-26 00:12:59 +08:00
cs
90c46e40b3 Archive all-collectives NCCL artifacts 2026-05-26 00:12:59 +08:00
cs
c2db68f608 Add multinode NCCL all collectives run 2026-05-26 00:12:59 +08:00
cs
e0cb796b0c Analyze multinode NCCL artifact signals 2026-05-26 00:12:59 +08:00
cs
4d06639129 Record multinode NCCL artifacts run 2026-05-26 00:12:59 +08:00
cs
098d1715f2 Archive multinode NCCL raw artifacts 2026-05-26 00:12:59 +08:00
cs
7bc15742ea Clarify multinode NCCL report thresholds 2026-05-26 00:12:59 +08:00
cs
c73d738557 Record multinode NCCL PDF matrix run 2026-05-26 00:12:55 +08:00
cs
8923270ce0 Add multinode NCCL PDF matrix runner 2026-05-26 00:12:55 +08:00
cs
2c5c31e451 Add single-node H100 all runner 2026-05-26 00:12:55 +08:00
cs
cadfbcfaa3 Add NCCL environment snapshot script 2026-05-26 00:12:55 +08:00
cs
ef56e5f15a Add NCCL latest report index 2026-05-26 00:12:55 +08:00
cs
892f833ff4 Add NCCL network handoff plan 2026-05-26 00:12:55 +08:00
cs
f64e85efaf Document NCCL environment equivalence gaps 2026-05-26 00:12:55 +08:00
cs
c183f5a9d1 Document NCCL deep diagnosis rerun 2026-05-26 00:12:55 +08:00
cs
b55666948c Add multinode NCCL deep diagnosis tools 2026-05-26 00:12:55 +08:00
cs
24a7bd5c1b Document NCCL graph comparison 2026-05-26 00:12:55 +08:00
cs
82c6316716 Document NCCL alltoall secondary sweep 2026-05-26 00:12:55 +08:00
cs
1813c11bbf Compare NCCL allreduce alltoall counters 2026-05-26 00:12:55 +08:00
cs
edc469cee9 Document NCCL alltoall counter probe 2026-05-26 00:12:55 +08:00
cs
2e194ded14 Document PXN alltoall rail balancing 2026-05-26 00:12:55 +08:00
cs
619a471634 Tune multinode alltoall PXN behavior 2026-05-26 00:12:54 +08:00
cs
a64e964e3c Add raw RDMA rail bandwidth evidence 2026-05-26 00:12:54 +08:00
cs
ce363b2f7a Document missing NCCL network plugin 2026-05-26 00:12:54 +08:00
cs
e756f0b7b4 Document NCCL rail saturation evidence 2026-05-26 00:12:54 +08:00
cs
aa05ccab2e Add NCCL PDF matrix topology report 2026-05-26 00:12:54 +08:00
cs
6c9f049b71 Tune multinode NCCL auto parameters 2026-05-26 00:12:50 +08:00
cs
1f907e9691 Validate NCCL 2.27 multinode GDR performance 2026-05-26 00:12:50 +08:00
cs
c660e04c99 Stabilize multinode NCCL launch diagnostics 2026-05-26 00:12:50 +08:00
cs
4b93fc785f Add multinode NCCL diagnostic report 2026-05-26 00:12:43 +08:00
cs
4b17bafd53 Add multi-node NCCL sweep test 2026-05-26 00:12:25 +08:00
cs
86f15544d7 Add H100 acceptance test coverage and reports 2026-05-26 00:12:10 +08:00