Add H100 acceptance PR summary
This commit is contained in:
parent
9b0e6e29df
commit
211140e7f1
11
README.md
11
README.md
@ -15,11 +15,12 @@
|
||||
| 1 | [reports_h100_acceptance_current_status_20260523.md](reports_h100_acceptance_current_status_20260523.md) | 当前总状态:已测项、失败项、阻塞项、下一步 |
|
||||
| 2 | [reports_h100_acceptance_closure_checklist_20260523.md](reports_h100_acceptance_closure_checklist_20260523.md) | 收尾检查清单:可交付项、未关闭门禁、最短收尾路径 |
|
||||
| 3 | [reports_h100_acceptance_delivery_manifest_20260523.md](reports_h100_acceptance_delivery_manifest_20260523.md) | 交付包 manifest:入口、脚本、远端 artifacts、checksum |
|
||||
| 4 | [reports_h100_network_hardware_escalation_request_20260523.md](reports_h100_network_hardware_escalation_request_20260523.md) | 给网络/硬件/环境侧的闭环请求和回填表 |
|
||||
| 5 | [reports_multinode_nccl_latest_index_20260523.md](reports_multinode_nccl_latest_index_20260523.md) | 多节点 NCCL 相关报告索引 |
|
||||
| 6 | [reports_multinode_nccl_handoff_plan_20260523.md](reports_multinode_nccl_handoff_plan_20260523.md) | 接手人复跑和继续定位计划 |
|
||||
| 7 | [reports_test_all_latest_summary_cn_20260523.md](reports_test_all_latest_summary_cn_20260523.md) | 单节点 `test all` 中文原始汇总 |
|
||||
| 8 | [reports_rdma_cross_node_mlx5_0_20260523.md](reports_rdma_cross_node_mlx5_0_20260523.md) | 跨节点 RDMA `mlx5_0` 双向结果 |
|
||||
| 4 | [reports_h100_acceptance_pr_summary_20260523.md](reports_h100_acceptance_pr_summary_20260523.md) | PR/审阅摘要:变更范围、验证、风险、合并说明 |
|
||||
| 5 | [reports_h100_network_hardware_escalation_request_20260523.md](reports_h100_network_hardware_escalation_request_20260523.md) | 给网络/硬件/环境侧的闭环请求和回填表 |
|
||||
| 6 | [reports_multinode_nccl_latest_index_20260523.md](reports_multinode_nccl_latest_index_20260523.md) | 多节点 NCCL 相关报告索引 |
|
||||
| 7 | [reports_multinode_nccl_handoff_plan_20260523.md](reports_multinode_nccl_handoff_plan_20260523.md) | 接手人复跑和继续定位计划 |
|
||||
| 8 | [reports_test_all_latest_summary_cn_20260523.md](reports_test_all_latest_summary_cn_20260523.md) | 单节点 `test all` 中文原始汇总 |
|
||||
| 9 | [reports_rdma_cross_node_mlx5_0_20260523.md](reports_rdma_cross_node_mlx5_0_20260523.md) | 跨节点 RDMA `mlx5_0` 双向结果 |
|
||||
|
||||
当前主要阻塞:
|
||||
|
||||
|
||||
@ -19,8 +19,9 @@
|
||||
| 1 | `README.md` | 仓库入口和 H100 当前验收入口 |
|
||||
| 2 | `reports_h100_acceptance_current_status_20260523.md` | 当前总状态和阻塞项 |
|
||||
| 3 | `reports_h100_acceptance_closure_checklist_20260523.md` | 可交付项、未关闭门禁、收尾路径 |
|
||||
| 4 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的回填请求 |
|
||||
| 5 | `reports_multinode_nccl_latest_index_20260523.md` | 多节点 NCCL 报告索引 |
|
||||
| 4 | `reports_h100_acceptance_pr_summary_20260523.md` | PR/审阅摘要 |
|
||||
| 5 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的回填请求 |
|
||||
| 6 | `reports_multinode_nccl_latest_index_20260523.md` | 多节点 NCCL 报告索引 |
|
||||
|
||||
## 核心报告
|
||||
|
||||
@ -28,6 +29,7 @@
|
||||
|---|---|---|
|
||||
| 总览 | `reports_h100_acceptance_current_status_20260523.md` | FAIL,证据链完整但门禁未过 |
|
||||
| 收尾 | `reports_h100_acceptance_closure_checklist_20260523.md` | 可阶段性交付,不能判生产通过 |
|
||||
| PR 摘要 | `reports_h100_acceptance_pr_summary_20260523.md` | 给代码审阅和合并说明使用 |
|
||||
| 闭环请求 | `reports_h100_network_hardware_escalation_request_20260523.md` | 等待网络/硬件/环境侧回填 |
|
||||
| 单节点 | `reports_test_all_latest_summary_cn_20260523.md` | 两台均 `6/10 PASS`,整体 FAIL |
|
||||
| 跨节点 RDMA | `reports_rdma_cross_node_mlx5_0_20260523.md` | write BW PASS,read BW/latency FAIL |
|
||||
@ -113,9 +115,10 @@ fa5961d47a5905da6ebc6c726421d73ddc2314a316a8f578683d31fe69c256e5 reports/multin
|
||||
以下 hash 用于确认本地与两台远端入口文件一致。本 manifest 本身不做自引用 hash。
|
||||
|
||||
```text
|
||||
bf3fd8197285dca964b78c584ee6263b0d0f4d47fbf689d121367666d3398231 README.md
|
||||
e2faf6cbd968924727c669827d7e838d5165ee961133c8e55e8993134b5e7b63 README.md
|
||||
846c3da4ac655a0b3ad072e4c4475d91b55e2bdc9d8aedb9c5f9d800608fb64c reports_h100_acceptance_current_status_20260523.md
|
||||
4a0ee9f456acc1284bf3a42df5bd338affb831471c27ca4b6584201acd72fd52 reports_h100_acceptance_closure_checklist_20260523.md
|
||||
0c71f36b9b1a6c5a73bd32337a56a702d3faa37c02640b93cb5d00b9b80c362f reports_h100_acceptance_pr_summary_20260523.md
|
||||
45438db9204ceef5f65019a6594c016f3183799ed3b89dcf40f383a34f9e3466 reports_h100_network_hardware_escalation_request_20260523.md
|
||||
d982d6f3698e8860b8505d65105f6056c11f1f72758401a4613ae8315b6f92d0 reports_multinode_nccl_latest_index_20260523.md
|
||||
8fca70e703961745d5bdacaa3fccb814709c426c0fa7713d0df2d1f2fb26a3f4 reports_multinode_nccl_handoff_plan_20260523.md
|
||||
|
||||
144
reports_h100_acceptance_pr_summary_20260523.md
Normal file
144
reports_h100_acceptance_pr_summary_20260523.md
Normal file
@ -0,0 +1,144 @@
|
||||
# H100 验收分支 PR 摘要 2026-05-23
|
||||
|
||||
## 建议 PR 标题
|
||||
|
||||
```text
|
||||
Add H100 acceptance evidence, multinode NCCL runs, and handoff reports
|
||||
```
|
||||
|
||||
## PR 结论
|
||||
|
||||
本 PR 完成 H100 验收测试侧的阶段性交付:脚本、单节点报告、多节点 NCCL 报告、RDMA 证据、artifacts、checksum、中文说明和交接文档已经齐备。
|
||||
|
||||
但本 PR **不表示生产验收通过**。当前两台 H100 节点按现有 PDF/配置口径仍为 `FAIL`,需要网络/硬件/环境侧完成回填或修复后再复跑。
|
||||
|
||||
## 变更范围
|
||||
|
||||
### 测试入口
|
||||
|
||||
- 新增/完善单节点 H100 `test all` 入口。
|
||||
- 新增多节点 NCCL PDF matrix 复跑入口。
|
||||
- 新增多节点 2x8 六项 collective 复跑入口。
|
||||
- 新增 NCCL 深度诊断和环境快照入口。
|
||||
|
||||
### 配置
|
||||
|
||||
- 固定 NCCL 2.27.7 / nccl-tests 路径的多节点 PDF matrix 配置。
|
||||
- 新增 2x8 六项 collective 配置。
|
||||
- `allreduce/alltoall` 保留已知 PDF 2x8 阈值;新增的 `broadcast/reducescatter/allgather/sendrecv` 暂按证据采集处理。
|
||||
|
||||
### 报告和证据
|
||||
|
||||
- 单节点 `test all` 中文汇总。
|
||||
- 跨节点 RDMA `mlx5_0` 双向证据。
|
||||
- 多节点 NCCL PDF matrix 中文摘要、原始报告、artifacts manifest。
|
||||
- 多节点 2x8 六项 collective 中文摘要、原始报告、artifacts manifest。
|
||||
- NCCL artifact 信号分析、环境等价性分析、handoff 计划、收尾清单。
|
||||
- 网络/硬件/环境侧闭环请求和交付包 manifest。
|
||||
|
||||
## 当前验收状态
|
||||
|
||||
| 范围 | 结论 | 说明 |
|
||||
|---|---|---|
|
||||
| 单节点 `test all` | FAIL | 两台均 `6/10 PASS`;Compute、NCCL、Stress、RDMA 未过 |
|
||||
| 跨节点 RDMA | FAIL | write BW PASS;read BW 和 latency 未达阈值 |
|
||||
| 多节点 NCCL PDF matrix | FAIL | 8 个 case 仅 2x2 allreduce 性能 PASS;所有 case 正确性 OK |
|
||||
| 多节点 2x8 六项 collective | FAIL / evidence complete | 6 项正确性 OK;allreduce/alltoall 按 PDF 阈值 FAIL |
|
||||
| 环境等价性 | 未证明 | 当前每节点只有 4 条 400G rail,缺外部 NCCL net plugin / SHARP 证据 |
|
||||
|
||||
## 关键结果
|
||||
|
||||
### 单节点
|
||||
|
||||
```text
|
||||
aikubeworker0012: 6/10 PASS, PDF acceptance FAIL
|
||||
aikubeworker0016: 6/10 PASS, PDF acceptance FAIL
|
||||
```
|
||||
|
||||
### 跨节点 RDMA
|
||||
|
||||
```text
|
||||
ib_write_bw: 48.38-49.35 GB/s, PASS
|
||||
ib_read_bw: 44.36-44.37 GB/s, FAIL
|
||||
ib_write_lat avg: 2.13-2.17 us, FAIL
|
||||
ib_read_lat avg: 4.05-4.08 us, FAIL
|
||||
```
|
||||
|
||||
### 多节点 NCCL PDF matrix
|
||||
|
||||
| Topology | AllReduce | Target | Status | AllToAll | Target | Status |
|
||||
|---|---:|---:|---|---:|---:|---|
|
||||
| 2 nodes x 1 GPU | 47.29 | 48.90 | FAIL | 24.85 | 27.25 | FAIL |
|
||||
| 2 nodes x 2 GPUs | 137.16 | 136.93 | PASS | 47.76 | 54.41 | FAIL |
|
||||
| 2 nodes x 4 GPUs | 335.07 | 335.48 | FAIL | 72.74 | 73.73 | FAIL |
|
||||
| 2 nodes x 8 GPUs | 353.85 | 491.84 | FAIL | 36.83 | 76.54 | FAIL |
|
||||
|
||||
所有 NCCL case 均 `returncode=0`、`wrong=0`,当前失败来自性能阈值,不是功能错误。
|
||||
|
||||
## 主要风险
|
||||
|
||||
1. **不能把本 PR 合并理解为验收通过。**
|
||||
当前结果明确是 `FAIL`,本 PR 交付的是证据链和复跑能力。
|
||||
|
||||
2. **PDF 2x8 allreduce 阈值可能要求比当前环境更强的 rail/plugin 能力。**
|
||||
当前每节点仅 4 条 400G IB rail;PDF 2x8 allreduce 目标 `491.84 GB/s busbw` 反推 algbw `262.31 GB/s`,高于 4 x 400G rail 的理论单向原始带宽 `200 GB/s`。
|
||||
|
||||
3. **alltoall 需要网络侧继续定位。**
|
||||
`NCCL_PXN_DISABLE=1` 后 rail 更均衡,但 2x8 alltoall 仍只有 `36-37 GB/s`。
|
||||
|
||||
4. **单节点门禁也仍未过。**
|
||||
即使多节点 NCCL 后续解决,Compute、Stress、RDMA 单节点项仍需闭环。
|
||||
|
||||
## 验证方式
|
||||
|
||||
已完成:
|
||||
|
||||
- `git diff --check`
|
||||
- 本地与两台远端入口文件 sha256 核对
|
||||
- 多节点 NCCL PDF matrix 复跑并归档 artifacts
|
||||
- 多节点 2x8 六项 collective 复跑并归档 artifacts
|
||||
- 跨节点 RDMA 单 rail 双向测试
|
||||
- 单节点 `test all` 汇总
|
||||
|
||||
远端同步路径:
|
||||
|
||||
```text
|
||||
nccl-gpu-1: /root/test_gpu_scripts
|
||||
nccl-gpu-2: /root/test_gpu_scripts
|
||||
```
|
||||
|
||||
## 复跑命令
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
bash scripts/multinode_nccl_deep_diagnose.sh preflight
|
||||
bash scripts/run_multinode_nccl_pdf_matrix.sh
|
||||
bash scripts/run_multinode_nccl_all_collectives.sh
|
||||
```
|
||||
|
||||
单节点复跑:
|
||||
|
||||
```bash
|
||||
cd /root/test_gpu_scripts
|
||||
bash scripts/run_h100_single_node_all.sh
|
||||
```
|
||||
|
||||
## Reviewer 重点看
|
||||
|
||||
| 文件 | 为什么要看 |
|
||||
|---|---|
|
||||
| `reports_h100_acceptance_current_status_20260523.md` | 当前总览和失败项 |
|
||||
| `reports_h100_acceptance_delivery_manifest_20260523.md` | 交付包入口、远端 artifacts、checksum |
|
||||
| `reports_h100_network_hardware_escalation_request_20260523.md` | 需要网络/硬件/环境侧回填的问题 |
|
||||
| `reports_multinode_nccl_environment_gap_20260523.md` | 为什么当前环境不能证明与 PDF 等价 |
|
||||
| `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 多节点 PDF matrix 结果 |
|
||||
| `reports_multinode_nccl_all_collectives_run_20260523.md` | 六项 collective 补测结果 |
|
||||
|
||||
## 合并建议
|
||||
|
||||
可以合并为测试侧交付分支,但合并说明中必须保留:
|
||||
|
||||
```text
|
||||
当前 H100 生产验收未通过;本分支交付测试证据、复跑脚本和闭环请求。
|
||||
最终验收需等待网络/硬件/环境侧确认或修复后复跑。
|
||||
```
|
||||
Loading…
x
Reference in New Issue
Block a user