From 9b0e6e29dffdb34368071465212e969c73adb017 Mon Sep 17 00:00:00 2001 From: cs Date: Sat, 23 May 2026 20:34:01 +0800 Subject: [PATCH] Add H100 acceptance delivery manifest --- README.md | 11 +- ...0_acceptance_closure_checklist_20260523.md | 10 +- ...h100_acceptance_current_status_20260523.md | 16 +- ...0_acceptance_delivery_manifest_20260523.md | 149 ++++++++++++++++++ ...ts_multinode_nccl_latest_index_20260523.md | 19 ++- 5 files changed, 181 insertions(+), 24 deletions(-) create mode 100644 reports_h100_acceptance_delivery_manifest_20260523.md diff --git a/README.md b/README.md index ea763a1..80e954d 100644 --- a/README.md +++ b/README.md @@ -14,11 +14,12 @@ |---|---|---| | 1 | [reports_h100_acceptance_current_status_20260523.md](reports_h100_acceptance_current_status_20260523.md) | 当前总状态:已测项、失败项、阻塞项、下一步 | | 2 | [reports_h100_acceptance_closure_checklist_20260523.md](reports_h100_acceptance_closure_checklist_20260523.md) | 收尾检查清单:可交付项、未关闭门禁、最短收尾路径 | -| 3 | [reports_h100_network_hardware_escalation_request_20260523.md](reports_h100_network_hardware_escalation_request_20260523.md) | 给网络/硬件/环境侧的闭环请求和回填表 | -| 4 | [reports_multinode_nccl_latest_index_20260523.md](reports_multinode_nccl_latest_index_20260523.md) | 多节点 NCCL 相关报告索引 | -| 5 | [reports_multinode_nccl_handoff_plan_20260523.md](reports_multinode_nccl_handoff_plan_20260523.md) | 接手人复跑和继续定位计划 | -| 6 | [reports_test_all_latest_summary_cn_20260523.md](reports_test_all_latest_summary_cn_20260523.md) | 单节点 `test all` 中文原始汇总 | -| 7 | [reports_rdma_cross_node_mlx5_0_20260523.md](reports_rdma_cross_node_mlx5_0_20260523.md) | 跨节点 RDMA `mlx5_0` 双向结果 | +| 3 | [reports_h100_acceptance_delivery_manifest_20260523.md](reports_h100_acceptance_delivery_manifest_20260523.md) | 交付包 manifest:入口、脚本、远端 artifacts、checksum | +| 4 | [reports_h100_network_hardware_escalation_request_20260523.md](reports_h100_network_hardware_escalation_request_20260523.md) | 给网络/硬件/环境侧的闭环请求和回填表 | +| 5 | [reports_multinode_nccl_latest_index_20260523.md](reports_multinode_nccl_latest_index_20260523.md) | 多节点 NCCL 相关报告索引 | +| 6 | [reports_multinode_nccl_handoff_plan_20260523.md](reports_multinode_nccl_handoff_plan_20260523.md) | 接手人复跑和继续定位计划 | +| 7 | [reports_test_all_latest_summary_cn_20260523.md](reports_test_all_latest_summary_cn_20260523.md) | 单节点 `test all` 中文原始汇总 | +| 8 | [reports_rdma_cross_node_mlx5_0_20260523.md](reports_rdma_cross_node_mlx5_0_20260523.md) | 跨节点 RDMA `mlx5_0` 双向结果 | 当前主要阻塞: diff --git a/reports_h100_acceptance_closure_checklist_20260523.md b/reports_h100_acceptance_closure_checklist_20260523.md index 670c146..6b0264f 100644 --- a/reports_h100_acceptance_closure_checklist_20260523.md +++ b/reports_h100_acceptance_closure_checklist_20260523.md @@ -22,6 +22,7 @@ | 多节点 2x8 六项 collective | 完成 | `scripts/run_multinode_nccl_all_collectives.sh`,`reports_multinode_nccl_all_collectives_run_20260523.md` | | NCCL artifacts / checksum | 完成 | `reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md`,`reports_multinode_nccl_all_collectives_artifacts_manifest_20260523_120144.md` | | 环境等价性分析 | 完成 | `reports_multinode_nccl_environment_gap_20260523.md` | +| 交付包 manifest | 完成 | `reports_h100_acceptance_delivery_manifest_20260523.md` | | 网络/硬件/环境闭环请求 | 完成 | `reports_h100_network_hardware_escalation_request_20260523.md` | | 接手 runbook / README 入口 | 完成 | `README.md`,`reports_multinode_nccl_handoff_plan_20260523.md` | @@ -89,10 +90,11 @@ bash scripts/run_multinode_nccl_all_collectives.sh 1. `reports_h100_acceptance_current_status_20260523.md` 2. `reports_h100_acceptance_closure_checklist_20260523.md` -3. `reports_h100_network_hardware_escalation_request_20260523.md` -4. `reports_multinode_nccl_handoff_plan_20260523.md` -5. `reports_multinode_nccl_environment_gap_20260523.md` -6. `reports_multinode_nccl_latest_index_20260523.md` +3. `reports_h100_acceptance_delivery_manifest_20260523.md` +4. `reports_h100_network_hardware_escalation_request_20260523.md` +5. `reports_multinode_nccl_handoff_plan_20260523.md` +6. `reports_multinode_nccl_environment_gap_20260523.md` +7. `reports_multinode_nccl_latest_index_20260523.md` 当前项目可以向外汇报为: diff --git a/reports_h100_acceptance_current_status_20260523.md b/reports_h100_acceptance_current_status_20260523.md index 8b74012..0686918 100644 --- a/reports_h100_acceptance_current_status_20260523.md +++ b/reports_h100_acceptance_current_status_20260523.md @@ -15,6 +15,7 @@ | NCCL artifacts 信号 | `reports_multinode_nccl_artifact_signal_analysis_20260523.md` | 基础链路正常 | IB/GDRDMA/HCA 均正常;无 SHARP/CollNet/外部 net plugin | | 环境等价性 | `reports_multinode_nccl_environment_gap_20260523.md` | 未证明等价 | 每节点只有 4 条 400G rail,缺 NCCL net plugin / SHARP | | 收尾检查 | `reports_h100_acceptance_closure_checklist_20260523.md` | 可阶段性交付 | 生产验收门禁仍未关闭 | +| 交付包 manifest | `reports_h100_acceptance_delivery_manifest_20260523.md` | 已形成 | 入口、脚本、远端 artifacts、checksum 已汇总 | | 网络/硬件/环境闭环 | `reports_h100_network_hardware_escalation_request_20260523.md` | 已形成请求 | 等待 rail/plugin/SHARP/交换策略/阈值口径回填 | ## 已完成的能力 @@ -153,10 +154,11 @@ NCCL 日志中没有 SHARP/CollNet 迹象,当前走 internal IB plugin。 |---:|---|---| | 1 | `reports_h100_acceptance_current_status_20260523.md` | 当前总览和阻塞清单 | | 2 | `reports_h100_acceptance_closure_checklist_20260523.md` | 收尾检查清单和关闭条件 | -| 3 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的闭环请求 | -| 4 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给网络/硬件/环境侧的交接计划 | -| 5 | `reports_multinode_nccl_environment_gap_20260523.md` | PDF 环境等价性缺口 | -| 6 | `reports_multinode_nccl_artifact_signal_analysis_20260523.md` | NCCL artifacts 信号分析 | -| 7 | `reports_multinode_nccl_all_collectives_run_20260523.md` | 多机 2x8 六项 collective 补测摘要 | -| 8 | `reports_test_all_latest_summary_cn_20260523.md` | 单节点 test all 中文汇总 | -| 9 | `reports_rdma_cross_node_mlx5_0_20260523.md` | 跨节点 RDMA 单 rail 证据 | +| 3 | `reports_h100_acceptance_delivery_manifest_20260523.md` | 交付包 manifest 和 checksum | +| 4 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的闭环请求 | +| 5 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给网络/硬件/环境侧的交接计划 | +| 6 | `reports_multinode_nccl_environment_gap_20260523.md` | PDF 环境等价性缺口 | +| 7 | `reports_multinode_nccl_artifact_signal_analysis_20260523.md` | NCCL artifacts 信号分析 | +| 8 | `reports_multinode_nccl_all_collectives_run_20260523.md` | 多机 2x8 六项 collective 补测摘要 | +| 9 | `reports_test_all_latest_summary_cn_20260523.md` | 单节点 test all 中文汇总 | +| 10 | `reports_rdma_cross_node_mlx5_0_20260523.md` | 跨节点 RDMA 单 rail 证据 | diff --git a/reports_h100_acceptance_delivery_manifest_20260523.md b/reports_h100_acceptance_delivery_manifest_20260523.md new file mode 100644 index 0000000..1de9278 --- /dev/null +++ b/reports_h100_acceptance_delivery_manifest_20260523.md @@ -0,0 +1,149 @@ +# H100 验收交付包 Manifest 2026-05-23 + +## 交付结论 + +当前分支:`h100-acceptance-current` + +最新 commit:以 `git log -1 --oneline` 为准。 + +当前状态:**测试侧阶段性交付完成,生产验收未通过。** + +本交付包已经覆盖单节点 `test all`、跨节点 RDMA、多节点 NCCL PDF matrix、多节点 2x8 六项 collective、环境等价性分析、网络/硬件/环境闭环请求、复跑脚本和 artifacts checksum。剩余工作需要网络/硬件/环境侧确认后才能继续往最终验收推进。 + +## 主入口 + +按下面顺序阅读: + +| 顺序 | 文件 | 用途 | +|---:|---|---| +| 1 | `README.md` | 仓库入口和 H100 当前验收入口 | +| 2 | `reports_h100_acceptance_current_status_20260523.md` | 当前总状态和阻塞项 | +| 3 | `reports_h100_acceptance_closure_checklist_20260523.md` | 可交付项、未关闭门禁、收尾路径 | +| 4 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的回填请求 | +| 5 | `reports_multinode_nccl_latest_index_20260523.md` | 多节点 NCCL 报告索引 | + +## 核心报告 + +| 分类 | 文件 | 当前结论 | +|---|---|---| +| 总览 | `reports_h100_acceptance_current_status_20260523.md` | FAIL,证据链完整但门禁未过 | +| 收尾 | `reports_h100_acceptance_closure_checklist_20260523.md` | 可阶段性交付,不能判生产通过 | +| 闭环请求 | `reports_h100_network_hardware_escalation_request_20260523.md` | 等待网络/硬件/环境侧回填 | +| 单节点 | `reports_test_all_latest_summary_cn_20260523.md` | 两台均 `6/10 PASS`,整体 FAIL | +| 跨节点 RDMA | `reports_rdma_cross_node_mlx5_0_20260523.md` | write BW PASS,read BW/latency FAIL | +| 多节点 NCCL PDF matrix | `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 8 个 case 仅 1 个性能 PASS;正确性均 OK | +| 多节点 NCCL 六项 collective | `reports_multinode_nccl_all_collectives_run_20260523.md` | 6 项正确性 OK;allreduce/alltoall 按 PDF 阈值 FAIL | +| 环境等价性 | `reports_multinode_nccl_environment_gap_20260523.md` | 当前不能证明与 PDF 等价 | +| NCCL artifact 信号 | `reports_multinode_nccl_artifact_signal_analysis_20260523.md` | IB/GDRDMA 正常;缺外部 plugin/SHARP | +| 接手计划 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给继续定位和复跑的人使用 | + +## 可复跑入口 + +| 脚本 | 用途 | 建议执行位置 | +|---|---|---| +| `scripts/run_h100_single_node_all.sh` | 单节点 H100 全量验收 | 两台节点分别执行 | +| `scripts/run_multinode_nccl_pdf_matrix.sh` | 多节点 NCCL PDF matrix | `nccl-gpu-1` | +| `scripts/run_multinode_nccl_all_collectives.sh` | 多节点 2x8 六项 collective | `nccl-gpu-1` | +| `scripts/multinode_nccl_deep_diagnose.sh` | 多节点 NCCL 深度诊断 | `nccl-gpu-1` | +| `scripts/nccl_environment_snapshot.sh` | 单节点 HCA/plugin/topo 快照 | 两台节点分别执行 | + +推荐复跑顺序: + +```bash +cd /root/test_gpu_scripts +bash scripts/multinode_nccl_deep_diagnose.sh preflight +bash scripts/run_multinode_nccl_pdf_matrix.sh +bash scripts/run_multinode_nccl_all_collectives.sh +``` + +如果网络/硬件/环境侧调整了单节点条件,还需要分别在两台节点执行: + +```bash +cd /root/test_gpu_scripts +bash scripts/run_h100_single_node_all.sh +``` + +## 远端位置 + +两台远端默认路径: + +```text +nccl-gpu-1: /root/test_gpu_scripts +nccl-gpu-2: /root/test_gpu_scripts +``` + +最新多节点 NCCL 原始 artifacts 位于 `nccl-gpu-1`: + +| 类型 | 路径 | +|---|---| +| PDF matrix raw report | `/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803.md` | +| PDF matrix artifacts dir | `/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts` | +| PDF matrix artifacts tar | `/root/test_gpu_scripts/reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.tar.gz` | +| 六项 collective raw report | `/root/test_gpu_scripts/reports/multinode_nccl_all_collectives_20260523_120144.md` | +| 六项 collective artifacts dir | `/root/test_gpu_scripts/reports/multinode_nccl_all_collectives_20260523_120144_artifacts` | +| 六项 collective artifacts tar | `/root/test_gpu_scripts/reports/multinode_nccl_all_collectives_20260523_120144_artifacts.tar.gz` | + +## Artifact 校验 + +PDF matrix bundle checksum: + +```text +682ac637460472d464a0d56ccc0f3335ed7f79a270157a403ebec23b8d9feceb reports/multinode_nccl_pdf_matrix_20260523_113803.md +7371fcaf7269f92eb1544e5e63573ebf77f4ae38f668b5b22169ca86e6d603ee reports/multinode_nccl_pdf_matrix_20260523_113803_artifacts.tar.gz +``` + +六项 collective bundle checksum: + +```text +06c565281813c4260da9cfee8f0b0289b61b3be95c01dd670c71fa1a441133e3 reports/multinode_nccl_all_collectives_20260523_120144.md +fa5961d47a5905da6ebc6c726421d73ddc2314a316a8f578683d31fe69c256e5 reports/multinode_nccl_all_collectives_20260523_120144_artifacts.tar.gz +``` + +逐文件 checksum: + +| 文件 | 用途 | +|---|---| +| `reports_multinode_nccl_all_collectives_20260523_120144_bundle.sha256` | 六项 collective raw report + tar checksum | +| `reports_multinode_nccl_all_collectives_20260523_120144_artifacts.sha256` | 六项 collective artifacts 逐文件 checksum | +| `reports_multinode_nccl_pdf_matrix_artifacts_manifest_20260523_113803.md` | PDF matrix case summary + bundle checksum | +| `reports_multinode_nccl_all_collectives_artifacts_manifest_20260523_120144.md` | 六项 collective case summary + bundle/per-file checksum | + +## 入口文件 SHA256 + +以下 hash 用于确认本地与两台远端入口文件一致。本 manifest 本身不做自引用 hash。 + +```text +bf3fd8197285dca964b78c584ee6263b0d0f4d47fbf689d121367666d3398231 README.md +846c3da4ac655a0b3ad072e4c4475d91b55e2bdc9d8aedb9c5f9d800608fb64c reports_h100_acceptance_current_status_20260523.md +4a0ee9f456acc1284bf3a42df5bd338affb831471c27ca4b6584201acd72fd52 reports_h100_acceptance_closure_checklist_20260523.md +45438db9204ceef5f65019a6594c016f3183799ed3b89dcf40f383a34f9e3466 reports_h100_network_hardware_escalation_request_20260523.md +d982d6f3698e8860b8505d65105f6056c11f1f72758401a4613ae8315b6f92d0 reports_multinode_nccl_latest_index_20260523.md +8fca70e703961745d5bdacaa3fccb814709c426c0fa7713d0df2d1f2fb26a3f4 reports_multinode_nccl_handoff_plan_20260523.md +b0d0d1fa9b1aa0d8cbdd2672508df5c7bafffc91b607b35b199e624352147e75 reports_multinode_nccl_environment_gap_20260523.md +a7bc27c630fb97c0b83a3427ed4017a16a21e1285f4be5a2cc21f653921fab2b reports_multinode_nccl_pdf_matrix_run_20260523.md +60bdb85e087e796d59c6f0cb7e79c7e60b4147b5fff8c6b60606f6c1f53b1b58 reports_multinode_nccl_all_collectives_run_20260523.md +6affec63694d31dc2d7f097210794e7821e931b8c8b9ac8f451c6f7948bf138a reports_test_all_latest_summary_cn_20260523.md +3895cdf040220aa13093c3377c301580120f04eb9958dbb7c3df3d7285c2d733 reports_rdma_cross_node_mlx5_0_20260523.md +``` + +## 还不能关闭的事项 + +| 项目 | 当前阻塞 | +|---|---| +| 单节点 Compute | 多 dtype 绝对 TFLOPS 阈值未达,部分 GPU spread 超 3% | +| 单节点 NCCL | 多 op/size 未达阈值,小包和部分 2G case 明显 | +| 单节点 Stress | 30 分钟可跑满,但温差和 `sw_power_cap` throttle 触发 FAIL | +| 单节点 RDMA | read BW 未达 47 GB/s,部分端口不是 400G | +| 跨节点 RDMA | read BW 和 write/read latency 未达阈值 | +| 多节点 NCCL allreduce | 2x8 `353.85 GB/s`,PDF 目标 `491.84 GB/s` | +| 多节点 NCCL alltoall | 2x8 `36.83 GB/s`,PDF 目标 `76.54 GB/s` | +| PDF 环境等价性 | 当前只有 4 条 400G rail,缺 NCCL net plugin / SHARP 证据 | + +## 下一步闭环条件 + +网络/硬件/环境侧需要给出以下任一结论: + +1. 当前两台机器已修复到 PDF 参考环境等价状态,测试侧复跑。 +2. 当前机器与 PDF 参考环境不等价,但可以接受新的阈值或豁免口径。 +3. 当前硬件/网络不满足交付规格,需要先修复。 +4. PDF 阈值不适用于当前跨 Leaf/4 rail/plugin 缺失场景,需要更新验收标准。 diff --git a/reports_multinode_nccl_latest_index_20260523.md b/reports_multinode_nccl_latest_index_20260523.md index 5a7e0af..129b50d 100644 --- a/reports_multinode_nccl_latest_index_20260523.md +++ b/reports_multinode_nccl_latest_index_20260523.md @@ -13,6 +13,7 @@ - 已补充当前验收状态总览:`reports_h100_acceptance_current_status_20260523.md`,把单节点、多机 NCCL、跨节点 RDMA、环境等价性和阻塞项合并到一份中文总表。 - 已补充收尾检查清单:`reports_h100_acceptance_closure_checklist_20260523.md`,明确哪些工作可以阶段性交付、哪些验收门禁仍不能关闭。 - 已补充网络/硬件/环境侧闭环请求:`reports_h100_network_hardware_escalation_request_20260523.md`,用于让责任侧回填 rail、plugin/SHARP、跨 Leaf 和新阈值口径。 +- 已补充交付包 manifest:`reports_h100_acceptance_delivery_manifest_20260523.md`,汇总主入口、脚本、远端 artifacts 和 checksum。 - 2 机 1/2/4 GPU per node 档位已接近 PDF 参考值,但严格按阈值仍 FAIL。 - 2 机 8 GPU 档位仍未达到 PDF 参考值: - allreduce 实测 `353.85 GB/s busbw`,PDF 目标 `491.84 GB/s`。 @@ -26,14 +27,15 @@ |---:|---|---| | 1 | `reports_h100_acceptance_current_status_20260523.md` | 当前 H100 验收总览,汇总单节点、多机 NCCL、跨节点 RDMA 和阻塞项 | | 2 | `reports_h100_acceptance_closure_checklist_20260523.md` | 收尾检查清单:可交付项、未关闭门禁、最短收尾路径 | -| 3 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的闭环请求和回填表 | -| 4 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给网络/硬件/环境侧的交接计划,包含决策树、要问的问题和复跑命令 | -| 5 | `reports_multinode_nccl_environment_gap_20260523.md` | 说明当前环境为什么不能证明与 PDF 等价,重点是 4 x 400G rail 和缺少 NCCL net plugin / SHARP | -| 6 | `reports_multinode_nccl_artifact_signal_analysis_20260523.md` | 最新 artifacts 信号分析,确认 IB/GDRDMA/HCA 使用情况和 plugin/SHARP 缺口 | -| 7 | `reports_multinode_nccl_all_collectives_run_20260523.md` | 多机多卡 2x8 六项 collective 补测结果,补齐单机 test all 的 NCCL 覆盖面 | -| 8 | `reports_multinode_nccl_all_collectives_artifacts_manifest_20260523_120144.md` | 多机多卡 2x8 六项 collective artifacts manifest 和 checksum | -| 9 | `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新正式多机多卡 PDF matrix 结果摘要 | -| 10 | `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮完整深度诊断复跑结果,包含 counter、GRAPH、PXN sweep | +| 3 | `reports_h100_acceptance_delivery_manifest_20260523.md` | 交付包 manifest:入口、脚本、远端 artifacts、checksum | +| 4 | `reports_h100_network_hardware_escalation_request_20260523.md` | 给网络/硬件/环境侧的闭环请求和回填表 | +| 5 | `reports_multinode_nccl_handoff_plan_20260523.md` | 给网络/硬件/环境侧的交接计划,包含决策树、要问的问题和复跑命令 | +| 6 | `reports_multinode_nccl_environment_gap_20260523.md` | 说明当前环境为什么不能证明与 PDF 等价,重点是 4 x 400G rail 和缺少 NCCL net plugin / SHARP | +| 7 | `reports_multinode_nccl_artifact_signal_analysis_20260523.md` | 最新 artifacts 信号分析,确认 IB/GDRDMA/HCA 使用情况和 plugin/SHARP 缺口 | +| 8 | `reports_multinode_nccl_all_collectives_run_20260523.md` | 多机多卡 2x8 六项 collective 补测结果,补齐单机 test all 的 NCCL 覆盖面 | +| 9 | `reports_multinode_nccl_all_collectives_artifacts_manifest_20260523_120144.md` | 多机多卡 2x8 六项 collective artifacts manifest 和 checksum | +| 10 | `reports_multinode_nccl_pdf_matrix_run_20260523.md` | 最新正式多机多卡 PDF matrix 结果摘要 | +| 11 | `reports_multinode_nccl_deep_diagnose_run_20260523.md` | 本轮完整深度诊断复跑结果,包含 counter、GRAPH、PXN sweep | ## 关键脚本 @@ -107,6 +109,7 @@ OUT_DIR=/root/test_gpu_scripts/reports/nccl_deep_diag_plugin_check_$(date +%Y%m% /root/test_gpu_scripts/reports_multinode_nccl_handoff_plan_20260523.md /root/test_gpu_scripts/reports_h100_acceptance_current_status_20260523.md /root/test_gpu_scripts/reports_h100_acceptance_closure_checklist_20260523.md +/root/test_gpu_scripts/reports_h100_acceptance_delivery_manifest_20260523.md /root/test_gpu_scripts/reports_h100_network_hardware_escalation_request_20260523.md /root/test_gpu_scripts/reports_multinode_nccl_environment_gap_20260523.md /root/test_gpu_scripts/reports_multinode_nccl_artifact_signal_analysis_20260523.md