Document H100 acceptance entrypoint

2026-05-23 20:22:15 +08:00 · 2026-05-23 20:22:15 +08:00 · 1203b025a0
commit 1203b025a0
parent 5b022d5849
1 changed files with 62 additions and 15 deletions
--- a/README.md
+++ b/README.md
@ -6,10 +6,49 @@
 > **支持 GPU 架构：** Ampere (A100/A800) · Hopper (H100/H200) · Blackwell (B200/B300)
 > 系统自动检测 GPU 型号并使用对应的规格参数进行基准对比。

+## H100 当前验收入口
+
+当前分支 `h100-acceptance-current` 已补齐 H100 单节点、多节点 NCCL、跨节点 RDMA 的主要证据链。按现有 PDF/配置口径，当前结论仍是 **FAIL**：脚本和证据基本可交付，但机器尚未达到生产验收阈值。
+
+| 优先级 | 文件 | 用途 |
+|---|---|---|
+| 1 | [reports_h100_acceptance_current_status_20260523.md](reports_h100_acceptance_current_status_20260523.md) | 当前总状态：已测项、失败项、阻塞项、下一步 |
+| 2 | [reports_multinode_nccl_latest_index_20260523.md](reports_multinode_nccl_latest_index_20260523.md) | 多节点 NCCL 相关报告索引 |
+| 3 | [reports_multinode_nccl_handoff_plan_20260523.md](reports_multinode_nccl_handoff_plan_20260523.md) | 接手人复跑和继续定位计划 |
+| 4 | [reports_test_all_latest_summary_cn_20260523.md](reports_test_all_latest_summary_cn_20260523.md) | 单节点 `test all` 中文原始汇总 |
+| 5 | [reports_rdma_cross_node_mlx5_0_20260523.md](reports_rdma_cross_node_mlx5_0_20260523.md) | 跨节点 RDMA `mlx5_0` 双向结果 |
+
+当前主要阻塞：
+
+- 单节点 `test all`：两台节点均为 `6/10 PASS`，Compute、NCCL、Stress、RDMA 未过。
+- 跨节点 RDMA：`mlx5_0` 写带宽接近/达到阈值，但读带宽和读写延迟未过。
+- 多节点 NCCL：`2x8 allreduce`、`2x8 alltoall` 按 PDF 阈值未过；NCCL `wrong_count=0`，主要是性能不达标。
+- 环境差异：当前可用 400G IB rail 主要是 `mlx5_0,mlx5_1,mlx5_6,mlx5_7`，未发现外部 NCCL net plugin / SHARP / HCOLL。
+
+### H100 复跑入口
+
+远端默认路径为 `/root/test_gpu_scripts`，建议在 `nccl-gpu-1` 作为发起节点执行多节点测试。
+
+```bash
+# 单节点全量验收，分别在每台机器执行
+bash scripts/run_h100_single_node_all.sh
+
+# 多节点 NCCL PDF 矩阵：allreduce/alltoall x 2x1/2x2/2x4/2x8
+bash scripts/run_multinode_nccl_pdf_matrix.sh
+
+# 多节点 NCCL 六类 collective：2 节点 x 8 GPU
+bash scripts/run_multinode_nccl_all_collectives.sh
+
+# 多节点 NCCL 深度诊断和环境证据抓取
+bash scripts/multinode_nccl_deep_diagnose.sh preflight
+bash scripts/multinode_nccl_deep_diagnose.sh all
+```
+
 ---

 ## 目录

+- [H100 当前验收入口](#h100-当前验收入口)
 - [项目结构](#项目结构)
 - [环境要求](#环境要求)
 - [快速开始](#快速开始)
@ -26,23 +65,31 @@
 ## 项目结构

 ```
-servertest/
-├── gpu_tester.py               # 主入口：CLI + 交互式菜单
-├── install_deps.sh             # 一键安装三方工具
+test_gpu_scripts/
+├── gpu_tester.py                               # 主入口：CLI + 交互式菜单
+├── install_deps.sh                             # 一键安装三方工具
 ├── configs/
-│   └── default.yaml            # 默认配置
+│   ├── default.yaml                            # 默认配置
+│   ├── multinode_nccl_nccl227_pdf_matrix.yaml  # H100 多节点 PDF 矩阵配置
+│   └── multinode_nccl_nccl227_all_collectives_2x8.yaml
 ├── modules/
-│   ├── gpu_specs.py            # GPU 规格数据库 (A100/A800/H100/H200/B200/B300)
-│   ├── gpu_info.py             # GPU 检测 & 信息
-│   ├── health_check.py         # 健康诊断
-│   ├── benchmark.py            # 内存带宽 + 计算吞吐
-│   ├── nccl_test.py            # NCCL 多卡通信
-│   ├── stress_test.py          # GPU 压力/稳定性
-│   ├── rdma_test.py            # RDMA/InfiniBand
-│   ├── training_sim.py         # 训练模拟
-│   └── report.py               # 报告生成
-├── requirements.txt
-└── 调研.md                     # 行业框架调研
+│   ├── gpu_specs.py                            # GPU 规格数据库
+│   ├── gpu_info.py                             # GPU 检测 & 信息
+│   ├── health_check.py                         # 健康诊断
+│   ├── benchmark.py                            # 内存带宽 + 计算吞吐
+│   ├── nccl_test.py                            # NCCL 多卡/多节点通信
+│   ├── stress_test.py                          # GPU 压力/稳定性
+│   ├── rdma_test.py                            # RDMA/InfiniBand
+│   ├── training_sim.py                         # 训练模拟
+│   └── report.py                               # 报告生成
+├── scripts/
+│   ├── run_h100_single_node_all.sh             # H100 单节点全量复跑
+│   ├── run_multinode_nccl_pdf_matrix.sh        # 多节点 NCCL PDF 矩阵复跑
+│   ├── run_multinode_nccl_all_collectives.sh   # 多节点 NCCL 六类 collective 复跑
+│   └── multinode_nccl_deep_diagnose.sh         # 多节点 NCCL 深度诊断
+├── docs/                                       # 指标说明和 runbook
+├── reports_*20260523*.md                       # 当前 H100 验收证据和汇总报告
+└── requirements.txt
 ```

 ---