Commit Graph

  • 017c981062 Remove remaining report docs from PR h100-acceptance-current cs 2026-05-26 00:44:56 +08:00
  • 1c3c811254 Remove generated reports from PR cs 2026-05-26 00:44:39 +08:00
  • 7ec2da18bc Clean report whitespace cs 2026-05-26 00:15:48 +08:00
  • 4dddab27b3 Add FP8 GEMM path comparison reports cs 2026-05-26 00:13:33 +08:00
  • 4484c731b6 Add H100 acceptance PR summary cs 2026-05-23 20:37:19 +08:00
  • f80a3b3636 Add H100 acceptance delivery manifest cs 2026-05-23 20:34:01 +08:00
  • 639651ef24 Add H100 network escalation request cs 2026-05-23 20:29:19 +08:00
  • edb4612cc6 Add H100 acceptance closure checklist cs 2026-05-23 20:25:39 +08:00
  • 1203b025a0 Document H100 acceptance entrypoint cs 2026-05-23 20:22:15 +08:00
  • 5b022d5849 Summarize current H100 acceptance status cs 2026-05-23 20:15:01 +08:00
  • 90c46e40b3 Archive all-collectives NCCL artifacts cs 2026-05-23 20:11:22 +08:00
  • c2db68f608 Add multinode NCCL all collectives run cs 2026-05-23 20:07:47 +08:00
  • e0cb796b0c Analyze multinode NCCL artifact signals cs 2026-05-23 19:50:51 +08:00
  • 4d06639129 Record multinode NCCL artifacts run cs 2026-05-23 19:45:03 +08:00
  • 098d1715f2 Archive multinode NCCL raw artifacts cs 2026-05-23 19:36:53 +08:00
  • 7bc15742ea Clarify multinode NCCL report thresholds cs 2026-05-23 19:33:01 +08:00
  • c73d738557 Record multinode NCCL PDF matrix run cs 2026-05-23 19:30:14 +08:00
  • 8923270ce0 Add multinode NCCL PDF matrix runner cs 2026-05-23 19:21:58 +08:00
  • 2c5c31e451 Add single-node H100 all runner cs 2026-05-23 19:16:40 +08:00
  • cadfbcfaa3 Add NCCL environment snapshot script cs 2026-05-23 19:13:35 +08:00
  • ef56e5f15a Add NCCL latest report index cs 2026-05-23 18:59:45 +08:00
  • 892f833ff4 Add NCCL network handoff plan cs 2026-05-23 18:57:22 +08:00
  • f64e85efaf Document NCCL environment equivalence gaps cs 2026-05-23 18:54:35 +08:00
  • c183f5a9d1 Document NCCL deep diagnosis rerun cs 2026-05-23 18:51:41 +08:00
  • b55666948c Add multinode NCCL deep diagnosis tools cs 2026-05-23 17:37:19 +08:00
  • 24a7bd5c1b Document NCCL graph comparison cs 2026-05-23 17:32:03 +08:00
  • 82c6316716 Document NCCL alltoall secondary sweep cs 2026-05-23 17:28:28 +08:00
  • 1813c11bbf Compare NCCL allreduce alltoall counters cs 2026-05-23 17:17:22 +08:00
  • edc469cee9 Document NCCL alltoall counter probe cs 2026-05-23 17:13:03 +08:00
  • 2e194ded14 Document PXN alltoall rail balancing cs 2026-05-23 17:03:02 +08:00
  • 619a471634 Tune multinode alltoall PXN behavior cs 2026-05-23 17:00:03 +08:00
  • a64e964e3c Add raw RDMA rail bandwidth evidence cs 2026-05-23 16:46:15 +08:00
  • ce363b2f7a Document missing NCCL network plugin cs 2026-05-23 16:43:25 +08:00
  • e756f0b7b4 Document NCCL rail saturation evidence cs 2026-05-23 16:42:27 +08:00
  • aa05ccab2e Add NCCL PDF matrix topology report cs 2026-05-23 16:35:24 +08:00
  • 6c9f049b71 Tune multinode NCCL auto parameters cs 2026-05-23 16:12:32 +08:00
  • 1f907e9691 Validate NCCL 2.27 multinode GDR performance cs 2026-05-23 15:58:21 +08:00
  • c660e04c99 Stabilize multinode NCCL launch diagnostics cs 2026-05-23 15:49:14 +08:00
  • 4b93fc785f Add multinode NCCL diagnostic report cs 2026-05-23 15:39:15 +08:00
  • 4b17bafd53 Add multi-node NCCL sweep test cs 2026-05-23 13:03:26 +08:00
  • 86f15544d7 Add H100 acceptance test coverage and reports cs 2026-05-23 10:41:09 +08:00
  • dd77a882f1 feat: 跨机 RDMA 并入 rdma_test.py + H800 算力门槛对齐 H100 main zulifeng 2026-05-25 19:38:43 +08:00
  • d0ab823766 update dk/disk_benchmark dukai 2026-05-25 19:36:53 +08:00
  • 6cad5bca5d update dukai 2026-05-25 19:35:37 +08:00
  • 6ecb0390e5 update dukai 2026-05-25 19:16:18 +08:00
  • d0c527744b feat: add disk benckmark script dukai 2026-05-25 14:37:22 +08:00
  • e49ea32094 feat: 新增多机 nccl test 测试脚本 zulifeng 2026-05-25 14:19:02 +08:00
  • fc97a768cf feat: 按 H100 生产验收标准更新测试指标与判定逻辑 zulifeng 2026-05-13 14:52:41 +08:00
  • 375d439abb feat: 新增 H20 支持、优化算力测试精度并修复多项稳定性问题 zulifeng 2026-05-12 21:41:46 +08:00
  • 1db1313d50 fix(health): mark PCIe link downgrade as FAIL instead of WARN donghongshuai hongshuai.dong 2026-05-10 17:22:43 +08:00
  • ef2ca11c58 fix: resolve FP8 benchmark, NCCL parsing, and report None-value bugs zulifeng 2026-05-10 15:53:12 +08:00
  • 09f81973bc Merge pull request 'hanks/test_gpu' (#1) from hanks/test_gpu into main zulifeng han.zhao 2026-05-07 21:34:56 +08:00
  • fefef8e03b refactor: remove hardcoding, fix AMP bug, unify English output hanks/test_gpu qinyusen 2026-05-07 21:32:35 +08:00
  • f2158f6cd3 fix: resolve stress OOM, D2D efficiency calculation, NCCL execution failures qinyusen 2026-05-07 18:09:22 +08:00
  • 24934bc182 feat: rewrite install_deps.sh with env isolation and add numpy to requirements qinyusen 2026-05-07 01:32:13 +08:00
  • 3e967dd34a feat: add Ampere (A100/A800) support and generalize project naming qinyusen 2026-05-07 01:02:28 +08:00
  • 07250af845 docs: update README for multi-GPU support with auto-detection guide qinyusen 2026-05-06 19:32:09 +08:00
  • 2cb776d7d5 fix: generic branding, wire up report generation, fix --config flag qinyusen 2026-05-06 19:32:01 +08:00
  • 52fe96f2f5 refactor: replace hardcoded H200 specs with dynamic GPU detection qinyusen 2026-05-06 19:31:51 +08:00
  • 98e4977e28 add: GPU specs database with auto-detection (H100/H200/B200/B300) qinyusen 2026-05-06 19:31:44 +08:00
  • 8f7539d9b0 add: research notes on GPU server testing frameworks and tools qinyusen 2026-04-25 17:24:06 +08:00
  • 82cd4d5180 add: training simulation and report generation modules qinyusen 2026-04-25 17:24:01 +08:00
  • 1c6ba4809a add: stress test (gpu-burn) and RDMA/IB test modules qinyusen 2026-04-25 17:23:57 +08:00
  • eac1438227 add: NCCL test module (nccl-tests integration + torchrun fallback) qinyusen 2026-04-25 17:23:54 +08:00
  • 65f10dd365 add: benchmark module (nvbandwidth integration + PyTorch compute) qinyusen 2026-04-25 17:23:49 +08:00
  • b6dff76ef7 add: health check module (temperature, power, ECC, PCIe, system checks) qinyusen 2026-04-25 17:23:44 +08:00
  • f5fdde5fc1 add: GPU information module (nvidia-smi wrapper, NVLink topology) qinyusen 2026-04-25 17:23:40 +08:00
  • d4f46b6394 add: CLI entry point with interactive menu and argument parsing qinyusen 2026-04-25 17:23:35 +08:00
  • 65cf7feee5 add: dependency installation script (nvbandwidth, nccl-tests, gpu-burn) qinyusen 2026-04-25 17:23:32 +08:00
  • 418dc70efb init: project scaffolding with README, config, and requirements qinyusen 2026-04-25 17:23:27 +08:00