test_gpu_scripts/reports_multinode_nccl_16g_2x8_nccl227.md

2.4 KiB

GPU Test Report

  • Date: 2026-05-23T07:56:26.791384
  • Host: aikubeworker0012

Overall Acceptance Verdict

Result: FAIL

Missing required evidence:

  • GPU Info
  • Health Check
  • Memory Bandwidth
  • Compute Throughput
  • NVLink/NVSwitch
  • NCCL
  • Stress Test
  • RDMA
  • DCGM
  • Training

Summary

Test Result
Multi-node NCCL FAIL

Multi-node NCCL / Cross Leaf

Source: nccl-tests-mpirun | Mode: large-message-nccl-2.27.7

  • Hosts: nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16)
  • Preflight: PASS

Multi-node NCCL allreduce

Topology Peak Bus BW Peak Size Avg Bus BW Threshold Status
2 nodes x 8 GPUs NCCL 2.27.7 16G 237.86 GB/s 16G 238.56 GB/s >= 480 GB/s FAIL
Topology NCCL Network GPU Direct RDMA GDR Enabled HCAs GDR Disabled HCAs
2 nodes x 8 GPUs NCCL 2.27.7 16G IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
Topology Return Code Error / Output Tail
2 nodes x 8 GPUs NCCL 2.27.7 16G 0 aikubeworker0016:1019342:1020412 [4] NCCL INFO comm 0x559f14871c30 rank 12 nranks 16 cudaDev 4 busId 9a000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 238.555 # # Collective test concluded: all_reduce_perf #

Multi-node NCCL alltoall

Topology Peak Bus BW Peak Size Avg Bus BW Threshold Status
2 nodes x 8 GPUs NCCL 2.27.7 16G 28.62 GB/s 16G 28.62 GB/s >= 75 GB/s FAIL
Topology NCCL Network GPU Direct RDMA GDR Enabled HCAs GDR Disabled HCAs
2 nodes x 8 GPUs NCCL 2.27.7 16G IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
Topology Return Code Error / Output Tail
2 nodes x 8 GPUs NCCL 2.27.7 16G 0 E aikubeworker0016:1020609:1021756 [5] NCCL INFO comm 0x55f920e55d90 rank 13 nranks 16 cudaDev 5 busId ab000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 28.6222 # # Collective test concluded: alltoall_perf #

Overall: FAIL


Generated by GPU Test Suite v0.2.0