test_gpu_scripts/reports_multinode_nccl_pdf_matrix_nccl227.md

5.4 KiB

GPU Test Report

  • Date: 2026-05-23T08:58:19.911230
  • Host: aikubeworker0012

Overall Acceptance Verdict

Result: FAIL

Missing required evidence:

  • GPU Info
  • Health Check
  • Memory Bandwidth
  • Compute Throughput
  • NVLink/NVSwitch
  • NCCL
  • Stress Test
  • RDMA
  • DCGM
  • Training

Summary

Test Result
Multi-node NCCL FAIL

Multi-node NCCL / Cross Leaf

Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7

  • Hosts: nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16)
  • Preflight: PASS

Multi-node NCCL allreduce

Topology CUDA Visible Devices Peak Bus BW Peak Size Avg Bus BW Threshold Status
2 nodes x 1 GPU (PDF 2 machines 2 GPUs) - 47.26 GB/s 16G 47.19 GB/s >= 49 GB/s FAIL
2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) - 136.36 GB/s 16G 136.69 GB/s >= 137 GB/s FAIL
2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) 0,1,4,5 333.23 GB/s 16G 333.45 GB/s >= 335 GB/s FAIL
2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) - 353.47 GB/s 16G 353.86 GB/s >= 492 GB/s FAIL
Topology NCCL Network GPU Direct RDMA GDR Enabled HCAs GDR Disabled HCAs
2 nodes x 1 GPU (PDF 2 machines 2 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
Topology Return Code Error / Output Tail
2 nodes x 1 GPU (PDF 2 machines 2 GPUs) 0 TE aikubeworker0012:2165982:2166060 [0] NCCL INFO comm 0x55d452f2df80 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.189 # # Collective test concluded: all_reduce_perf #
2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) 0 ker0016:1221425:1222411 [0] NCCL INFO comm 0x56437384f040 rank 2 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1221427:1222412 [1] NCCL INFO comm 0x55ab9313f950 rank 3 nranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE
2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) 0 E aikubeworker0012:2166160:2166257 [0] NCCL INFO comm 0x557243829d50 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 333.449 # # Collective test concluded: all_reduce_perf #
2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) 0 r0012:2166272:2166442 [5] NCCL INFO comm 0x55721e270960 rank 5 nranks 16 cudaDev 5 busId ab000 - Destroy COMPLETE aikubeworker0012:2166268:2166447 [1] NCCL INFO comm 0x5644fafd24e0 rank 1 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE

Multi-node NCCL alltoall

Topology CUDA Visible Devices Peak Bus BW Peak Size Avg Bus BW Threshold Status
2 nodes x 1 GPU (PDF 2 machines 2 GPUs) - 24.87 GB/s 16G 24.93 GB/s >= 27 GB/s FAIL
2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) - 47.69 GB/s 16G 47.93 GB/s >= 54 GB/s FAIL
2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) 0,1,4,5 72.82 GB/s 16G 72.87 GB/s >= 74 GB/s FAIL
2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) - 36.70 GB/s 16G 36.74 GB/s >= 77 GB/s FAIL
Topology NCCL Network GPU Direct RDMA GDR Enabled HCAs GDR Disabled HCAs
2 nodes x 1 GPU (PDF 2 machines 2 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) IB ENABLED mlx5_0, mlx5_1, mlx5_6, mlx5_7 -
Topology Return Code Error / Output Tail
2 nodes x 1 GPU (PDF 2 machines 2 GPUs) 0 ETE aikubeworker0012:2166458:2166534 [0] NCCL INFO comm 0x5603baefb150 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 24.9304 # # Collective test concluded: alltoall_perf #
2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) 0 ETE aikubeworker0012:2166543:2166743 [0] NCCL INFO comm 0x5569d31d4f50 rank 0 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.9258 # # Collective test concluded: alltoall_perf #
2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) 0 ker0016:1227342:1228382 [1] NCCL INFO comm 0x55cdec231780 rank 5 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0016:1227344:1228381 [3] NCCL INFO comm 0x563c7ed39680 rank 7 nranks 8 cudaDev 3 busId ab000 - Destroy COMPLETE
2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) 0 TE aikubeworker0012:2166925:2167127 [7] NCCL INFO comm 0x560553b91250 rank 7 nranks 16 cudaDev 7 busId db000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 36.7382 # # Collective test concluded: alltoall_perf #

Overall: FAIL


Generated by GPU Test Suite v0.2.0