# GPU Test Report - **Date:** 2026-05-23T08:32:58.113416 - **Host:** aikubeworker0012 ## Overall Acceptance Verdict **Result: FAIL** Missing required evidence: - GPU Info - Health Check - Memory Bandwidth - Compute Throughput - NVLink/NVSwitch - NCCL - Stress Test - RDMA - DCGM - Training ## Summary | Test | Result | |------|--------| | Multi-node NCCL | FAIL | ## Multi-node NCCL / Cross Leaf Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7 - **Hosts:** nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16) - **Preflight:** PASS ### Multi-node NCCL allreduce | Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status | |----------|----------------------|-------------|-----------|------------|-----------|--------| | 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 47.23 GB/s | 16G | 47.24 GB/s | >= 49 GB/s | FAIL | | 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 136.97 GB/s | 16G | 137.17 GB/s | >= 137 GB/s | PASS | | 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 333.22 GB/s | 16G | 333.24 GB/s | >= 335 GB/s | FAIL | | 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 354.02 GB/s | 16G | 353.92 GB/s | >= 492 GB/s | FAIL | | Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs | |----------|--------------|-----------------|------------------|-------------------| | 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | Topology | Return Code | Error / Output Tail | |----------|-------------|---------------------| | 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | E aikubeworker0012:2157248:2157325 [0] NCCL INFO comm 0x5595f28bf420 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.2399 # # Collective test concluded: all_reduce_perf # | | 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ker0012:2157429:2157526 [3] NCCL INFO comm 0x55a8a0147090 rank 3 nranks 8 cudaDev 3 busId ab000 - Destroy COMPLETE aikubeworker0012:2157427:2157524 [1] NCCL INFO comm 0x55b1b0f86630 rank 1 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE | | 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | aikubeworker0016:1138578:1139592 [0] NCCL INFO comm 0x556eff26c190 rank 8 nranks 16 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 353.915 # # Collective test concluded: all_reduce_perf # | ### Multi-node NCCL alltoall | Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status | |----------|----------------------|-------------|-----------|------------|-----------|--------| | 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 24.84 GB/s | 16G | 24.89 GB/s | >= 27 GB/s | FAIL | | 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 47.67 GB/s | 16G | 47.91 GB/s | >= 54 GB/s | FAIL | | 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 72.93 GB/s | 16G | 72.97 GB/s | >= 74 GB/s | FAIL | | 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 30.04 GB/s | 16G | 30.04 GB/s | >= 77 GB/s | FAIL | | Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs | |----------|--------------|-----------------|------------------|-------------------| | 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - | | Topology | Return Code | Error / Output Tail | |----------|-------------|---------------------| | 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ETE aikubeworker0012:2157727:2157802 [0] NCCL INFO comm 0x55a0349b02b0 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 24.8897 # # Collective test concluded: alltoall_perf # | | 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ETE aikubeworker0016:1141290:1142410 [0] NCCL INFO comm 0x55fabbea6410 rank 2 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.9094 # # Collective test concluded: alltoall_perf # | | 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ETE aikubeworker0012:2158071:2158172 [0] NCCL INFO comm 0x563312baa7f0 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 72.9657 # # Collective test concluded: alltoall_perf # | | 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | 016:1143717:1145948 [7] NCCL INFO comm 0x5558cc9de640 rank 15 nranks 16 cudaDev 7 busId db000 - Destroy COMPLETE aikubeworker0016:1143713:1145946 [3] NCCL INFO comm 0x55c1af080e60 rank 11 nranks 16 cudaDev 3 busId 5d000 - Destroy COMPLETE | **Overall: FAIL** --- *Generated by GPU Test Suite v0.2.0*