5.4 KiB
5.4 KiB
GPU Test Report
- Date: 2026-05-23T08:58:19.911230
- Host: aikubeworker0012
Overall Acceptance Verdict
Result: FAIL
Missing required evidence:
- GPU Info
- Health Check
- Memory Bandwidth
- Compute Throughput
- NVLink/NVSwitch
- NCCL
- Stress Test
- RDMA
- DCGM
- Training
Summary
| Test | Result |
|---|---|
| Multi-node NCCL | FAIL |
Multi-node NCCL / Cross Leaf
Source: nccl-tests-mpirun | Mode: cross-leaf-pdf-matrix-nccl-2.27.7
- Hosts: nccl-gpu-1(172.72.8.12), nccl-gpu-2(172.72.8.16)
- Preflight: PASS
Multi-node NCCL allreduce
| Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
|---|---|---|---|---|---|---|
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 47.26 GB/s | 16G | 47.19 GB/s | >= 49 GB/s | FAIL |
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 136.36 GB/s | 16G | 136.69 GB/s | >= 137 GB/s | FAIL |
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 333.23 GB/s | 16G | 333.45 GB/s | >= 335 GB/s | FAIL |
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 353.47 GB/s | 16G | 353.86 GB/s | >= 492 GB/s | FAIL |
| Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs |
|---|---|---|---|---|
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| Topology | Return Code | Error / Output Tail |
|---|---|---|
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | TE aikubeworker0012:2165982:2166060 [0] NCCL INFO comm 0x55d452f2df80 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.189 # # Collective test concluded: all_reduce_perf # |
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ker0016:1221425:1222411 [0] NCCL INFO comm 0x56437384f040 rank 2 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE aikubeworker0016:1221427:1222412 [1] NCCL INFO comm 0x55ab9313f950 rank 3 nranks 4 cudaDev 1 busId 2a000 - Destroy COMPLETE |
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | E aikubeworker0012:2166160:2166257 [0] NCCL INFO comm 0x557243829d50 rank 0 nranks 8 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 333.449 # # Collective test concluded: all_reduce_perf # |
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | r0012:2166272:2166442 [5] NCCL INFO comm 0x55721e270960 rank 5 nranks 16 cudaDev 5 busId ab000 - Destroy COMPLETE aikubeworker0012:2166268:2166447 [1] NCCL INFO comm 0x5644fafd24e0 rank 1 nranks 16 cudaDev 1 busId 2a000 - Destroy COMPLETE |
Multi-node NCCL alltoall
| Topology | CUDA Visible Devices | Peak Bus BW | Peak Size | Avg Bus BW | Threshold | Status |
|---|---|---|---|---|---|---|
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | - | 24.87 GB/s | 16G | 24.93 GB/s | >= 27 GB/s | FAIL |
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | - | 47.69 GB/s | 16G | 47.93 GB/s | >= 54 GB/s | FAIL |
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0,1,4,5 | 72.82 GB/s | 16G | 72.87 GB/s | >= 74 GB/s | FAIL |
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | - | 36.70 GB/s | 16G | 36.74 GB/s | >= 77 GB/s | FAIL |
| Topology | NCCL Network | GPU Direct RDMA | GDR Enabled HCAs | GDR Disabled HCAs |
|---|---|---|---|---|
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | IB | ENABLED | mlx5_0, mlx5_1, mlx5_6, mlx5_7 | - |
| Topology | Return Code | Error / Output Tail |
|---|---|---|
| 2 nodes x 1 GPU (PDF 2 machines 2 GPUs) | 0 | ETE aikubeworker0012:2166458:2166534 [0] NCCL INFO comm 0x5603baefb150 rank 0 nranks 2 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 24.9304 # # Collective test concluded: alltoall_perf # |
| 2 nodes x 2 GPUs (PDF 2 machines 4 GPUs) | 0 | ETE aikubeworker0012:2166543:2166743 [0] NCCL INFO comm 0x5569d31d4f50 rank 0 nranks 4 cudaDev 0 busId 18000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 47.9258 # # Collective test concluded: alltoall_perf # |
| 2 nodes x 4 GPUs (PDF 2 machines 8 GPUs) | 0 | ker0016:1227342:1228382 [1] NCCL INFO comm 0x55cdec231780 rank 5 nranks 8 cudaDev 1 busId 2a000 - Destroy COMPLETE aikubeworker0016:1227344:1228381 [3] NCCL INFO comm 0x563c7ed39680 rank 7 nranks 8 cudaDev 3 busId ab000 - Destroy COMPLETE |
| 2 nodes x 8 GPUs (PDF 2 machines 16 GPUs) | 0 | TE aikubeworker0012:2166925:2167127 [7] NCCL INFO comm 0x560553b91250 rank 7 nranks 16 cudaDev 7 busId db000 - Destroy COMPLETE # Out of bounds values : 0 OK # Avg bus bandwidth : 36.7382 # # Collective test concluded: alltoall_perf # |
Overall: FAIL
Generated by GPU Test Suite v0.2.0