test_gpu_scripts/reports_all_aikubeworker0016.md

157 lines
5.0 KiB
Markdown

# GPU Test Report
- **Date:** 2026-05-22T15:49:02.368516
- **Host:** aikubeworker0016
- **GPU:** NVIDIA H100 80GB HBM3 x8
- **Driver:** 580.159.03 | **CUDA:** 13.0
## Overall Acceptance Verdict
**Result: FAIL**
Failed or unverified items:
- Compute Throughput: FAIL (worst FP32 52 vs >= 54)
- NCCL: FAIL (no nccl-tests bus BW)
- RDMA: FAIL
- Training: UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict)
Missing required evidence:
- NVLink/NVSwitch
- DCGM
## Summary
| Test | Result |
|------|--------|
| GPU Info | PASS (8 GPUs detected) |
| Health Check | PASS |
| Memory Bandwidth | PASS (108.1%) |
| Compute Throughput | FAIL (worst FP32 52 vs >= 54) |
| NCCL | FAIL (no nccl-tests bus BW) |
| Stress Test | PASS |
| RDMA | FAIL |
| Training | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
## GPU Information
| GPU | Model | VRAM | Temp | Power | SM Clock |
|-----|-------|------|------|-------|----------|
| 0 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 70/700W | 345 MHz |
| 1 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
| 2 | NVIDIA H100 80GB HBM3 | 81559 MB | 22C | 67/700W | 345 MHz |
| 3 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
| 4 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 67/700W | 345 MHz |
| 5 | NVIDIA H100 80GB HBM3 | 81559 MB | 23C | 69/700W | 345 MHz |
| 6 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 68/700W | 345 MHz |
| 7 | NVIDIA H100 80GB HBM3 | 81559 MB | 21C | 66/700W | 345 MHz |
## Health Check
**Overall: PASS**
| GPU | Temp | Power | ECC | PCIe | Throttle | Status |
|-----|------|-------|-----|------|----------|--------|
| 0 | 21C PASS | 70W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 1 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 2 | 22C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 3 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 4 | 21C PASS | 67W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 5 | 23C PASS | 69W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 6 | 21C PASS | 68W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
| 7 | 21C PASS | 66W PASS | S:0 D:0 | Gen5x16 | PASS | **WARN** |
## Memory Bandwidth
Source: nvbandwidth
| Metric | Value | Peak | Efficiency |
|--------|-------|------|------------|
| H2D (PCIe) | 55.5 GB/s | 64 GB/s | 86.7% |
| D2H (PCIe) | 55.3 GB/s | 64 GB/s | 86.4% |
| D2D (NVLink) | 486.5 GB/s | 450 GB/s | 108.1% |
**Verdict: PASS** (D2D efficiency 108.1%)
## Compute Throughput
| DType | Achieved (TFLOPS) | Peak | Threshold | Status |
|-------|-------------------|------|------------|--------|
| FP32 | 51.9 | 67 | >= 54 | FAIL |
| TF32 | 357.0 | 495 | >= 444 | FAIL |
| FP16 | 664.0 | 990 | >= 734 | FAIL |
| BF16 | 700.1 | 990 | >= 745 | FAIL |
| FP8 | 1116.2 | 1979 | >= 1400 | FAIL |
**Verdict: FAIL** (absolute TFLOPS thresholds; worst efficiency 56.4%)
### Compute Per-GPU TFLOPS
| GPU | FP32 | TF32 | FP16 | BF16 | FP8 |
|---|---|---|---|---|---|
| 0 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 1 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 2 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 3 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 4 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 5 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 6 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
| 7 | 51.9 | 357.0 | 664.0 | 700.1 | 1116.2 |
## NCCL Multi-GPU
Source: torchrun_fallback | GPUs: 8
> Functional NCCL smoke only: nccl-tests bus bandwidth was not measured, so this does not satisfy production acceptance.
| Operation | Bus BW (GB/s) | Threshold | Status |
|-----------|---------------|-----------|--------|
| NCCL version 2.21.5+cuda12.4 | 0.0 | >= 0 | FAIL |
| allreduce | 0.0 | >= 0 | PASS |
| broadcast | 0.0 | >= 0 | PASS |
| allgather | 0.0 | >= 0 | PASS |
| reducescatter | 0.0 | >= 0 | PASS |
| alltoall | 0.0 | >= 0 | PASS |
**Overall: FAIL**
## Stress Test
- **Source:** pytorch
- **Duration:** 60s (requested 60s)
- **Result: PASS**
## RDMA/InfiniBand
> Legacy RDMA result re-evaluated with current PDF acceptance thresholds; old WARN statuses and old 50GB/s/10us limits are not used for verdict.
| Test | Value | Threshold | Status |
|------|-------|-----------|--------|
| ib_write_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
| ib_read_bw | 0.1 GB/s | >= 47 GB/s | FAIL |
| ib_write_lat | 4.10 us | <= 2 us | FAIL |
| ib_read_lat | 16.00 us | <= 3.5 us | FAIL |
- **Failure reasons:**
- ib_write_bw bandwidth 0.13GB/s < 47GB/s
- ib_read_bw bandwidth 0.13GB/s < 47GB/s
- ib_write_lat latency 4.1us > 2us
- ib_read_lat latency 16.0us > 3.5us
**Overall: FAIL**
## Training Simulation
| Metric | Value |
|--------|-------|
| Model | synthetic_transformer |
| Params | 1470.5M |
| Throughput | 52471 tokens/sec |
| Avg Step Time | 312.3 ms |
| Peak Memory | 27.3 GB |
| Final Loss | 0.0041 |
| Step Jitter | N/A% |
| Distributed Mode | N/A |
| Acceptance Gaps | missing passed, step_jitter_pct, distributed_mode, loss_finite |
| Verdict | UNVERIFIED (52471 tokens/sec; legacy result lacks explicit acceptance verdict) |
---
*Generated by GPU Test Suite v0.2.0*