JiRack_GPT5_236b / PerformanceBenchmarkPlan.md
kgrabko's picture
Upload PerformanceBenchmarkPlan.md
fd18000 verified
# PERFORMANCE BENCHMARK PLAN: JiRack 236B
**Document ID:** CMS-JR-236B-BCH-2025
**Target:** 16x H100 GPU Cluster (2-Node HGX)
---
## 1. Stage I: Fabric & Communication (The "Stress Test")
Before loading weights, the inter-node fabric must be verified. JiRack's **14:1 GQA** relies on ultra-fast All-Reduce operations during the attention heads' merge phase.
| **Test Type** | **Tool** | **Target Metric** | **Success Criteria** |
|-------------------------|---------------------------|-------------------------------------------|-----------------------------------------------|
| NCCL All-Reduce | `nccl-tests` | Bus Bandwidth | > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node) |
| P2P Latency | `p2pBandwidthLatencyTest` | Latency (μs) | < 2.0 μs via NVLink |
| IB Write BW | `ib_write_bw` | Throughput | 390+ Gbps per link (NDR 400) |
---
## 2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test")
Standard benchmarks (like HPL) are ineffective for JiRack. The **SwiGLU-Attention (SWA)** fusion logic must be tested on the specific **14,336-width dimension**.
- **Benchmark Tool:** `trtllm-bench` (TensorRT-LLM) or a custom Triton kernel profiler.
- **Target Configuration:**
- **Input:** 1024 tokens (Prompt).
- **Output:** 128 tokens (Generation).
- **Batch Size:** 1, 8, 32, 64.
### **The "Grabko Metric":**
- The system must achieve **at least 55% MFU (Model FLOPs Utilization)**.
- If the vendor delivers **<45%**, BRE pre-fetching logic may be throttled due to PCIe bottlenecks.
---
## 3. Stage III: Throughput & Latency (The "User Experience")
This stage simulates real-world commercial usage of the JiRack API.
| **Scenario** | **Metric** | **Target for 236B** |
|-------------------------|----------------------------|-------------------------------------------|
| Interactive (BS=1) | Time to First Token (TTFT) | < 150ms |
| Interactive (BS=1) | Tokens per Second (TPS) | > 25 tokens/sec |
| High Load (BS=64) | Total Throughput | > 1,200 tokens/sec |
---
## 4. Stability & Reliability (The "Burn-in")
Large clusters often fail during long reasoning chains.
- **Test:** 24-hour continuous generation at 80% TDP (Thermal Design Power).
- **Success Criteria:**
- Zero NCCL timeouts.
- Zero XID errors (GPU hardware faults).
### **MTBF Target:**
- Mean Time Between Failures (MTBF) for the 108-layer stack must exceed **720 hours** on this specific 16-GPU setup.
---
## 5. Official Verification
The vendor must provide a **logs export** containing the `proof_of_authorship` string verification.
```python
verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025"
```
Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.