File size: 3,272 Bytes

fd18000

# PERFORMANCE BENCHMARK PLAN: JiRack 236B  

**Document ID:** CMS-JR-236B-BCH-2025  
**Target:** 16x H100 GPU Cluster (2-Node HGX)  

---

## 1. Stage I: Fabric & Communication (The "Stress Test")  
Before loading weights, the inter-node fabric must be verified. JiRack's **14:1 GQA** relies on ultra-fast All-Reduce operations during the attention heads' merge phase.  

| **Test Type**          | **Tool**                  | **Target Metric**                         | **Success Criteria**                          |
|-------------------------|---------------------------|-------------------------------------------|-----------------------------------------------|
| NCCL All-Reduce         | `nccl-tests`             | Bus Bandwidth                             | > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node) |
| P2P Latency             | `p2pBandwidthLatencyTest` | Latency (μs)                              | < 2.0 μs via NVLink                           |
| IB Write BW             | `ib_write_bw`            | Throughput                                | 390+ Gbps per link (NDR 400)                 |

---

## 2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test")  
Standard benchmarks (like HPL) are ineffective for JiRack. The **SwiGLU-Attention (SWA)** fusion logic must be tested on the specific **14,336-width dimension**.  

- **Benchmark Tool:** `trtllm-bench` (TensorRT-LLM) or a custom Triton kernel profiler.  
- **Target Configuration:**  
  - **Input:** 1024 tokens (Prompt).  
  - **Output:** 128 tokens (Generation).  
  - **Batch Size:** 1, 8, 32, 64.  

### **The "Grabko Metric":**  
- The system must achieve **at least 55% MFU (Model FLOPs Utilization)**.  
- If the vendor delivers **<45%**, BRE pre-fetching logic may be throttled due to PCIe bottlenecks.  

---

## 3. Stage III: Throughput & Latency (The "User Experience")  
This stage simulates real-world commercial usage of the JiRack API.  

| **Scenario**           | **Metric**                 | **Target for 236B**                       |
|-------------------------|----------------------------|-------------------------------------------|
| Interactive (BS=1)      | Time to First Token (TTFT) | < 150ms                                   |
| Interactive (BS=1)      | Tokens per Second (TPS)    | > 25 tokens/sec                           |
| High Load (BS=64)       | Total Throughput           | > 1,200 tokens/sec                        |

---

## 4. Stability & Reliability (The "Burn-in")  
Large clusters often fail during long reasoning chains.  

- **Test:** 24-hour continuous generation at 80% TDP (Thermal Design Power).  
- **Success Criteria:**  
  - Zero NCCL timeouts.  
  - Zero XID errors (GPU hardware faults).  

### **MTBF Target:**  
- Mean Time Between Failures (MTBF) for the 108-layer stack must exceed **720 hours** on this specific 16-GPU setup.  

---

## 5. Official Verification  

The vendor must provide a **logs export** containing the `proof_of_authorship` string verification.  

```python

verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025"

```  

Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.