# PERFORMANCE BENCHMARK PLAN: JiRack 236B **Document ID:** CMS-JR-236B-BCH-2025 **Target:** 16x H100 GPU Cluster (2-Node HGX) --- ## 1. Stage I: Fabric & Communication (The "Stress Test") Before loading weights, the inter-node fabric must be verified. JiRack's **14:1 GQA** relies on ultra-fast All-Reduce operations during the attention heads' merge phase. | **Test Type** | **Tool** | **Target Metric** | **Success Criteria** | |-------------------------|---------------------------|-------------------------------------------|-----------------------------------------------| | NCCL All-Reduce | `nccl-tests` | Bus Bandwidth | > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node) | | P2P Latency | `p2pBandwidthLatencyTest` | Latency (μs) | < 2.0 μs via NVLink | | IB Write BW | `ib_write_bw` | Throughput | 390+ Gbps per link (NDR 400) | --- ## 2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test") Standard benchmarks (like HPL) are ineffective for JiRack. The **SwiGLU-Attention (SWA)** fusion logic must be tested on the specific **14,336-width dimension**. - **Benchmark Tool:** `trtllm-bench` (TensorRT-LLM) or a custom Triton kernel profiler. - **Target Configuration:** - **Input:** 1024 tokens (Prompt). - **Output:** 128 tokens (Generation). - **Batch Size:** 1, 8, 32, 64. ### **The "Grabko Metric":** - The system must achieve **at least 55% MFU (Model FLOPs Utilization)**. - If the vendor delivers **<45%**, BRE pre-fetching logic may be throttled due to PCIe bottlenecks. --- ## 3. Stage III: Throughput & Latency (The "User Experience") This stage simulates real-world commercial usage of the JiRack API. | **Scenario** | **Metric** | **Target for 236B** | |-------------------------|----------------------------|-------------------------------------------| | Interactive (BS=1) | Time to First Token (TTFT) | < 150ms | | Interactive (BS=1) | Tokens per Second (TPS) | > 25 tokens/sec | | High Load (BS=64) | Total Throughput | > 1,200 tokens/sec | --- ## 4. Stability & Reliability (The "Burn-in") Large clusters often fail during long reasoning chains. - **Test:** 24-hour continuous generation at 80% TDP (Thermal Design Power). - **Success Criteria:** - Zero NCCL timeouts. - Zero XID errors (GPU hardware faults). ### **MTBF Target:** - Mean Time Between Failures (MTBF) for the 108-layer stack must exceed **720 hours** on this specific 16-GPU setup. --- ## 5. Official Verification The vendor must provide a **logs export** containing the `proof_of_authorship` string verification. ```python verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025" ``` Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.