PERFORMANCE BENCHMARK PLAN: JiRack 236B
Document ID: CMS-JR-236B-BCH-2025
Target: 16x H100 GPU Cluster (2-Node HGX)
1. Stage I: Fabric & Communication (The "Stress Test")
Before loading weights, the inter-node fabric must be verified. JiRack's 14:1 GQA relies on ultra-fast All-Reduce operations during the attention heads' merge phase.
| Test Type | Tool | Target Metric | Success Criteria |
|---|---|---|---|
| NCCL All-Reduce | nccl-tests |
Bus Bandwidth | > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node) |
| P2P Latency | p2pBandwidthLatencyTest |
Latency (μs) | < 2.0 μs via NVLink |
| IB Write BW | ib_write_bw |
Throughput | 390+ Gbps per link (NDR 400) |
2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test")
Standard benchmarks (like HPL) are ineffective for JiRack. The SwiGLU-Attention (SWA) fusion logic must be tested on the specific 14,336-width dimension.
- Benchmark Tool:
trtllm-bench(TensorRT-LLM) or a custom Triton kernel profiler. - Target Configuration:
- Input: 1024 tokens (Prompt).
- Output: 128 tokens (Generation).
- Batch Size: 1, 8, 32, 64.
The "Grabko Metric":
- The system must achieve at least 55% MFU (Model FLOPs Utilization).
- If the vendor delivers <45%, BRE pre-fetching logic may be throttled due to PCIe bottlenecks.
3. Stage III: Throughput & Latency (The "User Experience")
This stage simulates real-world commercial usage of the JiRack API.
| Scenario | Metric | Target for 236B |
|---|---|---|
| Interactive (BS=1) | Time to First Token (TTFT) | < 150ms |
| Interactive (BS=1) | Tokens per Second (TPS) | > 25 tokens/sec |
| High Load (BS=64) | Total Throughput | > 1,200 tokens/sec |
4. Stability & Reliability (The "Burn-in")
Large clusters often fail during long reasoning chains.
- Test: 24-hour continuous generation at 80% TDP (Thermal Design Power).
- Success Criteria:
- Zero NCCL timeouts.
- Zero XID errors (GPU hardware faults).
MTBF Target:
- Mean Time Between Failures (MTBF) for the 108-layer stack must exceed 720 hours on this specific 16-GPU setup.
5. Official Verification
The vendor must provide a logs export containing the proof_of_authorship string verification.
verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025"
Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.