JiRack_GPT5_236b / PerformanceBenchmarkPlan.md
kgrabko's picture
Upload PerformanceBenchmarkPlan.md
fd18000 verified

PERFORMANCE BENCHMARK PLAN: JiRack 236B

Document ID: CMS-JR-236B-BCH-2025
Target: 16x H100 GPU Cluster (2-Node HGX)


1. Stage I: Fabric & Communication (The "Stress Test")

Before loading weights, the inter-node fabric must be verified. JiRack's 14:1 GQA relies on ultra-fast All-Reduce operations during the attention heads' merge phase.

Test Type Tool Target Metric Success Criteria
NCCL All-Reduce nccl-tests Bus Bandwidth > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node)
P2P Latency p2pBandwidthLatencyTest Latency (μs) < 2.0 μs via NVLink
IB Write BW ib_write_bw Throughput 390+ Gbps per link (NDR 400)

2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test")

Standard benchmarks (like HPL) are ineffective for JiRack. The SwiGLU-Attention (SWA) fusion logic must be tested on the specific 14,336-width dimension.

  • Benchmark Tool: trtllm-bench (TensorRT-LLM) or a custom Triton kernel profiler.
  • Target Configuration:
    • Input: 1024 tokens (Prompt).
    • Output: 128 tokens (Generation).
    • Batch Size: 1, 8, 32, 64.

The "Grabko Metric":

  • The system must achieve at least 55% MFU (Model FLOPs Utilization).
  • If the vendor delivers <45%, BRE pre-fetching logic may be throttled due to PCIe bottlenecks.

3. Stage III: Throughput & Latency (The "User Experience")

This stage simulates real-world commercial usage of the JiRack API.

Scenario Metric Target for 236B
Interactive (BS=1) Time to First Token (TTFT) < 150ms
Interactive (BS=1) Tokens per Second (TPS) > 25 tokens/sec
High Load (BS=64) Total Throughput > 1,200 tokens/sec

4. Stability & Reliability (The "Burn-in")

Large clusters often fail during long reasoning chains.

  • Test: 24-hour continuous generation at 80% TDP (Thermal Design Power).
  • Success Criteria:
    • Zero NCCL timeouts.
    • Zero XID errors (GPU hardware faults).

MTBF Target:

  • Mean Time Between Failures (MTBF) for the 108-layer stack must exceed 720 hours on this specific 16-GPU setup.

5. Official Verification

The vendor must provide a logs export containing the proof_of_authorship string verification.

verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025"

Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.