JiRack_GPT5_236b / PerformanceBenchmarkPlan.md

Upload PerformanceBenchmarkPlan.md

fd18000 verified about 1 month ago

3.27 kB

	# PERFORMANCE BENCHMARK PLAN: JiRack 236B

	Document ID: CMS-JR-236B-BCH-2025
	Target: 16x H100 GPU Cluster (2-Node HGX)

	---

	## 1. Stage I: Fabric & Communication (The "Stress Test")
	Before loading weights, the inter-node fabric must be verified. JiRack's 14:1 GQA relies on ultra-fast All-Reduce operations during the attention heads' merge phase.

	\| Test Type \| Tool \| Target Metric \| Success Criteria \|
	\|-------------------------\|---------------------------\|-------------------------------------------\|-----------------------------------------------\|
	\| NCCL All-Reduce \| `nccl-tests` \| Bus Bandwidth \| > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node) \|
	\| P2P Latency \| `p2pBandwidthLatencyTest` \| Latency (μs) \| < 2.0 μs via NVLink \|
	\| IB Write BW \| `ib_write_bw` \| Throughput \| 390+ Gbps per link (NDR 400) \|

	---

	## 2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test")
	Standard benchmarks (like HPL) are ineffective for JiRack. The SwiGLU-Attention (SWA) fusion logic must be tested on the specific 14,336-width dimension.

	- Benchmark Tool: `trtllm-bench` (TensorRT-LLM) or a custom Triton kernel profiler.
	- Target Configuration:
	- Input: 1024 tokens (Prompt).
	- Output: 128 tokens (Generation).
	- Batch Size: 1, 8, 32, 64.

	### The "Grabko Metric":
	- The system must achieve at least 55% MFU (Model FLOPs Utilization).
	- If the vendor delivers <45%, BRE pre-fetching logic may be throttled due to PCIe bottlenecks.

	---

	## 3. Stage III: Throughput & Latency (The "User Experience")
	This stage simulates real-world commercial usage of the JiRack API.

	\| Scenario \| Metric \| Target for 236B \|
	\|-------------------------\|----------------------------\|-------------------------------------------\|
	\| Interactive (BS=1) \| Time to First Token (TTFT) \| < 150ms \|
	\| Interactive (BS=1) \| Tokens per Second (TPS) \| > 25 tokens/sec \|
	\| High Load (BS=64) \| Total Throughput \| > 1,200 tokens/sec \|

	---

	## 4. Stability & Reliability (The "Burn-in")
	Large clusters often fail during long reasoning chains.

	- Test: 24-hour continuous generation at 80% TDP (Thermal Design Power).
	- Success Criteria:
	- Zero NCCL timeouts.
	- Zero XID errors (GPU hardware faults).

	### MTBF Target:
	- Mean Time Between Failures (MTBF) for the 108-layer stack must exceed 720 hours on this specific 16-GPU setup.

	---

	## 5. Official Verification

	The vendor must provide a logs export containing the `proof_of_authorship` string verification.

	```python
	verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025"
	```

	Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.

	# PERFORMANCE BENCHMARK PLAN: JiRack 236B

	Document ID: CMS-JR-236B-BCH-2025
	Target: 16x H100 GPU Cluster (2-Node HGX)

	---

	## 1. Stage I: Fabric & Communication (The "Stress Test")
	Before loading weights, the inter-node fabric must be verified. JiRack's 14:1 GQA relies on ultra-fast All-Reduce operations during the attention heads' merge phase.

	\| Test Type \| Tool \| Target Metric \| Success Criteria \|
	\|-------------------------\|---------------------------\|-------------------------------------------\|-----------------------------------------------\|
	\| NCCL All-Reduce \| `nccl-tests` \| Bus Bandwidth \| > 380 GB/s (Intra-node) / > 45 GB/s (Inter-node) \|
	\| P2P Latency \| `p2pBandwidthLatencyTest` \| Latency (μs) \| < 2.0 μs via NVLink \|
	\| IB Write BW \| `ib_write_bw` \| Throughput \| 390+ Gbps per link (NDR 400) \|

	---

	## 2. Stage II: JiRack SWA Kernel Fusion (The "Compute Test")
	Standard benchmarks (like HPL) are ineffective for JiRack. The SwiGLU-Attention (SWA) fusion logic must be tested on the specific 14,336-width dimension.

	- Benchmark Tool: `trtllm-bench` (TensorRT-LLM) or a custom Triton kernel profiler.
	- Target Configuration:
	- Input: 1024 tokens (Prompt).
	- Output: 128 tokens (Generation).
	- Batch Size: 1, 8, 32, 64.

	### The "Grabko Metric":
	- The system must achieve at least 55% MFU (Model FLOPs Utilization).
	- If the vendor delivers <45%, BRE pre-fetching logic may be throttled due to PCIe bottlenecks.

	---

	## 3. Stage III: Throughput & Latency (The "User Experience")
	This stage simulates real-world commercial usage of the JiRack API.

	\| Scenario \| Metric \| Target for 236B \|
	\|-------------------------\|----------------------------\|-------------------------------------------\|
	\| Interactive (BS=1) \| Time to First Token (TTFT) \| < 150ms \|
	\| Interactive (BS=1) \| Tokens per Second (TPS) \| > 25 tokens/sec \|
	\| High Load (BS=64) \| Total Throughput \| > 1,200 tokens/sec \|

	---

	## 4. Stability & Reliability (The "Burn-in")
	Large clusters often fail during long reasoning chains.

	- Test: 24-hour continuous generation at 80% TDP (Thermal Design Power).
	- Success Criteria:
	- Zero NCCL timeouts.
	- Zero XID errors (GPU hardware faults).

	### MTBF Target:
	- Mean Time Between Failures (MTBF) for the 108-layer stack must exceed 720 hours on this specific 16-GPU setup.

	---

	## 5. Official Verification

	The vendor must provide a logs export containing the `proof_of_authorship` string verification.

	```python
	verify_authorship(model) -> "Author: Konstantin Vladimirovich Grabko (CMS Manhattan) 2025"
	```

	Acceptance of the hardware cluster is contingent upon meeting these 236B benchmarks.