CMSManhattan
/

JiRack_GPT5_405b

Model card Files Files and versions

JiRack_GPT5_405b / ClusterOrchestrationScript.md

kgrabko's picture

Upload ClusterOrchestrationScript.md

a23d512 verified about 1 month ago

|

history blame contribute delete

3.23 kB

	# K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale

	Document ID: CMS-JR-405B-K8S-2025
	Framework: Kubeflow / LeaderWorkerSet (LWS)
	Hardware Target: 16-GPU Multi-node (H100/A100 Cluster)

	---

	## 1. The 4D Sharding Architecture
	To fit the ~810GB (BF16) weight footprint while maintaining real-time inference, the orchestration script implements 4D Parallelism:

	- Tensor Parallelism (TP): Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
	- Pipeline Parallelism (PP): Distributes the 126 layers across 2 nodes (63 layers per node).
	- Data Parallelism (DP): Replicates the sharded setup to handle parallel requests.
	- Sequence Parallelism (SP): Splits the 4,096-token attention across GPUs to avoid OOM (Out of Memory) during prefill.

	---

	## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)

	Using the Kubernetes LeaderWorkerSet API, we define a "Pod Group" where one pod acts as the scheduler (Leader) and others act as the compute workers.

	### YAML
	```yaml
	# jirack-405b-deployment.yaml
	apiVersion: leaderworkerset.x-k8s.io/v1
	kind: LeaderWorkerSet
	metadata:
	name: jirack-405b-flagship
	spec:
	replicas: 1 # Number of 16-GPU clusters
	leaderWorkerTemplate:
	size: 2 # 2 nodes per cluster (16 GPUs total)
	workerTemplate:
	spec:
	containers:
	- name: jirack-engine
	image: cms-manhattan/jirack-405b:latest
	resources:
	limits:
	nvidia.com/gpu: 8
	env:
	- name: MODEL_LAYERS
	value: "126"
	- name: PIPELINE_PARALLEL_SIZE
	value: "2"
	- name: TENSOR_PARALLEL_SIZE
	value: "8"
	- name: SWA_FUSION_ENABLED
	value: "true"
	- name: PROOF_OF_AUTHORSHIP
	value: "Konstantin Vladimirovich Grabko"
	```

	---

	## 3. High-Theta RoPE & GQA Management
	The orchestration layer must ensure that InfiniBand RDMA is correctly exposed to the pods. Without this, the 128-head GQA will suffer from extreme "all-reduce" latency during the layer handoffs.

	- Metric to Watch: `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
	- Network Plugin: Multus CNI with NVIDIA/Mellanox InfiniBand driver.

	---

	## 4. Autoscaling & The "Grabko Metric"

	Using KEDA (Kubernetes Event-Driven Autoscaler), the cluster monitors the number of waiting requests in the KV-cache.

	- Scale-Up: Triggered when `num_requests_waiting > 5`.
	- Scale-Down: Graceful shutdown of workers once the 108-layer inference queue is clear.

	---

	## 5. Compliance Verification

	The K8s Liveness probe is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as Unhealthy and terminated.

	Compliance Features:
	- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.

	Note: Commercial deployment of this script requires compliance with the 5% Royalty terms of the JiRack Commercial License V.1.2.