JiRack_GPT5_405b / ClusterOrchestrationScript.md
kgrabko's picture
Upload ClusterOrchestrationScript.md
a23d512 verified
# K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
**Document ID:** CMS-JR-405B-K8S-2025
**Framework:** Kubeflow / LeaderWorkerSet (LWS)
**Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)
---
## 1. The 4D Sharding Architecture
To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:
- **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
- **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).
- **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.
- **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.
---
## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)
Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.
### YAML
```yaml
# jirack-405b-deployment.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: jirack-405b-flagship
spec:
replicas: 1 # Number of 16-GPU clusters
leaderWorkerTemplate:
size: 2 # 2 nodes per cluster (16 GPUs total)
workerTemplate:
spec:
containers:
- name: jirack-engine
image: cms-manhattan/jirack-405b:latest
resources:
limits:
nvidia.com/gpu: 8
env:
- name: MODEL_LAYERS
value: "126"
- name: PIPELINE_PARALLEL_SIZE
value: "2"
- name: TENSOR_PARALLEL_SIZE
value: "8"
- name: SWA_FUSION_ENABLED
value: "true"
- name: PROOF_OF_AUTHORSHIP
value: "Konstantin Vladimirovich Grabko"
```
---
## 3. High-Theta RoPE & GQA Management
The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.
- **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
- **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.
---
## 4. Autoscaling & The "Grabko Metric"
Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.
- **Scale-Up:** Triggered when `num_requests_waiting > 5`.
- **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.
---
## 5. Compliance Verification
The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.
**Compliance Features:**
- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
**Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2.