CMSManhattan
/

JiRack_GPT5_405b

Model card Files Files and versions

xet

Community

kgrabko commited on Dec 22, 2025

Commit

a23d512

verified ·

1 Parent(s): fa26cf7

Upload ClusterOrchestrationScript.md

Browse files

Files changed (1) hide show

ClusterOrchestrationScript.md +38 -94

ClusterOrchestrationScript.md CHANGED Viewed

@@ -1,137 +1,81 @@
-# Kubernetes Manifest: JiRack 236B Deployment
-**Framework:** LeaderWorkerSet for Kubernetes
-**Model Scale:** JiRack 236B (108 Layers, 14:1 GQA Ratio)
 ---
-## 1. JiRack 236B Kubernetes Manifest
-The **JiRack 236B** model uses a **14:1 GQA ratio** and **108 layers**. This manifest shards it across **2 nodes** using Tensor Parallelism (TP) of 8 and Pipeline Parallelism (PP) of 2.
 ### YAML
 ```yaml
-# jirack-236b-frontier.yaml
 apiVersion: leaderworkerset.x-k8s.io/v1
 kind: LeaderWorkerSet
 metadata:
-  name: jirack-236b-frontier
 spec:
-  replicas: 1  # Deploy as one 16-GPU logical unit
   leaderWorkerTemplate:
-    size: 2    # Sharded across 2 nodes (8 GPUs each)
     workerTemplate:
       spec:
         containers:
           - name: jirack-engine
-            image: cms-manhattan/jirack-236b:latest
             resources:
               limits:
                 nvidia.com/gpu: 8
             env:
               - name: MODEL_LAYERS
-                value: "108"
               - name: PIPELINE_PARALLEL_SIZE
                 value: "2"
               - name: TENSOR_PARALLEL_SIZE
                 value: "8"
-              - name: MODEL_DIM
-                value: "14336"
-              - name: GQA_RATIO
-                value: "14"
-              - name: AUTHOR_SIG
                 value: "Konstantin Vladimirovich Grabko"
 ```
 ---
-## 2. CI/CD Pipeline: Build and Deploy JiRack 236B
-This **GitHub Actions** workflow automates the Build-Verify-Deploy cycle. The pipeline ensures that any update (e.g., to SWA fusion kernels) is tested and pushed to the **236B Production Cluster**.
-### YAML
-```yaml
-# .github/workflows/jirack-deploy.yml
-name: Build and Deploy JiRack 236B
-on:
-  push:
-    branches: [ main ]
-jobs:
-  build-and-push:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-      - name: Login to DockerHub
-        uses: docker/login-action@v3
-        with:
-          username: ${{ secrets.DOCKERHUB_USERNAME }}
-          password: ${{ secrets.DOCKERHUB_TOKEN }}
-      - name: Build JiRack Engine
-        run: |
-          docker build -t cms-manhattan/jirack-236b:${{ github.sha }} .
-          docker tag cms-manhattan/jirack-236b:${{ github.sha }} cms-manhattan/jirack-236b:latest
-      - name: Push Image
-        run: docker push cms-manhattan/jirack-236b:latest
-  deploy-to-k8s:
-    needs: build-and-push
-    runs-on: self-hosted # Use a runner with access to your K8s cluster
-    steps:
-      - name: Set Kubernetes Context
-        uses: azure/k8s-set-context@v3
-        with:
-          kubeconfig: ${{ secrets.KUBE_CONFIG }}
-      - name: Deploy Manifest
-        run: |
-          kubectl apply -f k8s/jirack-236b-frontier.yaml
-          kubectl rollout restart leaderworkerset/jirack-236b-frontier
-```
 ---
-## 3. The "236B Optimization" Benchmarking
-After deployment, the pipeline includes a **Post-Deployment Verification Step** to confirm SWA Fusion performance and functionality.
-| **Test Parameter**        | **Target for JiRack 236B** | **Failure Action**                      |
-|---------------------------|----------------------------|-----------------------------------------|
-| **KV Cache Latency**      | < 120ms (TTFT)             | Automatic Rollback                      |
-| **Kernel Throughput**     | > 28 tokens/sec            | Alert Admin                             |
-| **Auth Verification**     | "Grabko" Signature Found   | Immediate Kill Pod                      |
 ---
-## 4. Storage and Weight Loading
-The JiRack 236B model (~470GB in BF16) requires fast storage to load the **108 layers** in under **2 minutes**. Persistent Volume Claims (PVC) backed by NVMe storage are recommended.
-### YAML
-```yaml
-# fragment of pod spec
-volumeMounts:
-  - name: model-weights
-    mountPath: /models/jirack-236b
-volumes:
-  - name: model-weights
-    persistentVolumeClaim:
-      claimName: jirack-weights-pvc
-```
----
-## 5. Comparison: 236B vs. 405B+ Deployment
-| **Feature**              | **JiRack 236B**         | **JiRack 405B+**                  |
-|--------------------------|-------------------------|-----------------------------------|
-| **GPU Count**            | 16 (2 Nodes)            | 1,024+ (128+ Nodes)               |
-| **PP Degree**            | 2                       | 8 - 16K                           |
-| **K8s Resource**         | LeaderWorkerSet (Small) | LeaderWorkerSet (Mega-Cluster)    |
-| **CI/CD Target**         | Standard Production     | Multi-Region Canary               |
----

+# K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
+**Document ID:** CMS-JR-405B-K8S-2025
+**Framework:** Kubeflow / LeaderWorkerSet (LWS)
+**Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)
 ---
+## 1. The 4D Sharding Architecture
+To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:
+- **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
+- **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).
+- **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.
+- **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.
+---
+## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)
+Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.
 ### YAML
 ```yaml
+# jirack-405b-deployment.yaml
 apiVersion: leaderworkerset.x-k8s.io/v1
 kind: LeaderWorkerSet
 metadata:
+  name: jirack-405b-flagship
 spec:
+  replicas: 1  # Number of 16-GPU clusters
   leaderWorkerTemplate:
+    size: 2    # 2 nodes per cluster (16 GPUs total)
     workerTemplate:
       spec:
         containers:
           - name: jirack-engine
+            image: cms-manhattan/jirack-405b:latest
             resources:
               limits:
                 nvidia.com/gpu: 8
             env:
               - name: MODEL_LAYERS
+                value: "126"
               - name: PIPELINE_PARALLEL_SIZE
                 value: "2"
               - name: TENSOR_PARALLEL_SIZE
                 value: "8"
+              - name: SWA_FUSION_ENABLED
+                value: "true"
+              - name: PROOF_OF_AUTHORSHIP
                 value: "Konstantin Vladimirovich Grabko"
 ```
 ---
+## 3. High-Theta RoPE & GQA Management
+The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.
+- **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
+- **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.
 ---
+## 4. Autoscaling & The "Grabko Metric"
+Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.
+- **Scale-Up:** Triggered when `num_requests_waiting > 5`.
+- **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.
 ---
+## 5. Compliance Verification
+The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.
+**Compliance Features:**
+- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
+**Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2.