CMSManhattan
/

JiRack_GPT5_405b

Model card Files Files and versions

xet

Community

kgrabko commited on Dec 22, 2025

Commit

fa26cf7

verified ·

1 Parent(s): 7d79bf6

Upload ClusterOrchestrationScript.md

Browse files

Files changed (1) hide show

ClusterOrchestrationScript.md +137 -84

ClusterOrchestrationScript.md CHANGED Viewed

@@ -1,84 +1,137 @@
-# K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
-**Document ID:** CMS-JR-405B-K8S-2025
-**Framework:** Kubeflow / LeaderWorkerSet (LWS)
-**Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)
----
-## 1. The 4D Sharding Architecture
-To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:
-- **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
-- **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).
-- **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.
-- **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.
----
-## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)
-Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.
-### YAML
-```yaml
-# jirack-405b-deployment.yaml
-apiVersion: leaderworkerset.x-k8s.io/v1
-kind: LeaderWorkerSet
-metadata:
-  name: jirack-405b-flagship
-spec:
-  replicas: 1  # Number of 16-GPU clusters
-  leaderWorkerTemplate:
-    size: 2    # 2 nodes per cluster (16 GPUs total)
-    workerTemplate:
-      spec:
-        containers:
-          - name: jirack-engine
-            image: cms-manhattan/jirack-405b:latest
-            resources:
-              limits:
-                nvidia.com/gpu: 8
-            env:
-              - name: MODEL_LAYERS
-                value: "126"
-              - name: PIPELINE_PARALLEL_SIZE
-                value: "2"
-              - name: TENSOR_PARALLEL_SIZE
-                value: "8"
-              - name: SWA_FUSION_ENABLED
-                value: "true"
-              - name: PROOF_OF_AUTHORSHIP
-                value: "Konstantin Vladimirovich Grabko"
-```
----
-## 3. High-Theta RoPE & GQA Management
-The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.
-- **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
-- **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.
----
-## 4. Autoscaling & The "Grabko Metric"
-Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.
-- **Scale-Up:** Triggered when `num_requests_waiting > 5`.
-- **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.
----
-## 5. Compliance Verification
-The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.
-**Compliance Features:**
-- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
-**Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2.
-- other links
-- Serve Llama 3.1 405B on Kubernetes on Multi Host GPUs:  https://www.youtube.com/watch?v=cUjmuxacvBo

+# Kubernetes Manifest: JiRack 236B Deployment
+**Framework:** LeaderWorkerSet for Kubernetes
+**Model Scale:** JiRack 236B (108 Layers, 14:1 GQA Ratio)
+---
+## 1. JiRack 236B Kubernetes Manifest
+The **JiRack 236B** model uses a **14:1 GQA ratio** and **108 layers**. This manifest shards it across **2 nodes** using Tensor Parallelism (TP) of 8 and Pipeline Parallelism (PP) of 2.
+### YAML
+```yaml
+# jirack-236b-frontier.yaml
+apiVersion: leaderworkerset.x-k8s.io/v1
+kind: LeaderWorkerSet
+metadata:
+  name: jirack-236b-frontier
+spec:
+  replicas: 1  # Deploy as one 16-GPU logical unit
+  leaderWorkerTemplate:
+    size: 2    # Sharded across 2 nodes (8 GPUs each)
+    workerTemplate:
+      spec:
+        containers:
+          - name: jirack-engine
+            image: cms-manhattan/jirack-236b:latest
+            resources:
+              limits:
+                nvidia.com/gpu: 8
+            env:
+              - name: MODEL_LAYERS
+                value: "108"
+              - name: PIPELINE_PARALLEL_SIZE
+                value: "2"
+              - name: TENSOR_PARALLEL_SIZE
+                value: "8"
+              - name: MODEL_DIM
+                value: "14336"
+              - name: GQA_RATIO
+                value: "14"
+              - name: AUTHOR_SIG
+                value: "Konstantin Vladimirovich Grabko"
+```
+---
+## 2. CI/CD Pipeline: Build and Deploy JiRack 236B
+This **GitHub Actions** workflow automates the Build-Verify-Deploy cycle. The pipeline ensures that any update (e.g., to SWA fusion kernels) is tested and pushed to the **236B Production Cluster**.
+### YAML
+```yaml
+# .github/workflows/jirack-deploy.yml
+name: Build and Deploy JiRack 236B
+on:
+  push:
+    branches: [ main ]
+jobs:
+  build-and-push:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Login to DockerHub
+        uses: docker/login-action@v3
+        with:
+          username: ${{ secrets.DOCKERHUB_USERNAME }}
+          password: ${{ secrets.DOCKERHUB_TOKEN }}
+      - name: Build JiRack Engine
+        run: |
+          docker build -t cms-manhattan/jirack-236b:${{ github.sha }} .
+          docker tag cms-manhattan/jirack-236b:${{ github.sha }} cms-manhattan/jirack-236b:latest
+      - name: Push Image
+        run: docker push cms-manhattan/jirack-236b:latest
+  deploy-to-k8s:
+    needs: build-and-push
+    runs-on: self-hosted # Use a runner with access to your K8s cluster
+    steps:
+      - name: Set Kubernetes Context
+        uses: azure/k8s-set-context@v3
+        with:
+          kubeconfig: ${{ secrets.KUBE_CONFIG }}
+      - name: Deploy Manifest
+        run: |
+          kubectl apply -f k8s/jirack-236b-frontier.yaml
+          kubectl rollout restart leaderworkerset/jirack-236b-frontier
+```
+---
+## 3. The "236B Optimization" Benchmarking
+After deployment, the pipeline includes a **Post-Deployment Verification Step** to confirm SWA Fusion performance and functionality.
+| **Test Parameter**        | **Target for JiRack 236B** | **Failure Action**                      |
+|---------------------------|----------------------------|-----------------------------------------|
+| **KV Cache Latency**      | < 120ms (TTFT)             | Automatic Rollback                      |
+| **Kernel Throughput**     | > 28 tokens/sec            | Alert Admin                             |
+| **Auth Verification**     | "Grabko" Signature Found   | Immediate Kill Pod                      |
+---
+## 4. Storage and Weight Loading
+The JiRack 236B model (~470GB in BF16) requires fast storage to load the **108 layers** in under **2 minutes**. Persistent Volume Claims (PVC) backed by NVMe storage are recommended.
+### YAML
+```yaml
+# fragment of pod spec
+volumeMounts:
+  - name: model-weights
+    mountPath: /models/jirack-236b
+volumes:
+  - name: model-weights
+    persistentVolumeClaim:
+      claimName: jirack-weights-pvc
+```
+---
+## 5. Comparison: 236B vs. 405B+ Deployment
+| **Feature**              | **JiRack 236B**         | **JiRack 405B+**                  |
+|--------------------------|-------------------------|-----------------------------------|
+| **GPU Count**            | 16 (2 Nodes)            | 1,024+ (128+ Nodes)               |
+| **PP Degree**            | 2                       | 8 - 16K                           |
+| **K8s Resource**         | LeaderWorkerSet (Small) | LeaderWorkerSet (Mega-Cluster)    |
+| **CI/CD Target**         | Standard Production     | Multi-Region Canary               |
+---