kgrabko commited on
Commit
fa26cf7
·
verified ·
1 Parent(s): 7d79bf6

Upload ClusterOrchestrationScript.md

Browse files
Files changed (1) hide show
  1. ClusterOrchestrationScript.md +137 -84
ClusterOrchestrationScript.md CHANGED
@@ -1,84 +1,137 @@
1
- # K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
2
-
3
- **Document ID:** CMS-JR-405B-K8S-2025
4
- **Framework:** Kubeflow / LeaderWorkerSet (LWS)
5
- **Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)
6
-
7
- ---
8
-
9
- ## 1. The 4D Sharding Architecture
10
- To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:
11
-
12
- - **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
13
- - **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).
14
- - **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.
15
- - **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.
16
-
17
- ---
18
-
19
- ## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)
20
-
21
- Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.
22
-
23
- ### YAML
24
- ```yaml
25
- # jirack-405b-deployment.yaml
26
- apiVersion: leaderworkerset.x-k8s.io/v1
27
- kind: LeaderWorkerSet
28
- metadata:
29
- name: jirack-405b-flagship
30
- spec:
31
- replicas: 1 # Number of 16-GPU clusters
32
- leaderWorkerTemplate:
33
- size: 2 # 2 nodes per cluster (16 GPUs total)
34
- workerTemplate:
35
- spec:
36
- containers:
37
- - name: jirack-engine
38
- image: cms-manhattan/jirack-405b:latest
39
- resources:
40
- limits:
41
- nvidia.com/gpu: 8
42
- env:
43
- - name: MODEL_LAYERS
44
- value: "126"
45
- - name: PIPELINE_PARALLEL_SIZE
46
- value: "2"
47
- - name: TENSOR_PARALLEL_SIZE
48
- value: "8"
49
- - name: SWA_FUSION_ENABLED
50
- value: "true"
51
- - name: PROOF_OF_AUTHORSHIP
52
- value: "Konstantin Vladimirovich Grabko"
53
- ```
54
-
55
- ---
56
-
57
- ## 3. High-Theta RoPE & GQA Management
58
- The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.
59
-
60
- - **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
61
- - **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.
62
-
63
- ---
64
-
65
- ## 4. Autoscaling & The "Grabko Metric"
66
-
67
- Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.
68
-
69
- - **Scale-Up:** Triggered when `num_requests_waiting > 5`.
70
- - **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.
71
-
72
- ---
73
-
74
- ## 5. Compliance Verification
75
-
76
- The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.
77
-
78
- **Compliance Features:**
79
- - Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
80
-
81
- **Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2.
82
-
83
- - other links
84
- - Serve Llama 3.1 405B on Kubernetes on Multi Host GPUs: https://www.youtube.com/watch?v=cUjmuxacvBo
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Kubernetes Manifest: JiRack 236B Deployment
2
+
3
+ **Framework:** LeaderWorkerSet for Kubernetes
4
+ **Model Scale:** JiRack 236B (108 Layers, 14:1 GQA Ratio)
5
+
6
+ ---
7
+
8
+ ## 1. JiRack 236B Kubernetes Manifest
9
+
10
+ The **JiRack 236B** model uses a **14:1 GQA ratio** and **108 layers**. This manifest shards it across **2 nodes** using Tensor Parallelism (TP) of 8 and Pipeline Parallelism (PP) of 2.
11
+
12
+ ### YAML
13
+ ```yaml
14
+ # jirack-236b-frontier.yaml
15
+ apiVersion: leaderworkerset.x-k8s.io/v1
16
+ kind: LeaderWorkerSet
17
+ metadata:
18
+ name: jirack-236b-frontier
19
+ spec:
20
+ replicas: 1 # Deploy as one 16-GPU logical unit
21
+ leaderWorkerTemplate:
22
+ size: 2 # Sharded across 2 nodes (8 GPUs each)
23
+ workerTemplate:
24
+ spec:
25
+ containers:
26
+ - name: jirack-engine
27
+ image: cms-manhattan/jirack-236b:latest
28
+ resources:
29
+ limits:
30
+ nvidia.com/gpu: 8
31
+ env:
32
+ - name: MODEL_LAYERS
33
+ value: "108"
34
+ - name: PIPELINE_PARALLEL_SIZE
35
+ value: "2"
36
+ - name: TENSOR_PARALLEL_SIZE
37
+ value: "8"
38
+ - name: MODEL_DIM
39
+ value: "14336"
40
+ - name: GQA_RATIO
41
+ value: "14"
42
+ - name: AUTHOR_SIG
43
+ value: "Konstantin Vladimirovich Grabko"
44
+ ```
45
+
46
+ ---
47
+
48
+ ## 2. CI/CD Pipeline: Build and Deploy JiRack 236B
49
+
50
+ This **GitHub Actions** workflow automates the Build-Verify-Deploy cycle. The pipeline ensures that any update (e.g., to SWA fusion kernels) is tested and pushed to the **236B Production Cluster**.
51
+
52
+ ### YAML
53
+ ```yaml
54
+ # .github/workflows/jirack-deploy.yml
55
+ name: Build and Deploy JiRack 236B
56
+
57
+ on:
58
+ push:
59
+ branches: [ main ]
60
+
61
+ jobs:
62
+ build-and-push:
63
+ runs-on: ubuntu-latest
64
+ steps:
65
+ - uses: actions/checkout@v4
66
+
67
+ - name: Login to DockerHub
68
+ uses: docker/login-action@v3
69
+ with:
70
+ username: ${{ secrets.DOCKERHUB_USERNAME }}
71
+ password: ${{ secrets.DOCKERHUB_TOKEN }}
72
+
73
+ - name: Build JiRack Engine
74
+ run: |
75
+ docker build -t cms-manhattan/jirack-236b:${{ github.sha }} .
76
+ docker tag cms-manhattan/jirack-236b:${{ github.sha }} cms-manhattan/jirack-236b:latest
77
+
78
+ - name: Push Image
79
+ run: docker push cms-manhattan/jirack-236b:latest
80
+
81
+ deploy-to-k8s:
82
+ needs: build-and-push
83
+ runs-on: self-hosted # Use a runner with access to your K8s cluster
84
+ steps:
85
+ - name: Set Kubernetes Context
86
+ uses: azure/k8s-set-context@v3
87
+ with:
88
+ kubeconfig: ${{ secrets.KUBE_CONFIG }}
89
+
90
+ - name: Deploy Manifest
91
+ run: |
92
+ kubectl apply -f k8s/jirack-236b-frontier.yaml
93
+ kubectl rollout restart leaderworkerset/jirack-236b-frontier
94
+ ```
95
+
96
+ ---
97
+
98
+ ## 3. The "236B Optimization" Benchmarking
99
+
100
+ After deployment, the pipeline includes a **Post-Deployment Verification Step** to confirm SWA Fusion performance and functionality.
101
+
102
+ | **Test Parameter** | **Target for JiRack 236B** | **Failure Action** |
103
+ |---------------------------|----------------------------|-----------------------------------------|
104
+ | **KV Cache Latency** | < 120ms (TTFT) | Automatic Rollback |
105
+ | **Kernel Throughput** | > 28 tokens/sec | Alert Admin |
106
+ | **Auth Verification** | "Grabko" Signature Found | Immediate Kill Pod |
107
+
108
+ ---
109
+
110
+ ## 4. Storage and Weight Loading
111
+
112
+ The JiRack 236B model (~470GB in BF16) requires fast storage to load the **108 layers** in under **2 minutes**. Persistent Volume Claims (PVC) backed by NVMe storage are recommended.
113
+
114
+ ### YAML
115
+ ```yaml
116
+ # fragment of pod spec
117
+ volumeMounts:
118
+ - name: model-weights
119
+ mountPath: /models/jirack-236b
120
+ volumes:
121
+ - name: model-weights
122
+ persistentVolumeClaim:
123
+ claimName: jirack-weights-pvc
124
+ ```
125
+
126
+ ---
127
+
128
+ ## 5. Comparison: 236B vs. 405B+ Deployment
129
+
130
+ | **Feature** | **JiRack 236B** | **JiRack 405B+** |
131
+ |--------------------------|-------------------------|-----------------------------------|
132
+ | **GPU Count** | 16 (2 Nodes) | 1,024+ (128+ Nodes) |
133
+ | **PP Degree** | 2 | 8 - 16K |
134
+ | **K8s Resource** | LeaderWorkerSet (Small) | LeaderWorkerSet (Mega-Cluster) |
135
+ | **CI/CD Target** | Standard Production | Multi-Region Canary |
136
+
137
+ ---