kgrabko commited on
Commit
a23d512
·
verified ·
1 Parent(s): fa26cf7

Upload ClusterOrchestrationScript.md

Browse files
Files changed (1) hide show
  1. ClusterOrchestrationScript.md +38 -94
ClusterOrchestrationScript.md CHANGED
@@ -1,137 +1,81 @@
1
- # Kubernetes Manifest: JiRack 236B Deployment
2
 
3
- **Framework:** LeaderWorkerSet for Kubernetes
4
- **Model Scale:** JiRack 236B (108 Layers, 14:1 GQA Ratio)
 
5
 
6
  ---
7
 
8
- ## 1. JiRack 236B Kubernetes Manifest
 
9
 
10
- The **JiRack 236B** model uses a **14:1 GQA ratio** and **108 layers**. This manifest shards it across **2 nodes** using Tensor Parallelism (TP) of 8 and Pipeline Parallelism (PP) of 2.
 
 
 
 
 
 
 
 
 
11
 
12
  ### YAML
13
  ```yaml
14
- # jirack-236b-frontier.yaml
15
  apiVersion: leaderworkerset.x-k8s.io/v1
16
  kind: LeaderWorkerSet
17
  metadata:
18
- name: jirack-236b-frontier
19
  spec:
20
- replicas: 1 # Deploy as one 16-GPU logical unit
21
  leaderWorkerTemplate:
22
- size: 2 # Sharded across 2 nodes (8 GPUs each)
23
  workerTemplate:
24
  spec:
25
  containers:
26
  - name: jirack-engine
27
- image: cms-manhattan/jirack-236b:latest
28
  resources:
29
  limits:
30
  nvidia.com/gpu: 8
31
  env:
32
  - name: MODEL_LAYERS
33
- value: "108"
34
  - name: PIPELINE_PARALLEL_SIZE
35
  value: "2"
36
  - name: TENSOR_PARALLEL_SIZE
37
  value: "8"
38
- - name: MODEL_DIM
39
- value: "14336"
40
- - name: GQA_RATIO
41
- value: "14"
42
- - name: AUTHOR_SIG
43
  value: "Konstantin Vladimirovich Grabko"
44
  ```
45
 
46
  ---
47
 
48
- ## 2. CI/CD Pipeline: Build and Deploy JiRack 236B
 
49
 
50
- This **GitHub Actions** workflow automates the Build-Verify-Deploy cycle. The pipeline ensures that any update (e.g., to SWA fusion kernels) is tested and pushed to the **236B Production Cluster**.
51
-
52
- ### YAML
53
- ```yaml
54
- # .github/workflows/jirack-deploy.yml
55
- name: Build and Deploy JiRack 236B
56
-
57
- on:
58
- push:
59
- branches: [ main ]
60
-
61
- jobs:
62
- build-and-push:
63
- runs-on: ubuntu-latest
64
- steps:
65
- - uses: actions/checkout@v4
66
-
67
- - name: Login to DockerHub
68
- uses: docker/login-action@v3
69
- with:
70
- username: ${{ secrets.DOCKERHUB_USERNAME }}
71
- password: ${{ secrets.DOCKERHUB_TOKEN }}
72
-
73
- - name: Build JiRack Engine
74
- run: |
75
- docker build -t cms-manhattan/jirack-236b:${{ github.sha }} .
76
- docker tag cms-manhattan/jirack-236b:${{ github.sha }} cms-manhattan/jirack-236b:latest
77
-
78
- - name: Push Image
79
- run: docker push cms-manhattan/jirack-236b:latest
80
-
81
- deploy-to-k8s:
82
- needs: build-and-push
83
- runs-on: self-hosted # Use a runner with access to your K8s cluster
84
- steps:
85
- - name: Set Kubernetes Context
86
- uses: azure/k8s-set-context@v3
87
- with:
88
- kubeconfig: ${{ secrets.KUBE_CONFIG }}
89
-
90
- - name: Deploy Manifest
91
- run: |
92
- kubectl apply -f k8s/jirack-236b-frontier.yaml
93
- kubectl rollout restart leaderworkerset/jirack-236b-frontier
94
- ```
95
 
96
  ---
97
 
98
- ## 3. The "236B Optimization" Benchmarking
99
 
100
- After deployment, the pipeline includes a **Post-Deployment Verification Step** to confirm SWA Fusion performance and functionality.
101
 
102
- | **Test Parameter** | **Target for JiRack 236B** | **Failure Action** |
103
- |---------------------------|----------------------------|-----------------------------------------|
104
- | **KV Cache Latency** | < 120ms (TTFT) | Automatic Rollback |
105
- | **Kernel Throughput** | > 28 tokens/sec | Alert Admin |
106
- | **Auth Verification** | "Grabko" Signature Found | Immediate Kill Pod |
107
 
108
  ---
109
 
110
- ## 4. Storage and Weight Loading
111
-
112
- The JiRack 236B model (~470GB in BF16) requires fast storage to load the **108 layers** in under **2 minutes**. Persistent Volume Claims (PVC) backed by NVMe storage are recommended.
113
-
114
- ### YAML
115
- ```yaml
116
- # fragment of pod spec
117
- volumeMounts:
118
- - name: model-weights
119
- mountPath: /models/jirack-236b
120
- volumes:
121
- - name: model-weights
122
- persistentVolumeClaim:
123
- claimName: jirack-weights-pvc
124
- ```
125
-
126
- ---
127
 
128
- ## 5. Comparison: 236B vs. 405B+ Deployment
129
 
130
- | **Feature** | **JiRack 236B** | **JiRack 405B+** |
131
- |--------------------------|-------------------------|-----------------------------------|
132
- | **GPU Count** | 16 (2 Nodes) | 1,024+ (128+ Nodes) |
133
- | **PP Degree** | 2 | 8 - 16K |
134
- | **K8s Resource** | LeaderWorkerSet (Small) | LeaderWorkerSet (Mega-Cluster) |
135
- | **CI/CD Target** | Standard Production | Multi-Region Canary |
136
 
137
- ---
 
1
+ # K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
2
 
3
+ **Document ID:** CMS-JR-405B-K8S-2025
4
+ **Framework:** Kubeflow / LeaderWorkerSet (LWS)
5
+ **Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)
6
 
7
  ---
8
 
9
+ ## 1. The 4D Sharding Architecture
10
+ To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:
11
 
12
+ - **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
13
+ - **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).
14
+ - **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.
15
+ - **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.
16
+
17
+ ---
18
+
19
+ ## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)
20
+
21
+ Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.
22
 
23
  ### YAML
24
  ```yaml
25
+ # jirack-405b-deployment.yaml
26
  apiVersion: leaderworkerset.x-k8s.io/v1
27
  kind: LeaderWorkerSet
28
  metadata:
29
+ name: jirack-405b-flagship
30
  spec:
31
+ replicas: 1 # Number of 16-GPU clusters
32
  leaderWorkerTemplate:
33
+ size: 2 # 2 nodes per cluster (16 GPUs total)
34
  workerTemplate:
35
  spec:
36
  containers:
37
  - name: jirack-engine
38
+ image: cms-manhattan/jirack-405b:latest
39
  resources:
40
  limits:
41
  nvidia.com/gpu: 8
42
  env:
43
  - name: MODEL_LAYERS
44
+ value: "126"
45
  - name: PIPELINE_PARALLEL_SIZE
46
  value: "2"
47
  - name: TENSOR_PARALLEL_SIZE
48
  value: "8"
49
+ - name: SWA_FUSION_ENABLED
50
+ value: "true"
51
+ - name: PROOF_OF_AUTHORSHIP
 
 
52
  value: "Konstantin Vladimirovich Grabko"
53
  ```
54
 
55
  ---
56
 
57
+ ## 3. High-Theta RoPE & GQA Management
58
+ The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.
59
 
60
+ - **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
61
+ - **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
63
  ---
64
 
65
+ ## 4. Autoscaling & The "Grabko Metric"
66
 
67
+ Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.
68
 
69
+ - **Scale-Up:** Triggered when `num_requests_waiting > 5`.
70
+ - **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.
 
 
 
71
 
72
  ---
73
 
74
+ ## 5. Compliance Verification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
+ The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.
77
 
78
+ **Compliance Features:**
79
+ - Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
 
 
 
 
80
 
81
+ **Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2.