File size: 3,225 Bytes
a23d512
fa26cf7
a23d512
 
 
fa26cf7
 
 
a23d512
 
fa26cf7
a23d512
 
 
 
 
 
 
 
 
 
fa26cf7
 
 
a23d512
fa26cf7
 
 
a23d512
fa26cf7
a23d512
fa26cf7
a23d512
fa26cf7
 
 
 
a23d512
fa26cf7
 
 
 
 
a23d512
fa26cf7
 
 
 
a23d512
 
 
fa26cf7
 
 
 
 
a23d512
 
fa26cf7
a23d512
 
fa26cf7
 
 
a23d512
fa26cf7
a23d512
fa26cf7
a23d512
 
fa26cf7
 
 
a23d512
fa26cf7
a23d512
fa26cf7
a23d512
 
fa26cf7
a23d512
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale  

**Document ID:** CMS-JR-405B-K8S-2025  
**Framework:** Kubeflow / LeaderWorkerSet (LWS)  
**Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)  

---

## 1. The 4D Sharding Architecture  
To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:  

- **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.  
- **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).  
- **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.  
- **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.  

---

## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)  

Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.  

### YAML
```yaml

# jirack-405b-deployment.yaml

apiVersion: leaderworkerset.x-k8s.io/v1

kind: LeaderWorkerSet

metadata:

  name: jirack-405b-flagship

spec:

  replicas: 1  # Number of 16-GPU clusters

  leaderWorkerTemplate:

    size: 2    # 2 nodes per cluster (16 GPUs total)

    workerTemplate:

      spec:

        containers:

          - name: jirack-engine

            image: cms-manhattan/jirack-405b:latest

            resources:

              limits:

                nvidia.com/gpu: 8

            env:

              - name: MODEL_LAYERS

                value: "126"

              - name: PIPELINE_PARALLEL_SIZE

                value: "2"

              - name: TENSOR_PARALLEL_SIZE

                value: "8"

              - name: SWA_FUSION_ENABLED

                value: "true"

              - name: PROOF_OF_AUTHORSHIP

                value: "Konstantin Vladimirovich Grabko"

```

---

## 3. High-Theta RoPE & GQA Management  
The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.  

- **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).  
- **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.  

---

## 4. Autoscaling & The "Grabko Metric"  

Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.  

- **Scale-Up:** Triggered when `num_requests_waiting > 5`.  
- **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.  

---

## 5. Compliance Verification  

The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.  

**Compliance Features:**  
- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.  

**Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2.