File size: 3,225 Bytes
a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 fa26cf7 a23d512 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
# K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
**Document ID:** CMS-JR-405B-K8S-2025
**Framework:** Kubeflow / LeaderWorkerSet (LWS)
**Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster)
---
## 1. The 4D Sharding Architecture
To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**:
- **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node.
- **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node).
- **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests.
- **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill.
---
## 2. Kubernetes Manifest: LeaderWorkerSet (LWS)
Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers.
### YAML
```yaml
# jirack-405b-deployment.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: jirack-405b-flagship
spec:
replicas: 1 # Number of 16-GPU clusters
leaderWorkerTemplate:
size: 2 # 2 nodes per cluster (16 GPUs total)
workerTemplate:
spec:
containers:
- name: jirack-engine
image: cms-manhattan/jirack-405b:latest
resources:
limits:
nvidia.com/gpu: 8
env:
- name: MODEL_LAYERS
value: "126"
- name: PIPELINE_PARALLEL_SIZE
value: "2"
- name: TENSOR_PARALLEL_SIZE
value: "8"
- name: SWA_FUSION_ENABLED
value: "true"
- name: PROOF_OF_AUTHORSHIP
value: "Konstantin Vladimirovich Grabko"
```
---
## 3. High-Theta RoPE & GQA Management
The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs.
- **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes).
- **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver.
---
## 4. Autoscaling & The "Grabko Metric"
Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache.
- **Scale-Up:** Triggered when `num_requests_waiting > 5`.
- **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear.
---
## 5. Compliance Verification
The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated.
**Compliance Features:**
- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
**Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2. |