| # K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale | |
| **Document ID:** CMS-JR-405B-K8S-2025 | |
| **Framework:** Kubeflow / LeaderWorkerSet (LWS) | |
| **Hardware Target:** 16-GPU Multi-node (H100/A100 Cluster) | |
| --- | |
| ## 1. The 4D Sharding Architecture | |
| To fit the **~810GB (BF16)** weight footprint while maintaining real-time inference, the orchestration script implements **4D Parallelism**: | |
| - **Tensor Parallelism (TP):** Shards the `MODEL_DIM` (16,384) across 8 GPUs within a node. | |
| - **Pipeline Parallelism (PP):** Distributes the **126 layers** across 2 nodes (63 layers per node). | |
| - **Data Parallelism (DP):** Replicates the sharded setup to handle parallel requests. | |
| - **Sequence Parallelism (SP):** Splits the **4,096-token attention** across GPUs to avoid OOM (Out of Memory) during prefill. | |
| --- | |
| ## 2. Kubernetes Manifest: LeaderWorkerSet (LWS) | |
| Using the Kubernetes **LeaderWorkerSet API**, we define a "Pod Group" where one pod acts as the scheduler (**Leader**) and others act as the compute workers. | |
| ### YAML | |
| ```yaml | |
| # jirack-405b-deployment.yaml | |
| apiVersion: leaderworkerset.x-k8s.io/v1 | |
| kind: LeaderWorkerSet | |
| metadata: | |
| name: jirack-405b-flagship | |
| spec: | |
| replicas: 1 # Number of 16-GPU clusters | |
| leaderWorkerTemplate: | |
| size: 2 # 2 nodes per cluster (16 GPUs total) | |
| workerTemplate: | |
| spec: | |
| containers: | |
| - name: jirack-engine | |
| image: cms-manhattan/jirack-405b:latest | |
| resources: | |
| limits: | |
| nvidia.com/gpu: 8 | |
| env: | |
| - name: MODEL_LAYERS | |
| value: "126" | |
| - name: PIPELINE_PARALLEL_SIZE | |
| value: "2" | |
| - name: TENSOR_PARALLEL_SIZE | |
| value: "8" | |
| - name: SWA_FUSION_ENABLED | |
| value: "true" | |
| - name: PROOF_OF_AUTHORSHIP | |
| value: "Konstantin Vladimirovich Grabko" | |
| ``` | |
| --- | |
| ## 3. High-Theta RoPE & GQA Management | |
| The orchestration layer must ensure that **InfiniBand RDMA** is correctly exposed to the pods. Without this, the **128-head GQA** will suffer from extreme "all-reduce" latency during the layer handoffs. | |
| - **Metric to Watch:** `gpu_cache_usage_perc` (Target < 85% to allow for 4K context spikes). | |
| - **Network Plugin:** Multus CNI with NVIDIA/Mellanox InfiniBand driver. | |
| --- | |
| ## 4. Autoscaling & The "Grabko Metric" | |
| Using **KEDA (Kubernetes Event-Driven Autoscaler)**, the cluster monitors the number of waiting requests in the KV-cache. | |
| - **Scale-Up:** Triggered when `num_requests_waiting > 5`. | |
| - **Scale-Down:** Graceful shutdown of workers once the 108-layer inference queue is clear. | |
| --- | |
| ## 5. Compliance Verification | |
| The K8s **Liveness probe** is configured to hit the `/v1/auth` endpoint. If the model does not return the verified Grabko Signature, the pod is marked as **Unhealthy** and terminated. | |
| **Compliance Features:** | |
| - Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship. | |
| **Note:** Commercial deployment of this script requires compliance with the **5% Royalty terms** of the JiRack Commercial License V.1.2. |