K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale
Document ID: CMS-JR-405B-K8S-2025
Framework: Kubeflow / LeaderWorkerSet (LWS)
Hardware Target: 16-GPU Multi-node (H100/A100 Cluster)
1. The 4D Sharding Architecture
To fit the ~810GB (BF16) weight footprint while maintaining real-time inference, the orchestration script implements 4D Parallelism:
- Tensor Parallelism (TP): Shards the
MODEL_DIM(16,384) across 8 GPUs within a node. - Pipeline Parallelism (PP): Distributes the 126 layers across 2 nodes (63 layers per node).
- Data Parallelism (DP): Replicates the sharded setup to handle parallel requests.
- Sequence Parallelism (SP): Splits the 4,096-token attention across GPUs to avoid OOM (Out of Memory) during prefill.
2. Kubernetes Manifest: LeaderWorkerSet (LWS)
Using the Kubernetes LeaderWorkerSet API, we define a "Pod Group" where one pod acts as the scheduler (Leader) and others act as the compute workers.
YAML
# jirack-405b-deployment.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: jirack-405b-flagship
spec:
replicas: 1 # Number of 16-GPU clusters
leaderWorkerTemplate:
size: 2 # 2 nodes per cluster (16 GPUs total)
workerTemplate:
spec:
containers:
- name: jirack-engine
image: cms-manhattan/jirack-405b:latest
resources:
limits:
nvidia.com/gpu: 8
env:
- name: MODEL_LAYERS
value: "126"
- name: PIPELINE_PARALLEL_SIZE
value: "2"
- name: TENSOR_PARALLEL_SIZE
value: "8"
- name: SWA_FUSION_ENABLED
value: "true"
- name: PROOF_OF_AUTHORSHIP
value: "Konstantin Vladimirovich Grabko"
3. High-Theta RoPE & GQA Management
The orchestration layer must ensure that InfiniBand RDMA is correctly exposed to the pods. Without this, the 128-head GQA will suffer from extreme "all-reduce" latency during the layer handoffs.
- Metric to Watch:
gpu_cache_usage_perc(Target < 85% to allow for 4K context spikes). - Network Plugin: Multus CNI with NVIDIA/Mellanox InfiniBand driver.
4. Autoscaling & The "Grabko Metric"
Using KEDA (Kubernetes Event-Driven Autoscaler), the cluster monitors the number of waiting requests in the KV-cache.
- Scale-Up: Triggered when
num_requests_waiting > 5. - Scale-Down: Graceful shutdown of workers once the 108-layer inference queue is clear.
5. Compliance Verification
The K8s Liveness probe is configured to hit the /v1/auth endpoint. If the model does not return the verified Grabko Signature, the pod is marked as Unhealthy and terminated.
Compliance Features:
- Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.
Note: Commercial deployment of this script requires compliance with the 5% Royalty terms of the JiRack Commercial License V.1.2.