JiRack_GPT5_405b / ClusterOrchestrationScript.md
kgrabko's picture
Upload ClusterOrchestrationScript.md
a23d512 verified

K8S ORCHESTRATION: JiRack 405B+ Ultimate Scale

Document ID: CMS-JR-405B-K8S-2025
Framework: Kubeflow / LeaderWorkerSet (LWS)
Hardware Target: 16-GPU Multi-node (H100/A100 Cluster)


1. The 4D Sharding Architecture

To fit the ~810GB (BF16) weight footprint while maintaining real-time inference, the orchestration script implements 4D Parallelism:

  • Tensor Parallelism (TP): Shards the MODEL_DIM (16,384) across 8 GPUs within a node.
  • Pipeline Parallelism (PP): Distributes the 126 layers across 2 nodes (63 layers per node).
  • Data Parallelism (DP): Replicates the sharded setup to handle parallel requests.
  • Sequence Parallelism (SP): Splits the 4,096-token attention across GPUs to avoid OOM (Out of Memory) during prefill.

2. Kubernetes Manifest: LeaderWorkerSet (LWS)

Using the Kubernetes LeaderWorkerSet API, we define a "Pod Group" where one pod acts as the scheduler (Leader) and others act as the compute workers.

YAML

# jirack-405b-deployment.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: jirack-405b-flagship
spec:
  replicas: 1  # Number of 16-GPU clusters
  leaderWorkerTemplate:
    size: 2    # 2 nodes per cluster (16 GPUs total)
    workerTemplate:
      spec:
        containers:
          - name: jirack-engine
            image: cms-manhattan/jirack-405b:latest
            resources:
              limits:
                nvidia.com/gpu: 8
            env:
              - name: MODEL_LAYERS
                value: "126"
              - name: PIPELINE_PARALLEL_SIZE
                value: "2"
              - name: TENSOR_PARALLEL_SIZE
                value: "8"
              - name: SWA_FUSION_ENABLED
                value: "true"
              - name: PROOF_OF_AUTHORSHIP
                value: "Konstantin Vladimirovich Grabko"

3. High-Theta RoPE & GQA Management

The orchestration layer must ensure that InfiniBand RDMA is correctly exposed to the pods. Without this, the 128-head GQA will suffer from extreme "all-reduce" latency during the layer handoffs.

  • Metric to Watch: gpu_cache_usage_perc (Target < 85% to allow for 4K context spikes).
  • Network Plugin: Multus CNI with NVIDIA/Mellanox InfiniBand driver.

4. Autoscaling & The "Grabko Metric"

Using KEDA (Kubernetes Event-Driven Autoscaler), the cluster monitors the number of waiting requests in the KV-cache.

  • Scale-Up: Triggered when num_requests_waiting > 5.
  • Scale-Down: Graceful shutdown of workers once the 108-layer inference queue is clear.

5. Compliance Verification

The K8s Liveness probe is configured to hit the /v1/auth endpoint. If the model does not return the verified Grabko Signature, the pod is marked as Unhealthy and terminated.

Compliance Features:

  • Prevents the execution of "de-branded" or unauthorized versions of the 405B+ Flagship.

Note: Commercial deployment of this script requires compliance with the 5% Royalty terms of the JiRack Commercial License V.1.2.