Kubernetes Manifest: JiRack 236B Deployment

Framework: LeaderWorkerSet for Kubernetes
Model Scale: JiRack 236B (108 Layers, 14:1 GQA Ratio)

1. JiRack 236B Kubernetes Manifest

The JiRack 236B model uses a 14:1 GQA ratio and 108 layers. This manifest shards it across 2 nodes using Tensor Parallelism (TP) of 8 and Pipeline Parallelism (PP) of 2.

YAML

# jirack-236b-frontier.yaml
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: jirack-236b-frontier
spec:
  replicas: 1  # Deploy as one 16-GPU logical unit
  leaderWorkerTemplate:
    size: 2    # Sharded across 2 nodes (8 GPUs each)
    workerTemplate:
      spec:
        containers:
          - name: jirack-engine
            image: cms-manhattan/jirack-236b:latest
            resources:
              limits:
                nvidia.com/gpu: 8
            env:
              - name: MODEL_LAYERS
                value: "108"
              - name: PIPELINE_PARALLEL_SIZE
                value: "2"
              - name: TENSOR_PARALLEL_SIZE
                value: "8"
              - name: MODEL_DIM
                value: "14336"
              - name: GQA_RATIO
                value: "14"
              - name: AUTHOR_SIG
                value: "Konstantin Vladimirovich Grabko"

2. CI/CD Pipeline: Build and Deploy JiRack 236B

This GitHub Actions workflow automates the Build-Verify-Deploy cycle. The pipeline ensures that any update (e.g., to SWA fusion kernels) is tested and pushed to the 236B Production Cluster.

YAML

# .github/workflows/jirack-deploy.yml
name: Build and Deploy JiRack 236B

on:
  push:
    branches: [ main ]

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Login to DockerHub
        uses: docker/login-action@v3
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}

      - name: Build JiRack Engine
        run: |
          docker build -t cms-manhattan/jirack-236b:${{ github.sha }} .
          docker tag cms-manhattan/jirack-236b:${{ github.sha }} cms-manhattan/jirack-236b:latest

      - name: Push Image
        run: docker push cms-manhattan/jirack-236b:latest

  deploy-to-k8s:
    needs: build-and-push
    runs-on: self-hosted # Use a runner with access to your K8s cluster
    steps:
      - name: Set Kubernetes Context
        uses: azure/k8s-set-context@v3
        with:
          kubeconfig: ${{ secrets.KUBE_CONFIG }}

      - name: Deploy Manifest
        run: |
          kubectl apply -f k8s/jirack-236b-frontier.yaml
          kubectl rollout restart leaderworkerset/jirack-236b-frontier

3. The "236B Optimization" Benchmarking

After deployment, the pipeline includes a Post-Deployment Verification Step to confirm SWA Fusion performance and functionality.

Test Parameter	Target for JiRack 236B	Failure Action
KV Cache Latency	< 120ms (TTFT)	Automatic Rollback
Kernel Throughput	> 28 tokens/sec	Alert Admin
Auth Verification	"Grabko" Signature Found	Immediate Kill Pod

4. Storage and Weight Loading

The JiRack 236B model (~470GB in BF16) requires fast storage to load the 108 layers in under 2 minutes. Persistent Volume Claims (PVC) backed by NVMe storage are recommended.

YAML

# fragment of pod spec
volumeMounts:
  - name: model-weights
    mountPath: /models/jirack-236b
volumes:
  - name: model-weights
    persistentVolumeClaim:
      claimName: jirack-weights-pvc

5. Comparison: 236B vs. 405B+ Deployment

Feature	JiRack 236B	JiRack 405B+
GPU Count	16 (2 Nodes)	1,024+ (128+ Nodes)
PP Degree	2	8 - 16K
K8s Resource	LeaderWorkerSet (Small)	LeaderWorkerSet (Mega-Cluster)
CI/CD Target	Standard Production	Multi-Region Canary