div18
feat: implement Kubernetes executor for automated cluster scaling and infrastructure management
cf2697b | # AntiAtropos AWS Operations Guide | |
| Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill. | |
| **Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.** | |
| --- | |
| ## Table of Contents | |
| 1. [Replica Strategy & Caps](#1-replica-strategy--caps) | |
| 2. [Autoscaling Configuration](#2-autoscaling-configuration) | |
| 3. [Cost Guardrails](#3-cost-guardrails) | |
| 4. [Step-by-Step Deployment Walkthrough](#4-step-by-step-deployment-walkthrough) | |
| 5. [Configuring HF Spaces to Connect to AWS](#5-configuring-hf-spaces-to-connect-to-aws) | |
| 6. [Day-2 Operations](#6-day-2-operations) | |
| 7. [Teardown & Cost Recovery](#7-teardown--cost-recovery) | |
| --- | |
| ## 1. Replica Strategy & Caps | |
| ### What Runs Where | |
| | Component | Where | Scaled By | Cost Impact | | |
| |---|---|---|---| | |
| | **AntiAtropos FastAPI server** | HF Spaces | HF auto-scales | $0-5/month (HF billing) | | |
| | **Workload pods** (payments, checkout, etc.) | EKS | SRE agent via `KubernetesExecutor` | **HIGH** — this is where costs spiral | | |
| | **Prometheus Agent** | EKS (monitoring ns) | Static (1 pod) | Low | | |
| | **AMP** | AWS managed | Serverless | Pay per GB ingested | | |
| | **AMG** | AWS managed | Serverless | Pay per editor | | |
| ### Workload Pod Replicas — Where Costs Spiral | |
| The SRE agent's `SCALE_UP` action calls `KubernetesExecutor._scale_deployment()`, which patches `replicas` on real K8s Deployments. A bad agent can scale every deployment to the cap. | |
| The `ANTIATROPOS_MAX_REPLICAS` env var (set on HF Spaces) is the **global** ceiling applied to all deployments. The default in `kubernetes_executor.py` is 20 — with 5 deployments, that's **100 pods** worst case. **Set it to 6.** | |
| **Recommended caps by deployment:** | |
| | Deployment | Min | Max Replicas | Reasoning | | |
| |---|---|---|---| | |
| | `payments` (node-0, VIP) | 2 | 6 | VIP node — needs redundancy, 6 is plenty for the traffic model | | |
| | `checkout` (node-1) | 1 | 5 | Can burst but shouldn't stay high | | |
| | `catalog` (node-2) | 1 | 5 | Same | | |
| | `cart` (node-3) | 1 | 4 | Non-critical, sheddable | | |
| | `auth` (node-4) | 1 | 4 | Non-critical, sheddable | | |
| **Total worst case: 24 workload pods.** | |
| At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort. | |
| ### How the Cap Works | |
| The `KubernetesExecutor._scale_deployment()` method reads `ANTIATROPOS_MAX_REPLICAS` from the environment and refuses to scale above it: | |
| ``` | |
| Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6) | |
| ``` | |
| This is enforced in code (`kubernetes_executor.py` line 115): | |
| ```python | |
| desired = min(self.max_replicas, current + delta) | |
| ``` | |
| **Set `ANTIATROPOS_MAX_REPLICAS=6` on your HF Space.** | |
| --- | |
| ## 2. Autoscaling Configuration | |
| ### EKS Node Autoscaling | |
| The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler: | |
| ```bash | |
| helm repo add autoscaler https://kubernetes.github.io/autoscaler | |
| helm repo update | |
| helm install cluster-autoscaler autoscaler/cluster-autoscaler \ | |
| --namespace kube-system \ | |
| -f deploy/aws/cluster-autoscaler-values.yaml | |
| ``` | |
| **The node group `maxSize` in `eksctl-cluster.yaml` (4) is your ultimate cost ceiling.** | |
| ``` | |
| 4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max | |
| ``` | |
| With spot instances, this drops to ~$36/month max. | |
| ### What Happens When the Agent Scales Workloads | |
| 1. Agent on HF Spaces sends `SCALE_UP` action | |
| 2. `KubernetesExecutor._scale_deployment()` patches the Deployment's `spec.replicas` via EKS API server | |
| 3. Kubernetes scheduler tries to place the new pod | |
| 4. If no node has capacity -> pod is `Pending` | |
| 5. Cluster Autoscaler sees `Pending` pods -> adds a node (within `maxSize`) | |
| 6. If `maxSize` is hit -> pod stays `Pending` (agent action succeeded but pod won't schedule) | |
| **This is why `maxSize` in the node group is your ultimate cost ceiling.** | |
| --- | |
| ## 3. Cost Guardrails | |
| ### Monthly Cost Caps by Tier | |
| | Tier | Max Nodes | Max Workload Pods | Estimated Monthly Cost | | |
| |---|---|---|---| | |
| | **Dev/Testing** | 2 | 10 (2/deployment) | ~$80 | | |
| | **Training** | 3 | 15 (3/deployment) | ~$130 | | |
| | **Benchmark Suite** | 4 | 24 (~5/deployment) | ~$160 | | |
| | **Unlimited (danger)** | inf | 100 (20/deployment) | $500+ | | |
| ### AWS Budgets — Get Alerts Before You Overspend | |
| ```bash | |
| aws budgets create-budget \ | |
| --account-id $(aws sts get-caller-identity --query Account --output text) \ | |
| --budget '{ | |
| "BudgetName": "AntiAtropos-Monthly", | |
| "BudgetLimit": {"Amount": "150", "Unit": "USD"}, | |
| "TimeUnit": "MONTHLY", | |
| "CostFilters": { | |
| "TagKeyValue": ["user:Project$AntiAtropos"] | |
| }, | |
| "CostTypes": { | |
| "IncludeTax": true, | |
| "IncludeSubscription": true, | |
| "UseBlended": false | |
| } | |
| }' | |
| # Alert at 50% | |
| aws budgets create-notification \ | |
| --account-id $(aws sts get-caller-identity --query Account --output text) \ | |
| --budget-name "AntiAtropos-Monthly" \ | |
| --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \ | |
| --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]' | |
| # Alert at 80% | |
| aws budgets create-notification \ | |
| --account-id $(aws sts get-caller-identity --query Account --output text) \ | |
| --budget-name "AntiAtropos-Monthly" \ | |
| --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \ | |
| --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]' | |
| ``` | |
| ### Cost-Saving Checklist | |
| - [ ] Use **spot instances** for node groups (60-70% cheaper, OK for training) | |
| - [ ] Set `ANTIATROPOS_MAX_REPLICAS=6` on HF Spaces (not 20) to prevent agent runaway | |
| - [ ] Cap node group `maxSize` at 4 (in `eksctl-cluster.yaml`) | |
| - [ ] Set AWS Budget alert at $150/month | |
| - [ ] Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all` | |
| - [ ] Delete the cluster for multi-day breaks: `eksctl delete cluster --name antiatropos` | |
| - [ ] AMP free tier covers first 10GB ingest/month | |
| - [ ] AMG free tier is 1 editor for 30 days — cancel if not needed | |
| --- | |
| ## 4. Step-by-Step Deployment Walkthrough | |
| ### Before You Start | |
| You need: | |
| - AWS account with billing alerts enabled | |
| - AWS CLI v2 installed and configured (`aws configure`) | |
| - eksctl, kubectl, helm installed | |
| - About 20-30 minutes | |
| ### Step 1: Create the EKS Cluster (15 min) | |
| ```bash | |
| eksctl create cluster -f deploy/aws/eksctl-cluster.yaml | |
| # Verify | |
| aws eks update-kubeconfig --name antiatropos --region ap-south-1 | |
| kubectl get nodes | |
| ``` | |
| ### Step 2: Deploy Sample Workloads (1 min) | |
| ```bash | |
| kubectl apply -f deploy/aws/k8s-workloads.yaml | |
| kubectl get pods -n prod-sre | |
| ``` | |
| ### Step 3: Create AMP Workspace (1 min) | |
| ```bash | |
| aws amp create-workspace --alias antiatropos-metrics --region ap-south-1 | |
| # Note the workspace ID | |
| aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text | |
| ``` | |
| ### Step 4: Set Up IRSA (2 min) | |
| ```bash | |
| # Prometheus agent needs to write to AMP | |
| eksctl create iamserviceaccount \ | |
| --cluster antiatropos \ | |
| --namespace monitoring \ | |
| --name prometheus-sa \ | |
| --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ | |
| --approve | |
| ``` | |
| ### Step 5: Install Prometheus Agent (2 min) | |
| ```bash | |
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts | |
| helm repo update | |
| # Replace WORKSPACE_ID | |
| helm install prometheus-agent prometheus-community/prometheus \ | |
| --namespace monitoring --create-namespace \ | |
| -f deploy/aws/prometheus-agent-values.yaml \ | |
| --set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write" | |
| ``` | |
| ### Step 6: Set Up AMG (5 min) | |
| ```bash | |
| # Create IAM role for AMG | |
| aws iam create-role \ | |
| --role-name AntiAtroposGrafanaRole \ | |
| --assume-role-policy-document file://deploy/aws/grafana-trust-policy.json | |
| aws iam attach-role-policy \ | |
| --role-name AntiAtroposGrafanaRole \ | |
| --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess | |
| # Create workspace | |
| aws grafana create-workspace \ | |
| --workspace-name antiatropos-dashboards \ | |
| --account-access-type CURRENT_ACCOUNT \ | |
| --authentication-method AWS_SSO \ | |
| --permission-type SERVICE_MANAGED \ | |
| --data-sources PROMETHEUS \ | |
| --region ap-south-1 | |
| ``` | |
| Then in the AMG web UI: | |
| 1. Sign in with AWS SSO | |
| 2. Configuration -> Data Sources -> Add AMP workspace | |
| 3. Dashboards -> Import -> Upload JSON from `deploy/grafana/provisioning/dashboards/json/` | |
| 4. Select AMP data source when importing | |
| ### Step 7: Install Cluster Autoscaler (2 min) | |
| ```bash | |
| helm repo add autoscaler https://kubernetes.github.io/autoscaler | |
| helm repo update | |
| helm install cluster-autoscaler autoscaler/cluster-autoscaler \ | |
| --namespace kube-system \ | |
| -f deploy/aws/cluster-autoscaler-values.yaml | |
| ``` | |
| ### Step 8: Generate Kubeconfig for HF Spaces (1 min) | |
| ```bash | |
| ./deploy/aws/generate-kubeconfig.sh | |
| # Outputs: deploy/aws/kubeconfig-antiatropos.yaml | |
| ``` | |
| ### Step 9: Configure HF Spaces | |
| See [Section 5](#5-configuring-hf-spaces-to-connect-to-aws) below. | |
| --- | |
| ## 5. Configuring HF Spaces to Connect to AWS | |
| ### Secrets (HF Space Settings -> Repository secrets) | |
| | Secret | Value | | |
| |---|---| | |
| | `OPENAI_API_KEY` | Your OpenAI API key | | |
| | `KUBECONFIG_CONTENT` | Base64-encoded content of `kubeconfig-antiatropos.yaml` | | |
| To encode the kubeconfig: | |
| ```bash | |
| cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0 | |
| ``` | |
| ### Environment Variables (HF Space Settings -> Variables) | |
| | Variable | Value | | |
| |---|---| | |
| | `ANTIATROPOS_ENV_MODE` | `live` | | |
| | `ANTIATROPOS_STRICT_REAL` | `false` | | |
| | `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` | | |
| | `KUBECONFIG` | `/app/kubeconfig.yaml` | | |
| | `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` | | |
| | `ANTIATROPOS_DEPLOYMENT_PREFIX` | `` (empty) | | |
| | `ANTIATROPOS_MIN_REPLICAS` | `1` | | |
| | `ANTIATROPOS_MAX_REPLICAS` | `6` | | |
| | `ANTIATROPOS_SCALE_STEP` | `3` | | |
| | `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` | | |
| | `ANTIATROPOS_METRIC_AGGREGATION` | `sum` | | |
| | `ANTIATROPOS_WORKLOAD_MAP` | See below | | |
| ### Workload Map Value | |
| ```json | |
| { | |
| "node-0": {"deployment": "payments", "namespace": "prod-sre"}, | |
| "node-1": {"deployment": "checkout", "namespace": "prod-sre"}, | |
| "node-2": {"deployment": "catalog", "namespace": "prod-sre"}, | |
| "node-3": {"deployment": "cart", "namespace": "prod-sre"}, | |
| "node-4": {"deployment": "auth", "namespace": "prod-sre"} | |
| } | |
| ``` | |
| ### Entrypoint Modification | |
| Add this to `deploy/entrypoint.sh` before the uvicorn line, so the kubeconfig is decoded from the HF secret: | |
| ```bash | |
| # Decode kubeconfig from HF Spaces secret | |
| if [ -n "${KUBECONFIG_CONTENT:-}" ]; then | |
| echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml | |
| export KUBECONFIG=/app/kubeconfig.yaml | |
| fi | |
| ``` | |
| ### Verifying the Connection | |
| After deploying, check from HF Spaces that the server can reach AWS: | |
| 1. Check the HF Space logs for `antiatropos_step` events | |
| 2. Look for `Ack: SCALE_UP` messages (agent is reaching EKS) | |
| 3. Look for non-zero `request_rate` / `cpu_utilization` (PrometheusClient is reaching AMP) | |
| 4. If `ANTIATROPOS_STRICT_REAL=false` (recommended), failures fall back to mock silently | |
| --- | |
| ## 6. Day-2 Operations | |
| ### Scaling Workloads Manually | |
| ```bash | |
| # Scale a specific deployment | |
| kubectl scale deployment/payments -n prod-sre --replicas=4 | |
| # Scale all workloads down | |
| kubectl scale deployment -n prod-sre --replicas=0 --all | |
| # Scale all workloads back up | |
| kubectl scale deployment payments -n prod-sre --replicas=2 | |
| kubectl scale deployment checkout -n prod-sre --replicas=1 | |
| kubectl scale deployment catalog -n prod-sre --replicas=1 | |
| kubectl scale deployment cart -n prod-sre --replicas=1 | |
| kubectl scale deployment auth -n prod-sre --replicas=1 | |
| ``` | |
| ### Pausing Everything (Without Deleting) | |
| ```bash | |
| # Scale all workloads to 0 | |
| kubectl scale deployment -n prod-sre --replicas=0 --all | |
| # Note: EKS nodes still run and cost money. | |
| # For real savings, delete the cluster (Section 7). | |
| ``` | |
| ### Monitoring Agent Behavior | |
| Watch what the SRE agent is doing in real-time: | |
| ```bash | |
| # Check how many workload pods the agent has created | |
| kubectl get deployments -n prod-sre | |
| # Check current replica counts | |
| kubectl get hpa -A # if any HPAs are defined | |
| # Check node pressure | |
| kubectl top nodes | |
| ``` | |
| ### Checking Current Spend | |
| ```bash | |
| # Current month cost by service | |
| aws ce get-cost-and-usage \ | |
| --time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \ | |
| --granularity MONTHLY \ | |
| --metrics BlendedCost \ | |
| --group-by Type=DIMENSION,Key=SERVICE | |
| ``` | |
| ### Regenerating Kubeconfig | |
| If the EKS cluster is recreated or credentials expire: | |
| ```bash | |
| ./deploy/aws/generate-kubeconfig.sh | |
| # Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT | |
| ``` | |
| --- | |
| ## 7. Teardown & Cost Recovery | |
| ### Partial Teardown (Keep Cluster, Stop Workloads) | |
| ```bash | |
| kubectl scale deployment -n prod-sre --replicas=0 --all | |
| # Still paying for EKS control plane ($73/month) and idle nodes | |
| ``` | |
| ### Full Teardown (Stop All Charges) | |
| ```bash | |
| # Delete workloads | |
| kubectl delete -f deploy/aws/k8s-workloads.yaml | |
| # Delete Prometheus agent | |
| helm uninstall prometheus-agent -n monitoring | |
| kubectl delete namespace monitoring | |
| # Delete AMP workspace | |
| AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text) | |
| aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1 | |
| # Delete AMG workspace | |
| AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text) | |
| aws grafana delete-workspace --workspace-id $AMG_WS_ID | |
| # Delete IAM role for Grafana | |
| aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess | |
| aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess | |
| aws iam delete-role --role-name AntiAtroposGrafanaRole | |
| # Delete the EKS cluster (10-15 min) | |
| eksctl delete cluster --name antiatropos --region ap-south-1 | |
| # Verify nothing is left | |
| aws eks list-clusters --region ap-south-1 | |
| aws amp list-workspaces --region ap-south-1 | |
| ``` | |
| Also remove the `KUBECONFIG_CONTENT` secret and reset `PROMETHEUS_URL` to `mock` in your HF Space. | |
| --- | |
| ## Quick Reference Card | |
| | Task | Command | | |
| |---|---| | |
| | Deploy AWS infra | `./deploy/aws/deploy.sh` | | |
| | Check workloads | `kubectl get pods -n prod-sre` | | |
| | Check monitoring | `kubectl get pods -n monitoring` | | |
| | Scale a workload | `kubectl scale deployment/payments -n prod-sre --replicas=N` | | |
| | Pause all workloads | `kubectl scale deployment -n prod-sre --replicas=0 --all` | | |
| | Check AMP data | `awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1` | | |
| | Generate kubeconfig | `./deploy/aws/generate-kubeconfig.sh` | | |
| | Nuke everything | `eksctl delete cluster --name antiatropos --region ap-south-1` | | |