# AntiAtropos AWS Operations Guide Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill. **Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.** --- ## Table of Contents 1. [Replica Strategy & Caps](#1-replica-strategy--caps) 2. [Autoscaling Configuration](#2-autoscaling-configuration) 3. [Cost Guardrails](#3-cost-guardrails) 4. [Step-by-Step Deployment Walkthrough](#4-step-by-step-deployment-walkthrough) 5. [Configuring HF Spaces to Connect to AWS](#5-configuring-hf-spaces-to-connect-to-aws) 6. [Day-2 Operations](#6-day-2-operations) 7. [Teardown & Cost Recovery](#7-teardown--cost-recovery) --- ## 1. Replica Strategy & Caps ### What Runs Where | Component | Where | Scaled By | Cost Impact | |---|---|---|---| | **AntiAtropos FastAPI server** | HF Spaces | HF auto-scales | $0-5/month (HF billing) | | **Workload pods** (payments, checkout, etc.) | EKS | SRE agent via `KubernetesExecutor` | **HIGH** — this is where costs spiral | | **Prometheus Agent** | EKS (monitoring ns) | Static (1 pod) | Low | | **AMP** | AWS managed | Serverless | Pay per GB ingested | | **AMG** | AWS managed | Serverless | Pay per editor | ### Workload Pod Replicas — Where Costs Spiral The SRE agent's `SCALE_UP` action calls `KubernetesExecutor._scale_deployment()`, which patches `replicas` on real K8s Deployments. A bad agent can scale every deployment to the cap. The `ANTIATROPOS_MAX_REPLICAS` env var (set on HF Spaces) is the **global** ceiling applied to all deployments. The default in `kubernetes_executor.py` is 20 — with 5 deployments, that's **100 pods** worst case. **Set it to 6.** **Recommended caps by deployment:** | Deployment | Min | Max Replicas | Reasoning | |---|---|---|---| | `payments` (node-0, VIP) | 2 | 6 | VIP node — needs redundancy, 6 is plenty for the traffic model | | `checkout` (node-1) | 1 | 5 | Can burst but shouldn't stay high | | `catalog` (node-2) | 1 | 5 | Same | | `cart` (node-3) | 1 | 4 | Non-critical, sheddable | | `auth` (node-4) | 1 | 4 | Non-critical, sheddable | **Total worst case: 24 workload pods.** At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort. ### How the Cap Works The `KubernetesExecutor._scale_deployment()` method reads `ANTIATROPOS_MAX_REPLICAS` from the environment and refuses to scale above it: ``` Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6) ``` This is enforced in code (`kubernetes_executor.py` line 115): ```python desired = min(self.max_replicas, current + delta) ``` **Set `ANTIATROPOS_MAX_REPLICAS=6` on your HF Space.** --- ## 2. Autoscaling Configuration ### EKS Node Autoscaling The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler: ```bash helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ -f deploy/aws/cluster-autoscaler-values.yaml ``` **The node group `maxSize` in `eksctl-cluster.yaml` (4) is your ultimate cost ceiling.** ``` 4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max ``` With spot instances, this drops to ~$36/month max. ### What Happens When the Agent Scales Workloads 1. Agent on HF Spaces sends `SCALE_UP` action 2. `KubernetesExecutor._scale_deployment()` patches the Deployment's `spec.replicas` via EKS API server 3. Kubernetes scheduler tries to place the new pod 4. If no node has capacity -> pod is `Pending` 5. Cluster Autoscaler sees `Pending` pods -> adds a node (within `maxSize`) 6. If `maxSize` is hit -> pod stays `Pending` (agent action succeeded but pod won't schedule) **This is why `maxSize` in the node group is your ultimate cost ceiling.** --- ## 3. Cost Guardrails ### Monthly Cost Caps by Tier | Tier | Max Nodes | Max Workload Pods | Estimated Monthly Cost | |---|---|---|---| | **Dev/Testing** | 2 | 10 (2/deployment) | ~$80 | | **Training** | 3 | 15 (3/deployment) | ~$130 | | **Benchmark Suite** | 4 | 24 (~5/deployment) | ~$160 | | **Unlimited (danger)** | inf | 100 (20/deployment) | $500+ | ### AWS Budgets — Get Alerts Before You Overspend ```bash aws budgets create-budget \ --account-id $(aws sts get-caller-identity --query Account --output text) \ --budget '{ "BudgetName": "AntiAtropos-Monthly", "BudgetLimit": {"Amount": "150", "Unit": "USD"}, "TimeUnit": "MONTHLY", "CostFilters": { "TagKeyValue": ["user:Project$AntiAtropos"] }, "CostTypes": { "IncludeTax": true, "IncludeSubscription": true, "UseBlended": false } }' # Alert at 50% aws budgets create-notification \ --account-id $(aws sts get-caller-identity --query Account --output text) \ --budget-name "AntiAtropos-Monthly" \ --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \ --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]' # Alert at 80% aws budgets create-notification \ --account-id $(aws sts get-caller-identity --query Account --output text) \ --budget-name "AntiAtropos-Monthly" \ --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \ --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]' ``` ### Cost-Saving Checklist - [ ] Use **spot instances** for node groups (60-70% cheaper, OK for training) - [ ] Set `ANTIATROPOS_MAX_REPLICAS=6` on HF Spaces (not 20) to prevent agent runaway - [ ] Cap node group `maxSize` at 4 (in `eksctl-cluster.yaml`) - [ ] Set AWS Budget alert at $150/month - [ ] Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all` - [ ] Delete the cluster for multi-day breaks: `eksctl delete cluster --name antiatropos` - [ ] AMP free tier covers first 10GB ingest/month - [ ] AMG free tier is 1 editor for 30 days — cancel if not needed --- ## 4. Step-by-Step Deployment Walkthrough ### Before You Start You need: - AWS account with billing alerts enabled - AWS CLI v2 installed and configured (`aws configure`) - eksctl, kubectl, helm installed - About 20-30 minutes ### Step 1: Create the EKS Cluster (15 min) ```bash eksctl create cluster -f deploy/aws/eksctl-cluster.yaml # Verify aws eks update-kubeconfig --name antiatropos --region ap-south-1 kubectl get nodes ``` ### Step 2: Deploy Sample Workloads (1 min) ```bash kubectl apply -f deploy/aws/k8s-workloads.yaml kubectl get pods -n prod-sre ``` ### Step 3: Create AMP Workspace (1 min) ```bash aws amp create-workspace --alias antiatropos-metrics --region ap-south-1 # Note the workspace ID aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text ``` ### Step 4: Set Up IRSA (2 min) ```bash # Prometheus agent needs to write to AMP eksctl create iamserviceaccount \ --cluster antiatropos \ --namespace monitoring \ --name prometheus-sa \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ --approve ``` ### Step 5: Install Prometheus Agent (2 min) ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Replace WORKSPACE_ID helm install prometheus-agent prometheus-community/prometheus \ --namespace monitoring --create-namespace \ -f deploy/aws/prometheus-agent-values.yaml \ --set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write" ``` ### Step 6: Set Up AMG (5 min) ```bash # Create IAM role for AMG aws iam create-role \ --role-name AntiAtroposGrafanaRole \ --assume-role-policy-document file://deploy/aws/grafana-trust-policy.json aws iam attach-role-policy \ --role-name AntiAtroposGrafanaRole \ --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess # Create workspace aws grafana create-workspace \ --workspace-name antiatropos-dashboards \ --account-access-type CURRENT_ACCOUNT \ --authentication-method AWS_SSO \ --permission-type SERVICE_MANAGED \ --data-sources PROMETHEUS \ --region ap-south-1 ``` Then in the AMG web UI: 1. Sign in with AWS SSO 2. Configuration -> Data Sources -> Add AMP workspace 3. Dashboards -> Import -> Upload JSON from `deploy/grafana/provisioning/dashboards/json/` 4. Select AMP data source when importing ### Step 7: Install Cluster Autoscaler (2 min) ```bash helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ -f deploy/aws/cluster-autoscaler-values.yaml ``` ### Step 8: Generate Kubeconfig for HF Spaces (1 min) ```bash ./deploy/aws/generate-kubeconfig.sh # Outputs: deploy/aws/kubeconfig-antiatropos.yaml ``` ### Step 9: Configure HF Spaces See [Section 5](#5-configuring-hf-spaces-to-connect-to-aws) below. --- ## 5. Configuring HF Spaces to Connect to AWS ### Secrets (HF Space Settings -> Repository secrets) | Secret | Value | |---|---| | `OPENAI_API_KEY` | Your OpenAI API key | | `KUBECONFIG_CONTENT` | Base64-encoded content of `kubeconfig-antiatropos.yaml` | To encode the kubeconfig: ```bash cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0 ``` ### Environment Variables (HF Space Settings -> Variables) | Variable | Value | |---|---| | `ANTIATROPOS_ENV_MODE` | `live` | | `ANTIATROPOS_STRICT_REAL` | `false` | | `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` | | `KUBECONFIG` | `/app/kubeconfig.yaml` | | `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` | | `ANTIATROPOS_DEPLOYMENT_PREFIX` | `` (empty) | | `ANTIATROPOS_MIN_REPLICAS` | `1` | | `ANTIATROPOS_MAX_REPLICAS` | `6` | | `ANTIATROPOS_SCALE_STEP` | `3` | | `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` | | `ANTIATROPOS_METRIC_AGGREGATION` | `sum` | | `ANTIATROPOS_WORKLOAD_MAP` | See below | ### Workload Map Value ```json { "node-0": {"deployment": "payments", "namespace": "prod-sre"}, "node-1": {"deployment": "checkout", "namespace": "prod-sre"}, "node-2": {"deployment": "catalog", "namespace": "prod-sre"}, "node-3": {"deployment": "cart", "namespace": "prod-sre"}, "node-4": {"deployment": "auth", "namespace": "prod-sre"} } ``` ### Entrypoint Modification Add this to `deploy/entrypoint.sh` before the uvicorn line, so the kubeconfig is decoded from the HF secret: ```bash # Decode kubeconfig from HF Spaces secret if [ -n "${KUBECONFIG_CONTENT:-}" ]; then echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml export KUBECONFIG=/app/kubeconfig.yaml fi ``` ### Verifying the Connection After deploying, check from HF Spaces that the server can reach AWS: 1. Check the HF Space logs for `antiatropos_step` events 2. Look for `Ack: SCALE_UP` messages (agent is reaching EKS) 3. Look for non-zero `request_rate` / `cpu_utilization` (PrometheusClient is reaching AMP) 4. If `ANTIATROPOS_STRICT_REAL=false` (recommended), failures fall back to mock silently --- ## 6. Day-2 Operations ### Scaling Workloads Manually ```bash # Scale a specific deployment kubectl scale deployment/payments -n prod-sre --replicas=4 # Scale all workloads down kubectl scale deployment -n prod-sre --replicas=0 --all # Scale all workloads back up kubectl scale deployment payments -n prod-sre --replicas=2 kubectl scale deployment checkout -n prod-sre --replicas=1 kubectl scale deployment catalog -n prod-sre --replicas=1 kubectl scale deployment cart -n prod-sre --replicas=1 kubectl scale deployment auth -n prod-sre --replicas=1 ``` ### Pausing Everything (Without Deleting) ```bash # Scale all workloads to 0 kubectl scale deployment -n prod-sre --replicas=0 --all # Note: EKS nodes still run and cost money. # For real savings, delete the cluster (Section 7). ``` ### Monitoring Agent Behavior Watch what the SRE agent is doing in real-time: ```bash # Check how many workload pods the agent has created kubectl get deployments -n prod-sre # Check current replica counts kubectl get hpa -A # if any HPAs are defined # Check node pressure kubectl top nodes ``` ### Checking Current Spend ```bash # Current month cost by service aws ce get-cost-and-usage \ --time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \ --granularity MONTHLY \ --metrics BlendedCost \ --group-by Type=DIMENSION,Key=SERVICE ``` ### Regenerating Kubeconfig If the EKS cluster is recreated or credentials expire: ```bash ./deploy/aws/generate-kubeconfig.sh # Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT ``` --- ## 7. Teardown & Cost Recovery ### Partial Teardown (Keep Cluster, Stop Workloads) ```bash kubectl scale deployment -n prod-sre --replicas=0 --all # Still paying for EKS control plane ($73/month) and idle nodes ``` ### Full Teardown (Stop All Charges) ```bash # Delete workloads kubectl delete -f deploy/aws/k8s-workloads.yaml # Delete Prometheus agent helm uninstall prometheus-agent -n monitoring kubectl delete namespace monitoring # Delete AMP workspace AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text) aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1 # Delete AMG workspace AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text) aws grafana delete-workspace --workspace-id $AMG_WS_ID # Delete IAM role for Grafana aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess aws iam delete-role --role-name AntiAtroposGrafanaRole # Delete the EKS cluster (10-15 min) eksctl delete cluster --name antiatropos --region ap-south-1 # Verify nothing is left aws eks list-clusters --region ap-south-1 aws amp list-workspaces --region ap-south-1 ``` Also remove the `KUBECONFIG_CONTENT` secret and reset `PROMETHEUS_URL` to `mock` in your HF Space. --- ## Quick Reference Card | Task | Command | |---|---| | Deploy AWS infra | `./deploy/aws/deploy.sh` | | Check workloads | `kubectl get pods -n prod-sre` | | Check monitoring | `kubectl get pods -n monitoring` | | Scale a workload | `kubectl scale deployment/payments -n prod-sre --replicas=N` | | Pause all workloads | `kubectl scale deployment -n prod-sre --replicas=0 --all` | | Check AMP data | `awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1` | | Generate kubeconfig | `./deploy/aws/generate-kubeconfig.sh` | | Nuke everything | `eksctl delete cluster --name antiatropos --region ap-south-1` |