AntiAtropos / deploy /aws /OPERATIONS.md

div18

feat: implement Kubernetes executor for automated cluster scaling and infrastructure management

cf2697b about 1 month ago

15 kB

	# AntiAtropos AWS Operations Guide

	Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.

	Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.

	---

	## Table of Contents

	1. [Replica Strategy & Caps](#1-replica-strategy--caps)
	2. [Autoscaling Configuration](#2-autoscaling-configuration)
	3. [Cost Guardrails](#3-cost-guardrails)
	4. [Step-by-Step Deployment Walkthrough](#4-step-by-step-deployment-walkthrough)
	5. [Configuring HF Spaces to Connect to AWS](#5-configuring-hf-spaces-to-connect-to-aws)
	6. [Day-2 Operations](#6-day-2-operations)
	7. [Teardown & Cost Recovery](#7-teardown--cost-recovery)

	---

	## 1. Replica Strategy & Caps

	### What Runs Where

	\| Component \| Where \| Scaled By \| Cost Impact \|
	\|---\|---\|---\|---\|
	\| AntiAtropos FastAPI server \| HF Spaces \| HF auto-scales \| $0-5/month (HF billing) \|
	\| Workload pods (payments, checkout, etc.) \| EKS \| SRE agent via `KubernetesExecutor` \| HIGH — this is where costs spiral \|
	\| Prometheus Agent \| EKS (monitoring ns) \| Static (1 pod) \| Low \|
	\| AMP \| AWS managed \| Serverless \| Pay per GB ingested \|
	\| AMG \| AWS managed \| Serverless \| Pay per editor \|

	### Workload Pod Replicas — Where Costs Spiral

	The SRE agent's `SCALE_UP` action calls `KubernetesExecutor._scale_deployment()`, which patches `replicas` on real K8s Deployments. A bad agent can scale every deployment to the cap.

	The `ANTIATROPOS_MAX_REPLICAS` env var (set on HF Spaces) is the global ceiling applied to all deployments. The default in `kubernetes_executor.py` is 20 — with 5 deployments, that's 100 pods worst case. Set it to 6.

	Recommended caps by deployment:

	\| Deployment \| Min \| Max Replicas \| Reasoning \|
	\|---\|---\|---\|---\|
	\| `payments` (node-0, VIP) \| 2 \| 6 \| VIP node — needs redundancy, 6 is plenty for the traffic model \|
	\| `checkout` (node-1) \| 1 \| 5 \| Can burst but shouldn't stay high \|
	\| `catalog` (node-2) \| 1 \| 5 \| Same \|
	\| `cart` (node-3) \| 1 \| 4 \| Non-critical, sheddable \|
	\| `auth` (node-4) \| 1 \| 4 \| Non-critical, sheddable \|

	Total worst case: 24 workload pods.

	At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.

	### How the Cap Works

	The `KubernetesExecutor._scale_deployment()` method reads `ANTIATROPOS_MAX_REPLICAS` from the environment and refuses to scale above it:

	```
	Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
	```

	This is enforced in code (`kubernetes_executor.py` line 115):
	```python
	desired = min(self.max_replicas, current + delta)
	```

	Set `ANTIATROPOS_MAX_REPLICAS=6` on your HF Space.

	---

	## 2. Autoscaling Configuration

	### EKS Node Autoscaling

	The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:

	```bash
	helm repo add autoscaler https://kubernetes.github.io/autoscaler
	helm repo update

	helm install cluster-autoscaler autoscaler/cluster-autoscaler \
	--namespace kube-system \
	-f deploy/aws/cluster-autoscaler-values.yaml
	```

	The node group `maxSize` in `eksctl-cluster.yaml` (4) is your ultimate cost ceiling.

	```
	4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max
	```

	With spot instances, this drops to ~$36/month max.

	### What Happens When the Agent Scales Workloads

	1. Agent on HF Spaces sends `SCALE_UP` action
	2. `KubernetesExecutor._scale_deployment()` patches the Deployment's `spec.replicas` via EKS API server
	3. Kubernetes scheduler tries to place the new pod
	4. If no node has capacity -> pod is `Pending`
	5. Cluster Autoscaler sees `Pending` pods -> adds a node (within `maxSize`)
	6. If `maxSize` is hit -> pod stays `Pending` (agent action succeeded but pod won't schedule)

	This is why `maxSize` in the node group is your ultimate cost ceiling.

	---

	## 3. Cost Guardrails

	### Monthly Cost Caps by Tier

	\| Tier \| Max Nodes \| Max Workload Pods \| Estimated Monthly Cost \|
	\|---\|---\|---\|---\|
	\| Dev/Testing \| 2 \| 10 (2/deployment) \| ~$80 \|
	\| Training \| 3 \| 15 (3/deployment) \| ~$130 \|
	\| Benchmark Suite \| 4 \| 24 (~5/deployment) \| ~$160 \|
	\| Unlimited (danger) \| inf \| 100 (20/deployment) \| $500+ \|

	### AWS Budgets — Get Alerts Before You Overspend

	```bash
	aws budgets create-budget \
	--account-id $(aws sts get-caller-identity --query Account --output text) \
	--budget '{
	"BudgetName": "AntiAtropos-Monthly",
	"BudgetLimit": {"Amount": "150", "Unit": "USD"},
	"TimeUnit": "MONTHLY",
	"CostFilters": {
	"TagKeyValue": ["user:Project$AntiAtropos"]
	},
	"CostTypes": {
	"IncludeTax": true,
	"IncludeSubscription": true,
	"UseBlended": false
	}
	}'

	# Alert at 50%
	aws budgets create-notification \
	--account-id $(aws sts get-caller-identity --query Account --output text) \
	--budget-name "AntiAtropos-Monthly" \
	--notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
	--subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'

	# Alert at 80%
	aws budgets create-notification \
	--account-id $(aws sts get-caller-identity --query Account --output text) \
	--budget-name "AntiAtropos-Monthly" \
	--notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
	--subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
	```

	### Cost-Saving Checklist

	- [ ] Use spot instances for node groups (60-70% cheaper, OK for training)
	- [ ] Set `ANTIATROPOS_MAX_REPLICAS=6` on HF Spaces (not 20) to prevent agent runaway
	- [ ] Cap node group `maxSize` at 4 (in `eksctl-cluster.yaml`)
	- [ ] Set AWS Budget alert at $150/month
	- [ ] Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all`
	- [ ] Delete the cluster for multi-day breaks: `eksctl delete cluster --name antiatropos`
	- [ ] AMP free tier covers first 10GB ingest/month
	- [ ] AMG free tier is 1 editor for 30 days — cancel if not needed

	---

	## 4. Step-by-Step Deployment Walkthrough

	### Before You Start

	You need:
	- AWS account with billing alerts enabled
	- AWS CLI v2 installed and configured (`aws configure`)
	- eksctl, kubectl, helm installed
	- About 20-30 minutes

	### Step 1: Create the EKS Cluster (15 min)

	```bash
	eksctl create cluster -f deploy/aws/eksctl-cluster.yaml

	# Verify
	aws eks update-kubeconfig --name antiatropos --region ap-south-1
	kubectl get nodes
	```

	### Step 2: Deploy Sample Workloads (1 min)

	```bash
	kubectl apply -f deploy/aws/k8s-workloads.yaml
	kubectl get pods -n prod-sre
	```

	### Step 3: Create AMP Workspace (1 min)

	```bash
	aws amp create-workspace --alias antiatropos-metrics --region ap-south-1

	# Note the workspace ID
	aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text
	```

	### Step 4: Set Up IRSA (2 min)

	```bash
	# Prometheus agent needs to write to AMP
	eksctl create iamserviceaccount \
	--cluster antiatropos \
	--namespace monitoring \
	--name prometheus-sa \
	--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
	--approve
	```

	### Step 5: Install Prometheus Agent (2 min)

	```bash
	helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
	helm repo update

	# Replace WORKSPACE_ID
	helm install prometheus-agent prometheus-community/prometheus \
	--namespace monitoring --create-namespace \
	-f deploy/aws/prometheus-agent-values.yaml \
	--set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
	```

	### Step 6: Set Up AMG (5 min)

	```bash
	# Create IAM role for AMG
	aws iam create-role \
	--role-name AntiAtroposGrafanaRole \
	--assume-role-policy-document file://deploy/aws/grafana-trust-policy.json

	aws iam attach-role-policy \
	--role-name AntiAtroposGrafanaRole \
	--policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess

	# Create workspace
	aws grafana create-workspace \
	--workspace-name antiatropos-dashboards \
	--account-access-type CURRENT_ACCOUNT \
	--authentication-method AWS_SSO \
	--permission-type SERVICE_MANAGED \
	--data-sources PROMETHEUS \
	--region ap-south-1
	```

	Then in the AMG web UI:
	1. Sign in with AWS SSO
	2. Configuration -> Data Sources -> Add AMP workspace
	3. Dashboards -> Import -> Upload JSON from `deploy/grafana/provisioning/dashboards/json/`
	4. Select AMP data source when importing

	### Step 7: Install Cluster Autoscaler (2 min)

	```bash
	helm repo add autoscaler https://kubernetes.github.io/autoscaler
	helm repo update

	helm install cluster-autoscaler autoscaler/cluster-autoscaler \
	--namespace kube-system \
	-f deploy/aws/cluster-autoscaler-values.yaml
	```

	### Step 8: Generate Kubeconfig for HF Spaces (1 min)

	```bash
	./deploy/aws/generate-kubeconfig.sh
	# Outputs: deploy/aws/kubeconfig-antiatropos.yaml
	```

	### Step 9: Configure HF Spaces

	See [Section 5](#5-configuring-hf-spaces-to-connect-to-aws) below.

	---

	## 5. Configuring HF Spaces to Connect to AWS

	### Secrets (HF Space Settings -> Repository secrets)

	\| Secret \| Value \|
	\|---\|---\|
	\| `OPENAI_API_KEY` \| Your OpenAI API key \|
	\| `KUBECONFIG_CONTENT` \| Base64-encoded content of `kubeconfig-antiatropos.yaml` \|

	To encode the kubeconfig:
	```bash
	cat deploy/aws/kubeconfig-antiatropos.yaml \| base64 -w 0
	```

	### Environment Variables (HF Space Settings -> Variables)

	\| Variable \| Value \|
	\|---\|---\|
	\| `ANTIATROPOS_ENV_MODE` \| `live` \|
	\| `ANTIATROPOS_STRICT_REAL` \| `false` \|
	\| `PROMETHEUS_URL` \| `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` \|
	\| `KUBECONFIG` \| `/app/kubeconfig.yaml` \|
	\| `ANTIATROPOS_K8S_NAMESPACE` \| `prod-sre` \|
	\| `ANTIATROPOS_DEPLOYMENT_PREFIX` \| `` (empty) \|
	\| `ANTIATROPOS_MIN_REPLICAS` \| `1` \|
	\| `ANTIATROPOS_MAX_REPLICAS` \| `6` \|
	\| `ANTIATROPOS_SCALE_STEP` \| `3` \|
	\| `ANTIATROPOS_PROM_TIMEOUT_S` \| `5.0` \|
	\| `ANTIATROPOS_METRIC_AGGREGATION` \| `sum` \|
	\| `ANTIATROPOS_WORKLOAD_MAP` \| See below \|

	### Workload Map Value

	```json
	{
	"node-0": {"deployment": "payments", "namespace": "prod-sre"},
	"node-1": {"deployment": "checkout", "namespace": "prod-sre"},
	"node-2": {"deployment": "catalog", "namespace": "prod-sre"},
	"node-3": {"deployment": "cart", "namespace": "prod-sre"},
	"node-4": {"deployment": "auth", "namespace": "prod-sre"}
	}
	```

	### Entrypoint Modification

	Add this to `deploy/entrypoint.sh` before the uvicorn line, so the kubeconfig is decoded from the HF secret:

	```bash
	# Decode kubeconfig from HF Spaces secret
	if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
	echo "${KUBECONFIG_CONTENT}" \| base64 -d > /app/kubeconfig.yaml
	export KUBECONFIG=/app/kubeconfig.yaml
	fi
	```

	### Verifying the Connection

	After deploying, check from HF Spaces that the server can reach AWS:

	1. Check the HF Space logs for `antiatropos_step` events
	2. Look for `Ack: SCALE_UP` messages (agent is reaching EKS)
	3. Look for non-zero `request_rate` / `cpu_utilization` (PrometheusClient is reaching AMP)
	4. If `ANTIATROPOS_STRICT_REAL=false` (recommended), failures fall back to mock silently

	---

	## 6. Day-2 Operations

	### Scaling Workloads Manually

	```bash
	# Scale a specific deployment
	kubectl scale deployment/payments -n prod-sre --replicas=4

	# Scale all workloads down
	kubectl scale deployment -n prod-sre --replicas=0 --all

	# Scale all workloads back up
	kubectl scale deployment payments -n prod-sre --replicas=2
	kubectl scale deployment checkout -n prod-sre --replicas=1
	kubectl scale deployment catalog -n prod-sre --replicas=1
	kubectl scale deployment cart -n prod-sre --replicas=1
	kubectl scale deployment auth -n prod-sre --replicas=1
	```

	### Pausing Everything (Without Deleting)

	```bash
	# Scale all workloads to 0
	kubectl scale deployment -n prod-sre --replicas=0 --all

	# Note: EKS nodes still run and cost money.
	# For real savings, delete the cluster (Section 7).
	```

	### Monitoring Agent Behavior

	Watch what the SRE agent is doing in real-time:

	```bash
	# Check how many workload pods the agent has created
	kubectl get deployments -n prod-sre

	# Check current replica counts
	kubectl get hpa -A # if any HPAs are defined

	# Check node pressure
	kubectl top nodes
	```

	### Checking Current Spend

	```bash
	# Current month cost by service
	aws ce get-cost-and-usage \
	--time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
	--granularity MONTHLY \
	--metrics BlendedCost \
	--group-by Type=DIMENSION,Key=SERVICE
	```

	### Regenerating Kubeconfig

	If the EKS cluster is recreated or credentials expire:

	```bash
	./deploy/aws/generate-kubeconfig.sh
	# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT
	```

	---

	## 7. Teardown & Cost Recovery

	### Partial Teardown (Keep Cluster, Stop Workloads)

	```bash
	kubectl scale deployment -n prod-sre --replicas=0 --all
	# Still paying for EKS control plane ($73/month) and idle nodes
	```

	### Full Teardown (Stop All Charges)

	```bash
	# Delete workloads
	kubectl delete -f deploy/aws/k8s-workloads.yaml

	# Delete Prometheus agent
	helm uninstall prometheus-agent -n monitoring
	kubectl delete namespace monitoring

	# Delete AMP workspace
	AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
	aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1

	# Delete AMG workspace
	AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
	aws grafana delete-workspace --workspace-id $AMG_WS_ID

	# Delete IAM role for Grafana
	aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
	aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
	aws iam delete-role --role-name AntiAtroposGrafanaRole

	# Delete the EKS cluster (10-15 min)
	eksctl delete cluster --name antiatropos --region ap-south-1

	# Verify nothing is left
	aws eks list-clusters --region ap-south-1
	aws amp list-workspaces --region ap-south-1
	```

	Also remove the `KUBECONFIG_CONTENT` secret and reset `PROMETHEUS_URL` to `mock` in your HF Space.

	---

	## Quick Reference Card

	\| Task \| Command \|
	\|---\|---\|
	\| Deploy AWS infra \| `./deploy/aws/deploy.sh` \|
	\| Check workloads \| `kubectl get pods -n prod-sre` \|
	\| Check monitoring \| `kubectl get pods -n monitoring` \|
	\| Scale a workload \| `kubectl scale deployment/payments -n prod-sre --replicas=N` \|
	\| Pause all workloads \| `kubectl scale deployment -n prod-sre --replicas=0 --all` \|
	\| Check AMP data \| `awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1` \|
	\| Generate kubeconfig \| `./deploy/aws/generate-kubeconfig.sh` \|
	\| Nuke everything \| `eksctl delete cluster --name antiatropos --region ap-south-1` \|