AntiAtropos / deploy /aws /OPERATIONS.md
div18
feat: implement Kubernetes executor for automated cluster scaling and infrastructure management
cf2697b
# AntiAtropos AWS Operations Guide
Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.
**Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.**
---
## Table of Contents
1. [Replica Strategy & Caps](#1-replica-strategy--caps)
2. [Autoscaling Configuration](#2-autoscaling-configuration)
3. [Cost Guardrails](#3-cost-guardrails)
4. [Step-by-Step Deployment Walkthrough](#4-step-by-step-deployment-walkthrough)
5. [Configuring HF Spaces to Connect to AWS](#5-configuring-hf-spaces-to-connect-to-aws)
6. [Day-2 Operations](#6-day-2-operations)
7. [Teardown & Cost Recovery](#7-teardown--cost-recovery)
---
## 1. Replica Strategy & Caps
### What Runs Where
| Component | Where | Scaled By | Cost Impact |
|---|---|---|---|
| **AntiAtropos FastAPI server** | HF Spaces | HF auto-scales | $0-5/month (HF billing) |
| **Workload pods** (payments, checkout, etc.) | EKS | SRE agent via `KubernetesExecutor` | **HIGH** — this is where costs spiral |
| **Prometheus Agent** | EKS (monitoring ns) | Static (1 pod) | Low |
| **AMP** | AWS managed | Serverless | Pay per GB ingested |
| **AMG** | AWS managed | Serverless | Pay per editor |
### Workload Pod Replicas — Where Costs Spiral
The SRE agent's `SCALE_UP` action calls `KubernetesExecutor._scale_deployment()`, which patches `replicas` on real K8s Deployments. A bad agent can scale every deployment to the cap.
The `ANTIATROPOS_MAX_REPLICAS` env var (set on HF Spaces) is the **global** ceiling applied to all deployments. The default in `kubernetes_executor.py` is 20 — with 5 deployments, that's **100 pods** worst case. **Set it to 6.**
**Recommended caps by deployment:**
| Deployment | Min | Max Replicas | Reasoning |
|---|---|---|---|
| `payments` (node-0, VIP) | 2 | 6 | VIP node — needs redundancy, 6 is plenty for the traffic model |
| `checkout` (node-1) | 1 | 5 | Can burst but shouldn't stay high |
| `catalog` (node-2) | 1 | 5 | Same |
| `cart` (node-3) | 1 | 4 | Non-critical, sheddable |
| `auth` (node-4) | 1 | 4 | Non-critical, sheddable |
**Total worst case: 24 workload pods.**
At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.
### How the Cap Works
The `KubernetesExecutor._scale_deployment()` method reads `ANTIATROPOS_MAX_REPLICAS` from the environment and refuses to scale above it:
```
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
```
This is enforced in code (`kubernetes_executor.py` line 115):
```python
desired = min(self.max_replicas, current + delta)
```
**Set `ANTIATROPOS_MAX_REPLICAS=6` on your HF Space.**
---
## 2. Autoscaling Configuration
### EKS Node Autoscaling
The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:
```bash
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
-f deploy/aws/cluster-autoscaler-values.yaml
```
**The node group `maxSize` in `eksctl-cluster.yaml` (4) is your ultimate cost ceiling.**
```
4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max
```
With spot instances, this drops to ~$36/month max.
### What Happens When the Agent Scales Workloads
1. Agent on HF Spaces sends `SCALE_UP` action
2. `KubernetesExecutor._scale_deployment()` patches the Deployment's `spec.replicas` via EKS API server
3. Kubernetes scheduler tries to place the new pod
4. If no node has capacity -> pod is `Pending`
5. Cluster Autoscaler sees `Pending` pods -> adds a node (within `maxSize`)
6. If `maxSize` is hit -> pod stays `Pending` (agent action succeeded but pod won't schedule)
**This is why `maxSize` in the node group is your ultimate cost ceiling.**
---
## 3. Cost Guardrails
### Monthly Cost Caps by Tier
| Tier | Max Nodes | Max Workload Pods | Estimated Monthly Cost |
|---|---|---|---|
| **Dev/Testing** | 2 | 10 (2/deployment) | ~$80 |
| **Training** | 3 | 15 (3/deployment) | ~$130 |
| **Benchmark Suite** | 4 | 24 (~5/deployment) | ~$160 |
| **Unlimited (danger)** | inf | 100 (20/deployment) | $500+ |
### AWS Budgets — Get Alerts Before You Overspend
```bash
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget '{
"BudgetName": "AntiAtropos-Monthly",
"BudgetLimit": {"Amount": "150", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"CostFilters": {
"TagKeyValue": ["user:Project$AntiAtropos"]
},
"CostTypes": {
"IncludeTax": true,
"IncludeSubscription": true,
"UseBlended": false
}
}'
# Alert at 50%
aws budgets create-notification \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget-name "AntiAtropos-Monthly" \
--notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
--subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
# Alert at 80%
aws budgets create-notification \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget-name "AntiAtropos-Monthly" \
--notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
--subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
```
### Cost-Saving Checklist
- [ ] Use **spot instances** for node groups (60-70% cheaper, OK for training)
- [ ] Set `ANTIATROPOS_MAX_REPLICAS=6` on HF Spaces (not 20) to prevent agent runaway
- [ ] Cap node group `maxSize` at 4 (in `eksctl-cluster.yaml`)
- [ ] Set AWS Budget alert at $150/month
- [ ] Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all`
- [ ] Delete the cluster for multi-day breaks: `eksctl delete cluster --name antiatropos`
- [ ] AMP free tier covers first 10GB ingest/month
- [ ] AMG free tier is 1 editor for 30 days — cancel if not needed
---
## 4. Step-by-Step Deployment Walkthrough
### Before You Start
You need:
- AWS account with billing alerts enabled
- AWS CLI v2 installed and configured (`aws configure`)
- eksctl, kubectl, helm installed
- About 20-30 minutes
### Step 1: Create the EKS Cluster (15 min)
```bash
eksctl create cluster -f deploy/aws/eksctl-cluster.yaml
# Verify
aws eks update-kubeconfig --name antiatropos --region ap-south-1
kubectl get nodes
```
### Step 2: Deploy Sample Workloads (1 min)
```bash
kubectl apply -f deploy/aws/k8s-workloads.yaml
kubectl get pods -n prod-sre
```
### Step 3: Create AMP Workspace (1 min)
```bash
aws amp create-workspace --alias antiatropos-metrics --region ap-south-1
# Note the workspace ID
aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text
```
### Step 4: Set Up IRSA (2 min)
```bash
# Prometheus agent needs to write to AMP
eksctl create iamserviceaccount \
--cluster antiatropos \
--namespace monitoring \
--name prometheus-sa \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
--approve
```
### Step 5: Install Prometheus Agent (2 min)
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Replace WORKSPACE_ID
helm install prometheus-agent prometheus-community/prometheus \
--namespace monitoring --create-namespace \
-f deploy/aws/prometheus-agent-values.yaml \
--set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
```
### Step 6: Set Up AMG (5 min)
```bash
# Create IAM role for AMG
aws iam create-role \
--role-name AntiAtroposGrafanaRole \
--assume-role-policy-document file://deploy/aws/grafana-trust-policy.json
aws iam attach-role-policy \
--role-name AntiAtroposGrafanaRole \
--policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
# Create workspace
aws grafana create-workspace \
--workspace-name antiatropos-dashboards \
--account-access-type CURRENT_ACCOUNT \
--authentication-method AWS_SSO \
--permission-type SERVICE_MANAGED \
--data-sources PROMETHEUS \
--region ap-south-1
```
Then in the AMG web UI:
1. Sign in with AWS SSO
2. Configuration -> Data Sources -> Add AMP workspace
3. Dashboards -> Import -> Upload JSON from `deploy/grafana/provisioning/dashboards/json/`
4. Select AMP data source when importing
### Step 7: Install Cluster Autoscaler (2 min)
```bash
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
-f deploy/aws/cluster-autoscaler-values.yaml
```
### Step 8: Generate Kubeconfig for HF Spaces (1 min)
```bash
./deploy/aws/generate-kubeconfig.sh
# Outputs: deploy/aws/kubeconfig-antiatropos.yaml
```
### Step 9: Configure HF Spaces
See [Section 5](#5-configuring-hf-spaces-to-connect-to-aws) below.
---
## 5. Configuring HF Spaces to Connect to AWS
### Secrets (HF Space Settings -> Repository secrets)
| Secret | Value |
|---|---|
| `OPENAI_API_KEY` | Your OpenAI API key |
| `KUBECONFIG_CONTENT` | Base64-encoded content of `kubeconfig-antiatropos.yaml` |
To encode the kubeconfig:
```bash
cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0
```
### Environment Variables (HF Space Settings -> Variables)
| Variable | Value |
|---|---|
| `ANTIATROPOS_ENV_MODE` | `live` |
| `ANTIATROPOS_STRICT_REAL` | `false` |
| `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` |
| `KUBECONFIG` | `/app/kubeconfig.yaml` |
| `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` |
| `ANTIATROPOS_DEPLOYMENT_PREFIX` | `` (empty) |
| `ANTIATROPOS_MIN_REPLICAS` | `1` |
| `ANTIATROPOS_MAX_REPLICAS` | `6` |
| `ANTIATROPOS_SCALE_STEP` | `3` |
| `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` |
| `ANTIATROPOS_METRIC_AGGREGATION` | `sum` |
| `ANTIATROPOS_WORKLOAD_MAP` | See below |
### Workload Map Value
```json
{
"node-0": {"deployment": "payments", "namespace": "prod-sre"},
"node-1": {"deployment": "checkout", "namespace": "prod-sre"},
"node-2": {"deployment": "catalog", "namespace": "prod-sre"},
"node-3": {"deployment": "cart", "namespace": "prod-sre"},
"node-4": {"deployment": "auth", "namespace": "prod-sre"}
}
```
### Entrypoint Modification
Add this to `deploy/entrypoint.sh` before the uvicorn line, so the kubeconfig is decoded from the HF secret:
```bash
# Decode kubeconfig from HF Spaces secret
if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
export KUBECONFIG=/app/kubeconfig.yaml
fi
```
### Verifying the Connection
After deploying, check from HF Spaces that the server can reach AWS:
1. Check the HF Space logs for `antiatropos_step` events
2. Look for `Ack: SCALE_UP` messages (agent is reaching EKS)
3. Look for non-zero `request_rate` / `cpu_utilization` (PrometheusClient is reaching AMP)
4. If `ANTIATROPOS_STRICT_REAL=false` (recommended), failures fall back to mock silently
---
## 6. Day-2 Operations
### Scaling Workloads Manually
```bash
# Scale a specific deployment
kubectl scale deployment/payments -n prod-sre --replicas=4
# Scale all workloads down
kubectl scale deployment -n prod-sre --replicas=0 --all
# Scale all workloads back up
kubectl scale deployment payments -n prod-sre --replicas=2
kubectl scale deployment checkout -n prod-sre --replicas=1
kubectl scale deployment catalog -n prod-sre --replicas=1
kubectl scale deployment cart -n prod-sre --replicas=1
kubectl scale deployment auth -n prod-sre --replicas=1
```
### Pausing Everything (Without Deleting)
```bash
# Scale all workloads to 0
kubectl scale deployment -n prod-sre --replicas=0 --all
# Note: EKS nodes still run and cost money.
# For real savings, delete the cluster (Section 7).
```
### Monitoring Agent Behavior
Watch what the SRE agent is doing in real-time:
```bash
# Check how many workload pods the agent has created
kubectl get deployments -n prod-sre
# Check current replica counts
kubectl get hpa -A # if any HPAs are defined
# Check node pressure
kubectl top nodes
```
### Checking Current Spend
```bash
# Current month cost by service
aws ce get-cost-and-usage \
--time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
```
### Regenerating Kubeconfig
If the EKS cluster is recreated or credentials expire:
```bash
./deploy/aws/generate-kubeconfig.sh
# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT
```
---
## 7. Teardown & Cost Recovery
### Partial Teardown (Keep Cluster, Stop Workloads)
```bash
kubectl scale deployment -n prod-sre --replicas=0 --all
# Still paying for EKS control plane ($73/month) and idle nodes
```
### Full Teardown (Stop All Charges)
```bash
# Delete workloads
kubectl delete -f deploy/aws/k8s-workloads.yaml
# Delete Prometheus agent
helm uninstall prometheus-agent -n monitoring
kubectl delete namespace monitoring
# Delete AMP workspace
AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1
# Delete AMG workspace
AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
aws grafana delete-workspace --workspace-id $AMG_WS_ID
# Delete IAM role for Grafana
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
aws iam delete-role --role-name AntiAtroposGrafanaRole
# Delete the EKS cluster (10-15 min)
eksctl delete cluster --name antiatropos --region ap-south-1
# Verify nothing is left
aws eks list-clusters --region ap-south-1
aws amp list-workspaces --region ap-south-1
```
Also remove the `KUBECONFIG_CONTENT` secret and reset `PROMETHEUS_URL` to `mock` in your HF Space.
---
## Quick Reference Card
| Task | Command |
|---|---|
| Deploy AWS infra | `./deploy/aws/deploy.sh` |
| Check workloads | `kubectl get pods -n prod-sre` |
| Check monitoring | `kubectl get pods -n monitoring` |
| Scale a workload | `kubectl scale deployment/payments -n prod-sre --replicas=N` |
| Pause all workloads | `kubectl scale deployment -n prod-sre --replicas=0 --all` |
| Check AMP data | `awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1` |
| Generate kubeconfig | `./deploy/aws/generate-kubeconfig.sh` |
| Nuke everything | `eksctl delete cluster --name antiatropos --region ap-south-1` |