AntiAtropos AWS Operations Guide
Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.
Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.
Table of Contents
- Replica Strategy & Caps
- Autoscaling Configuration
- Cost Guardrails
- Step-by-Step Deployment Walkthrough
- Configuring HF Spaces to Connect to AWS
- Day-2 Operations
- Teardown & Cost Recovery
1. Replica Strategy & Caps
What Runs Where
| Component | Where | Scaled By | Cost Impact |
|---|---|---|---|
| AntiAtropos FastAPI server | HF Spaces | HF auto-scales | $0-5/month (HF billing) |
| Workload pods (payments, checkout, etc.) | EKS | SRE agent via KubernetesExecutor |
HIGH — this is where costs spiral |
| Prometheus Agent | EKS (monitoring ns) | Static (1 pod) | Low |
| AMP | AWS managed | Serverless | Pay per GB ingested |
| AMG | AWS managed | Serverless | Pay per editor |
Workload Pod Replicas — Where Costs Spiral
The SRE agent's SCALE_UP action calls KubernetesExecutor._scale_deployment(), which patches replicas on real K8s Deployments. A bad agent can scale every deployment to the cap.
The ANTIATROPOS_MAX_REPLICAS env var (set on HF Spaces) is the global ceiling applied to all deployments. The default in kubernetes_executor.py is 20 — with 5 deployments, that's 100 pods worst case. Set it to 6.
Recommended caps by deployment:
| Deployment | Min | Max Replicas | Reasoning |
|---|---|---|---|
payments (node-0, VIP) |
2 | 6 | VIP node — needs redundancy, 6 is plenty for the traffic model |
checkout (node-1) |
1 | 5 | Can burst but shouldn't stay high |
catalog (node-2) |
1 | 5 | Same |
cart (node-3) |
1 | 4 | Non-critical, sheddable |
auth (node-4) |
1 | 4 | Non-critical, sheddable |
Total worst case: 24 workload pods.
At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.
How the Cap Works
The KubernetesExecutor._scale_deployment() method reads ANTIATROPOS_MAX_REPLICAS from the environment and refuses to scale above it:
Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)
This is enforced in code (kubernetes_executor.py line 115):
desired = min(self.max_replicas, current + delta)
Set ANTIATROPOS_MAX_REPLICAS=6 on your HF Space.
2. Autoscaling Configuration
EKS Node Autoscaling
The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
-f deploy/aws/cluster-autoscaler-values.yaml
The node group maxSize in eksctl-cluster.yaml (4) is your ultimate cost ceiling.
4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max
With spot instances, this drops to ~$36/month max.
What Happens When the Agent Scales Workloads
- Agent on HF Spaces sends
SCALE_UPaction KubernetesExecutor._scale_deployment()patches the Deployment'sspec.replicasvia EKS API server- Kubernetes scheduler tries to place the new pod
- If no node has capacity -> pod is
Pending - Cluster Autoscaler sees
Pendingpods -> adds a node (withinmaxSize) - If
maxSizeis hit -> pod staysPending(agent action succeeded but pod won't schedule)
This is why maxSize in the node group is your ultimate cost ceiling.
3. Cost Guardrails
Monthly Cost Caps by Tier
| Tier | Max Nodes | Max Workload Pods | Estimated Monthly Cost |
|---|---|---|---|
| Dev/Testing | 2 | 10 (2/deployment) | ~$80 |
| Training | 3 | 15 (3/deployment) | ~$130 |
| Benchmark Suite | 4 | 24 (~5/deployment) | ~$160 |
| Unlimited (danger) | inf | 100 (20/deployment) | $500+ |
AWS Budgets — Get Alerts Before You Overspend
aws budgets create-budget \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget '{
"BudgetName": "AntiAtropos-Monthly",
"BudgetLimit": {"Amount": "150", "Unit": "USD"},
"TimeUnit": "MONTHLY",
"CostFilters": {
"TagKeyValue": ["user:Project$AntiAtropos"]
},
"CostTypes": {
"IncludeTax": true,
"IncludeSubscription": true,
"UseBlended": false
}
}'
# Alert at 50%
aws budgets create-notification \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget-name "AntiAtropos-Monthly" \
--notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
--subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
# Alert at 80%
aws budgets create-notification \
--account-id $(aws sts get-caller-identity --query Account --output text) \
--budget-name "AntiAtropos-Monthly" \
--notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
--subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'
Cost-Saving Checklist
- Use spot instances for node groups (60-70% cheaper, OK for training)
- Set
ANTIATROPOS_MAX_REPLICAS=6on HF Spaces (not 20) to prevent agent runaway - Cap node group
maxSizeat 4 (ineksctl-cluster.yaml) - Set AWS Budget alert at $150/month
- Scale workloads to zero between runs:
kubectl scale deployment -n prod-sre --replicas=0 --all - Delete the cluster for multi-day breaks:
eksctl delete cluster --name antiatropos - AMP free tier covers first 10GB ingest/month
- AMG free tier is 1 editor for 30 days — cancel if not needed
4. Step-by-Step Deployment Walkthrough
Before You Start
You need:
- AWS account with billing alerts enabled
- AWS CLI v2 installed and configured (
aws configure) - eksctl, kubectl, helm installed
- About 20-30 minutes
Step 1: Create the EKS Cluster (15 min)
eksctl create cluster -f deploy/aws/eksctl-cluster.yaml
# Verify
aws eks update-kubeconfig --name antiatropos --region ap-south-1
kubectl get nodes
Step 2: Deploy Sample Workloads (1 min)
kubectl apply -f deploy/aws/k8s-workloads.yaml
kubectl get pods -n prod-sre
Step 3: Create AMP Workspace (1 min)
aws amp create-workspace --alias antiatropos-metrics --region ap-south-1
# Note the workspace ID
aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text
Step 4: Set Up IRSA (2 min)
# Prometheus agent needs to write to AMP
eksctl create iamserviceaccount \
--cluster antiatropos \
--namespace monitoring \
--name prometheus-sa \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
--approve
Step 5: Install Prometheus Agent (2 min)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Replace WORKSPACE_ID
helm install prometheus-agent prometheus-community/prometheus \
--namespace monitoring --create-namespace \
-f deploy/aws/prometheus-agent-values.yaml \
--set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"
Step 6: Set Up AMG (5 min)
# Create IAM role for AMG
aws iam create-role \
--role-name AntiAtroposGrafanaRole \
--assume-role-policy-document file://deploy/aws/grafana-trust-policy.json
aws iam attach-role-policy \
--role-name AntiAtroposGrafanaRole \
--policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
# Create workspace
aws grafana create-workspace \
--workspace-name antiatropos-dashboards \
--account-access-type CURRENT_ACCOUNT \
--authentication-method AWS_SSO \
--permission-type SERVICE_MANAGED \
--data-sources PROMETHEUS \
--region ap-south-1
Then in the AMG web UI:
- Sign in with AWS SSO
- Configuration -> Data Sources -> Add AMP workspace
- Dashboards -> Import -> Upload JSON from
deploy/grafana/provisioning/dashboards/json/ - Select AMP data source when importing
Step 7: Install Cluster Autoscaler (2 min)
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--namespace kube-system \
-f deploy/aws/cluster-autoscaler-values.yaml
Step 8: Generate Kubeconfig for HF Spaces (1 min)
./deploy/aws/generate-kubeconfig.sh
# Outputs: deploy/aws/kubeconfig-antiatropos.yaml
Step 9: Configure HF Spaces
See Section 5 below.
5. Configuring HF Spaces to Connect to AWS
Secrets (HF Space Settings -> Repository secrets)
| Secret | Value |
|---|---|
OPENAI_API_KEY |
Your OpenAI API key |
KUBECONFIG_CONTENT |
Base64-encoded content of kubeconfig-antiatropos.yaml |
To encode the kubeconfig:
cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0
Environment Variables (HF Space Settings -> Variables)
| Variable | Value |
|---|---|
ANTIATROPOS_ENV_MODE |
live |
ANTIATROPOS_STRICT_REAL |
false |
PROMETHEUS_URL |
https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID |
KUBECONFIG |
/app/kubeconfig.yaml |
ANTIATROPOS_K8S_NAMESPACE |
prod-sre |
ANTIATROPOS_DEPLOYMENT_PREFIX |
`` (empty) |
ANTIATROPOS_MIN_REPLICAS |
1 |
ANTIATROPOS_MAX_REPLICAS |
6 |
ANTIATROPOS_SCALE_STEP |
3 |
ANTIATROPOS_PROM_TIMEOUT_S |
5.0 |
ANTIATROPOS_METRIC_AGGREGATION |
sum |
ANTIATROPOS_WORKLOAD_MAP |
See below |
Workload Map Value
{
"node-0": {"deployment": "payments", "namespace": "prod-sre"},
"node-1": {"deployment": "checkout", "namespace": "prod-sre"},
"node-2": {"deployment": "catalog", "namespace": "prod-sre"},
"node-3": {"deployment": "cart", "namespace": "prod-sre"},
"node-4": {"deployment": "auth", "namespace": "prod-sre"}
}
Entrypoint Modification
Add this to deploy/entrypoint.sh before the uvicorn line, so the kubeconfig is decoded from the HF secret:
# Decode kubeconfig from HF Spaces secret
if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
export KUBECONFIG=/app/kubeconfig.yaml
fi
Verifying the Connection
After deploying, check from HF Spaces that the server can reach AWS:
- Check the HF Space logs for
antiatropos_stepevents - Look for
Ack: SCALE_UPmessages (agent is reaching EKS) - Look for non-zero
request_rate/cpu_utilization(PrometheusClient is reaching AMP) - If
ANTIATROPOS_STRICT_REAL=false(recommended), failures fall back to mock silently
6. Day-2 Operations
Scaling Workloads Manually
# Scale a specific deployment
kubectl scale deployment/payments -n prod-sre --replicas=4
# Scale all workloads down
kubectl scale deployment -n prod-sre --replicas=0 --all
# Scale all workloads back up
kubectl scale deployment payments -n prod-sre --replicas=2
kubectl scale deployment checkout -n prod-sre --replicas=1
kubectl scale deployment catalog -n prod-sre --replicas=1
kubectl scale deployment cart -n prod-sre --replicas=1
kubectl scale deployment auth -n prod-sre --replicas=1
Pausing Everything (Without Deleting)
# Scale all workloads to 0
kubectl scale deployment -n prod-sre --replicas=0 --all
# Note: EKS nodes still run and cost money.
# For real savings, delete the cluster (Section 7).
Monitoring Agent Behavior
Watch what the SRE agent is doing in real-time:
# Check how many workload pods the agent has created
kubectl get deployments -n prod-sre
# Check current replica counts
kubectl get hpa -A # if any HPAs are defined
# Check node pressure
kubectl top nodes
Checking Current Spend
# Current month cost by service
aws ce get-cost-and-usage \
--time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
--granularity MONTHLY \
--metrics BlendedCost \
--group-by Type=DIMENSION,Key=SERVICE
Regenerating Kubeconfig
If the EKS cluster is recreated or credentials expire:
./deploy/aws/generate-kubeconfig.sh
# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT
7. Teardown & Cost Recovery
Partial Teardown (Keep Cluster, Stop Workloads)
kubectl scale deployment -n prod-sre --replicas=0 --all
# Still paying for EKS control plane ($73/month) and idle nodes
Full Teardown (Stop All Charges)
# Delete workloads
kubectl delete -f deploy/aws/k8s-workloads.yaml
# Delete Prometheus agent
helm uninstall prometheus-agent -n monitoring
kubectl delete namespace monitoring
# Delete AMP workspace
AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1
# Delete AMG workspace
AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
aws grafana delete-workspace --workspace-id $AMG_WS_ID
# Delete IAM role for Grafana
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
aws iam delete-role --role-name AntiAtroposGrafanaRole
# Delete the EKS cluster (10-15 min)
eksctl delete cluster --name antiatropos --region ap-south-1
# Verify nothing is left
aws eks list-clusters --region ap-south-1
aws amp list-workspaces --region ap-south-1
Also remove the KUBECONFIG_CONTENT secret and reset PROMETHEUS_URL to mock in your HF Space.
Quick Reference Card
| Task | Command |
|---|---|
| Deploy AWS infra | ./deploy/aws/deploy.sh |
| Check workloads | kubectl get pods -n prod-sre |
| Check monitoring | kubectl get pods -n monitoring |
| Scale a workload | kubectl scale deployment/payments -n prod-sre --replicas=N |
| Pause all workloads | kubectl scale deployment -n prod-sre --replicas=0 --all |
| Check AMP data | awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1 |
| Generate kubeconfig | ./deploy/aws/generate-kubeconfig.sh |
| Nuke everything | eksctl delete cluster --name antiatropos --region ap-south-1 |