AntiAtropos / deploy /aws /OPERATIONS.md
div18
feat: implement Kubernetes executor for automated cluster scaling and infrastructure management
cf2697b

AntiAtropos AWS Operations Guide

Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.

Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.


Table of Contents

  1. Replica Strategy & Caps
  2. Autoscaling Configuration
  3. Cost Guardrails
  4. Step-by-Step Deployment Walkthrough
  5. Configuring HF Spaces to Connect to AWS
  6. Day-2 Operations
  7. Teardown & Cost Recovery

1. Replica Strategy & Caps

What Runs Where

Component Where Scaled By Cost Impact
AntiAtropos FastAPI server HF Spaces HF auto-scales $0-5/month (HF billing)
Workload pods (payments, checkout, etc.) EKS SRE agent via KubernetesExecutor HIGH — this is where costs spiral
Prometheus Agent EKS (monitoring ns) Static (1 pod) Low
AMP AWS managed Serverless Pay per GB ingested
AMG AWS managed Serverless Pay per editor

Workload Pod Replicas — Where Costs Spiral

The SRE agent's SCALE_UP action calls KubernetesExecutor._scale_deployment(), which patches replicas on real K8s Deployments. A bad agent can scale every deployment to the cap.

The ANTIATROPOS_MAX_REPLICAS env var (set on HF Spaces) is the global ceiling applied to all deployments. The default in kubernetes_executor.py is 20 — with 5 deployments, that's 100 pods worst case. Set it to 6.

Recommended caps by deployment:

Deployment Min Max Replicas Reasoning
payments (node-0, VIP) 2 6 VIP node — needs redundancy, 6 is plenty for the traffic model
checkout (node-1) 1 5 Can burst but shouldn't stay high
catalog (node-2) 1 5 Same
cart (node-3) 1 4 Non-critical, sheddable
auth (node-4) 1 4 Non-critical, sheddable

Total worst case: 24 workload pods.

At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.

How the Cap Works

The KubernetesExecutor._scale_deployment() method reads ANTIATROPOS_MAX_REPLICAS from the environment and refuses to scale above it:

Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)

This is enforced in code (kubernetes_executor.py line 115):

desired = min(self.max_replicas, current + delta)

Set ANTIATROPOS_MAX_REPLICAS=6 on your HF Space.


2. Autoscaling Configuration

EKS Node Autoscaling

The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml

The node group maxSize in eksctl-cluster.yaml (4) is your ultimate cost ceiling.

4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max

With spot instances, this drops to ~$36/month max.

What Happens When the Agent Scales Workloads

  1. Agent on HF Spaces sends SCALE_UP action
  2. KubernetesExecutor._scale_deployment() patches the Deployment's spec.replicas via EKS API server
  3. Kubernetes scheduler tries to place the new pod
  4. If no node has capacity -> pod is Pending
  5. Cluster Autoscaler sees Pending pods -> adds a node (within maxSize)
  6. If maxSize is hit -> pod stays Pending (agent action succeeded but pod won't schedule)

This is why maxSize in the node group is your ultimate cost ceiling.


3. Cost Guardrails

Monthly Cost Caps by Tier

Tier Max Nodes Max Workload Pods Estimated Monthly Cost
Dev/Testing 2 10 (2/deployment) ~$80
Training 3 15 (3/deployment) ~$130
Benchmark Suite 4 24 (~5/deployment) ~$160
Unlimited (danger) inf 100 (20/deployment) $500+

AWS Budgets — Get Alerts Before You Overspend

aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "AntiAtropos-Monthly",
    "BudgetLimit": {"Amount": "150", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "CostFilters": {
      "TagKeyValue": ["user:Project$AntiAtropos"]
    },
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false
    }
  }'

# Alert at 50%
aws budgets create-notification \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget-name "AntiAtropos-Monthly" \
  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'

# Alert at 80%
aws budgets create-notification \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget-name "AntiAtropos-Monthly" \
  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'

Cost-Saving Checklist

  • Use spot instances for node groups (60-70% cheaper, OK for training)
  • Set ANTIATROPOS_MAX_REPLICAS=6 on HF Spaces (not 20) to prevent agent runaway
  • Cap node group maxSize at 4 (in eksctl-cluster.yaml)
  • Set AWS Budget alert at $150/month
  • Scale workloads to zero between runs: kubectl scale deployment -n prod-sre --replicas=0 --all
  • Delete the cluster for multi-day breaks: eksctl delete cluster --name antiatropos
  • AMP free tier covers first 10GB ingest/month
  • AMG free tier is 1 editor for 30 days — cancel if not needed

4. Step-by-Step Deployment Walkthrough

Before You Start

You need:

  • AWS account with billing alerts enabled
  • AWS CLI v2 installed and configured (aws configure)
  • eksctl, kubectl, helm installed
  • About 20-30 minutes

Step 1: Create the EKS Cluster (15 min)

eksctl create cluster -f deploy/aws/eksctl-cluster.yaml

# Verify
aws eks update-kubeconfig --name antiatropos --region ap-south-1
kubectl get nodes

Step 2: Deploy Sample Workloads (1 min)

kubectl apply -f deploy/aws/k8s-workloads.yaml
kubectl get pods -n prod-sre

Step 3: Create AMP Workspace (1 min)

aws amp create-workspace --alias antiatropos-metrics --region ap-south-1

# Note the workspace ID
aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text

Step 4: Set Up IRSA (2 min)

# Prometheus agent needs to write to AMP
eksctl create iamserviceaccount \
  --cluster antiatropos \
  --namespace monitoring \
  --name prometheus-sa \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
  --approve

Step 5: Install Prometheus Agent (2 min)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Replace WORKSPACE_ID
helm install prometheus-agent prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  -f deploy/aws/prometheus-agent-values.yaml \
  --set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"

Step 6: Set Up AMG (5 min)

# Create IAM role for AMG
aws iam create-role \
  --role-name AntiAtroposGrafanaRole \
  --assume-role-policy-document file://deploy/aws/grafana-trust-policy.json

aws iam attach-role-policy \
  --role-name AntiAtroposGrafanaRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess

# Create workspace
aws grafana create-workspace \
  --workspace-name antiatropos-dashboards \
  --account-access-type CURRENT_ACCOUNT \
  --authentication-method AWS_SSO \
  --permission-type SERVICE_MANAGED \
  --data-sources PROMETHEUS \
  --region ap-south-1

Then in the AMG web UI:

  1. Sign in with AWS SSO
  2. Configuration -> Data Sources -> Add AMP workspace
  3. Dashboards -> Import -> Upload JSON from deploy/grafana/provisioning/dashboards/json/
  4. Select AMP data source when importing

Step 7: Install Cluster Autoscaler (2 min)

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml

Step 8: Generate Kubeconfig for HF Spaces (1 min)

./deploy/aws/generate-kubeconfig.sh
# Outputs: deploy/aws/kubeconfig-antiatropos.yaml

Step 9: Configure HF Spaces

See Section 5 below.


5. Configuring HF Spaces to Connect to AWS

Secrets (HF Space Settings -> Repository secrets)

Secret Value
OPENAI_API_KEY Your OpenAI API key
KUBECONFIG_CONTENT Base64-encoded content of kubeconfig-antiatropos.yaml

To encode the kubeconfig:

cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0

Environment Variables (HF Space Settings -> Variables)

Variable Value
ANTIATROPOS_ENV_MODE live
ANTIATROPOS_STRICT_REAL false
PROMETHEUS_URL https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID
KUBECONFIG /app/kubeconfig.yaml
ANTIATROPOS_K8S_NAMESPACE prod-sre
ANTIATROPOS_DEPLOYMENT_PREFIX `` (empty)
ANTIATROPOS_MIN_REPLICAS 1
ANTIATROPOS_MAX_REPLICAS 6
ANTIATROPOS_SCALE_STEP 3
ANTIATROPOS_PROM_TIMEOUT_S 5.0
ANTIATROPOS_METRIC_AGGREGATION sum
ANTIATROPOS_WORKLOAD_MAP See below

Workload Map Value

{
  "node-0": {"deployment": "payments", "namespace": "prod-sre"},
  "node-1": {"deployment": "checkout", "namespace": "prod-sre"},
  "node-2": {"deployment": "catalog", "namespace": "prod-sre"},
  "node-3": {"deployment": "cart", "namespace": "prod-sre"},
  "node-4": {"deployment": "auth", "namespace": "prod-sre"}
}

Entrypoint Modification

Add this to deploy/entrypoint.sh before the uvicorn line, so the kubeconfig is decoded from the HF secret:

# Decode kubeconfig from HF Spaces secret
if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
    echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
    export KUBECONFIG=/app/kubeconfig.yaml
fi

Verifying the Connection

After deploying, check from HF Spaces that the server can reach AWS:

  1. Check the HF Space logs for antiatropos_step events
  2. Look for Ack: SCALE_UP messages (agent is reaching EKS)
  3. Look for non-zero request_rate / cpu_utilization (PrometheusClient is reaching AMP)
  4. If ANTIATROPOS_STRICT_REAL=false (recommended), failures fall back to mock silently

6. Day-2 Operations

Scaling Workloads Manually

# Scale a specific deployment
kubectl scale deployment/payments -n prod-sre --replicas=4

# Scale all workloads down
kubectl scale deployment -n prod-sre --replicas=0 --all

# Scale all workloads back up
kubectl scale deployment payments -n prod-sre --replicas=2
kubectl scale deployment checkout -n prod-sre --replicas=1
kubectl scale deployment catalog -n prod-sre --replicas=1
kubectl scale deployment cart -n prod-sre --replicas=1
kubectl scale deployment auth -n prod-sre --replicas=1

Pausing Everything (Without Deleting)

# Scale all workloads to 0
kubectl scale deployment -n prod-sre --replicas=0 --all

# Note: EKS nodes still run and cost money.
# For real savings, delete the cluster (Section 7).

Monitoring Agent Behavior

Watch what the SRE agent is doing in real-time:

# Check how many workload pods the agent has created
kubectl get deployments -n prod-sre

# Check current replica counts
kubectl get hpa -A  # if any HPAs are defined

# Check node pressure
kubectl top nodes

Checking Current Spend

# Current month cost by service
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Regenerating Kubeconfig

If the EKS cluster is recreated or credentials expire:

./deploy/aws/generate-kubeconfig.sh
# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT

7. Teardown & Cost Recovery

Partial Teardown (Keep Cluster, Stop Workloads)

kubectl scale deployment -n prod-sre --replicas=0 --all
# Still paying for EKS control plane ($73/month) and idle nodes

Full Teardown (Stop All Charges)

# Delete workloads
kubectl delete -f deploy/aws/k8s-workloads.yaml

# Delete Prometheus agent
helm uninstall prometheus-agent -n monitoring
kubectl delete namespace monitoring

# Delete AMP workspace
AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1

# Delete AMG workspace
AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
aws grafana delete-workspace --workspace-id $AMG_WS_ID

# Delete IAM role for Grafana
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
aws iam delete-role --role-name AntiAtroposGrafanaRole

# Delete the EKS cluster (10-15 min)
eksctl delete cluster --name antiatropos --region ap-south-1

# Verify nothing is left
aws eks list-clusters --region ap-south-1
aws amp list-workspaces --region ap-south-1

Also remove the KUBECONFIG_CONTENT secret and reset PROMETHEUS_URL to mock in your HF Space.


Quick Reference Card

Task Command
Deploy AWS infra ./deploy/aws/deploy.sh
Check workloads kubectl get pods -n prod-sre
Check monitoring kubectl get pods -n monitoring
Scale a workload kubectl scale deployment/payments -n prod-sre --replicas=N
Pause all workloads kubectl scale deployment -n prod-sre --replicas=0 --all
Check AMP data awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1
Generate kubeconfig ./deploy/aws/generate-kubeconfig.sh
Nuke everything eksctl delete cluster --name antiatropos --region ap-south-1