AntiAtropos / deploy /aws /OPERATIONS.md

div18

feat: implement Kubernetes executor for automated cluster scaling and infrastructure management

cf2697b about 1 month ago

preview code

raw

history blame contribute delete

15 kB

AntiAtropos AWS Operations Guide

Everything you need to run the AWS infrastructure for AntiAtropos without blowing up your bill.

Architecture: FastAPI on Hugging Face Spaces, EKS + AMP + AMG on AWS.

Replica Strategy & Caps
Autoscaling Configuration
Cost Guardrails
Step-by-Step Deployment Walkthrough
Configuring HF Spaces to Connect to AWS
Day-2 Operations
Teardown & Cost Recovery

1. Replica Strategy & Caps

What Runs Where

Component	Where	Scaled By	Cost Impact
AntiAtropos FastAPI server	HF Spaces	HF auto-scales	$0-5/month (HF billing)
Workload pods (payments, checkout, etc.)	EKS	SRE agent via `KubernetesExecutor`	HIGH — this is where costs spiral
Prometheus Agent	EKS (monitoring ns)	Static (1 pod)	Low
AMP	AWS managed	Serverless	Pay per GB ingested
AMG	AWS managed	Serverless	Pay per editor

Workload Pod Replicas — Where Costs Spiral

The SRE agent's SCALE_UP action calls KubernetesExecutor._scale_deployment(), which patches replicas on real K8s Deployments. A bad agent can scale every deployment to the cap.

The ANTIATROPOS_MAX_REPLICAS env var (set on HF Spaces) is the global ceiling applied to all deployments. The default in kubernetes_executor.py is 20 — with 5 deployments, that's 100 pods worst case. Set it to 6.

Recommended caps by deployment:

Deployment	Min	Max Replicas	Reasoning
`payments` (node-0, VIP)	2	6	VIP node — needs redundancy, 6 is plenty for the traffic model
`checkout` (node-1)	1	5	Can burst but shouldn't stay high
`catalog` (node-2)	1	5	Same
`cart` (node-3)	1	4	Non-critical, sheddable
`auth` (node-4)	1	4	Non-critical, sheddable

Total worst case: 24 workload pods.

At ~0.25 vCPU / 256MB per workload pod (nginx containers), that's ~6 vCPU and ~6GB RAM — fits on 2x t3.medium nodes with some headroom, or 3 nodes for comfort.

How the Cap Works

The KubernetesExecutor._scale_deployment() method reads ANTIATROPOS_MAX_REPLICAS from the environment and refuses to scale above it:

Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6)

This is enforced in code (kubernetes_executor.py line 115):

desired = min(self.max_replicas, current + delta)

Set ANTIATROPOS_MAX_REPLICAS=6 on your HF Space.

2. Autoscaling Configuration

EKS Node Autoscaling

The cluster needs to grow nodes when the agent scales workloads. Install the Cluster Autoscaler:

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml

The node group maxSize in eksctl-cluster.yaml (4) is your ultimate cost ceiling.

4 nodes x $0.0416/hr (t3.medium on-demand) = $0.1664/hr = ~$120/month max

With spot instances, this drops to ~$36/month max.

What Happens When the Agent Scales Workloads

Agent on HF Spaces sends SCALE_UP action
KubernetesExecutor._scale_deployment() patches the Deployment's spec.replicas via EKS API server
Kubernetes scheduler tries to place the new pod
If no node has capacity -> pod is Pending
Cluster Autoscaler sees Pending pods -> adds a node (within maxSize)
If maxSize is hit -> pod stays Pending (agent action succeeded but pod won't schedule)

This is why maxSize in the node group is your ultimate cost ceiling.

3. Cost Guardrails

Monthly Cost Caps by Tier

Tier	Max Nodes	Max Workload Pods	Estimated Monthly Cost
Dev/Testing	2	10 (2/deployment)	~$80
Training	3	15 (3/deployment)	~$130
Benchmark Suite	4	24 (~5/deployment)	~$160
Unlimited (danger)	inf	100 (20/deployment)	$500+

AWS Budgets — Get Alerts Before You Overspend

aws budgets create-budget \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget '{
    "BudgetName": "AntiAtropos-Monthly",
    "BudgetLimit": {"Amount": "150", "Unit": "USD"},
    "TimeUnit": "MONTHLY",
    "CostFilters": {
      "TagKeyValue": ["user:Project$AntiAtropos"]
    },
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false
    }
  }'

# Alert at 50%
aws budgets create-notification \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget-name "AntiAtropos-Monthly" \
  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":50}' \
  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'

# Alert at 80%
aws budgets create-notification \
  --account-id $(aws sts get-caller-identity --query Account --output text) \
  --budget-name "AntiAtropos-Monthly" \
  --notification '{"NotificationType":"ACTUAL","ComparisonOperator":"GREATER_THAN","Threshold":80}' \
  --subscribers '[{"SubscriptionType":"EMAIL","Address":"your-email@example.com"}]'

Cost-Saving Checklist

Use spot instances for node groups (60-70% cheaper, OK for training)
Set ANTIATROPOS_MAX_REPLICAS=6 on HF Spaces (not 20) to prevent agent runaway
Cap node group maxSize at 4 (in eksctl-cluster.yaml)
Set AWS Budget alert at $150/month
Scale workloads to zero between runs: kubectl scale deployment -n prod-sre --replicas=0 --all
Delete the cluster for multi-day breaks: eksctl delete cluster --name antiatropos
AMP free tier covers first 10GB ingest/month
AMG free tier is 1 editor for 30 days — cancel if not needed

4. Step-by-Step Deployment Walkthrough

Before You Start

You need:

AWS account with billing alerts enabled
AWS CLI v2 installed and configured (aws configure)
eksctl, kubectl, helm installed
About 20-30 minutes

Step 1: Create the EKS Cluster (15 min)

eksctl create cluster -f deploy/aws/eksctl-cluster.yaml

# Verify
aws eks update-kubeconfig --name antiatropos --region ap-south-1
kubectl get nodes

Step 2: Deploy Sample Workloads (1 min)

kubectl apply -f deploy/aws/k8s-workloads.yaml
kubectl get pods -n prod-sre

Step 3: Create AMP Workspace (1 min)

aws amp create-workspace --alias antiatropos-metrics --region ap-south-1

# Note the workspace ID
aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text

Step 4: Set Up IRSA (2 min)

# Prometheus agent needs to write to AMP
eksctl create iamserviceaccount \
  --cluster antiatropos \
  --namespace monitoring \
  --name prometheus-sa \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
  --approve

Step 5: Install Prometheus Agent (2 min)

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Replace WORKSPACE_ID
helm install prometheus-agent prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  -f deploy/aws/prometheus-agent-values.yaml \
  --set "prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"

Step 6: Set Up AMG (5 min)

# Create IAM role for AMG
aws iam create-role \
  --role-name AntiAtroposGrafanaRole \
  --assume-role-policy-document file://deploy/aws/grafana-trust-policy.json

aws iam attach-role-policy \
  --role-name AntiAtroposGrafanaRole \
  --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess

# Create workspace
aws grafana create-workspace \
  --workspace-name antiatropos-dashboards \
  --account-access-type CURRENT_ACCOUNT \
  --authentication-method AWS_SSO \
  --permission-type SERVICE_MANAGED \
  --data-sources PROMETHEUS \
  --region ap-south-1

Then in the AMG web UI:

Sign in with AWS SSO
Configuration -> Data Sources -> Add AMP workspace
Dashboards -> Import -> Upload JSON from deploy/grafana/provisioning/dashboards/json/
Select AMP data source when importing

Step 7: Install Cluster Autoscaler (2 min)

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml

Step 8: Generate Kubeconfig for HF Spaces (1 min)

./deploy/aws/generate-kubeconfig.sh
# Outputs: deploy/aws/kubeconfig-antiatropos.yaml

Step 9: Configure HF Spaces

See Section 5 below.

5. Configuring HF Spaces to Connect to AWS

Secrets (HF Space Settings -> Repository secrets)

Secret	Value
`OPENAI_API_KEY`	Your OpenAI API key
`KUBECONFIG_CONTENT`	Base64-encoded content of `kubeconfig-antiatropos.yaml`

To encode the kubeconfig:

cat deploy/aws/kubeconfig-antiatropos.yaml | base64 -w 0

Environment Variables (HF Space Settings -> Variables)

Variable	Value
`ANTIATROPOS_ENV_MODE`	`live`
`ANTIATROPOS_STRICT_REAL`	`false`
`PROMETHEUS_URL`	`https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID`
`KUBECONFIG`	`/app/kubeconfig.yaml`
`ANTIATROPOS_K8S_NAMESPACE`	`prod-sre`
`ANTIATROPOS_DEPLOYMENT_PREFIX`	`` (empty)
`ANTIATROPOS_MIN_REPLICAS`	`1`
`ANTIATROPOS_MAX_REPLICAS`	`6`
`ANTIATROPOS_SCALE_STEP`	`3`
`ANTIATROPOS_PROM_TIMEOUT_S`	`5.0`
`ANTIATROPOS_METRIC_AGGREGATION`	`sum`
`ANTIATROPOS_WORKLOAD_MAP`	See below

Workload Map Value

{
  "node-0": {"deployment": "payments", "namespace": "prod-sre"},
  "node-1": {"deployment": "checkout", "namespace": "prod-sre"},
  "node-2": {"deployment": "catalog", "namespace": "prod-sre"},
  "node-3": {"deployment": "cart", "namespace": "prod-sre"},
  "node-4": {"deployment": "auth", "namespace": "prod-sre"}
}

Entrypoint Modification

Add this to deploy/entrypoint.sh before the uvicorn line, so the kubeconfig is decoded from the HF secret:

# Decode kubeconfig from HF Spaces secret
if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
    echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
    export KUBECONFIG=/app/kubeconfig.yaml
fi

Verifying the Connection

After deploying, check from HF Spaces that the server can reach AWS:

Check the HF Space logs for antiatropos_step events
Look for Ack: SCALE_UP messages (agent is reaching EKS)
Look for non-zero request_rate / cpu_utilization (PrometheusClient is reaching AMP)
If ANTIATROPOS_STRICT_REAL=false (recommended), failures fall back to mock silently

6. Day-2 Operations

Scaling Workloads Manually

# Scale a specific deployment
kubectl scale deployment/payments -n prod-sre --replicas=4

# Scale all workloads down
kubectl scale deployment -n prod-sre --replicas=0 --all

# Scale all workloads back up
kubectl scale deployment payments -n prod-sre --replicas=2
kubectl scale deployment checkout -n prod-sre --replicas=1
kubectl scale deployment catalog -n prod-sre --replicas=1
kubectl scale deployment cart -n prod-sre --replicas=1
kubectl scale deployment auth -n prod-sre --replicas=1

Pausing Everything (Without Deleting)

# Scale all workloads to 0
kubectl scale deployment -n prod-sre --replicas=0 --all

# Note: EKS nodes still run and cost money.
# For real savings, delete the cluster (Section 7).

Monitoring Agent Behavior

Watch what the SRE agent is doing in real-time:

# Check how many workload pods the agent has created
kubectl get deployments -n prod-sre

# Check current replica counts
kubectl get hpa -A  # if any HPAs are defined

# Check node pressure
kubectl top nodes

Checking Current Spend

# Current month cost by service
aws ce get-cost-and-usage \
  --time-period Start=$(date -d '1st of this month' +%Y-%m-%d),End=$(date +%Y-%m-%d) \
  --granularity MONTHLY \
  --metrics BlendedCost \
  --group-by Type=DIMENSION,Key=SERVICE

Regenerating Kubeconfig

If the EKS cluster is recreated or credentials expire:

./deploy/aws/generate-kubeconfig.sh
# Re-upload the base64-encoded content to HF Spaces secret KUBECONFIG_CONTENT

7. Teardown & Cost Recovery

Partial Teardown (Keep Cluster, Stop Workloads)

kubectl scale deployment -n prod-sre --replicas=0 --all
# Still paying for EKS control plane ($73/month) and idle nodes

Full Teardown (Stop All Charges)

# Delete workloads
kubectl delete -f deploy/aws/k8s-workloads.yaml

# Delete Prometheus agent
helm uninstall prometheus-agent -n monitoring
kubectl delete namespace monitoring

# Delete AMP workspace
AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1

# Delete AMG workspace
AMG_WS_ID=$(aws grafana list-workspaces --region ap-south-1 --query 'workspaces[0].id' --output text)
aws grafana delete-workspace --workspace-id $AMG_WS_ID

# Delete IAM role for Grafana
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess
aws iam detach-role-policy --role-name AntiAtroposGrafanaRole --policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess
aws iam delete-role --role-name AntiAtroposGrafanaRole

# Delete the EKS cluster (10-15 min)
eksctl delete cluster --name antiatropos --region ap-south-1

# Verify nothing is left
aws eks list-clusters --region ap-south-1
aws amp list-workspaces --region ap-south-1

Also remove the KUBECONFIG_CONTENT secret and reset PROMETHEUS_URL to mock in your HF Space.

Quick Reference Card

Task	Command
Deploy AWS infra	`./deploy/aws/deploy.sh`
Check workloads	`kubectl get pods -n prod-sre`
Check monitoring	`kubectl get pods -n monitoring`
Scale a workload	`kubectl scale deployment/payments -n prod-sre --replicas=N`
Pause all workloads	`kubectl scale deployment -n prod-sre --replicas=0 --all`
Check AMP data	`awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WS_ID/api/v1/query?query=up" --region ap-south-1`
Generate kubeconfig	`./deploy/aws/generate-kubeconfig.sh`
Nuke everything	`eksctl delete cluster --name antiatropos --region ap-south-1`