AntiAtropos / deploy /aws /README.md

div18

feat: implement Kubernetes executor for automated cluster scaling and infrastructure management

cf2697b about 1 month ago

preview code

raw

history blame contribute delete

11.4 kB

AntiAtropos AWS Deployment Guide

Deploy the AWS infrastructure (EKS + AMP) that AntiAtropos on Hugging Face Spaces connects to.

For FastAPI wiring with aws mode and laptop Grafana, see deploy/aws/FASTAPI_AWS_MODE_GUIDE.md.

Architecture

Hugging Face Spaces                    AWS Region (ap-south-1)
=====================                  ======================
                                       ┌─────────────────────────┐
                                       │ EKS Cluster             │
┌─────────────────┐                    │  ├── Workload pods      │
│ AntiAtropos     │  PROMETHEUS_URL    │  │   (payments, checkout │
│ FastAPI Server  │───────────────────>│  │    catalog, cart, auth)│
│ (port 7860)     │  (HTTPS + SigV4)   │  ├── Prometheus Agent    │
│                 │                    │  │   (scrapes workloads, │
│                 │  KUBECONFIG        │  │    remote-writes AMP) │
│                 │───────────────────>│  ├── Grafana            │
│                 │  (EKS API server)  │  │   (self-hosted,       │
│                 │                    │  │    dashboards)        │
│                 │                    │  └── Monitoring ns       │
│                 │                    └─────────────────────────┘
│                 │                    ┌─────────────────────────┐
│                 │                    │ Amazon Managed          │
│                 │                    │ Prometheus (AMP)        │
│                 │                    │  Workspace: antiatropos │
│                 │                    └─────────────────────────┘
└─────────────────┘

Key principle: FastAPI runs on HF Spaces. AWS runs K8s workloads + AMP + self-hosted Grafana.

Phase 0: Prerequisites

# AWS CLI v2
curl "https://awscli.amazonaws.com/AWSCLIV2.msi" -o "AWSCLIV2.msi"
msiexec /i AWSCLIV2.msi

# eksctl
choco install eksctl

# kubectl
choco install kubernetes-cli

# Helm
choco install kubernetes-helm

# Authenticate
aws configure

Phase 1: Create the EKS Cluster (15 min)

eksctl create cluster -f deploy/aws/eksctl-cluster.yaml

# Verify
aws eks update-kubeconfig --name antiatropos --region ap-south-1
kubectl get nodes

Phase 2: Deploy Sample Workloads on EKS

These are the microservice deployments the SRE agent will scale up/down:

kubectl apply -f deploy/aws/k8s-workloads.yaml

This creates 5 deployments in the prod-sre namespace:

payments (node-0, VIP) — 2 replicas
checkout (node-1) — 1 replica
catalog (node-2) — 1 replica
cart (node-3) — 1 replica
auth (node-4) — 1 replica

Verify:

kubectl get pods -n prod-sre

Phase 3: Set Up Amazon Managed Prometheus (AMP)

Create AMP Workspace

aws amp create-workspace \
  --alias antiatropos-metrics \
  --region ap-south-1

# Note the workspace ID
aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1

Set Up IRSA for Prometheus Agent

eksctl create iamserviceaccount \
  --cluster antiatropos \
  --namespace monitoring \
  --name prometheus-sa \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
  --approve \
  --override-existing-serviceaccounts

Install Prometheus Agent on EKS

The agent scrapes workload pods and remote-writes metrics to AMP:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Replace WORKSPACE_ID with your AMP workspace ID
helm install prometheus-agent prometheus-community/prometheus \
  --namespace monitoring --create-namespace \
  -f deploy/aws/prometheus-agent-values.yaml \
  --set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write"

Verify AMP is Receiving Data

pip install awscurl
awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query?query=up" --region ap-south-1

Phase 4 (Optional): Set Up Self-Hosted Grafana on EKS

If you are on free-tier nodes, skip this section and run Grafana locally on your laptop.

Install Grafana

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm install grafana grafana/grafana \
  --namespace monitoring \
  -f deploy/aws/grafana-values.yaml

Create Dashboard Secret

kubectl create secret generic antiatropos-grafana-dashboards \
  --from-file=antiatropos-overview.json=deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json \
  --from-file=antiatropos-live.json=deploy/grafana/provisioning/dashboards/json/antiatropos-live.json \
  --namespace monitoring \
  --dry-run=client -o yaml | kubectl apply -f -

Access Grafana

kubectl port-forward svc/grafana 3000 -n monitoring

Open http://localhost:3000 in your browser:

Username: admin
Password: antiatropos

The data source AMP-Local is pre-configured to use the local Prometheus agent, and dashboards are auto-imported from the secret.

Phase 5: Generate Kubeconfig for HF Spaces

The AntiAtropos server on HF Spaces needs a kubeconfig to talk to EKS:

./deploy/aws/generate-kubeconfig.sh

This outputs deploy/aws/kubeconfig-antiatropos.yaml. You'll set this as a secret on HF Spaces.

Phase 6: Configure HF Spaces Environment Variables

Set these in your HF Space (Settings → Repository secrets and Variables):

Secrets

Secret	Value
`OPENAI_API_KEY`	Your OpenAI API key
`KUBECONFIG_CONTENT`	Full content of `kubeconfig-antiatropos.yaml`, base64-encoded

Environment Variables

Variable	Value
`ANTIATROPOS_ENV_MODE`	`aws`
`ANTIATROPOS_STRICT_REAL`	`false`
`PROMETHEUS_URL`	`https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID`
`KUBECONFIG`	`/app/kubeconfig.yaml`
`ANTIATROPOS_K8S_NAMESPACE`	`prod-sre`
`ANTIATROPOS_MAX_REPLICAS`	`6`
`ANTIATROPOS_MIN_REPLICAS`	`1`
`ANTIATROPOS_SCALE_STEP`	`3`
`ANTIATROPOS_PROM_TIMEOUT_S`	`5.0`
`ANTIATROPOS_METRIC_AGGREGATION`	`sum`
`ANTIATROPOS_WORKLOAD_MAP`	See below

Workload Map

{
  "node-0": {"deployment": "payments", "namespace": "prod-sre"},
  "node-1": {"deployment": "checkout", "namespace": "prod-sre"},
  "node-2": {"deployment": "catalog", "namespace": "prod-sre"},
  "node-3": {"deployment": "cart", "namespace": "prod-sre"},
  "node-4": {"deployment": "auth", "namespace": "prod-sre"}
}

Entrypoint Addition

Add this to deploy/entrypoint.sh before starting uvicorn, so the kubeconfig is decoded from the HF secret:

# Decode kubeconfig from HF Spaces secret
if [ -n "${KUBECONFIG_CONTENT:-}" ]; then
    echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml
    export KUBECONFIG=/app/kubeconfig.yaml
fi

FastAPI Reset Mode

Use mode="aws" on environment reset for AWS-backed execution. If omitted, the server will use ANTIATROPOS_ENV_MODE.

Local Grafana (Recommended on Free Tier)

Grafana is only for observability dashboards. Agent action execution stays in FastAPI + Kubernetes executor.

Start Grafana locally:

docker run -d --name antiatropos-grafana -p 3000:3000 grafana/grafana:latest

Then in Grafana:

Add Prometheus datasource using AMP workspace URL:

https://aps-workspaces.<region>.amazonaws.com/workspaces/<WORKSPACE_ID>

Enable SigV4 auth and set the same AWS region.
Import dashboards:

Phase 7: Install Cluster Autoscaler

So EKS can add nodes when the agent scales workloads:

helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm repo update

helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  -f deploy/aws/cluster-autoscaler-values.yaml

The node group maxSize: 4 in eksctl-cluster.yaml caps your compute cost.

Cost Estimates

Resource	Config	Monthly Cost (approx)
EKS Control Plane	1 cluster	$73
EKS Nodes	2x t3.medium	$60
AMP	<10GB ingest	~$3-5
EBS Volume (Grafana)	5Gi	~$0.50
Total		~$135-145/month
HF Spaces	Free tier or $5/mo	(separate billing)

No ECR, no ALB, no server pods on AWS — cheaper than running everything on AWS.

Cost-Saving Tips

Use spot instances for node groups (60-70% cheaper)
Scale workloads to zero between runs: kubectl scale deployment -n prod-sre --replicas=0 --all
Delete the cluster between training runs: eksctl delete cluster --name antiatropos
AMP free tier covers first 10GB ingest/month
Grafana is self-hosted (free, runs on EKS)

Teardown

# Delete workloads
kubectl delete -f deploy/aws/k8s-workloads.yaml

# Delete Grafana
helm uninstall grafana -n monitoring

# Delete Prometheus agent
helm uninstall prometheus-agent -n monitoring
kubectl delete namespace monitoring

# Delete dashboard secret
kubectl delete secret antiatropos-grafana-dashboards -n monitoring 2>/dev/null || true

# Delete AMP workspace
AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text)
aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1

# Delete the EKS cluster (10-15 min)
eksctl delete cluster --name antiatropos --region ap-south-1

Troubleshooting

HF Spaces can't reach AMP

Verify PROMETHEUS_URL includes the full workspace path
AMP requires SigV4 auth — ensure requests-aws4auth is in your dependencies
Set ANTIATROPOS_PROM_TIMEOUT_S=5.0 (cross-network latency)

HF Spaces can't reach EKS

Verify KUBECONFIG path and the file is decoded properly
Check the EKS API server endpoint is public (default)
Verify the IAM user in the kubeconfig has EKS access
Test locally: kubectl --kubeconfig=kubeconfig-antiatropos.yaml get nodes

AMP not receiving metrics

kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus

Grafana shows no data

Verify the AMP-Local data source is configured: http://prometheus-agent-server.monitoring.svc.cluster.local:80
Check time range (AMP default retention is 30 days)
Verify PromQL queries match your metric names
Check Grafana logs: kubectl logs -n monitoring -l app.kubernetes.io/name=grafana
Verify dashboards secret exists: kubectl get secret antiatropos-grafana-dashboards -n monitoring