# AntiAtropos AWS Deployment Guide Deploy the AWS infrastructure (EKS + AMP) that AntiAtropos on Hugging Face Spaces connects to. For FastAPI wiring with `aws` mode and laptop Grafana, see [deploy/aws/FASTAPI_AWS_MODE_GUIDE.md](deploy/aws/FASTAPI_AWS_MODE_GUIDE.md). ## Architecture ``` Hugging Face Spaces AWS Region (ap-south-1) ===================== ====================== ┌─────────────────────────┐ │ EKS Cluster │ ┌─────────────────┐ │ ├── Workload pods │ │ AntiAtropos │ PROMETHEUS_URL │ │ (payments, checkout │ │ FastAPI Server │───────────────────>│ │ catalog, cart, auth)│ │ (port 7860) │ (HTTPS + SigV4) │ ├── Prometheus Agent │ │ │ │ │ (scrapes workloads, │ │ │ KUBECONFIG │ │ remote-writes AMP) │ │ │───────────────────>│ ├── Grafana │ │ │ (EKS API server) │ │ (self-hosted, │ │ │ │ │ dashboards) │ │ │ │ └── Monitoring ns │ │ │ └─────────────────────────┘ │ │ ┌─────────────────────────┐ │ │ │ Amazon Managed │ │ │ │ Prometheus (AMP) │ │ │ │ Workspace: antiatropos │ │ │ └─────────────────────────┘ └─────────────────┘ ``` **Key principle: FastAPI runs on HF Spaces. AWS runs K8s workloads + AMP + self-hosted Grafana.** --- ## Phase 0: Prerequisites ```bash # AWS CLI v2 curl "https://awscli.amazonaws.com/AWSCLIV2.msi" -o "AWSCLIV2.msi" msiexec /i AWSCLIV2.msi # eksctl choco install eksctl # kubectl choco install kubernetes-cli # Helm choco install kubernetes-helm # Authenticate aws configure ``` --- ## Phase 1: Create the EKS Cluster (15 min) ```bash eksctl create cluster -f deploy/aws/eksctl-cluster.yaml # Verify aws eks update-kubeconfig --name antiatropos --region ap-south-1 kubectl get nodes ``` --- ## Phase 2: Deploy Sample Workloads on EKS These are the microservice deployments the SRE agent will scale up/down: ```bash kubectl apply -f deploy/aws/k8s-workloads.yaml ``` This creates 5 deployments in the `prod-sre` namespace: - `payments` (node-0, VIP) — 2 replicas - `checkout` (node-1) — 1 replica - `catalog` (node-2) — 1 replica - `cart` (node-3) — 1 replica - `auth` (node-4) — 1 replica Verify: ```bash kubectl get pods -n prod-sre ``` --- ## Phase 3: Set Up Amazon Managed Prometheus (AMP) ### Create AMP Workspace ```bash aws amp create-workspace \ --alias antiatropos-metrics \ --region ap-south-1 # Note the workspace ID aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 ``` ### Set Up IRSA for Prometheus Agent ```bash eksctl create iamserviceaccount \ --cluster antiatropos \ --namespace monitoring \ --name prometheus-sa \ --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \ --approve \ --override-existing-serviceaccounts ``` ### Install Prometheus Agent on EKS The agent scrapes workload pods and remote-writes metrics to AMP: ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Replace WORKSPACE_ID with your AMP workspace ID helm install prometheus-agent prometheus-community/prometheus \ --namespace monitoring --create-namespace \ -f deploy/aws/prometheus-agent-values.yaml \ --set prometheus.prometheusSpec.remoteWrite[0].url="https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/remote_write" ``` ### Verify AMP is Receiving Data ```bash pip install awscurl awscurl --service aps "https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query?query=up" --region ap-south-1 ``` --- ## Phase 4 (Optional): Set Up Self-Hosted Grafana on EKS If you are on free-tier nodes, skip this section and run Grafana locally on your laptop. ### Install Grafana ```bash helm repo add grafana https://grafana.github.io/helm-charts helm repo update helm install grafana grafana/grafana \ --namespace monitoring \ -f deploy/aws/grafana-values.yaml ``` ### Create Dashboard Secret ```bash kubectl create secret generic antiatropos-grafana-dashboards \ --from-file=antiatropos-overview.json=deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json \ --from-file=antiatropos-live.json=deploy/grafana/provisioning/dashboards/json/antiatropos-live.json \ --namespace monitoring \ --dry-run=client -o yaml | kubectl apply -f - ``` ### Access Grafana ```bash kubectl port-forward svc/grafana 3000 -n monitoring ``` Open `http://localhost:3000` in your browser: - Username: `admin` - Password: `antiatropos` The data source `AMP-Local` is pre-configured to use the local Prometheus agent, and dashboards are auto-imported from the secret. --- ## Phase 5: Generate Kubeconfig for HF Spaces The AntiAtropos server on HF Spaces needs a kubeconfig to talk to EKS: ```bash ./deploy/aws/generate-kubeconfig.sh ``` This outputs `deploy/aws/kubeconfig-antiatropos.yaml`. You'll set this as a secret on HF Spaces. --- ## Phase 6: Configure HF Spaces Environment Variables Set these in your HF Space (Settings → Repository secrets and Variables): ### Secrets | Secret | Value | |---|---| | `OPENAI_API_KEY` | Your OpenAI API key | | `KUBECONFIG_CONTENT` | Full content of `kubeconfig-antiatropos.yaml`, base64-encoded | ### Environment Variables | Variable | Value | |---|---| | `ANTIATROPOS_ENV_MODE` | `aws` | | `ANTIATROPOS_STRICT_REAL` | `false` | | `PROMETHEUS_URL` | `https://aps-workspaces.ap-south-1.amazonaws.com/workspaces/WORKSPACE_ID` | | `KUBECONFIG` | `/app/kubeconfig.yaml` | | `ANTIATROPOS_K8S_NAMESPACE` | `prod-sre` | | `ANTIATROPOS_MAX_REPLICAS` | `6` | | `ANTIATROPOS_MIN_REPLICAS` | `1` | | `ANTIATROPOS_SCALE_STEP` | `3` | | `ANTIATROPOS_PROM_TIMEOUT_S` | `5.0` | | `ANTIATROPOS_METRIC_AGGREGATION` | `sum` | | `ANTIATROPOS_WORKLOAD_MAP` | See below | ### Workload Map ```json { "node-0": {"deployment": "payments", "namespace": "prod-sre"}, "node-1": {"deployment": "checkout", "namespace": "prod-sre"}, "node-2": {"deployment": "catalog", "namespace": "prod-sre"}, "node-3": {"deployment": "cart", "namespace": "prod-sre"}, "node-4": {"deployment": "auth", "namespace": "prod-sre"} } ``` ### Entrypoint Addition Add this to `deploy/entrypoint.sh` before starting uvicorn, so the kubeconfig is decoded from the HF secret: ```bash # Decode kubeconfig from HF Spaces secret if [ -n "${KUBECONFIG_CONTENT:-}" ]; then echo "${KUBECONFIG_CONTENT}" | base64 -d > /app/kubeconfig.yaml export KUBECONFIG=/app/kubeconfig.yaml fi ``` ### FastAPI Reset Mode Use `mode="aws"` on environment reset for AWS-backed execution. If omitted, the server will use `ANTIATROPOS_ENV_MODE`. --- ## Local Grafana (Recommended on Free Tier) Grafana is only for observability dashboards. Agent action execution stays in FastAPI + Kubernetes executor. Start Grafana locally: ```bash docker run -d --name antiatropos-grafana -p 3000:3000 grafana/grafana:latest ``` Then in Grafana: 1. Add Prometheus datasource using AMP workspace URL: - `https://aps-workspaces..amazonaws.com/workspaces/` 2. Enable SigV4 auth and set the same AWS region. 3. Import dashboards: - [deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json](deploy/grafana/provisioning/dashboards/json/antiatropos-overview.json) - [deploy/grafana/provisioning/dashboards/json/antiatropos-live.json](deploy/grafana/provisioning/dashboards/json/antiatropos-live.json) --- ## Phase 7: Install Cluster Autoscaler So EKS can add nodes when the agent scales workloads: ```bash helm repo add autoscaler https://kubernetes.github.io/autoscaler helm repo update helm install cluster-autoscaler autoscaler/cluster-autoscaler \ --namespace kube-system \ -f deploy/aws/cluster-autoscaler-values.yaml ``` The node group `maxSize: 4` in `eksctl-cluster.yaml` caps your compute cost. --- ## Cost Estimates | Resource | Config | Monthly Cost (approx) | |---|---|---| | EKS Control Plane | 1 cluster | $73 | | EKS Nodes | 2x t3.medium | $60 | | AMP | <10GB ingest | ~$3-5 | | EBS Volume (Grafana) | 5Gi | ~$0.50 | | **Total** | | **~$135-145/month** | | HF Spaces | Free tier or $5/mo | (separate billing) | No ECR, no ALB, no server pods on AWS — cheaper than running everything on AWS. ### Cost-Saving Tips - Use spot instances for node groups (60-70% cheaper) - Scale workloads to zero between runs: `kubectl scale deployment -n prod-sre --replicas=0 --all` - Delete the cluster between training runs: `eksctl delete cluster --name antiatropos` - AMP free tier covers first 10GB ingest/month - Grafana is self-hosted (free, runs on EKS) --- ## Teardown ```bash # Delete workloads kubectl delete -f deploy/aws/k8s-workloads.yaml # Delete Grafana helm uninstall grafana -n monitoring # Delete Prometheus agent helm uninstall prometheus-agent -n monitoring kubectl delete namespace monitoring # Delete dashboard secret kubectl delete secret antiatropos-grafana-dashboards -n monitoring 2>/dev/null || true # Delete AMP workspace AMP_WS_ID=$(aws amp list-workspaces --alias antiatropos-metrics --region ap-south-1 --query 'workspaces[0].workspaceId' --output text) aws amp delete-workspace --workspace-id $AMP_WS_ID --region ap-south-1 # Delete the EKS cluster (10-15 min) eksctl delete cluster --name antiatropos --region ap-south-1 ``` --- ## Troubleshooting ### HF Spaces can't reach AMP - Verify `PROMETHEUS_URL` includes the full workspace path - AMP requires SigV4 auth — ensure `requests-aws4auth` is in your dependencies - Set `ANTIATROPOS_PROM_TIMEOUT_S=5.0` (cross-network latency) ### HF Spaces can't reach EKS - Verify `KUBECONFIG` path and the file is decoded properly - Check the EKS API server endpoint is public (default) - Verify the IAM user in the kubeconfig has EKS access - Test locally: `kubectl --kubeconfig=kubeconfig-antiatropos.yaml get nodes` ### AMP not receiving metrics ```bash kubectl logs -n monitoring -l app.kubernetes.io/name=prometheus ``` ### Grafana shows no data 1. Verify the `AMP-Local` data source is configured: `http://prometheus-agent-server.monitoring.svc.cluster.local:80` 2. Check time range (AMP default retention is 30 days) 3. Verify PromQL queries match your metric names 4. Check Grafana logs: `kubectl logs -n monitoring -l app.kubernetes.io/name=grafana` 5. Verify dashboards secret exists: `kubectl get secret antiatropos-grafana-dashboards -n monitoring`