| # AntiAtropos Architecture Guide |
|
|
| A complete explanation of how AntiAtropos works across Hugging Face Spaces and AWS, written for someone who is technically strong but new to Kubernetes. |
|
|
| --- |
|
|
| ## The Big Picture |
|
|
| AntiAtropos trains AI agents to be Site Reliability Engineers (SREs). An SRE agent watches a simulated microservice cluster and decides when to scale services, reroute traffic, or shed load to keep things running smoothly. |
|
|
| The system is split across two platforms: |
|
|
| ``` |
| Hugging Face Spaces AWS |
| ===================== ====================== |
| The "brain" The "muscle" |
| |
| AntiAtropos FastAPI server EKS (Kubernetes cluster) |
| - Runs the simulator - Runs the actual microservice pods |
| - Runs the SRE agent logic - The agent scales these pods |
| - Queries Prometheus for metrics - Prometheus Agent scrapes metrics |
| - Sends scale commands to K8s - Metrics flow to AMP |
| - Grafana (AMG) visualizes it all |
| ``` |
|
|
| Why split? HF Spaces is free/cheap for running the Python server. AWS EKS is where the real infrastructure lives that the agent practices on. |
|
|
| --- |
|
|
| ## Kubernetes Concepts You Need |
|
|
| ### Pod |
|
|
| The smallest unit in Kubernetes. A pod is one or more containers that run together. In our case, each pod runs a single nginx container that simulates a microservice (like "payments" or "checkout"). |
|
|
| Think of it as: one running instance of a service. |
|
|
| ### Deployment |
|
|
| A Deployment is a recipe that tells Kubernetes "keep N copies of this pod running at all times." If a pod dies, the Deployment automatically replaces it. |
|
|
| The key field is `spec.replicas` — this is the number the SRE agent changes when it scales a service up or down. |
|
|
| ``` |
| Deployment: payments |
| replicas: 3 <-- the agent changes this number |
| | |
| +-- Pod: payments-abc123 (running) |
| +-- Pod: payments-def456 (running) |
| +-- Pod: payments-ghi789 (running) |
| ``` |
|
|
| **The agent scales replicas, not pods.** When it sets `replicas: 5`, Kubernetes creates 5 pods. When it sets `replicas: 2`, Kubernetes kills 3 pods. |
|
|
| ### Service |
|
|
| A Service gives pods a stable network name. Instead of connecting to `payments-abc123` directly (which changes when the pod is recreated), you connect to `payments` (the Service), which routes to whichever pods are healthy. |
|
|
| ### Namespace |
|
|
| A namespace is a folder for organizing resources. We use: |
| - `prod-sre` — where the 5 microservice Deployments live |
| - `monitoring` — where the Prometheus Agent pod lives |
| - `kube-system` — where AWS/EKS system pods live |
|
|
| ### Node |
|
|
| A node is one EC2 virtual machine in the EKS cluster. Our cluster has 2-4 nodes. Each node runs multiple pods. When all nodes are full and the agent wants to scale up, Kubernetes adds more nodes (up to `maxSize: 4` in our config). |
|
|
| ``` |
| EKS Cluster |
| Node 1 (t3.medium - 4 vCPU, 8GB RAM) |
| Pod: payments-abc123 |
| Pod: checkout-def456 |
| Pod: catalog-ghi789 |
| Pod: prometheus-agent-xyz |
| Node 2 (t3.medium - 4 vCPU, 8GB RAM) |
| Pod: payments-jkl012 <-- agent scaled payments from 1 to 2 |
| Pod: cart-mno345 |
| Pod: auth-pqr678 |
| ``` |
|
|
| ### ResourceQuota |
|
|
| A hard limit on how many resources a namespace can use. We set one on `prod-sre` that caps total pods at 30. This is a safety net — even if the Python code cap fails, Kubernetes itself will refuse to create more than 30 pods. |
|
|
| --- |
|
|
| ## How the SRE Agent Works |
|
|
| ### The Loop |
|
|
| Every "tick" (one step of the simulation), the agent goes through this cycle: |
|
|
| ``` |
| 1. OBSERVE -- Read telemetry (CPU, latency, queue depth) from Prometheus |
| 2. DECIDE -- Choose an action (SCALE_UP, SCALE_DOWN, REROUTE_TRAFFIC, SHED_LOAD, NO_OP) |
| 3. ACT -- Send the action to KubernetesExecutor |
| 4. REWARD -- Compute Lyapunov stability reward (was the cluster more or less stable?) |
| 5. REPEAT |
| ``` |
|
|
| ### How Each Action Works |
|
|
| | Action | What the Agent Decides | What Happens on EKS | |
| |---|---|---| |
| | `SCALE_UP` | "node-0 needs more capacity" | `KubernetesExecutor` patches `payments` Deployment: `replicas: 2 -> 5` | |
| | `SCALE_DOWN` | "node-3 is over-provisioned" | `KubernetesExecutor` patches `cart` Deployment: `replicas: 4 -> 1` | |
| | `REROUTE_TRAFFIC` | "Move traffic away from node-2" | `KubernetesExecutor` scales DOWN target deployment and redistributes replicas to healthy peer deployments | |
| | `SHED_LOAD` | "Drop 50% of traffic to node-3" | `KubernetesExecutor` scales DOWN target deployment by `parameter * current_replicas` | |
| | `NO_OP` | "Do nothing this tick" | Nothing changes on EKS | |
|
|
| ### The SCALE_UP Flow in Detail |
| |
| Here is exactly what happens when the agent decides to scale up `node-0` (the payments service): |
| |
| ``` |
| HF Spaces AWS EKS |
| ---------- -------- |
| |
| Agent: "SCALE_UP, node-0, parameter=0.5" |
| | |
| v |
| AntiAtroposEnvironment.step() |
| | |
| v |
| KubernetesExecutor.execute_with_metadata() |
| | |
| v |
| _load_node_workload_map() |
| reads: node-0 -> {"deployment": "payments", "namespace": "prod-sre"} |
| | |
| v |
| _scale_deployment("SCALE_UP", "node-0", 0.5) |
| | |
| +-- 1. Read current replicas: apps_v1.read_namespaced_deployment_scale("payments", "prod-sre") |
| | Current replicas = 2 |
| | |
| +-- 2. Calculate delta: max(1, int(0.5 * 3)) = 1 |
| | Desired = min(6, 2 + 1) = 3 <-- max_replicas cap from env var |
| | |
| +-- 3. Patch: apps_v1.patch_namespaced_deployment_scale("payments", "prod-sre", |
| | body={"spec": {"replicas": 3}}) |
| | |
| v +---------------------------+ |
| Returns: "Ack: SCALE_UP for node-0 - | K8s creates 1 new pod: | |
| deployment payments in namespace | payments-newpod-xyz | |
| prod-sre scaled 2->3" +---------------------------+ |
| ``` |
| |
| ### The Telemetry Flow in Detail |
| |
| How the agent reads metrics from the real cluster: |
| |
| ``` |
| EKS Cluster AMP HF Spaces |
| ----------- --- ---------- |
| |
| Workload pods AMP Workspace AntiAtropos |
| (payments, checkout...) stores all metrics PrometheusClient |
| | ^ | |
| | /metrics (scraped every 15s) | | |
| v | | |
| Prometheus Agent | | |
| | | | |
| | remote-write (SigV4 auth) | | |
| +-------------------------------------------> | |
| | | |
| | HTTPS query | |
| +------------------------> |
| (PROMETHEUS_URL env var) |
| | |
| v |
| _fetch_real_metrics() |
| runs PromQL like: |
| sum(rate(http_requests_total[1m])) by (pod) |
| returns: TelemetryRecord for each node |
| ``` |
| |
| --- |
|
|
| ## The Three Layers of Scaling Caps |
|
|
| This is the most important thing to understand for cost control. There are **three** independent limits: |
|
|
| ### Layer 1: Python Code Cap (Soft) |
|
|
| **Where:** `ANTIATROPOS_MAX_REPLICAS` env var on HF Spaces, read by `kubernetes_executor.py` line 18. |
|
|
| **How it works:** The `_scale_deployment()` method calculates `desired = min(self.max_replicas, current + delta)`. If the agent tries to scale above 6, it gets: |
|
|
| ``` |
| Ack: SCALE_UP for node-0 - replicas unchanged at 6 (bounds 1-6) |
| ``` |
|
|
| **Can it be bypassed?** Yes. A bug in the code, or someone running `kubectl scale deployment payments --replicas=50` directly. |
|
|
| **Set to:** `6` on HF Spaces. |
|
|
| ### Layer 2: Kubernetes ResourceQuota (Hard) |
|
|
| **Where:** `k8s-workloads.yaml` — ResourceQuota on the `prod-sre` namespace. |
|
|
| **How it works:** Kubernetes itself refuses to schedule pods that would exceed the quota. If the namespace already has 30 pods and something tries to create a 31st: |
|
|
| ``` |
| Error from server (Forbidden): pods "payments-new" is forbidden: |
| exceeded quota: prod-sre-quota, requested: pods=1, used: pods=30, limited: pods=30 |
| ``` |
|
|
| **Can it be bypassed?** Only by someone with cluster-admin access who deletes or edits the ResourceQuota. |
|
|
| **Set to:** 30 pods total, 8 CPU, 8GB RAM. |
|
|
| ### Layer 3: EKS Node Group Max Size (Hard) |
|
|
| **Where:** `eksctl-cluster.yaml` — `managedNodeGroups[0].maxSize: 4`. |
|
|
| **How it works:** The Cluster Autoscaler will never add more than 4 nodes. Even if there are 100 pending pods, it stops at 4 nodes. Pending pods just wait. |
|
|
| **Can it be bypassed?** Only by someone editing the node group in the AWS console. |
|
|
| **Set to:** 4 nodes (4 x t3.medium = 8 vCPU, 16GB RAM max). |
|
|
| ### How the Three Layers Work Together |
|
|
| ``` |
| Agent wants to scale all 5 deployments to 20 replicas each: |
| |
| Layer 1 (Python cap): 6 replicas max per deployment -> agent gets "unchanged at 6" |
| 5 x 6 = 30 pods maximum |
| |
| Layer 2 (ResourceQuota): 30 pods max in namespace -> 31st pod is Forbidden |
| |
| Layer 3 (Node group): 4 nodes max -> if 30 pods don't fit on 4 nodes, |
| some stay Pending (no cost) |
| |
| Worst case with all caps: 30 pods on 4 nodes = ~$160/month |
| Without any caps: 100 pods on 25 nodes = ~$1,800/month |
| ``` |
|
|
| --- |
|
|
| ## The Mapping: Simulator Nodes to Real Deployments |
|
|
| The simulator has 5 abstract nodes (node-0 through node-4). The `ANTIATROPOS_WORKLOAD_MAP` env var tells the system which K8s Deployment each simulator node maps to: |
|
|
| ``` |
| Simulator Node K8s Deployment Namespace Notes |
| ------------- --------------- --------- ----- |
| node-0 payments prod-sre VIP (4x importance weight) |
| node-1 checkout prod-sre Critical (no SHED_LOAD) |
| node-2 catalog prod-sre Critical (no SHED_LOAD) |
| node-3 cart prod-sre Non-critical (sheddable) |
| node-4 auth prod-sre Non-critical (sheddable) |
| ``` |
|
|
| When the simulator says "SCALE_UP node-0 by 0.5", the system: |
| 1. Looks up node-0 in the workload map -> `payments` in `prod-sre` |
| 2. Calls `patch_namespaced_deployment_scale("payments", "prod-sre", ...)` |
| 3. Kubernetes creates/destroys pods to match the new replica count |
|
|
| --- |
|
|
| ## What Runs Where (Complete List) |
|
|
| ### On Hugging Face Spaces |
|
|
| | Component | What It Does | Port | |
| |---|---|---| |
| | FastAPI server (`server/app.py`) | HTTP API for the agent | 7860 (via NGINX) | |
| | Simulator (`simulator.py`) | 5-node microservice cluster simulation | Internal | |
| | PrometheusClient (`telemetry/prometheus_client.py`) | Queries AMP for real metrics | Outbound HTTPS | |
| | KubernetesExecutor (`control/kubernetes_executor.py`) | Sends scale commands to EKS | Outbound HTTPS | |
| | Prometheus metrics exporter | Serves `/metrics` for HF's monitoring | 8000 | |
| | Grafana + local Prometheus | Local dashboards (from the Dockerfile) | 3000, 9090 | |
|
|
| ### On AWS EKS |
|
|
| | Component | Namespace | What It Does | |
| |---|---|---| |
| | payments Deployment | prod-sre | 2 nginx pods (scales with agent) | |
| | checkout Deployment | prod-sre | 1 nginx pod (scales with agent) | |
| | catalog Deployment | prod-sre | 1 nginx pod (scales with agent) | |
| | cart Deployment | prod-sre | 1 nginx pod (scales with agent) | |
| | auth Deployment | prod-sre | 1 nginx pod (scales with agent) | |
| | Prometheus Agent | monitoring | Scrapes workload pods, remote-writes to AMP | |
| | Cluster Autoscaler | kube-system | Adds/removes EC2 nodes based on demand | |
|
|
| ### On AWS Managed Services |
|
|
| | Service | What It Does | |
| |---|---| |
| | AMP (Amazon Managed Prometheus) | Stores all metrics. Queried by HF Spaces. | |
| | AMG (Amazon Managed Grafana) | Visualizes metrics in dashboards. Accessed via browser. | |
|
|
| --- |
|
|
| ## The Simulator vs Real Cluster |
|
|
| AntiAtropos has three modes controlled by `ANTIATROPOS_ENV_MODE`: |
|
|
| ### Simulated Mode (`simulated`) |
|
|
| Everything is fake. The simulator generates synthetic metrics (random CPU, latency, etc.). No K8s, no Prometheus. The agent practices in a safe sandbox. |
|
|
| This is the default on HF Spaces without AWS configured. |
|
|
| ### Hybrid Mode (`hybrid`) |
|
|
| The simulator runs, but it pulls real metrics from AMP to calibrate itself. If AMP says `payments` pods have 80% CPU, the simulator adjusts its internal model to match. The agent can read real data but actions only affect the simulator, not real pods. |
|
|
| ### Live Mode (`live`) |
|
|
| The real deal. The agent reads real metrics from AMP and sends real scale commands to EKS. When it says `SCALE_UP`, actual pods get created on actual EC2 instances that cost actual money. |
|
|
| **Set `ANTIATROPOS_ENV_MODE=live` on HF Spaces to enable this.** |
|
|
| --- |
|
|
| ## Cost Flow |
|
|
| Every pod on EKS costs money. Here is how costs flow based on the agent's actions: |
|
|
| ``` |
| Agent action: SCALE_UP node-0 |
| -> payments Deployment: replicas 2 -> 5 |
| -> 3 new pods created |
| -> If existing nodes are full, Cluster Autoscaler adds a node |
| -> New node = another t3.medium EC2 instance = ~$0.04/hr |
| -> 3 pods running = 3 x (0.1 CPU + 64MB RAM) from the quota |
| |
| Agent action: SCALE_DOWN node-3 |
| -> cart Deployment: replicas 4 -> 1 |
| -> 3 pods terminated |
| -> If nodes are now underutilized, Cluster Autoscaler removes a node (after 10 min) |
| -> One fewer EC2 instance = saves ~$0.04/hr |
| ``` |
|
|
| The Lyapunov reward function penalizes the agent for both instability AND cost, so a well-trained agent should learn to scale efficiently: |
|
|
| ``` |
| R_t = -(alpha * delta_V + beta * cost + gamma * SLA_violation) |
| ^^^^ |
| beta=0.01 penalizes over-provisioning |
| ``` |
|
|
| --- |
|
|
| ## Quick Reference: Key Files |
|
|
| | File | Purpose | |
| |---|---| |
| | `kubernetes_executor.py` | Translates agent actions to K8s API calls | |
| | `prometheus_client.py` | Queries AMP for real metrics | |
| | `simulator.py` | 5-node fluid-queue simulation | |
| | `stability.py` | Lyapunov reward computation | |
| | `deploy/aws/k8s-workloads.yaml` | The 5 Deployments + ResourceQuota on EKS | |
| | `deploy/aws/eksctl-cluster.yaml` | EKS cluster definition (nodes, caps) | |
| | `deploy/aws/prometheus-agent-values.yaml` | Helm config for Prometheus Agent | |
| | `deploy/aws/generate-kubeconfig.sh` | Creates kubeconfig for HF Spaces | |
|
|